Spark-Docker-SQL - Troubleshooting Guide
🚨 Quick Emergency Fixes
🔥 "Everything is Broken" - Nuclear Option
# Stop everything and restart fresh
docker-compose down -v --remove-orphans
docker system prune -f
docker volume prune -f
docker-compose up -d --build
⚡ "Just Need to Restart" - Soft Reset
# Restart just the services
docker-compose restart
# Or restart specific service
docker-compose restart spark-master
🐳 Docker Issues
1. Container Won't Start
Symptom: docker-compose up
fails or containers exit immediately
Common Causes & Solutions:
Port Already in Use
# Check what's using the port
lsof -i :8080 # For Spark UI
lsof -i :5432 # For PostgreSQL
lsof -i :8888 # For Jupyter
# Kill the process using the port
sudo kill -9 <PID>
# Or change ports in docker-compose.yml
ports:
- "8081:8080" # Use different host port
Insufficient Memory
# Check Docker resource allocation
docker system info | grep -i memory
# Increase Docker memory limit (Docker Desktop):
# Settings → Resources → Memory → Increase to 8GB+
# For Linux, check available memory
free -h
Volume Mount Issues
# Check if directories exist
ls -la data/
ls -la notebooks/
# Create missing directories
mkdir -p data/raw data/processed data/features
mkdir -p notebooks config sql
# Fix permissions
sudo chown -R $USER:$USER data/ notebooks/ config/
chmod -R 755 data/ notebooks/ config/
2. Cannot Connect to Services
Symptom: "Connection refused" when accessing Spark UI or Jupyter
Solutions:
Check Container Status
# See which containers are running
docker-compose ps
# Check logs for specific service
docker-compose logs spark-master
docker-compose logs jupyter
docker-compose logs postgres
Network Issues
# Check if services are listening
docker-compose exec spark-master netstat -tlnp | grep 8080
docker-compose exec postgres netstat -tlnp | grep 5432
# Test connectivity between containers
docker-compose exec jupyter ping spark-master
docker-compose exec jupyter ping postgres
Firewall/Security Issues
# Disable firewall temporarily (Linux)
sudo ufw disable
# For macOS, check System Preferences → Security & Privacy
# For Windows, check Windows Defender Firewall
3. Out of Disk Space
Symptom: "No space left on device"
Solutions:
# Check disk usage
df -h
docker system df
# Clean up Docker resources
docker system prune -a --volumes
docker builder prune -a
# Remove unused images
docker image prune -a
# Clean up old containers
docker container prune
4. Docker Compose Version Issues
Symptom: "version not supported" or syntax errors
Solution:
# Check Docker Compose version
docker-compose --version
# Update Docker Compose (Linux)
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
# For older versions, use version 3.7 instead of 3.8 in docker-compose.yml
⚡ Spark Issues
1. Spark Session Creation Fails
Symptom: Cannot connect to Spark master
or session creation hangs
Common Causes & Solutions:
Master Not Running
# Check if Spark master is accessible
import requests
try:
response = requests.get("http://localhost:8080")
print("Spark master is running")
except:
print("Cannot reach Spark master")
Wrong Master URL
# Try different master configurations
# For local development
spark = SparkSession.builder.master("local[*]").getOrCreate()
# For Docker cluster
spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate()
# Check from inside Jupyter container
spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate()
Memory Configuration Issues
spark = (SparkSession.builder
.appName("SmartCityIoTPipeline")
.master("local[*]")
.config("spark.driver.memory", "2g") # Reduce if needed
.config("spark.executor.memory", "1g") # Reduce if needed
.config("spark.driver.maxResultSize", "1g")
.getOrCreate())
2. Out of Memory Errors
Symptom: Java heap space
or GC overhead limit exceeded
Solutions:
Increase Memory Allocation
spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.driver.maxResultSize", "2g")
Optimize Data Processing
# Use sampling for large datasets
sample_df = large_df.sample(0.1, seed=42)
# Cache frequently used DataFrames
df.cache()
df.count() # Trigger caching
# Repartition data
df = df.repartition(4) # Fewer partitions for small datasets
# Use coalesce to reduce partitions
df = df.coalesce(2)
Process Data in Chunks
# Process data month by month
for month in range(1, 13):
monthly_data = df.filter(F.month("timestamp") == month)
# Process monthly_data
monthly_data.unpersist() # Free memory
3. Slow Spark Jobs
Symptom: Jobs take very long time or appear to hang
Solutions:
Check Spark UI for Bottlenecks
- Open http://localhost:4040 (or 4041, 4042 if multiple sessions)
- Look at the Jobs tab for failed/slow stages
- Check Executors tab for resource usage
Optimize Partitioning
# Check current partitions
print(f"Partitions: {df.rdd.getNumPartitions()}")
# Optimal partitions = 2-3x number of cores
optimal_partitions = spark.sparkContext.defaultParallelism * 2
df = df.repartition(optimal_partitions)
Avoid Expensive Operations
# Avoid repeated .count() calls
count = df.count()
print(f"Records: {count}")
# Use .cache() for DataFrames used multiple times
df.cache()
# Avoid .collect() on large datasets
# Instead of:
all_data = df.collect() # BAD: loads all data to driver
# Use:
sample_data = df.limit(1000).collect() # GOOD: only sample
Optimize Joins
# Broadcast small DataFrames
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")
# Use appropriate join strategies
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
4. DataFrame Operations Fail
Symptom: AnalysisException
or column not found errors
Solutions:
Check Schema and Column Names
# Print schema to see exact column names
df.printSchema()
# Show column names
print(df.columns)
# Check for case sensitivity
df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()])
Handle Null Values
# Check for nulls before operations
df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()
# Drop nulls before joins
df_clean = df.na.drop(subset=['key_column'])
# Fill nulls with defaults
df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'})
Fix Data Type Issues
# Cast columns to correct types
df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
df = df.withColumn("numeric_col", F.col("numeric_col").cast("double"))
# Handle string/numeric conversion errors
df = df.withColumn("safe_numeric",
F.when(F.col("string_col").rlike("^[0-9.]+$"),
F.col("string_col").cast("double")).otherwise(0))
🗄️ Database Connection Issues
1. Cannot Connect to PostgreSQL
Symptom: Connection refused
or authentication failed
Solutions:
Check PostgreSQL Status
# Check if PostgreSQL container is running
docker-compose ps postgres
# Check PostgreSQL logs
docker-compose logs postgres
# Test connection from host
psql -h localhost -p 5432 -U postgres -d smartcity
From Jupyter/Spark Container
# Test database connection
import psycopg2
try:
conn = psycopg2.connect(
host="postgres", # Use container name, not localhost
port=5432,
user="postgres",
password="password",
database="smartcity"
)
print("Database connection successful")
conn.close()
except Exception as e:
print(f"Connection failed: {e}")
Spark JDBC Connection
# Correct JDBC URL for Docker
jdbc_url = "jdbc:postgresql://postgres:5432/smartcity"
# Test Spark database connection
test_df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", "(SELECT 1 as test) as test_table") \
.option("user", "postgres") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver") \
.load()
test_df.show()
2. JDBC Driver Issues
Symptom: ClassNotFoundException: org.postgresql.Driver
Solutions:
Add JDBC Driver to Spark
spark = SparkSession.builder \
.appName("SmartCityIoTPipeline") \
.config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \
.getOrCreate()
Download Driver Manually
# Download PostgreSQL JDBC driver
cd /opt/bitnami/spark/jars/
wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar
📊 Data Loading Issues
1. File Not Found Errors
Symptom: FileNotFoundException
or path does not exist
Solutions:
Check File Paths
import os
# Check if file exists
data_file = "data/raw/traffic_sensors.csv"
print(f"File exists: {os.path.exists(data_file)}")
# List directory contents
print(os.listdir("data/raw/"))
# Use absolute paths if needed
import os
abs_path = os.path.abspath("data/raw/traffic_sensors.csv")
df = spark.read.csv(abs_path, header=True, inferSchema=True)
Volume Mount Issues
# Check if volumes are mounted correctly
docker-compose exec jupyter ls -la /home/jovyan/work/data/
# Verify volume mounts in docker-compose.yml
volumes:
- ./data:/home/jovyan/work/data
- ./notebooks:/home/jovyan/work/notebooks
2. Schema Inference Problems
Symptom: Wrong data types or parsing errors
Solutions:
Explicit Schema Definition
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
# Define explicit schema
schema = StructType([
StructField("sensor_id", StringType(), False),
StructField("timestamp", StringType(), False), # Read as string first
StructField("vehicle_count", IntegerType(), True),
StructField("avg_speed", DoubleType(), True)
])
df = spark.read.csv("data/raw/traffic_sensors.csv",
header=True, schema=schema)
# Then convert timestamp
df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
Handle Different Date Formats
# Try different timestamp formats
df = df.withColumn("timestamp",
F.coalesce(
F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"),
F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"),
F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
))
3. Large File Loading Issues
Symptom: Out of memory when loading large files
Solutions:
Process Files in Chunks
# For very large CSV files, process line by line
def process_large_csv(file_path, chunk_size=10000):
# Read in smaller chunks
df = spark.read.option("maxRecordsPerFile", chunk_size) \
.csv(file_path, header=True, inferSchema=True)
return df
# Or split large files manually
# split -l 100000 large_file.csv chunk_
Optimize File Format
# Convert to Parquet for better performance
df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet")
# Read Parquet instead of CSV
df = spark.read.parquet("data/processed/traffic_optimized.parquet")
🔧 Environment Setup Issues
1. Python Package Conflicts
Symptom: ImportError
or version conflicts
Solutions:
Check Package Versions
import sys
print(f"Python version: {sys.version}")
import pyspark
print(f"PySpark version: {pyspark.__version__}")
import pandas
print(f"Pandas version: {pandas.__version__}")
Rebuild Jupyter Container
# Rebuild with latest packages
docker-compose down
docker-compose build --no-cache jupyter
docker-compose up -d
Manual Package Installation
# Install packages in running container
docker-compose exec jupyter pip install package_name
# Or add to requirements.txt and rebuild
2. Jupyter Notebook Issues
Symptom: Kernel won't start or crashes frequently
Solutions:
Restart Jupyter Kernel
- In Jupyter: Kernel → Restart & Clear Output
Check Jupyter Logs
docker-compose logs jupyter
Increase Memory Limits
# In docker-compose.yml
jupyter:
# ... other config
deploy:
resources:
limits:
memory: 4G
Clear Jupyter Cache
# Remove Jupyter cache
docker-compose exec jupyter rm -rf ~/.jupyter/
docker-compose restart jupyter
🚀 Performance Optimization Tips
1. Spark Configuration Tuning
# Optimal Spark configuration for development
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
# Memory optimization
spark.conf.set("spark.executor.memoryFraction", "0.8")
spark.conf.set("spark.sql.shuffle.partitions", "200") # Adjust based on data size
2. Data Processing Best Practices
# Cache DataFrames used multiple times
df.cache()
df.count() # Trigger caching
# Use appropriate file formats
# CSV (slowest) → JSON → Parquet (fastest)
# Partition data for better performance
df.write.partitionBy("year", "month").parquet("partitioned_data")
# Use column pruning
df.select("col1", "col2").filter("col1 > 100") # Better than df.filter().select()
3. Memory Management
# Unpersist DataFrames when done
df.unpersist()
# Clear Spark context periodically
spark.catalog.clearCache()
# Monitor memory usage
print(f"Cached tables: {spark.catalog.listTables()}")
🐞 Debugging Strategies
1. Enable Debug Logging
# Set log level for debugging
spark.sparkContext.setLogLevel("DEBUG") # Very verbose
spark.sparkContext.setLogLevel("INFO") # Moderate
spark.sparkContext.setLogLevel("WARN") # Minimal (default)
2. Inspect Data at Each Step
# Check DataFrame at each transformation
print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}")
df1.show(5)
df2 = df1.filter(F.col("value") > 0)
print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}")
df2.show(5)
3. Use Explain Plans
# See execution plan
df.explain(True)
# Check for expensive operations
df.explain("cost")
4. Sample Data for Testing
# Use small samples for development
sample_df = large_df.sample(0.01, seed=42) # 1% sample
# Limit rows for testing
test_df = df.limit(1000)
📋 Health Check Commands
Quick System Check Script
#!/bin/bash
echo "🔍 Smart City IoT Pipeline Health Check"
echo "======================================"
echo "📋 Docker Status:"
docker --version
docker-compose --version
echo "🐳 Container Status:"
docker-compose ps
echo "💾 Disk Usage:"
df -h
docker system df
echo "🧠 Memory Usage:"
free -h
echo "🌐 Network Connectivity:"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible"
echo "🗄️ Database Status:"
docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready"
echo "📁 Data Files:"
ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing"
Python Health Check
def health_check():
"""Run comprehensive health check"""
checks = {
"spark_session": False,
"database_connection": False,
"data_files": False,
"memory_usage": False
}
# Check Spark session
try:
spark.sparkContext.statusTracker()
checks["spark_session"] = True
print("✅ Spark session healthy")
except:
print("❌ Spark session issues")
# Check database
try:
test_df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://postgres:5432/smartcity") \
.option("dbtable", "(SELECT 1) as test") \
.option("user", "postgres") \
.option("password", "password") \
.load()
test_df.count()
checks["database_connection"] = True
print("✅ Database connection healthy")
except Exception as e:
print(f"❌ Database issues: {e}")
# Check data files
try:
import os
required_files = [
"data/raw/traffic_sensors.csv",
"data/raw/air_quality.json",
"data/raw/weather_data.parquet"
]
missing_files = [f for f in required_files if not os.path.exists(f)]
if not missing_files:
checks["data_files"] = True
print("✅ All data files present")
else:
print(f"❌ Missing files: {missing_files}")
except Exception as e:
print(f"❌ File check failed: {e}")
# Check memory usage
try:
import psutil
memory_percent = psutil.virtual_memory().percent
if memory_percent < 80:
checks["memory_usage"] = True
print(f"✅ Memory usage OK: {memory_percent:.1f}%")
else:
print(f"⚠️ High memory usage: {memory_percent:.1f}%")
except:
print("❓ Cannot check memory usage")
overall_health = sum(checks.values()) / len(checks) * 100
print(f"\n📊 Overall System Health: {overall_health:.1f}%")
return checks
# Run health check
health_status = health_check()
🆘 When All Else Fails
Complete Environment Reset
# Nuclear option - complete reset
docker-compose down -v --remove-orphans
docker system prune -a --volumes
docker builder prune -a
# Remove all project data (CAUTION!)
rm -rf data/processed/* data/features/*
# Rebuild everything
docker-compose build --no-cache
docker-compose up -d
# Regenerate sample data
python scripts/generate_data.py
Get Help
- Check GitHub Issues: Look for similar problems in the project repository
- Stack Overflow: Search for Spark/Docker specific errors
- Spark Documentation: https://spark.apache.org/docs/latest/
- Docker Documentation: https://docs.docker.com/
Collect Diagnostic Information
# Gather system information for help requests
echo "System Information:" > diagnostic_info.txt
uname -a >> diagnostic_info.txt
docker --version >> diagnostic_info.txt
docker-compose --version >> diagnostic_info.txt
python --version >> diagnostic_info.txt
echo "Container Status:" >> diagnostic_info.txt
docker-compose ps >> diagnostic_info.txt
echo "Container Logs:" >> diagnostic_info.txt
docker-compose logs --tail=50 >> diagnostic_info.txt
echo "Disk Usage:" >> diagnostic_info.txt
df -h >> diagnostic_info.txt
docker system df >> diagnostic_info.txt
📚 Additional Resources
- Spark Tuning Guide: https://spark.apache.org/docs/latest/tuning.html
- Docker Best Practices: https://docs.docker.com/develop/best-practices/
- PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/
- PostgreSQL Docker Guide: https://hub.docker.com/_/postgres
Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄