Última actividad 1751371779

Spark-Docker-Fixes.md Sin formato

Spark-Docker-SQL - Troubleshooting Guide

🚨 Quick Emergency Fixes

🔥 "Everything is Broken" - Nuclear Option

# Stop everything and restart fresh
docker-compose down -v --remove-orphans
docker system prune -f
docker volume prune -f
docker-compose up -d --build

⚡ "Just Need to Restart" - Soft Reset

# Restart just the services
docker-compose restart
# Or restart specific service
docker-compose restart spark-master

🐳 Docker Issues

1. Container Won't Start

Symptom: docker-compose up fails or containers exit immediately

Common Causes & Solutions:

Port Already in Use

# Check what's using the port
lsof -i :8080  # For Spark UI
lsof -i :5432  # For PostgreSQL
lsof -i :8888  # For Jupyter

# Kill the process using the port
sudo kill -9 <PID>

# Or change ports in docker-compose.yml
ports:
  - "8081:8080"  # Use different host port

Insufficient Memory

# Check Docker resource allocation
docker system info | grep -i memory

# Increase Docker memory limit (Docker Desktop):
# Settings → Resources → Memory → Increase to 8GB+

# For Linux, check available memory
free -h

Volume Mount Issues

# Check if directories exist
ls -la data/
ls -la notebooks/

# Create missing directories
mkdir -p data/raw data/processed data/features
mkdir -p notebooks config sql

# Fix permissions
sudo chown -R $USER:$USER data/ notebooks/ config/
chmod -R 755 data/ notebooks/ config/

2. Cannot Connect to Services

Symptom: "Connection refused" when accessing Spark UI or Jupyter

Solutions:

Check Container Status

# See which containers are running
docker-compose ps

# Check logs for specific service
docker-compose logs spark-master
docker-compose logs jupyter
docker-compose logs postgres

Network Issues

# Check if services are listening
docker-compose exec spark-master netstat -tlnp | grep 8080
docker-compose exec postgres netstat -tlnp | grep 5432

# Test connectivity between containers
docker-compose exec jupyter ping spark-master
docker-compose exec jupyter ping postgres

Firewall/Security Issues

# Disable firewall temporarily (Linux)
sudo ufw disable

# For macOS, check System Preferences → Security & Privacy

# For Windows, check Windows Defender Firewall

3. Out of Disk Space

Symptom: "No space left on device"

Solutions:

# Check disk usage
df -h
docker system df

# Clean up Docker resources
docker system prune -a --volumes
docker builder prune -a

# Remove unused images
docker image prune -a

# Clean up old containers
docker container prune

4. Docker Compose Version Issues

Symptom: "version not supported" or syntax errors

Solution:

# Check Docker Compose version
docker-compose --version

# Update Docker Compose (Linux)
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

# For older versions, use version 3.7 instead of 3.8 in docker-compose.yml

⚡ Spark Issues

1. Spark Session Creation Fails

Symptom: Cannot connect to Spark master or session creation hangs

Common Causes & Solutions:

Master Not Running

# Check if Spark master is accessible
import requests
try:
    response = requests.get("http://localhost:8080")
    print("Spark master is running")
except:
    print("Cannot reach Spark master")

Wrong Master URL

# Try different master configurations
# For local development
spark = SparkSession.builder.master("local[*]").getOrCreate()

# For Docker cluster
spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate()

# Check from inside Jupyter container
spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate()

Memory Configuration Issues

spark = (SparkSession.builder
         .appName("SmartCityIoTPipeline")
         .master("local[*]")
         .config("spark.driver.memory", "2g")  # Reduce if needed
         .config("spark.executor.memory", "1g")  # Reduce if needed
         .config("spark.driver.maxResultSize", "1g")
         .getOrCreate())

2. Out of Memory Errors

Symptom: Java heap space or GC overhead limit exceeded

Solutions:

Increase Memory Allocation

spark.conf.set("spark.driver.memory", "4g")
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.driver.maxResultSize", "2g")

Optimize Data Processing

# Use sampling for large datasets
sample_df = large_df.sample(0.1, seed=42)

# Cache frequently used DataFrames
df.cache()
df.count()  # Trigger caching

# Repartition data
df = df.repartition(4)  # Fewer partitions for small datasets

# Use coalesce to reduce partitions
df = df.coalesce(2)

Process Data in Chunks

# Process data month by month
for month in range(1, 13):
    monthly_data = df.filter(F.month("timestamp") == month)
    # Process monthly_data
    monthly_data.unpersist()  # Free memory

3. Slow Spark Jobs

Symptom: Jobs take very long time or appear to hang

Solutions:

Check Spark UI for Bottlenecks

  • Open http://localhost:4040 (or 4041, 4042 if multiple sessions)
  • Look at the Jobs tab for failed/slow stages
  • Check Executors tab for resource usage

Optimize Partitioning

# Check current partitions
print(f"Partitions: {df.rdd.getNumPartitions()}")

# Optimal partitions = 2-3x number of cores
optimal_partitions = spark.sparkContext.defaultParallelism * 2
df = df.repartition(optimal_partitions)

Avoid Expensive Operations

# Avoid repeated .count() calls
count = df.count()
print(f"Records: {count}")

# Use .cache() for DataFrames used multiple times
df.cache()

# Avoid .collect() on large datasets
# Instead of:
all_data = df.collect()  # BAD: loads all data to driver

# Use:
sample_data = df.limit(1000).collect()  # GOOD: only sample

Optimize Joins

# Broadcast small DataFrames
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")

# Use appropriate join strategies
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

4. DataFrame Operations Fail

Symptom: AnalysisException or column not found errors

Solutions:

Check Schema and Column Names

# Print schema to see exact column names
df.printSchema()

# Show column names
print(df.columns)

# Check for case sensitivity
df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()])

Handle Null Values

# Check for nulls before operations
df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()

# Drop nulls before joins
df_clean = df.na.drop(subset=['key_column'])

# Fill nulls with defaults
df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'})

Fix Data Type Issues

# Cast columns to correct types
df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
df = df.withColumn("numeric_col", F.col("numeric_col").cast("double"))

# Handle string/numeric conversion errors
df = df.withColumn("safe_numeric", 
    F.when(F.col("string_col").rlike("^[0-9.]+$"), 
           F.col("string_col").cast("double")).otherwise(0))

🗄️ Database Connection Issues

1. Cannot Connect to PostgreSQL

Symptom: Connection refused or authentication failed

Solutions:

Check PostgreSQL Status

# Check if PostgreSQL container is running
docker-compose ps postgres

# Check PostgreSQL logs
docker-compose logs postgres

# Test connection from host
psql -h localhost -p 5432 -U postgres -d smartcity

From Jupyter/Spark Container

# Test database connection
import psycopg2

try:
    conn = psycopg2.connect(
        host="postgres",  # Use container name, not localhost
        port=5432,
        user="postgres",
        password="password",
        database="smartcity"
    )
    print("Database connection successful")
    conn.close()
except Exception as e:
    print(f"Connection failed: {e}")

Spark JDBC Connection

# Correct JDBC URL for Docker
jdbc_url = "jdbc:postgresql://postgres:5432/smartcity"

# Test Spark database connection
test_df = spark.read.format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", "(SELECT 1 as test) as test_table") \
    .option("user", "postgres") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

test_df.show()

2. JDBC Driver Issues

Symptom: ClassNotFoundException: org.postgresql.Driver

Solutions:

Add JDBC Driver to Spark

spark = SparkSession.builder \
    .appName("SmartCityIoTPipeline") \
    .config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \
    .getOrCreate()

Download Driver Manually

# Download PostgreSQL JDBC driver
cd /opt/bitnami/spark/jars/
wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar

📊 Data Loading Issues

1. File Not Found Errors

Symptom: FileNotFoundException or path does not exist

Solutions:

Check File Paths

import os

# Check if file exists
data_file = "data/raw/traffic_sensors.csv"
print(f"File exists: {os.path.exists(data_file)}")

# List directory contents
print(os.listdir("data/raw/"))

# Use absolute paths if needed
import os
abs_path = os.path.abspath("data/raw/traffic_sensors.csv")
df = spark.read.csv(abs_path, header=True, inferSchema=True)

Volume Mount Issues

# Check if volumes are mounted correctly
docker-compose exec jupyter ls -la /home/jovyan/work/data/

# Verify volume mounts in docker-compose.yml
volumes:
  - ./data:/home/jovyan/work/data
  - ./notebooks:/home/jovyan/work/notebooks

2. Schema Inference Problems

Symptom: Wrong data types or parsing errors

Solutions:

Explicit Schema Definition

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

# Define explicit schema
schema = StructType([
    StructField("sensor_id", StringType(), False),
    StructField("timestamp", StringType(), False),  # Read as string first
    StructField("vehicle_count", IntegerType(), True),
    StructField("avg_speed", DoubleType(), True)
])

df = spark.read.csv("data/raw/traffic_sensors.csv", 
                   header=True, schema=schema)

# Then convert timestamp
df = df.withColumn("timestamp", F.to_timestamp("timestamp"))

Handle Different Date Formats

# Try different timestamp formats
df = df.withColumn("timestamp", 
    F.coalesce(
        F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"),
        F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"),
        F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
    ))

3. Large File Loading Issues

Symptom: Out of memory when loading large files

Solutions:

Process Files in Chunks

# For very large CSV files, process line by line
def process_large_csv(file_path, chunk_size=10000):
    # Read in smaller chunks
    df = spark.read.option("maxRecordsPerFile", chunk_size) \
        .csv(file_path, header=True, inferSchema=True)
    return df

# Or split large files manually
# split -l 100000 large_file.csv chunk_

Optimize File Format

# Convert to Parquet for better performance
df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet")

# Read Parquet instead of CSV
df = spark.read.parquet("data/processed/traffic_optimized.parquet")

🔧 Environment Setup Issues

1. Python Package Conflicts

Symptom: ImportError or version conflicts

Solutions:

Check Package Versions

import sys
print(f"Python version: {sys.version}")

import pyspark
print(f"PySpark version: {pyspark.__version__}")

import pandas
print(f"Pandas version: {pandas.__version__}")

Rebuild Jupyter Container

# Rebuild with latest packages
docker-compose down
docker-compose build --no-cache jupyter
docker-compose up -d

Manual Package Installation

# Install packages in running container
docker-compose exec jupyter pip install package_name

# Or add to requirements.txt and rebuild

2. Jupyter Notebook Issues

Symptom: Kernel won't start or crashes frequently

Solutions:

Restart Jupyter Kernel

  • In Jupyter: Kernel → Restart & Clear Output

Check Jupyter Logs

docker-compose logs jupyter

Increase Memory Limits

# In docker-compose.yml
jupyter:
  # ... other config
  deploy:
    resources:
      limits:
        memory: 4G

Clear Jupyter Cache

# Remove Jupyter cache
docker-compose exec jupyter rm -rf ~/.jupyter/
docker-compose restart jupyter

🚀 Performance Optimization Tips

1. Spark Configuration Tuning

# Optimal Spark configuration for development
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Memory optimization
spark.conf.set("spark.executor.memoryFraction", "0.8")
spark.conf.set("spark.sql.shuffle.partitions", "200")  # Adjust based on data size

2. Data Processing Best Practices

# Cache DataFrames used multiple times
df.cache()
df.count()  # Trigger caching

# Use appropriate file formats
# CSV (slowest) → JSON → Parquet (fastest)

# Partition data for better performance
df.write.partitionBy("year", "month").parquet("partitioned_data")

# Use column pruning
df.select("col1", "col2").filter("col1 > 100")  # Better than df.filter().select()

3. Memory Management

# Unpersist DataFrames when done
df.unpersist()

# Clear Spark context periodically
spark.catalog.clearCache()

# Monitor memory usage
print(f"Cached tables: {spark.catalog.listTables()}")

🐞 Debugging Strategies

1. Enable Debug Logging

# Set log level for debugging
spark.sparkContext.setLogLevel("DEBUG")  # Very verbose
spark.sparkContext.setLogLevel("INFO")   # Moderate
spark.sparkContext.setLogLevel("WARN")   # Minimal (default)

2. Inspect Data at Each Step

# Check DataFrame at each transformation
print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}")
df1.show(5)

df2 = df1.filter(F.col("value") > 0)
print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}")
df2.show(5)

3. Use Explain Plans

# See execution plan
df.explain(True)

# Check for expensive operations
df.explain("cost")

4. Sample Data for Testing

# Use small samples for development
sample_df = large_df.sample(0.01, seed=42)  # 1% sample

# Limit rows for testing
test_df = df.limit(1000)

📋 Health Check Commands

Quick System Check Script

#!/bin/bash
echo "🔍 Smart City IoT Pipeline Health Check"
echo "======================================"

echo "📋 Docker Status:"
docker --version
docker-compose --version

echo "🐳 Container Status:"
docker-compose ps

echo "💾 Disk Usage:"
df -h
docker system df

echo "🧠 Memory Usage:"
free -h

echo "🌐 Network Connectivity:"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible"

echo "🗄️ Database Status:"
docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready"

echo "📁 Data Files:"
ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing"

Python Health Check

def health_check():
    """Run comprehensive health check"""
    checks = {
        "spark_session": False,
        "database_connection": False,
        "data_files": False,
        "memory_usage": False
    }
    
    # Check Spark session
    try:
        spark.sparkContext.statusTracker()
        checks["spark_session"] = True
        print("✅ Spark session healthy")
    except:
        print("❌ Spark session issues")
    
    # Check database
    try:
        test_df = spark.read.format("jdbc") \
            .option("url", "jdbc:postgresql://postgres:5432/smartcity") \
            .option("dbtable", "(SELECT 1) as test") \
            .option("user", "postgres") \
            .option("password", "password") \
            .load()
        test_df.count()
        checks["database_connection"] = True
        print("✅ Database connection healthy")
    except Exception as e:
        print(f"❌ Database issues: {e}")
    
    # Check data files
    try:
        import os
        required_files = [
            "data/raw/traffic_sensors.csv",
            "data/raw/air_quality.json", 
            "data/raw/weather_data.parquet"
        ]
        
        missing_files = [f for f in required_files if not os.path.exists(f)]
        if not missing_files:
            checks["data_files"] = True
            print("✅ All data files present")
        else:
            print(f"❌ Missing files: {missing_files}")
    except Exception as e:
        print(f"❌ File check failed: {e}")
    
    # Check memory usage
    try:
        import psutil
        memory_percent = psutil.virtual_memory().percent
        if memory_percent < 80:
            checks["memory_usage"] = True
            print(f"✅ Memory usage OK: {memory_percent:.1f}%")
        else:
            print(f"⚠️ High memory usage: {memory_percent:.1f}%")
    except:
        print("❓ Cannot check memory usage")
    
    overall_health = sum(checks.values()) / len(checks) * 100
    print(f"\n📊 Overall System Health: {overall_health:.1f}%")
    
    return checks

# Run health check
health_status = health_check()

🆘 When All Else Fails

Complete Environment Reset

# Nuclear option - complete reset
docker-compose down -v --remove-orphans
docker system prune -a --volumes
docker builder prune -a

# Remove all project data (CAUTION!)
rm -rf data/processed/* data/features/*

# Rebuild everything
docker-compose build --no-cache
docker-compose up -d

# Regenerate sample data
python scripts/generate_data.py

Get Help

  1. Check GitHub Issues: Look for similar problems in the project repository
  2. Stack Overflow: Search for Spark/Docker specific errors
  3. Spark Documentation: https://spark.apache.org/docs/latest/
  4. Docker Documentation: https://docs.docker.com/

Collect Diagnostic Information

# Gather system information for help requests
echo "System Information:" > diagnostic_info.txt
uname -a >> diagnostic_info.txt
docker --version >> diagnostic_info.txt
docker-compose --version >> diagnostic_info.txt
python --version >> diagnostic_info.txt

echo "Container Status:" >> diagnostic_info.txt
docker-compose ps >> diagnostic_info.txt

echo "Container Logs:" >> diagnostic_info.txt
docker-compose logs --tail=50 >> diagnostic_info.txt

echo "Disk Usage:" >> diagnostic_info.txt
df -h >> diagnostic_info.txt
docker system df >> diagnostic_info.txt

📚 Additional Resources

Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄