kristofer revised this gist . Go to revision
1 file changed, 1 insertion, 1 deletion
Spark-Docker-Fixes.md
| @@ -1,4 +1,4 @@ | |||
| 1 | - | # Smart City IoT Pipeline - Troubleshooting Guide | |
| 1 | + | # Spark-Docker-SQL - Troubleshooting Guide | |
| 2 | 2 | ||
| 3 | 3 | ## 🚨 Quick Emergency Fixes | |
| 4 | 4 | ||
kristofer revised this gist . Go to revision
1 file changed, 836 insertions
Spark-Docker-Fixes.md(file created)
| @@ -0,0 +1,836 @@ | |||
| 1 | + | # Smart City IoT Pipeline - Troubleshooting Guide | |
| 2 | + | ||
| 3 | + | ## 🚨 Quick Emergency Fixes | |
| 4 | + | ||
| 5 | + | ### 🔥 "Everything is Broken" - Nuclear Option | |
| 6 | + | ```bash | |
| 7 | + | # Stop everything and restart fresh | |
| 8 | + | docker-compose down -v --remove-orphans | |
| 9 | + | docker system prune -f | |
| 10 | + | docker volume prune -f | |
| 11 | + | docker-compose up -d --build | |
| 12 | + | ``` | |
| 13 | + | ||
| 14 | + | ### ⚡ "Just Need to Restart" - Soft Reset | |
| 15 | + | ```bash | |
| 16 | + | # Restart just the services | |
| 17 | + | docker-compose restart | |
| 18 | + | # Or restart specific service | |
| 19 | + | docker-compose restart spark-master | |
| 20 | + | ``` | |
| 21 | + | ||
| 22 | + | --- | |
| 23 | + | ||
| 24 | + | ## 🐳 Docker Issues | |
| 25 | + | ||
| 26 | + | ### 1. Container Won't Start | |
| 27 | + | ||
| 28 | + | #### **Symptom**: `docker-compose up` fails or containers exit immediately | |
| 29 | + | ||
| 30 | + | #### **Common Causes & Solutions**: | |
| 31 | + | ||
| 32 | + | **Port Already in Use** | |
| 33 | + | ```bash | |
| 34 | + | # Check what's using the port | |
| 35 | + | lsof -i :8080 # For Spark UI | |
| 36 | + | lsof -i :5432 # For PostgreSQL | |
| 37 | + | lsof -i :8888 # For Jupyter | |
| 38 | + | ||
| 39 | + | # Kill the process using the port | |
| 40 | + | sudo kill -9 <PID> | |
| 41 | + | ||
| 42 | + | # Or change ports in docker-compose.yml | |
| 43 | + | ports: | |
| 44 | + | - "8081:8080" # Use different host port | |
| 45 | + | ``` | |
| 46 | + | ||
| 47 | + | **Insufficient Memory** | |
| 48 | + | ```bash | |
| 49 | + | # Check Docker resource allocation | |
| 50 | + | docker system info | grep -i memory | |
| 51 | + | ||
| 52 | + | # Increase Docker memory limit (Docker Desktop): | |
| 53 | + | # Settings → Resources → Memory → Increase to 8GB+ | |
| 54 | + | ||
| 55 | + | # For Linux, check available memory | |
| 56 | + | free -h | |
| 57 | + | ``` | |
| 58 | + | ||
| 59 | + | **Volume Mount Issues** | |
| 60 | + | ```bash | |
| 61 | + | # Check if directories exist | |
| 62 | + | ls -la data/ | |
| 63 | + | ls -la notebooks/ | |
| 64 | + | ||
| 65 | + | # Create missing directories | |
| 66 | + | mkdir -p data/raw data/processed data/features | |
| 67 | + | mkdir -p notebooks config sql | |
| 68 | + | ||
| 69 | + | # Fix permissions | |
| 70 | + | sudo chown -R $USER:$USER data/ notebooks/ config/ | |
| 71 | + | chmod -R 755 data/ notebooks/ config/ | |
| 72 | + | ``` | |
| 73 | + | ||
| 74 | + | ### 2. Cannot Connect to Services | |
| 75 | + | ||
| 76 | + | #### **Symptom**: "Connection refused" when accessing Spark UI or Jupyter | |
| 77 | + | ||
| 78 | + | #### **Solutions**: | |
| 79 | + | ||
| 80 | + | **Check Container Status** | |
| 81 | + | ```bash | |
| 82 | + | # See which containers are running | |
| 83 | + | docker-compose ps | |
| 84 | + | ||
| 85 | + | # Check logs for specific service | |
| 86 | + | docker-compose logs spark-master | |
| 87 | + | docker-compose logs jupyter | |
| 88 | + | docker-compose logs postgres | |
| 89 | + | ``` | |
| 90 | + | ||
| 91 | + | **Network Issues** | |
| 92 | + | ```bash | |
| 93 | + | # Check if services are listening | |
| 94 | + | docker-compose exec spark-master netstat -tlnp | grep 8080 | |
| 95 | + | docker-compose exec postgres netstat -tlnp | grep 5432 | |
| 96 | + | ||
| 97 | + | # Test connectivity between containers | |
| 98 | + | docker-compose exec jupyter ping spark-master | |
| 99 | + | docker-compose exec jupyter ping postgres | |
| 100 | + | ``` | |
| 101 | + | ||
| 102 | + | **Firewall/Security Issues** | |
| 103 | + | ```bash | |
| 104 | + | # Disable firewall temporarily (Linux) | |
| 105 | + | sudo ufw disable | |
| 106 | + | ||
| 107 | + | # For macOS, check System Preferences → Security & Privacy | |
| 108 | + | ||
| 109 | + | # For Windows, check Windows Defender Firewall | |
| 110 | + | ``` | |
| 111 | + | ||
| 112 | + | ### 3. Out of Disk Space | |
| 113 | + | ||
| 114 | + | #### **Symptom**: "No space left on device" | |
| 115 | + | ||
| 116 | + | #### **Solutions**: | |
| 117 | + | ```bash | |
| 118 | + | # Check disk usage | |
| 119 | + | df -h | |
| 120 | + | docker system df | |
| 121 | + | ||
| 122 | + | # Clean up Docker resources | |
| 123 | + | docker system prune -a --volumes | |
| 124 | + | docker builder prune -a | |
| 125 | + | ||
| 126 | + | # Remove unused images | |
| 127 | + | docker image prune -a | |
| 128 | + | ||
| 129 | + | # Clean up old containers | |
| 130 | + | docker container prune | |
| 131 | + | ``` | |
| 132 | + | ||
| 133 | + | ### 4. Docker Compose Version Issues | |
| 134 | + | ||
| 135 | + | #### **Symptom**: "version not supported" or syntax errors | |
| 136 | + | ||
| 137 | + | #### **Solution**: | |
| 138 | + | ```bash | |
| 139 | + | # Check Docker Compose version | |
| 140 | + | docker-compose --version | |
| 141 | + | ||
| 142 | + | # Update Docker Compose (Linux) | |
| 143 | + | sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose | |
| 144 | + | sudo chmod +x /usr/local/bin/docker-compose | |
| 145 | + | ||
| 146 | + | # For older versions, use version 3.7 instead of 3.8 in docker-compose.yml | |
| 147 | + | ``` | |
| 148 | + | ||
| 149 | + | --- | |
| 150 | + | ||
| 151 | + | ## ⚡ Spark Issues | |
| 152 | + | ||
| 153 | + | ### 1. Spark Session Creation Fails | |
| 154 | + | ||
| 155 | + | #### **Symptom**: `Cannot connect to Spark master` or session creation hangs | |
| 156 | + | ||
| 157 | + | #### **Common Causes & Solutions**: | |
| 158 | + | ||
| 159 | + | **Master Not Running** | |
| 160 | + | ```python | |
| 161 | + | # Check if Spark master is accessible | |
| 162 | + | import requests | |
| 163 | + | try: | |
| 164 | + | response = requests.get("http://localhost:8080") | |
| 165 | + | print("Spark master is running") | |
| 166 | + | except: | |
| 167 | + | print("Cannot reach Spark master") | |
| 168 | + | ``` | |
| 169 | + | ||
| 170 | + | **Wrong Master URL** | |
| 171 | + | ```python | |
| 172 | + | # Try different master configurations | |
| 173 | + | # For local development | |
| 174 | + | spark = SparkSession.builder.master("local[*]").getOrCreate() | |
| 175 | + | ||
| 176 | + | # For Docker cluster | |
| 177 | + | spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate() | |
| 178 | + | ||
| 179 | + | # Check from inside Jupyter container | |
| 180 | + | spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate() | |
| 181 | + | ``` | |
| 182 | + | ||
| 183 | + | **Memory Configuration Issues** | |
| 184 | + | ```python | |
| 185 | + | spark = (SparkSession.builder | |
| 186 | + | .appName("SmartCityIoTPipeline") | |
| 187 | + | .master("local[*]") | |
| 188 | + | .config("spark.driver.memory", "2g") # Reduce if needed | |
| 189 | + | .config("spark.executor.memory", "1g") # Reduce if needed | |
| 190 | + | .config("spark.driver.maxResultSize", "1g") | |
| 191 | + | .getOrCreate()) | |
| 192 | + | ``` | |
| 193 | + | ||
| 194 | + | ### 2. Out of Memory Errors | |
| 195 | + | ||
| 196 | + | #### **Symptom**: `Java heap space` or `GC overhead limit exceeded` | |
| 197 | + | ||
| 198 | + | #### **Solutions**: | |
| 199 | + | ||
| 200 | + | **Increase Memory Allocation** | |
| 201 | + | ```python | |
| 202 | + | spark.conf.set("spark.driver.memory", "4g") | |
| 203 | + | spark.conf.set("spark.executor.memory", "2g") | |
| 204 | + | spark.conf.set("spark.driver.maxResultSize", "2g") | |
| 205 | + | ``` | |
| 206 | + | ||
| 207 | + | **Optimize Data Processing** | |
| 208 | + | ```python | |
| 209 | + | # Use sampling for large datasets | |
| 210 | + | sample_df = large_df.sample(0.1, seed=42) | |
| 211 | + | ||
| 212 | + | # Cache frequently used DataFrames | |
| 213 | + | df.cache() | |
| 214 | + | df.count() # Trigger caching | |
| 215 | + | ||
| 216 | + | # Repartition data | |
| 217 | + | df = df.repartition(4) # Fewer partitions for small datasets | |
| 218 | + | ||
| 219 | + | # Use coalesce to reduce partitions | |
| 220 | + | df = df.coalesce(2) | |
| 221 | + | ``` | |
| 222 | + | ||
| 223 | + | **Process Data in Chunks** | |
| 224 | + | ```python | |
| 225 | + | # Process data month by month | |
| 226 | + | for month in range(1, 13): | |
| 227 | + | monthly_data = df.filter(F.month("timestamp") == month) | |
| 228 | + | # Process monthly_data | |
| 229 | + | monthly_data.unpersist() # Free memory | |
| 230 | + | ``` | |
| 231 | + | ||
| 232 | + | ### 3. Slow Spark Jobs | |
| 233 | + | ||
| 234 | + | #### **Symptom**: Jobs take very long time or appear to hang | |
| 235 | + | ||
| 236 | + | #### **Solutions**: | |
| 237 | + | ||
| 238 | + | **Check Spark UI for Bottlenecks** | |
| 239 | + | - Open http://localhost:4040 (or 4041, 4042 if multiple sessions) | |
| 240 | + | - Look at the Jobs tab for failed/slow stages | |
| 241 | + | - Check Executors tab for resource usage | |
| 242 | + | ||
| 243 | + | **Optimize Partitioning** | |
| 244 | + | ```python | |
| 245 | + | # Check current partitions | |
| 246 | + | print(f"Partitions: {df.rdd.getNumPartitions()}") | |
| 247 | + | ||
| 248 | + | # Optimal partitions = 2-3x number of cores | |
| 249 | + | optimal_partitions = spark.sparkContext.defaultParallelism * 2 | |
| 250 | + | df = df.repartition(optimal_partitions) | |
| 251 | + | ``` | |
| 252 | + | ||
| 253 | + | **Avoid Expensive Operations** | |
| 254 | + | ```python | |
| 255 | + | # Avoid repeated .count() calls | |
| 256 | + | count = df.count() | |
| 257 | + | print(f"Records: {count}") | |
| 258 | + | ||
| 259 | + | # Use .cache() for DataFrames used multiple times | |
| 260 | + | df.cache() | |
| 261 | + | ||
| 262 | + | # Avoid .collect() on large datasets | |
| 263 | + | # Instead of: | |
| 264 | + | all_data = df.collect() # BAD: loads all data to driver | |
| 265 | + | ||
| 266 | + | # Use: | |
| 267 | + | sample_data = df.limit(1000).collect() # GOOD: only sample | |
| 268 | + | ``` | |
| 269 | + | ||
| 270 | + | **Optimize Joins** | |
| 271 | + | ```python | |
| 272 | + | # Broadcast small DataFrames | |
| 273 | + | from pyspark.sql.functions import broadcast | |
| 274 | + | result = large_df.join(broadcast(small_df), "key") | |
| 275 | + | ||
| 276 | + | # Use appropriate join strategies | |
| 277 | + | spark.conf.set("spark.sql.adaptive.enabled", "true") | |
| 278 | + | spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") | |
| 279 | + | ``` | |
| 280 | + | ||
| 281 | + | ### 4. DataFrame Operations Fail | |
| 282 | + | ||
| 283 | + | #### **Symptom**: `AnalysisException` or column not found errors | |
| 284 | + | ||
| 285 | + | #### **Solutions**: | |
| 286 | + | ||
| 287 | + | **Check Schema and Column Names** | |
| 288 | + | ```python | |
| 289 | + | # Print schema to see exact column names | |
| 290 | + | df.printSchema() | |
| 291 | + | ||
| 292 | + | # Show column names | |
| 293 | + | print(df.columns) | |
| 294 | + | ||
| 295 | + | # Check for case sensitivity | |
| 296 | + | df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()]) | |
| 297 | + | ``` | |
| 298 | + | ||
| 299 | + | **Handle Null Values** | |
| 300 | + | ```python | |
| 301 | + | # Check for nulls before operations | |
| 302 | + | df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show() | |
| 303 | + | ||
| 304 | + | # Drop nulls before joins | |
| 305 | + | df_clean = df.na.drop(subset=['key_column']) | |
| 306 | + | ||
| 307 | + | # Fill nulls with defaults | |
| 308 | + | df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'}) | |
| 309 | + | ``` | |
| 310 | + | ||
| 311 | + | **Fix Data Type Issues** | |
| 312 | + | ```python | |
| 313 | + | # Cast columns to correct types | |
| 314 | + | df = df.withColumn("timestamp", F.to_timestamp("timestamp")) | |
| 315 | + | df = df.withColumn("numeric_col", F.col("numeric_col").cast("double")) | |
| 316 | + | ||
| 317 | + | # Handle string/numeric conversion errors | |
| 318 | + | df = df.withColumn("safe_numeric", | |
| 319 | + | F.when(F.col("string_col").rlike("^[0-9.]+$"), | |
| 320 | + | F.col("string_col").cast("double")).otherwise(0)) | |
| 321 | + | ``` | |
| 322 | + | ||
| 323 | + | --- | |
| 324 | + | ||
| 325 | + | ## 🗄️ Database Connection Issues | |
| 326 | + | ||
| 327 | + | ### 1. Cannot Connect to PostgreSQL | |
| 328 | + | ||
| 329 | + | #### **Symptom**: `Connection refused` or authentication failed | |
| 330 | + | ||
| 331 | + | #### **Solutions**: | |
| 332 | + | ||
| 333 | + | **Check PostgreSQL Status** | |
| 334 | + | ```bash | |
| 335 | + | # Check if PostgreSQL container is running | |
| 336 | + | docker-compose ps postgres | |
| 337 | + | ||
| 338 | + | # Check PostgreSQL logs | |
| 339 | + | docker-compose logs postgres | |
| 340 | + | ||
| 341 | + | # Test connection from host | |
| 342 | + | psql -h localhost -p 5432 -U postgres -d smartcity | |
| 343 | + | ``` | |
| 344 | + | ||
| 345 | + | **From Jupyter/Spark Container** | |
| 346 | + | ```python | |
| 347 | + | # Test database connection | |
| 348 | + | import psycopg2 | |
| 349 | + | ||
| 350 | + | try: | |
| 351 | + | conn = psycopg2.connect( | |
| 352 | + | host="postgres", # Use container name, not localhost | |
| 353 | + | port=5432, | |
| 354 | + | user="postgres", | |
| 355 | + | password="password", | |
| 356 | + | database="smartcity" | |
| 357 | + | ) | |
| 358 | + | print("Database connection successful") | |
| 359 | + | conn.close() | |
| 360 | + | except Exception as e: | |
| 361 | + | print(f"Connection failed: {e}") | |
| 362 | + | ``` | |
| 363 | + | ||
| 364 | + | **Spark JDBC Connection** | |
| 365 | + | ```python | |
| 366 | + | # Correct JDBC URL for Docker | |
| 367 | + | jdbc_url = "jdbc:postgresql://postgres:5432/smartcity" | |
| 368 | + | ||
| 369 | + | # Test Spark database connection | |
| 370 | + | test_df = spark.read.format("jdbc") \ | |
| 371 | + | .option("url", jdbc_url) \ | |
| 372 | + | .option("dbtable", "(SELECT 1 as test) as test_table") \ | |
| 373 | + | .option("user", "postgres") \ | |
| 374 | + | .option("password", "password") \ | |
| 375 | + | .option("driver", "org.postgresql.Driver") \ | |
| 376 | + | .load() | |
| 377 | + | ||
| 378 | + | test_df.show() | |
| 379 | + | ``` | |
| 380 | + | ||
| 381 | + | ### 2. JDBC Driver Issues | |
| 382 | + | ||
| 383 | + | #### **Symptom**: `ClassNotFoundException: org.postgresql.Driver` | |
| 384 | + | ||
| 385 | + | #### **Solutions**: | |
| 386 | + | ||
| 387 | + | **Add JDBC Driver to Spark** | |
| 388 | + | ```python | |
| 389 | + | spark = SparkSession.builder \ | |
| 390 | + | .appName("SmartCityIoTPipeline") \ | |
| 391 | + | .config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \ | |
| 392 | + | .getOrCreate() | |
| 393 | + | ``` | |
| 394 | + | ||
| 395 | + | **Download Driver Manually** | |
| 396 | + | ```bash | |
| 397 | + | # Download PostgreSQL JDBC driver | |
| 398 | + | cd /opt/bitnami/spark/jars/ | |
| 399 | + | wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar | |
| 400 | + | ``` | |
| 401 | + | ||
| 402 | + | --- | |
| 403 | + | ||
| 404 | + | ## 📊 Data Loading Issues | |
| 405 | + | ||
| 406 | + | ### 1. File Not Found Errors | |
| 407 | + | ||
| 408 | + | #### **Symptom**: `FileNotFoundException` or path does not exist | |
| 409 | + | ||
| 410 | + | #### **Solutions**: | |
| 411 | + | ||
| 412 | + | **Check File Paths** | |
| 413 | + | ```python | |
| 414 | + | import os | |
| 415 | + | ||
| 416 | + | # Check if file exists | |
| 417 | + | data_file = "data/raw/traffic_sensors.csv" | |
| 418 | + | print(f"File exists: {os.path.exists(data_file)}") | |
| 419 | + | ||
| 420 | + | # List directory contents | |
| 421 | + | print(os.listdir("data/raw/")) | |
| 422 | + | ||
| 423 | + | # Use absolute paths if needed | |
| 424 | + | import os | |
| 425 | + | abs_path = os.path.abspath("data/raw/traffic_sensors.csv") | |
| 426 | + | df = spark.read.csv(abs_path, header=True, inferSchema=True) | |
| 427 | + | ``` | |
| 428 | + | ||
| 429 | + | **Volume Mount Issues** | |
| 430 | + | ```bash | |
| 431 | + | # Check if volumes are mounted correctly | |
| 432 | + | docker-compose exec jupyter ls -la /home/jovyan/work/data/ | |
| 433 | + | ||
| 434 | + | # Verify volume mounts in docker-compose.yml | |
| 435 | + | volumes: | |
| 436 | + | - ./data:/home/jovyan/work/data | |
| 437 | + | - ./notebooks:/home/jovyan/work/notebooks | |
| 438 | + | ``` | |
| 439 | + | ||
| 440 | + | ### 2. Schema Inference Problems | |
| 441 | + | ||
| 442 | + | #### **Symptom**: Wrong data types or parsing errors | |
| 443 | + | ||
| 444 | + | #### **Solutions**: | |
| 445 | + | ||
| 446 | + | **Explicit Schema Definition** | |
| 447 | + | ```python | |
| 448 | + | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType | |
| 449 | + | ||
| 450 | + | # Define explicit schema | |
| 451 | + | schema = StructType([ | |
| 452 | + | StructField("sensor_id", StringType(), False), | |
| 453 | + | StructField("timestamp", StringType(), False), # Read as string first | |
| 454 | + | StructField("vehicle_count", IntegerType(), True), | |
| 455 | + | StructField("avg_speed", DoubleType(), True) | |
| 456 | + | ]) | |
| 457 | + | ||
| 458 | + | df = spark.read.csv("data/raw/traffic_sensors.csv", | |
| 459 | + | header=True, schema=schema) | |
| 460 | + | ||
| 461 | + | # Then convert timestamp | |
| 462 | + | df = df.withColumn("timestamp", F.to_timestamp("timestamp")) | |
| 463 | + | ``` | |
| 464 | + | ||
| 465 | + | **Handle Different Date Formats** | |
| 466 | + | ```python | |
| 467 | + | # Try different timestamp formats | |
| 468 | + | df = df.withColumn("timestamp", | |
| 469 | + | F.coalesce( | |
| 470 | + | F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"), | |
| 471 | + | F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"), | |
| 472 | + | F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") | |
| 473 | + | )) | |
| 474 | + | ``` | |
| 475 | + | ||
| 476 | + | ### 3. Large File Loading Issues | |
| 477 | + | ||
| 478 | + | #### **Symptom**: Out of memory when loading large files | |
| 479 | + | ||
| 480 | + | #### **Solutions**: | |
| 481 | + | ||
| 482 | + | **Process Files in Chunks** | |
| 483 | + | ```python | |
| 484 | + | # For very large CSV files, process line by line | |
| 485 | + | def process_large_csv(file_path, chunk_size=10000): | |
| 486 | + | # Read in smaller chunks | |
| 487 | + | df = spark.read.option("maxRecordsPerFile", chunk_size) \ | |
| 488 | + | .csv(file_path, header=True, inferSchema=True) | |
| 489 | + | return df | |
| 490 | + | ||
| 491 | + | # Or split large files manually | |
| 492 | + | # split -l 100000 large_file.csv chunk_ | |
| 493 | + | ``` | |
| 494 | + | ||
| 495 | + | **Optimize File Format** | |
| 496 | + | ```python | |
| 497 | + | # Convert to Parquet for better performance | |
| 498 | + | df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet") | |
| 499 | + | ||
| 500 | + | # Read Parquet instead of CSV | |
| 501 | + | df = spark.read.parquet("data/processed/traffic_optimized.parquet") | |
| 502 | + | ``` | |
| 503 | + | ||
| 504 | + | --- | |
| 505 | + | ||
| 506 | + | ## 🔧 Environment Setup Issues | |
| 507 | + | ||
| 508 | + | ### 1. Python Package Conflicts | |
| 509 | + | ||
| 510 | + | #### **Symptom**: `ImportError` or version conflicts | |
| 511 | + | ||
| 512 | + | #### **Solutions**: | |
| 513 | + | ||
| 514 | + | **Check Package Versions** | |
| 515 | + | ```python | |
| 516 | + | import sys | |
| 517 | + | print(f"Python version: {sys.version}") | |
| 518 | + | ||
| 519 | + | import pyspark | |
| 520 | + | print(f"PySpark version: {pyspark.__version__}") | |
| 521 | + | ||
| 522 | + | import pandas | |
| 523 | + | print(f"Pandas version: {pandas.__version__}") | |
| 524 | + | ``` | |
| 525 | + | ||
| 526 | + | **Rebuild Jupyter Container** | |
| 527 | + | ```bash | |
| 528 | + | # Rebuild with latest packages | |
| 529 | + | docker-compose down | |
| 530 | + | docker-compose build --no-cache jupyter | |
| 531 | + | docker-compose up -d | |
| 532 | + | ``` | |
| 533 | + | ||
| 534 | + | **Manual Package Installation** | |
| 535 | + | ```bash | |
| 536 | + | # Install packages in running container | |
| 537 | + | docker-compose exec jupyter pip install package_name | |
| 538 | + | ||
| 539 | + | # Or add to requirements.txt and rebuild | |
| 540 | + | ``` | |
| 541 | + | ||
| 542 | + | ### 2. Jupyter Notebook Issues | |
| 543 | + | ||
| 544 | + | #### **Symptom**: Kernel won't start or crashes frequently | |
| 545 | + | ||
| 546 | + | #### **Solutions**: | |
| 547 | + | ||
| 548 | + | **Restart Jupyter Kernel** | |
| 549 | + | - In Jupyter: Kernel → Restart & Clear Output | |
| 550 | + | ||
| 551 | + | **Check Jupyter Logs** | |
| 552 | + | ```bash | |
| 553 | + | docker-compose logs jupyter | |
| 554 | + | ``` | |
| 555 | + | ||
| 556 | + | **Increase Memory Limits** | |
| 557 | + | ```yaml | |
| 558 | + | # In docker-compose.yml | |
| 559 | + | jupyter: | |
| 560 | + | # ... other config | |
| 561 | + | deploy: | |
| 562 | + | resources: | |
| 563 | + | limits: | |
| 564 | + | memory: 4G | |
| 565 | + | ``` | |
| 566 | + | ||
| 567 | + | **Clear Jupyter Cache** | |
| 568 | + | ```bash | |
| 569 | + | # Remove Jupyter cache | |
| 570 | + | docker-compose exec jupyter rm -rf ~/.jupyter/ | |
| 571 | + | docker-compose restart jupyter | |
| 572 | + | ``` | |
| 573 | + | ||
| 574 | + | --- | |
| 575 | + | ||
| 576 | + | ## 🚀 Performance Optimization Tips | |
| 577 | + | ||
| 578 | + | ### 1. Spark Configuration Tuning | |
| 579 | + | ||
| 580 | + | ```python | |
| 581 | + | # Optimal Spark configuration for development | |
| 582 | + | spark.conf.set("spark.sql.adaptive.enabled", "true") | |
| 583 | + | spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") | |
| 584 | + | spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") | |
| 585 | + | spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") | |
| 586 | + | ||
| 587 | + | # Memory optimization | |
| 588 | + | spark.conf.set("spark.executor.memoryFraction", "0.8") | |
| 589 | + | spark.conf.set("spark.sql.shuffle.partitions", "200") # Adjust based on data size | |
| 590 | + | ``` | |
| 591 | + | ||
| 592 | + | ### 2. Data Processing Best Practices | |
| 593 | + | ||
| 594 | + | ```python | |
| 595 | + | # Cache DataFrames used multiple times | |
| 596 | + | df.cache() | |
| 597 | + | df.count() # Trigger caching | |
| 598 | + | ||
| 599 | + | # Use appropriate file formats | |
| 600 | + | # CSV (slowest) → JSON → Parquet (fastest) | |
| 601 | + | ||
| 602 | + | # Partition data for better performance | |
| 603 | + | df.write.partitionBy("year", "month").parquet("partitioned_data") | |
| 604 | + | ||
| 605 | + | # Use column pruning | |
| 606 | + | df.select("col1", "col2").filter("col1 > 100") # Better than df.filter().select() | |
| 607 | + | ``` | |
| 608 | + | ||
| 609 | + | ### 3. Memory Management | |
| 610 | + | ||
| 611 | + | ```python | |
| 612 | + | # Unpersist DataFrames when done | |
| 613 | + | df.unpersist() | |
| 614 | + | ||
| 615 | + | # Clear Spark context periodically | |
| 616 | + | spark.catalog.clearCache() | |
| 617 | + | ||
| 618 | + | # Monitor memory usage | |
| 619 | + | print(f"Cached tables: {spark.catalog.listTables()}") | |
| 620 | + | ``` | |
| 621 | + | ||
| 622 | + | --- | |
| 623 | + | ||
| 624 | + | ## 🐞 Debugging Strategies | |
| 625 | + | ||
| 626 | + | ### 1. Enable Debug Logging | |
| 627 | + | ||
| 628 | + | ```python | |
| 629 | + | # Set log level for debugging | |
| 630 | + | spark.sparkContext.setLogLevel("DEBUG") # Very verbose | |
| 631 | + | spark.sparkContext.setLogLevel("INFO") # Moderate | |
| 632 | + | spark.sparkContext.setLogLevel("WARN") # Minimal (default) | |
| 633 | + | ``` | |
| 634 | + | ||
| 635 | + | ### 2. Inspect Data at Each Step | |
| 636 | + | ||
| 637 | + | ```python | |
| 638 | + | # Check DataFrame at each transformation | |
| 639 | + | print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}") | |
| 640 | + | df1.show(5) | |
| 641 | + | ||
| 642 | + | df2 = df1.filter(F.col("value") > 0) | |
| 643 | + | print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}") | |
| 644 | + | df2.show(5) | |
| 645 | + | ``` | |
| 646 | + | ||
| 647 | + | ### 3. Use Explain Plans | |
| 648 | + | ||
| 649 | + | ```python | |
| 650 | + | # See execution plan | |
| 651 | + | df.explain(True) | |
| 652 | + | ||
| 653 | + | # Check for expensive operations | |
| 654 | + | df.explain("cost") | |
| 655 | + | ``` | |
| 656 | + | ||
| 657 | + | ### 4. Sample Data for Testing | |
| 658 | + | ||
| 659 | + | ```python | |
| 660 | + | # Use small samples for development | |
| 661 | + | sample_df = large_df.sample(0.01, seed=42) # 1% sample | |
| 662 | + | ||
| 663 | + | # Limit rows for testing | |
| 664 | + | test_df = df.limit(1000) | |
| 665 | + | ``` | |
| 666 | + | ||
| 667 | + | --- | |
| 668 | + | ||
| 669 | + | ## 📋 Health Check Commands | |
| 670 | + | ||
| 671 | + | ### Quick System Check Script | |
| 672 | + | ||
| 673 | + | ```bash | |
| 674 | + | #!/bin/bash | |
| 675 | + | echo "🔍 Smart City IoT Pipeline Health Check" | |
| 676 | + | echo "======================================" | |
| 677 | + | ||
| 678 | + | echo "📋 Docker Status:" | |
| 679 | + | docker --version | |
| 680 | + | docker-compose --version | |
| 681 | + | ||
| 682 | + | echo "🐳 Container Status:" | |
| 683 | + | docker-compose ps | |
| 684 | + | ||
| 685 | + | echo "💾 Disk Usage:" | |
| 686 | + | df -h | |
| 687 | + | docker system df | |
| 688 | + | ||
| 689 | + | echo "🧠 Memory Usage:" | |
| 690 | + | free -h | |
| 691 | + | ||
| 692 | + | echo "🌐 Network Connectivity:" | |
| 693 | + | curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible" | |
| 694 | + | curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible" | |
| 695 | + | ||
| 696 | + | echo "🗄️ Database Status:" | |
| 697 | + | docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready" | |
| 698 | + | ||
| 699 | + | echo "📁 Data Files:" | |
| 700 | + | ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing" | |
| 701 | + | ``` | |
| 702 | + | ||
| 703 | + | ### Python Health Check | |
| 704 | + | ||
| 705 | + | ```python | |
| 706 | + | def health_check(): | |
| 707 | + | """Run comprehensive health check""" | |
| 708 | + | checks = { | |
| 709 | + | "spark_session": False, | |
| 710 | + | "database_connection": False, | |
| 711 | + | "data_files": False, | |
| 712 | + | "memory_usage": False | |
| 713 | + | } | |
| 714 | + | ||
| 715 | + | # Check Spark session | |
| 716 | + | try: | |
| 717 | + | spark.sparkContext.statusTracker() | |
| 718 | + | checks["spark_session"] = True | |
| 719 | + | print("✅ Spark session healthy") | |
| 720 | + | except: | |
| 721 | + | print("❌ Spark session issues") | |
| 722 | + | ||
| 723 | + | # Check database | |
| 724 | + | try: | |
| 725 | + | test_df = spark.read.format("jdbc") \ | |
| 726 | + | .option("url", "jdbc:postgresql://postgres:5432/smartcity") \ | |
| 727 | + | .option("dbtable", "(SELECT 1) as test") \ | |
| 728 | + | .option("user", "postgres") \ | |
| 729 | + | .option("password", "password") \ | |
| 730 | + | .load() | |
| 731 | + | test_df.count() | |
| 732 | + | checks["database_connection"] = True | |
| 733 | + | print("✅ Database connection healthy") | |
| 734 | + | except Exception as e: | |
| 735 | + | print(f"❌ Database issues: {e}") | |
| 736 | + | ||
| 737 | + | # Check data files | |
| 738 | + | try: | |
| 739 | + | import os | |
| 740 | + | required_files = [ | |
| 741 | + | "data/raw/traffic_sensors.csv", | |
| 742 | + | "data/raw/air_quality.json", | |
| 743 | + | "data/raw/weather_data.parquet" | |
| 744 | + | ] | |
| 745 | + | ||
| 746 | + | missing_files = [f for f in required_files if not os.path.exists(f)] | |
| 747 | + | if not missing_files: | |
| 748 | + | checks["data_files"] = True | |
| 749 | + | print("✅ All data files present") | |
| 750 | + | else: | |
| 751 | + | print(f"❌ Missing files: {missing_files}") | |
| 752 | + | except Exception as e: | |
| 753 | + | print(f"❌ File check failed: {e}") | |
| 754 | + | ||
| 755 | + | # Check memory usage | |
| 756 | + | try: | |
| 757 | + | import psutil | |
| 758 | + | memory_percent = psutil.virtual_memory().percent | |
| 759 | + | if memory_percent < 80: | |
| 760 | + | checks["memory_usage"] = True | |
| 761 | + | print(f"✅ Memory usage OK: {memory_percent:.1f}%") | |
| 762 | + | else: | |
| 763 | + | print(f"⚠️ High memory usage: {memory_percent:.1f}%") | |
| 764 | + | except: | |
| 765 | + | print("❓ Cannot check memory usage") | |
| 766 | + | ||
| 767 | + | overall_health = sum(checks.values()) / len(checks) * 100 | |
| 768 | + | print(f"\n📊 Overall System Health: {overall_health:.1f}%") | |
| 769 | + | ||
| 770 | + | return checks | |
| 771 | + | ||
| 772 | + | # Run health check | |
| 773 | + | health_status = health_check() | |
| 774 | + | ``` | |
| 775 | + | ||
| 776 | + | --- | |
| 777 | + | ||
| 778 | + | ## 🆘 When All Else Fails | |
| 779 | + | ||
| 780 | + | ### Complete Environment Reset | |
| 781 | + | ||
| 782 | + | ```bash | |
| 783 | + | # Nuclear option - complete reset | |
| 784 | + | docker-compose down -v --remove-orphans | |
| 785 | + | docker system prune -a --volumes | |
| 786 | + | docker builder prune -a | |
| 787 | + | ||
| 788 | + | # Remove all project data (CAUTION!) | |
| 789 | + | rm -rf data/processed/* data/features/* | |
| 790 | + | ||
| 791 | + | # Rebuild everything | |
| 792 | + | docker-compose build --no-cache | |
| 793 | + | docker-compose up -d | |
| 794 | + | ||
| 795 | + | # Regenerate sample data | |
| 796 | + | python scripts/generate_data.py | |
| 797 | + | ``` | |
| 798 | + | ||
| 799 | + | ### Get Help | |
| 800 | + | ||
| 801 | + | 1. **Check GitHub Issues**: Look for similar problems in the project repository | |
| 802 | + | 2. **Stack Overflow**: Search for Spark/Docker specific errors | |
| 803 | + | 3. **Spark Documentation**: https://spark.apache.org/docs/latest/ | |
| 804 | + | 4. **Docker Documentation**: https://docs.docker.com/ | |
| 805 | + | ||
| 806 | + | ### Collect Diagnostic Information | |
| 807 | + | ||
| 808 | + | ```bash | |
| 809 | + | # Gather system information for help requests | |
| 810 | + | echo "System Information:" > diagnostic_info.txt | |
| 811 | + | uname -a >> diagnostic_info.txt | |
| 812 | + | docker --version >> diagnostic_info.txt | |
| 813 | + | docker-compose --version >> diagnostic_info.txt | |
| 814 | + | python --version >> diagnostic_info.txt | |
| 815 | + | ||
| 816 | + | echo "Container Status:" >> diagnostic_info.txt | |
| 817 | + | docker-compose ps >> diagnostic_info.txt | |
| 818 | + | ||
| 819 | + | echo "Container Logs:" >> diagnostic_info.txt | |
| 820 | + | docker-compose logs --tail=50 >> diagnostic_info.txt | |
| 821 | + | ||
| 822 | + | echo "Disk Usage:" >> diagnostic_info.txt | |
| 823 | + | df -h >> diagnostic_info.txt | |
| 824 | + | docker system df >> diagnostic_info.txt | |
| 825 | + | ``` | |
| 826 | + | ||
| 827 | + | --- | |
| 828 | + | ||
| 829 | + | ## 📚 Additional Resources | |
| 830 | + | ||
| 831 | + | - **Spark Tuning Guide**: https://spark.apache.org/docs/latest/tuning.html | |
| 832 | + | - **Docker Best Practices**: https://docs.docker.com/develop/best-practices/ | |
| 833 | + | - **PySpark API Documentation**: https://spark.apache.org/docs/latest/api/python/ | |
| 834 | + | - **PostgreSQL Docker Guide**: https://hub.docker.com/_/postgres | |
| 835 | + | ||
| 836 | + | Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄 | |