kristofer hat die Gist bearbeitet . Zu Änderung gehen
1 file changed, 1 insertion, 1 deletion
Spark-Docker-Fixes.md
@@ -1,4 +1,4 @@ | |||
1 | - | # Smart City IoT Pipeline - Troubleshooting Guide | |
1 | + | # Spark-Docker-SQL - Troubleshooting Guide | |
2 | 2 | ||
3 | 3 | ## 🚨 Quick Emergency Fixes | |
4 | 4 |
kristofer hat die Gist bearbeitet . Zu Änderung gehen
1 file changed, 836 insertions
Spark-Docker-Fixes.md(Datei erstellt)
@@ -0,0 +1,836 @@ | |||
1 | + | # Smart City IoT Pipeline - Troubleshooting Guide | |
2 | + | ||
3 | + | ## 🚨 Quick Emergency Fixes | |
4 | + | ||
5 | + | ### 🔥 "Everything is Broken" - Nuclear Option | |
6 | + | ```bash | |
7 | + | # Stop everything and restart fresh | |
8 | + | docker-compose down -v --remove-orphans | |
9 | + | docker system prune -f | |
10 | + | docker volume prune -f | |
11 | + | docker-compose up -d --build | |
12 | + | ``` | |
13 | + | ||
14 | + | ### ⚡ "Just Need to Restart" - Soft Reset | |
15 | + | ```bash | |
16 | + | # Restart just the services | |
17 | + | docker-compose restart | |
18 | + | # Or restart specific service | |
19 | + | docker-compose restart spark-master | |
20 | + | ``` | |
21 | + | ||
22 | + | --- | |
23 | + | ||
24 | + | ## 🐳 Docker Issues | |
25 | + | ||
26 | + | ### 1. Container Won't Start | |
27 | + | ||
28 | + | #### **Symptom**: `docker-compose up` fails or containers exit immediately | |
29 | + | ||
30 | + | #### **Common Causes & Solutions**: | |
31 | + | ||
32 | + | **Port Already in Use** | |
33 | + | ```bash | |
34 | + | # Check what's using the port | |
35 | + | lsof -i :8080 # For Spark UI | |
36 | + | lsof -i :5432 # For PostgreSQL | |
37 | + | lsof -i :8888 # For Jupyter | |
38 | + | ||
39 | + | # Kill the process using the port | |
40 | + | sudo kill -9 <PID> | |
41 | + | ||
42 | + | # Or change ports in docker-compose.yml | |
43 | + | ports: | |
44 | + | - "8081:8080" # Use different host port | |
45 | + | ``` | |
46 | + | ||
47 | + | **Insufficient Memory** | |
48 | + | ```bash | |
49 | + | # Check Docker resource allocation | |
50 | + | docker system info | grep -i memory | |
51 | + | ||
52 | + | # Increase Docker memory limit (Docker Desktop): | |
53 | + | # Settings → Resources → Memory → Increase to 8GB+ | |
54 | + | ||
55 | + | # For Linux, check available memory | |
56 | + | free -h | |
57 | + | ``` | |
58 | + | ||
59 | + | **Volume Mount Issues** | |
60 | + | ```bash | |
61 | + | # Check if directories exist | |
62 | + | ls -la data/ | |
63 | + | ls -la notebooks/ | |
64 | + | ||
65 | + | # Create missing directories | |
66 | + | mkdir -p data/raw data/processed data/features | |
67 | + | mkdir -p notebooks config sql | |
68 | + | ||
69 | + | # Fix permissions | |
70 | + | sudo chown -R $USER:$USER data/ notebooks/ config/ | |
71 | + | chmod -R 755 data/ notebooks/ config/ | |
72 | + | ``` | |
73 | + | ||
74 | + | ### 2. Cannot Connect to Services | |
75 | + | ||
76 | + | #### **Symptom**: "Connection refused" when accessing Spark UI or Jupyter | |
77 | + | ||
78 | + | #### **Solutions**: | |
79 | + | ||
80 | + | **Check Container Status** | |
81 | + | ```bash | |
82 | + | # See which containers are running | |
83 | + | docker-compose ps | |
84 | + | ||
85 | + | # Check logs for specific service | |
86 | + | docker-compose logs spark-master | |
87 | + | docker-compose logs jupyter | |
88 | + | docker-compose logs postgres | |
89 | + | ``` | |
90 | + | ||
91 | + | **Network Issues** | |
92 | + | ```bash | |
93 | + | # Check if services are listening | |
94 | + | docker-compose exec spark-master netstat -tlnp | grep 8080 | |
95 | + | docker-compose exec postgres netstat -tlnp | grep 5432 | |
96 | + | ||
97 | + | # Test connectivity between containers | |
98 | + | docker-compose exec jupyter ping spark-master | |
99 | + | docker-compose exec jupyter ping postgres | |
100 | + | ``` | |
101 | + | ||
102 | + | **Firewall/Security Issues** | |
103 | + | ```bash | |
104 | + | # Disable firewall temporarily (Linux) | |
105 | + | sudo ufw disable | |
106 | + | ||
107 | + | # For macOS, check System Preferences → Security & Privacy | |
108 | + | ||
109 | + | # For Windows, check Windows Defender Firewall | |
110 | + | ``` | |
111 | + | ||
112 | + | ### 3. Out of Disk Space | |
113 | + | ||
114 | + | #### **Symptom**: "No space left on device" | |
115 | + | ||
116 | + | #### **Solutions**: | |
117 | + | ```bash | |
118 | + | # Check disk usage | |
119 | + | df -h | |
120 | + | docker system df | |
121 | + | ||
122 | + | # Clean up Docker resources | |
123 | + | docker system prune -a --volumes | |
124 | + | docker builder prune -a | |
125 | + | ||
126 | + | # Remove unused images | |
127 | + | docker image prune -a | |
128 | + | ||
129 | + | # Clean up old containers | |
130 | + | docker container prune | |
131 | + | ``` | |
132 | + | ||
133 | + | ### 4. Docker Compose Version Issues | |
134 | + | ||
135 | + | #### **Symptom**: "version not supported" or syntax errors | |
136 | + | ||
137 | + | #### **Solution**: | |
138 | + | ```bash | |
139 | + | # Check Docker Compose version | |
140 | + | docker-compose --version | |
141 | + | ||
142 | + | # Update Docker Compose (Linux) | |
143 | + | sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose | |
144 | + | sudo chmod +x /usr/local/bin/docker-compose | |
145 | + | ||
146 | + | # For older versions, use version 3.7 instead of 3.8 in docker-compose.yml | |
147 | + | ``` | |
148 | + | ||
149 | + | --- | |
150 | + | ||
151 | + | ## ⚡ Spark Issues | |
152 | + | ||
153 | + | ### 1. Spark Session Creation Fails | |
154 | + | ||
155 | + | #### **Symptom**: `Cannot connect to Spark master` or session creation hangs | |
156 | + | ||
157 | + | #### **Common Causes & Solutions**: | |
158 | + | ||
159 | + | **Master Not Running** | |
160 | + | ```python | |
161 | + | # Check if Spark master is accessible | |
162 | + | import requests | |
163 | + | try: | |
164 | + | response = requests.get("http://localhost:8080") | |
165 | + | print("Spark master is running") | |
166 | + | except: | |
167 | + | print("Cannot reach Spark master") | |
168 | + | ``` | |
169 | + | ||
170 | + | **Wrong Master URL** | |
171 | + | ```python | |
172 | + | # Try different master configurations | |
173 | + | # For local development | |
174 | + | spark = SparkSession.builder.master("local[*]").getOrCreate() | |
175 | + | ||
176 | + | # For Docker cluster | |
177 | + | spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate() | |
178 | + | ||
179 | + | # Check from inside Jupyter container | |
180 | + | spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate() | |
181 | + | ``` | |
182 | + | ||
183 | + | **Memory Configuration Issues** | |
184 | + | ```python | |
185 | + | spark = (SparkSession.builder | |
186 | + | .appName("SmartCityIoTPipeline") | |
187 | + | .master("local[*]") | |
188 | + | .config("spark.driver.memory", "2g") # Reduce if needed | |
189 | + | .config("spark.executor.memory", "1g") # Reduce if needed | |
190 | + | .config("spark.driver.maxResultSize", "1g") | |
191 | + | .getOrCreate()) | |
192 | + | ``` | |
193 | + | ||
194 | + | ### 2. Out of Memory Errors | |
195 | + | ||
196 | + | #### **Symptom**: `Java heap space` or `GC overhead limit exceeded` | |
197 | + | ||
198 | + | #### **Solutions**: | |
199 | + | ||
200 | + | **Increase Memory Allocation** | |
201 | + | ```python | |
202 | + | spark.conf.set("spark.driver.memory", "4g") | |
203 | + | spark.conf.set("spark.executor.memory", "2g") | |
204 | + | spark.conf.set("spark.driver.maxResultSize", "2g") | |
205 | + | ``` | |
206 | + | ||
207 | + | **Optimize Data Processing** | |
208 | + | ```python | |
209 | + | # Use sampling for large datasets | |
210 | + | sample_df = large_df.sample(0.1, seed=42) | |
211 | + | ||
212 | + | # Cache frequently used DataFrames | |
213 | + | df.cache() | |
214 | + | df.count() # Trigger caching | |
215 | + | ||
216 | + | # Repartition data | |
217 | + | df = df.repartition(4) # Fewer partitions for small datasets | |
218 | + | ||
219 | + | # Use coalesce to reduce partitions | |
220 | + | df = df.coalesce(2) | |
221 | + | ``` | |
222 | + | ||
223 | + | **Process Data in Chunks** | |
224 | + | ```python | |
225 | + | # Process data month by month | |
226 | + | for month in range(1, 13): | |
227 | + | monthly_data = df.filter(F.month("timestamp") == month) | |
228 | + | # Process monthly_data | |
229 | + | monthly_data.unpersist() # Free memory | |
230 | + | ``` | |
231 | + | ||
232 | + | ### 3. Slow Spark Jobs | |
233 | + | ||
234 | + | #### **Symptom**: Jobs take very long time or appear to hang | |
235 | + | ||
236 | + | #### **Solutions**: | |
237 | + | ||
238 | + | **Check Spark UI for Bottlenecks** | |
239 | + | - Open http://localhost:4040 (or 4041, 4042 if multiple sessions) | |
240 | + | - Look at the Jobs tab for failed/slow stages | |
241 | + | - Check Executors tab for resource usage | |
242 | + | ||
243 | + | **Optimize Partitioning** | |
244 | + | ```python | |
245 | + | # Check current partitions | |
246 | + | print(f"Partitions: {df.rdd.getNumPartitions()}") | |
247 | + | ||
248 | + | # Optimal partitions = 2-3x number of cores | |
249 | + | optimal_partitions = spark.sparkContext.defaultParallelism * 2 | |
250 | + | df = df.repartition(optimal_partitions) | |
251 | + | ``` | |
252 | + | ||
253 | + | **Avoid Expensive Operations** | |
254 | + | ```python | |
255 | + | # Avoid repeated .count() calls | |
256 | + | count = df.count() | |
257 | + | print(f"Records: {count}") | |
258 | + | ||
259 | + | # Use .cache() for DataFrames used multiple times | |
260 | + | df.cache() | |
261 | + | ||
262 | + | # Avoid .collect() on large datasets | |
263 | + | # Instead of: | |
264 | + | all_data = df.collect() # BAD: loads all data to driver | |
265 | + | ||
266 | + | # Use: | |
267 | + | sample_data = df.limit(1000).collect() # GOOD: only sample | |
268 | + | ``` | |
269 | + | ||
270 | + | **Optimize Joins** | |
271 | + | ```python | |
272 | + | # Broadcast small DataFrames | |
273 | + | from pyspark.sql.functions import broadcast | |
274 | + | result = large_df.join(broadcast(small_df), "key") | |
275 | + | ||
276 | + | # Use appropriate join strategies | |
277 | + | spark.conf.set("spark.sql.adaptive.enabled", "true") | |
278 | + | spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") | |
279 | + | ``` | |
280 | + | ||
281 | + | ### 4. DataFrame Operations Fail | |
282 | + | ||
283 | + | #### **Symptom**: `AnalysisException` or column not found errors | |
284 | + | ||
285 | + | #### **Solutions**: | |
286 | + | ||
287 | + | **Check Schema and Column Names** | |
288 | + | ```python | |
289 | + | # Print schema to see exact column names | |
290 | + | df.printSchema() | |
291 | + | ||
292 | + | # Show column names | |
293 | + | print(df.columns) | |
294 | + | ||
295 | + | # Check for case sensitivity | |
296 | + | df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()]) | |
297 | + | ``` | |
298 | + | ||
299 | + | **Handle Null Values** | |
300 | + | ```python | |
301 | + | # Check for nulls before operations | |
302 | + | df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show() | |
303 | + | ||
304 | + | # Drop nulls before joins | |
305 | + | df_clean = df.na.drop(subset=['key_column']) | |
306 | + | ||
307 | + | # Fill nulls with defaults | |
308 | + | df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'}) | |
309 | + | ``` | |
310 | + | ||
311 | + | **Fix Data Type Issues** | |
312 | + | ```python | |
313 | + | # Cast columns to correct types | |
314 | + | df = df.withColumn("timestamp", F.to_timestamp("timestamp")) | |
315 | + | df = df.withColumn("numeric_col", F.col("numeric_col").cast("double")) | |
316 | + | ||
317 | + | # Handle string/numeric conversion errors | |
318 | + | df = df.withColumn("safe_numeric", | |
319 | + | F.when(F.col("string_col").rlike("^[0-9.]+$"), | |
320 | + | F.col("string_col").cast("double")).otherwise(0)) | |
321 | + | ``` | |
322 | + | ||
323 | + | --- | |
324 | + | ||
325 | + | ## 🗄️ Database Connection Issues | |
326 | + | ||
327 | + | ### 1. Cannot Connect to PostgreSQL | |
328 | + | ||
329 | + | #### **Symptom**: `Connection refused` or authentication failed | |
330 | + | ||
331 | + | #### **Solutions**: | |
332 | + | ||
333 | + | **Check PostgreSQL Status** | |
334 | + | ```bash | |
335 | + | # Check if PostgreSQL container is running | |
336 | + | docker-compose ps postgres | |
337 | + | ||
338 | + | # Check PostgreSQL logs | |
339 | + | docker-compose logs postgres | |
340 | + | ||
341 | + | # Test connection from host | |
342 | + | psql -h localhost -p 5432 -U postgres -d smartcity | |
343 | + | ``` | |
344 | + | ||
345 | + | **From Jupyter/Spark Container** | |
346 | + | ```python | |
347 | + | # Test database connection | |
348 | + | import psycopg2 | |
349 | + | ||
350 | + | try: | |
351 | + | conn = psycopg2.connect( | |
352 | + | host="postgres", # Use container name, not localhost | |
353 | + | port=5432, | |
354 | + | user="postgres", | |
355 | + | password="password", | |
356 | + | database="smartcity" | |
357 | + | ) | |
358 | + | print("Database connection successful") | |
359 | + | conn.close() | |
360 | + | except Exception as e: | |
361 | + | print(f"Connection failed: {e}") | |
362 | + | ``` | |
363 | + | ||
364 | + | **Spark JDBC Connection** | |
365 | + | ```python | |
366 | + | # Correct JDBC URL for Docker | |
367 | + | jdbc_url = "jdbc:postgresql://postgres:5432/smartcity" | |
368 | + | ||
369 | + | # Test Spark database connection | |
370 | + | test_df = spark.read.format("jdbc") \ | |
371 | + | .option("url", jdbc_url) \ | |
372 | + | .option("dbtable", "(SELECT 1 as test) as test_table") \ | |
373 | + | .option("user", "postgres") \ | |
374 | + | .option("password", "password") \ | |
375 | + | .option("driver", "org.postgresql.Driver") \ | |
376 | + | .load() | |
377 | + | ||
378 | + | test_df.show() | |
379 | + | ``` | |
380 | + | ||
381 | + | ### 2. JDBC Driver Issues | |
382 | + | ||
383 | + | #### **Symptom**: `ClassNotFoundException: org.postgresql.Driver` | |
384 | + | ||
385 | + | #### **Solutions**: | |
386 | + | ||
387 | + | **Add JDBC Driver to Spark** | |
388 | + | ```python | |
389 | + | spark = SparkSession.builder \ | |
390 | + | .appName("SmartCityIoTPipeline") \ | |
391 | + | .config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \ | |
392 | + | .getOrCreate() | |
393 | + | ``` | |
394 | + | ||
395 | + | **Download Driver Manually** | |
396 | + | ```bash | |
397 | + | # Download PostgreSQL JDBC driver | |
398 | + | cd /opt/bitnami/spark/jars/ | |
399 | + | wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar | |
400 | + | ``` | |
401 | + | ||
402 | + | --- | |
403 | + | ||
404 | + | ## 📊 Data Loading Issues | |
405 | + | ||
406 | + | ### 1. File Not Found Errors | |
407 | + | ||
408 | + | #### **Symptom**: `FileNotFoundException` or path does not exist | |
409 | + | ||
410 | + | #### **Solutions**: | |
411 | + | ||
412 | + | **Check File Paths** | |
413 | + | ```python | |
414 | + | import os | |
415 | + | ||
416 | + | # Check if file exists | |
417 | + | data_file = "data/raw/traffic_sensors.csv" | |
418 | + | print(f"File exists: {os.path.exists(data_file)}") | |
419 | + | ||
420 | + | # List directory contents | |
421 | + | print(os.listdir("data/raw/")) | |
422 | + | ||
423 | + | # Use absolute paths if needed | |
424 | + | import os | |
425 | + | abs_path = os.path.abspath("data/raw/traffic_sensors.csv") | |
426 | + | df = spark.read.csv(abs_path, header=True, inferSchema=True) | |
427 | + | ``` | |
428 | + | ||
429 | + | **Volume Mount Issues** | |
430 | + | ```bash | |
431 | + | # Check if volumes are mounted correctly | |
432 | + | docker-compose exec jupyter ls -la /home/jovyan/work/data/ | |
433 | + | ||
434 | + | # Verify volume mounts in docker-compose.yml | |
435 | + | volumes: | |
436 | + | - ./data:/home/jovyan/work/data | |
437 | + | - ./notebooks:/home/jovyan/work/notebooks | |
438 | + | ``` | |
439 | + | ||
440 | + | ### 2. Schema Inference Problems | |
441 | + | ||
442 | + | #### **Symptom**: Wrong data types or parsing errors | |
443 | + | ||
444 | + | #### **Solutions**: | |
445 | + | ||
446 | + | **Explicit Schema Definition** | |
447 | + | ```python | |
448 | + | from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType | |
449 | + | ||
450 | + | # Define explicit schema | |
451 | + | schema = StructType([ | |
452 | + | StructField("sensor_id", StringType(), False), | |
453 | + | StructField("timestamp", StringType(), False), # Read as string first | |
454 | + | StructField("vehicle_count", IntegerType(), True), | |
455 | + | StructField("avg_speed", DoubleType(), True) | |
456 | + | ]) | |
457 | + | ||
458 | + | df = spark.read.csv("data/raw/traffic_sensors.csv", | |
459 | + | header=True, schema=schema) | |
460 | + | ||
461 | + | # Then convert timestamp | |
462 | + | df = df.withColumn("timestamp", F.to_timestamp("timestamp")) | |
463 | + | ``` | |
464 | + | ||
465 | + | **Handle Different Date Formats** | |
466 | + | ```python | |
467 | + | # Try different timestamp formats | |
468 | + | df = df.withColumn("timestamp", | |
469 | + | F.coalesce( | |
470 | + | F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"), | |
471 | + | F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"), | |
472 | + | F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") | |
473 | + | )) | |
474 | + | ``` | |
475 | + | ||
476 | + | ### 3. Large File Loading Issues | |
477 | + | ||
478 | + | #### **Symptom**: Out of memory when loading large files | |
479 | + | ||
480 | + | #### **Solutions**: | |
481 | + | ||
482 | + | **Process Files in Chunks** | |
483 | + | ```python | |
484 | + | # For very large CSV files, process line by line | |
485 | + | def process_large_csv(file_path, chunk_size=10000): | |
486 | + | # Read in smaller chunks | |
487 | + | df = spark.read.option("maxRecordsPerFile", chunk_size) \ | |
488 | + | .csv(file_path, header=True, inferSchema=True) | |
489 | + | return df | |
490 | + | ||
491 | + | # Or split large files manually | |
492 | + | # split -l 100000 large_file.csv chunk_ | |
493 | + | ``` | |
494 | + | ||
495 | + | **Optimize File Format** | |
496 | + | ```python | |
497 | + | # Convert to Parquet for better performance | |
498 | + | df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet") | |
499 | + | ||
500 | + | # Read Parquet instead of CSV | |
501 | + | df = spark.read.parquet("data/processed/traffic_optimized.parquet") | |
502 | + | ``` | |
503 | + | ||
504 | + | --- | |
505 | + | ||
506 | + | ## 🔧 Environment Setup Issues | |
507 | + | ||
508 | + | ### 1. Python Package Conflicts | |
509 | + | ||
510 | + | #### **Symptom**: `ImportError` or version conflicts | |
511 | + | ||
512 | + | #### **Solutions**: | |
513 | + | ||
514 | + | **Check Package Versions** | |
515 | + | ```python | |
516 | + | import sys | |
517 | + | print(f"Python version: {sys.version}") | |
518 | + | ||
519 | + | import pyspark | |
520 | + | print(f"PySpark version: {pyspark.__version__}") | |
521 | + | ||
522 | + | import pandas | |
523 | + | print(f"Pandas version: {pandas.__version__}") | |
524 | + | ``` | |
525 | + | ||
526 | + | **Rebuild Jupyter Container** | |
527 | + | ```bash | |
528 | + | # Rebuild with latest packages | |
529 | + | docker-compose down | |
530 | + | docker-compose build --no-cache jupyter | |
531 | + | docker-compose up -d | |
532 | + | ``` | |
533 | + | ||
534 | + | **Manual Package Installation** | |
535 | + | ```bash | |
536 | + | # Install packages in running container | |
537 | + | docker-compose exec jupyter pip install package_name | |
538 | + | ||
539 | + | # Or add to requirements.txt and rebuild | |
540 | + | ``` | |
541 | + | ||
542 | + | ### 2. Jupyter Notebook Issues | |
543 | + | ||
544 | + | #### **Symptom**: Kernel won't start or crashes frequently | |
545 | + | ||
546 | + | #### **Solutions**: | |
547 | + | ||
548 | + | **Restart Jupyter Kernel** | |
549 | + | - In Jupyter: Kernel → Restart & Clear Output | |
550 | + | ||
551 | + | **Check Jupyter Logs** | |
552 | + | ```bash | |
553 | + | docker-compose logs jupyter | |
554 | + | ``` | |
555 | + | ||
556 | + | **Increase Memory Limits** | |
557 | + | ```yaml | |
558 | + | # In docker-compose.yml | |
559 | + | jupyter: | |
560 | + | # ... other config | |
561 | + | deploy: | |
562 | + | resources: | |
563 | + | limits: | |
564 | + | memory: 4G | |
565 | + | ``` | |
566 | + | ||
567 | + | **Clear Jupyter Cache** | |
568 | + | ```bash | |
569 | + | # Remove Jupyter cache | |
570 | + | docker-compose exec jupyter rm -rf ~/.jupyter/ | |
571 | + | docker-compose restart jupyter | |
572 | + | ``` | |
573 | + | ||
574 | + | --- | |
575 | + | ||
576 | + | ## 🚀 Performance Optimization Tips | |
577 | + | ||
578 | + | ### 1. Spark Configuration Tuning | |
579 | + | ||
580 | + | ```python | |
581 | + | # Optimal Spark configuration for development | |
582 | + | spark.conf.set("spark.sql.adaptive.enabled", "true") | |
583 | + | spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") | |
584 | + | spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") | |
585 | + | spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") | |
586 | + | ||
587 | + | # Memory optimization | |
588 | + | spark.conf.set("spark.executor.memoryFraction", "0.8") | |
589 | + | spark.conf.set("spark.sql.shuffle.partitions", "200") # Adjust based on data size | |
590 | + | ``` | |
591 | + | ||
592 | + | ### 2. Data Processing Best Practices | |
593 | + | ||
594 | + | ```python | |
595 | + | # Cache DataFrames used multiple times | |
596 | + | df.cache() | |
597 | + | df.count() # Trigger caching | |
598 | + | ||
599 | + | # Use appropriate file formats | |
600 | + | # CSV (slowest) → JSON → Parquet (fastest) | |
601 | + | ||
602 | + | # Partition data for better performance | |
603 | + | df.write.partitionBy("year", "month").parquet("partitioned_data") | |
604 | + | ||
605 | + | # Use column pruning | |
606 | + | df.select("col1", "col2").filter("col1 > 100") # Better than df.filter().select() | |
607 | + | ``` | |
608 | + | ||
609 | + | ### 3. Memory Management | |
610 | + | ||
611 | + | ```python | |
612 | + | # Unpersist DataFrames when done | |
613 | + | df.unpersist() | |
614 | + | ||
615 | + | # Clear Spark context periodically | |
616 | + | spark.catalog.clearCache() | |
617 | + | ||
618 | + | # Monitor memory usage | |
619 | + | print(f"Cached tables: {spark.catalog.listTables()}") | |
620 | + | ``` | |
621 | + | ||
622 | + | --- | |
623 | + | ||
624 | + | ## 🐞 Debugging Strategies | |
625 | + | ||
626 | + | ### 1. Enable Debug Logging | |
627 | + | ||
628 | + | ```python | |
629 | + | # Set log level for debugging | |
630 | + | spark.sparkContext.setLogLevel("DEBUG") # Very verbose | |
631 | + | spark.sparkContext.setLogLevel("INFO") # Moderate | |
632 | + | spark.sparkContext.setLogLevel("WARN") # Minimal (default) | |
633 | + | ``` | |
634 | + | ||
635 | + | ### 2. Inspect Data at Each Step | |
636 | + | ||
637 | + | ```python | |
638 | + | # Check DataFrame at each transformation | |
639 | + | print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}") | |
640 | + | df1.show(5) | |
641 | + | ||
642 | + | df2 = df1.filter(F.col("value") > 0) | |
643 | + | print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}") | |
644 | + | df2.show(5) | |
645 | + | ``` | |
646 | + | ||
647 | + | ### 3. Use Explain Plans | |
648 | + | ||
649 | + | ```python | |
650 | + | # See execution plan | |
651 | + | df.explain(True) | |
652 | + | ||
653 | + | # Check for expensive operations | |
654 | + | df.explain("cost") | |
655 | + | ``` | |
656 | + | ||
657 | + | ### 4. Sample Data for Testing | |
658 | + | ||
659 | + | ```python | |
660 | + | # Use small samples for development | |
661 | + | sample_df = large_df.sample(0.01, seed=42) # 1% sample | |
662 | + | ||
663 | + | # Limit rows for testing | |
664 | + | test_df = df.limit(1000) | |
665 | + | ``` | |
666 | + | ||
667 | + | --- | |
668 | + | ||
669 | + | ## 📋 Health Check Commands | |
670 | + | ||
671 | + | ### Quick System Check Script | |
672 | + | ||
673 | + | ```bash | |
674 | + | #!/bin/bash | |
675 | + | echo "🔍 Smart City IoT Pipeline Health Check" | |
676 | + | echo "======================================" | |
677 | + | ||
678 | + | echo "📋 Docker Status:" | |
679 | + | docker --version | |
680 | + | docker-compose --version | |
681 | + | ||
682 | + | echo "🐳 Container Status:" | |
683 | + | docker-compose ps | |
684 | + | ||
685 | + | echo "💾 Disk Usage:" | |
686 | + | df -h | |
687 | + | docker system df | |
688 | + | ||
689 | + | echo "🧠 Memory Usage:" | |
690 | + | free -h | |
691 | + | ||
692 | + | echo "🌐 Network Connectivity:" | |
693 | + | curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible" | |
694 | + | curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible" | |
695 | + | ||
696 | + | echo "🗄️ Database Status:" | |
697 | + | docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready" | |
698 | + | ||
699 | + | echo "📁 Data Files:" | |
700 | + | ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing" | |
701 | + | ``` | |
702 | + | ||
703 | + | ### Python Health Check | |
704 | + | ||
705 | + | ```python | |
706 | + | def health_check(): | |
707 | + | """Run comprehensive health check""" | |
708 | + | checks = { | |
709 | + | "spark_session": False, | |
710 | + | "database_connection": False, | |
711 | + | "data_files": False, | |
712 | + | "memory_usage": False | |
713 | + | } | |
714 | + | ||
715 | + | # Check Spark session | |
716 | + | try: | |
717 | + | spark.sparkContext.statusTracker() | |
718 | + | checks["spark_session"] = True | |
719 | + | print("✅ Spark session healthy") | |
720 | + | except: | |
721 | + | print("❌ Spark session issues") | |
722 | + | ||
723 | + | # Check database | |
724 | + | try: | |
725 | + | test_df = spark.read.format("jdbc") \ | |
726 | + | .option("url", "jdbc:postgresql://postgres:5432/smartcity") \ | |
727 | + | .option("dbtable", "(SELECT 1) as test") \ | |
728 | + | .option("user", "postgres") \ | |
729 | + | .option("password", "password") \ | |
730 | + | .load() | |
731 | + | test_df.count() | |
732 | + | checks["database_connection"] = True | |
733 | + | print("✅ Database connection healthy") | |
734 | + | except Exception as e: | |
735 | + | print(f"❌ Database issues: {e}") | |
736 | + | ||
737 | + | # Check data files | |
738 | + | try: | |
739 | + | import os | |
740 | + | required_files = [ | |
741 | + | "data/raw/traffic_sensors.csv", | |
742 | + | "data/raw/air_quality.json", | |
743 | + | "data/raw/weather_data.parquet" | |
744 | + | ] | |
745 | + | ||
746 | + | missing_files = [f for f in required_files if not os.path.exists(f)] | |
747 | + | if not missing_files: | |
748 | + | checks["data_files"] = True | |
749 | + | print("✅ All data files present") | |
750 | + | else: | |
751 | + | print(f"❌ Missing files: {missing_files}") | |
752 | + | except Exception as e: | |
753 | + | print(f"❌ File check failed: {e}") | |
754 | + | ||
755 | + | # Check memory usage | |
756 | + | try: | |
757 | + | import psutil | |
758 | + | memory_percent = psutil.virtual_memory().percent | |
759 | + | if memory_percent < 80: | |
760 | + | checks["memory_usage"] = True | |
761 | + | print(f"✅ Memory usage OK: {memory_percent:.1f}%") | |
762 | + | else: | |
763 | + | print(f"⚠️ High memory usage: {memory_percent:.1f}%") | |
764 | + | except: | |
765 | + | print("❓ Cannot check memory usage") | |
766 | + | ||
767 | + | overall_health = sum(checks.values()) / len(checks) * 100 | |
768 | + | print(f"\n📊 Overall System Health: {overall_health:.1f}%") | |
769 | + | ||
770 | + | return checks | |
771 | + | ||
772 | + | # Run health check | |
773 | + | health_status = health_check() | |
774 | + | ``` | |
775 | + | ||
776 | + | --- | |
777 | + | ||
778 | + | ## 🆘 When All Else Fails | |
779 | + | ||
780 | + | ### Complete Environment Reset | |
781 | + | ||
782 | + | ```bash | |
783 | + | # Nuclear option - complete reset | |
784 | + | docker-compose down -v --remove-orphans | |
785 | + | docker system prune -a --volumes | |
786 | + | docker builder prune -a | |
787 | + | ||
788 | + | # Remove all project data (CAUTION!) | |
789 | + | rm -rf data/processed/* data/features/* | |
790 | + | ||
791 | + | # Rebuild everything | |
792 | + | docker-compose build --no-cache | |
793 | + | docker-compose up -d | |
794 | + | ||
795 | + | # Regenerate sample data | |
796 | + | python scripts/generate_data.py | |
797 | + | ``` | |
798 | + | ||
799 | + | ### Get Help | |
800 | + | ||
801 | + | 1. **Check GitHub Issues**: Look for similar problems in the project repository | |
802 | + | 2. **Stack Overflow**: Search for Spark/Docker specific errors | |
803 | + | 3. **Spark Documentation**: https://spark.apache.org/docs/latest/ | |
804 | + | 4. **Docker Documentation**: https://docs.docker.com/ | |
805 | + | ||
806 | + | ### Collect Diagnostic Information | |
807 | + | ||
808 | + | ```bash | |
809 | + | # Gather system information for help requests | |
810 | + | echo "System Information:" > diagnostic_info.txt | |
811 | + | uname -a >> diagnostic_info.txt | |
812 | + | docker --version >> diagnostic_info.txt | |
813 | + | docker-compose --version >> diagnostic_info.txt | |
814 | + | python --version >> diagnostic_info.txt | |
815 | + | ||
816 | + | echo "Container Status:" >> diagnostic_info.txt | |
817 | + | docker-compose ps >> diagnostic_info.txt | |
818 | + | ||
819 | + | echo "Container Logs:" >> diagnostic_info.txt | |
820 | + | docker-compose logs --tail=50 >> diagnostic_info.txt | |
821 | + | ||
822 | + | echo "Disk Usage:" >> diagnostic_info.txt | |
823 | + | df -h >> diagnostic_info.txt | |
824 | + | docker system df >> diagnostic_info.txt | |
825 | + | ``` | |
826 | + | ||
827 | + | --- | |
828 | + | ||
829 | + | ## 📚 Additional Resources | |
830 | + | ||
831 | + | - **Spark Tuning Guide**: https://spark.apache.org/docs/latest/tuning.html | |
832 | + | - **Docker Best Practices**: https://docs.docker.com/develop/best-practices/ | |
833 | + | - **PySpark API Documentation**: https://spark.apache.org/docs/latest/api/python/ | |
834 | + | - **PostgreSQL Docker Guide**: https://hub.docker.com/_/postgres | |
835 | + | ||
836 | + | Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄 |