Zuletzt aktiv 1751371779

kristofer's Avatar kristofer hat die Gist bearbeitet 1751371779. Zu Änderung gehen

1 file changed, 1 insertion, 1 deletion

Spark-Docker-Fixes.md

@@ -1,4 +1,4 @@
1 - # Smart City IoT Pipeline - Troubleshooting Guide
1 + # Spark-Docker-SQL - Troubleshooting Guide
2 2
3 3 ## 🚨 Quick Emergency Fixes
4 4

kristofer's Avatar kristofer hat die Gist bearbeitet 1751371732. Zu Änderung gehen

1 file changed, 836 insertions

Spark-Docker-Fixes.md(Datei erstellt)

@@ -0,0 +1,836 @@
1 + # Smart City IoT Pipeline - Troubleshooting Guide
2 +
3 + ## 🚨 Quick Emergency Fixes
4 +
5 + ### 🔥 "Everything is Broken" - Nuclear Option
6 + ```bash
7 + # Stop everything and restart fresh
8 + docker-compose down -v --remove-orphans
9 + docker system prune -f
10 + docker volume prune -f
11 + docker-compose up -d --build
12 + ```
13 +
14 + ### ⚡ "Just Need to Restart" - Soft Reset
15 + ```bash
16 + # Restart just the services
17 + docker-compose restart
18 + # Or restart specific service
19 + docker-compose restart spark-master
20 + ```
21 +
22 + ---
23 +
24 + ## 🐳 Docker Issues
25 +
26 + ### 1. Container Won't Start
27 +
28 + #### **Symptom**: `docker-compose up` fails or containers exit immediately
29 +
30 + #### **Common Causes & Solutions**:
31 +
32 + **Port Already in Use**
33 + ```bash
34 + # Check what's using the port
35 + lsof -i :8080 # For Spark UI
36 + lsof -i :5432 # For PostgreSQL
37 + lsof -i :8888 # For Jupyter
38 +
39 + # Kill the process using the port
40 + sudo kill -9 <PID>
41 +
42 + # Or change ports in docker-compose.yml
43 + ports:
44 + - "8081:8080" # Use different host port
45 + ```
46 +
47 + **Insufficient Memory**
48 + ```bash
49 + # Check Docker resource allocation
50 + docker system info | grep -i memory
51 +
52 + # Increase Docker memory limit (Docker Desktop):
53 + # Settings → Resources → Memory → Increase to 8GB+
54 +
55 + # For Linux, check available memory
56 + free -h
57 + ```
58 +
59 + **Volume Mount Issues**
60 + ```bash
61 + # Check if directories exist
62 + ls -la data/
63 + ls -la notebooks/
64 +
65 + # Create missing directories
66 + mkdir -p data/raw data/processed data/features
67 + mkdir -p notebooks config sql
68 +
69 + # Fix permissions
70 + sudo chown -R $USER:$USER data/ notebooks/ config/
71 + chmod -R 755 data/ notebooks/ config/
72 + ```
73 +
74 + ### 2. Cannot Connect to Services
75 +
76 + #### **Symptom**: "Connection refused" when accessing Spark UI or Jupyter
77 +
78 + #### **Solutions**:
79 +
80 + **Check Container Status**
81 + ```bash
82 + # See which containers are running
83 + docker-compose ps
84 +
85 + # Check logs for specific service
86 + docker-compose logs spark-master
87 + docker-compose logs jupyter
88 + docker-compose logs postgres
89 + ```
90 +
91 + **Network Issues**
92 + ```bash
93 + # Check if services are listening
94 + docker-compose exec spark-master netstat -tlnp | grep 8080
95 + docker-compose exec postgres netstat -tlnp | grep 5432
96 +
97 + # Test connectivity between containers
98 + docker-compose exec jupyter ping spark-master
99 + docker-compose exec jupyter ping postgres
100 + ```
101 +
102 + **Firewall/Security Issues**
103 + ```bash
104 + # Disable firewall temporarily (Linux)
105 + sudo ufw disable
106 +
107 + # For macOS, check System Preferences → Security & Privacy
108 +
109 + # For Windows, check Windows Defender Firewall
110 + ```
111 +
112 + ### 3. Out of Disk Space
113 +
114 + #### **Symptom**: "No space left on device"
115 +
116 + #### **Solutions**:
117 + ```bash
118 + # Check disk usage
119 + df -h
120 + docker system df
121 +
122 + # Clean up Docker resources
123 + docker system prune -a --volumes
124 + docker builder prune -a
125 +
126 + # Remove unused images
127 + docker image prune -a
128 +
129 + # Clean up old containers
130 + docker container prune
131 + ```
132 +
133 + ### 4. Docker Compose Version Issues
134 +
135 + #### **Symptom**: "version not supported" or syntax errors
136 +
137 + #### **Solution**:
138 + ```bash
139 + # Check Docker Compose version
140 + docker-compose --version
141 +
142 + # Update Docker Compose (Linux)
143 + sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
144 + sudo chmod +x /usr/local/bin/docker-compose
145 +
146 + # For older versions, use version 3.7 instead of 3.8 in docker-compose.yml
147 + ```
148 +
149 + ---
150 +
151 + ## ⚡ Spark Issues
152 +
153 + ### 1. Spark Session Creation Fails
154 +
155 + #### **Symptom**: `Cannot connect to Spark master` or session creation hangs
156 +
157 + #### **Common Causes & Solutions**:
158 +
159 + **Master Not Running**
160 + ```python
161 + # Check if Spark master is accessible
162 + import requests
163 + try:
164 + response = requests.get("http://localhost:8080")
165 + print("Spark master is running")
166 + except:
167 + print("Cannot reach Spark master")
168 + ```
169 +
170 + **Wrong Master URL**
171 + ```python
172 + # Try different master configurations
173 + # For local development
174 + spark = SparkSession.builder.master("local[*]").getOrCreate()
175 +
176 + # For Docker cluster
177 + spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate()
178 +
179 + # Check from inside Jupyter container
180 + spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate()
181 + ```
182 +
183 + **Memory Configuration Issues**
184 + ```python
185 + spark = (SparkSession.builder
186 + .appName("SmartCityIoTPipeline")
187 + .master("local[*]")
188 + .config("spark.driver.memory", "2g") # Reduce if needed
189 + .config("spark.executor.memory", "1g") # Reduce if needed
190 + .config("spark.driver.maxResultSize", "1g")
191 + .getOrCreate())
192 + ```
193 +
194 + ### 2. Out of Memory Errors
195 +
196 + #### **Symptom**: `Java heap space` or `GC overhead limit exceeded`
197 +
198 + #### **Solutions**:
199 +
200 + **Increase Memory Allocation**
201 + ```python
202 + spark.conf.set("spark.driver.memory", "4g")
203 + spark.conf.set("spark.executor.memory", "2g")
204 + spark.conf.set("spark.driver.maxResultSize", "2g")
205 + ```
206 +
207 + **Optimize Data Processing**
208 + ```python
209 + # Use sampling for large datasets
210 + sample_df = large_df.sample(0.1, seed=42)
211 +
212 + # Cache frequently used DataFrames
213 + df.cache()
214 + df.count() # Trigger caching
215 +
216 + # Repartition data
217 + df = df.repartition(4) # Fewer partitions for small datasets
218 +
219 + # Use coalesce to reduce partitions
220 + df = df.coalesce(2)
221 + ```
222 +
223 + **Process Data in Chunks**
224 + ```python
225 + # Process data month by month
226 + for month in range(1, 13):
227 + monthly_data = df.filter(F.month("timestamp") == month)
228 + # Process monthly_data
229 + monthly_data.unpersist() # Free memory
230 + ```
231 +
232 + ### 3. Slow Spark Jobs
233 +
234 + #### **Symptom**: Jobs take very long time or appear to hang
235 +
236 + #### **Solutions**:
237 +
238 + **Check Spark UI for Bottlenecks**
239 + - Open http://localhost:4040 (or 4041, 4042 if multiple sessions)
240 + - Look at the Jobs tab for failed/slow stages
241 + - Check Executors tab for resource usage
242 +
243 + **Optimize Partitioning**
244 + ```python
245 + # Check current partitions
246 + print(f"Partitions: {df.rdd.getNumPartitions()}")
247 +
248 + # Optimal partitions = 2-3x number of cores
249 + optimal_partitions = spark.sparkContext.defaultParallelism * 2
250 + df = df.repartition(optimal_partitions)
251 + ```
252 +
253 + **Avoid Expensive Operations**
254 + ```python
255 + # Avoid repeated .count() calls
256 + count = df.count()
257 + print(f"Records: {count}")
258 +
259 + # Use .cache() for DataFrames used multiple times
260 + df.cache()
261 +
262 + # Avoid .collect() on large datasets
263 + # Instead of:
264 + all_data = df.collect() # BAD: loads all data to driver
265 +
266 + # Use:
267 + sample_data = df.limit(1000).collect() # GOOD: only sample
268 + ```
269 +
270 + **Optimize Joins**
271 + ```python
272 + # Broadcast small DataFrames
273 + from pyspark.sql.functions import broadcast
274 + result = large_df.join(broadcast(small_df), "key")
275 +
276 + # Use appropriate join strategies
277 + spark.conf.set("spark.sql.adaptive.enabled", "true")
278 + spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
279 + ```
280 +
281 + ### 4. DataFrame Operations Fail
282 +
283 + #### **Symptom**: `AnalysisException` or column not found errors
284 +
285 + #### **Solutions**:
286 +
287 + **Check Schema and Column Names**
288 + ```python
289 + # Print schema to see exact column names
290 + df.printSchema()
291 +
292 + # Show column names
293 + print(df.columns)
294 +
295 + # Check for case sensitivity
296 + df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()])
297 + ```
298 +
299 + **Handle Null Values**
300 + ```python
301 + # Check for nulls before operations
302 + df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()
303 +
304 + # Drop nulls before joins
305 + df_clean = df.na.drop(subset=['key_column'])
306 +
307 + # Fill nulls with defaults
308 + df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'})
309 + ```
310 +
311 + **Fix Data Type Issues**
312 + ```python
313 + # Cast columns to correct types
314 + df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
315 + df = df.withColumn("numeric_col", F.col("numeric_col").cast("double"))
316 +
317 + # Handle string/numeric conversion errors
318 + df = df.withColumn("safe_numeric",
319 + F.when(F.col("string_col").rlike("^[0-9.]+$"),
320 + F.col("string_col").cast("double")).otherwise(0))
321 + ```
322 +
323 + ---
324 +
325 + ## 🗄️ Database Connection Issues
326 +
327 + ### 1. Cannot Connect to PostgreSQL
328 +
329 + #### **Symptom**: `Connection refused` or authentication failed
330 +
331 + #### **Solutions**:
332 +
333 + **Check PostgreSQL Status**
334 + ```bash
335 + # Check if PostgreSQL container is running
336 + docker-compose ps postgres
337 +
338 + # Check PostgreSQL logs
339 + docker-compose logs postgres
340 +
341 + # Test connection from host
342 + psql -h localhost -p 5432 -U postgres -d smartcity
343 + ```
344 +
345 + **From Jupyter/Spark Container**
346 + ```python
347 + # Test database connection
348 + import psycopg2
349 +
350 + try:
351 + conn = psycopg2.connect(
352 + host="postgres", # Use container name, not localhost
353 + port=5432,
354 + user="postgres",
355 + password="password",
356 + database="smartcity"
357 + )
358 + print("Database connection successful")
359 + conn.close()
360 + except Exception as e:
361 + print(f"Connection failed: {e}")
362 + ```
363 +
364 + **Spark JDBC Connection**
365 + ```python
366 + # Correct JDBC URL for Docker
367 + jdbc_url = "jdbc:postgresql://postgres:5432/smartcity"
368 +
369 + # Test Spark database connection
370 + test_df = spark.read.format("jdbc") \
371 + .option("url", jdbc_url) \
372 + .option("dbtable", "(SELECT 1 as test) as test_table") \
373 + .option("user", "postgres") \
374 + .option("password", "password") \
375 + .option("driver", "org.postgresql.Driver") \
376 + .load()
377 +
378 + test_df.show()
379 + ```
380 +
381 + ### 2. JDBC Driver Issues
382 +
383 + #### **Symptom**: `ClassNotFoundException: org.postgresql.Driver`
384 +
385 + #### **Solutions**:
386 +
387 + **Add JDBC Driver to Spark**
388 + ```python
389 + spark = SparkSession.builder \
390 + .appName("SmartCityIoTPipeline") \
391 + .config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \
392 + .getOrCreate()
393 + ```
394 +
395 + **Download Driver Manually**
396 + ```bash
397 + # Download PostgreSQL JDBC driver
398 + cd /opt/bitnami/spark/jars/
399 + wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar
400 + ```
401 +
402 + ---
403 +
404 + ## 📊 Data Loading Issues
405 +
406 + ### 1. File Not Found Errors
407 +
408 + #### **Symptom**: `FileNotFoundException` or path does not exist
409 +
410 + #### **Solutions**:
411 +
412 + **Check File Paths**
413 + ```python
414 + import os
415 +
416 + # Check if file exists
417 + data_file = "data/raw/traffic_sensors.csv"
418 + print(f"File exists: {os.path.exists(data_file)}")
419 +
420 + # List directory contents
421 + print(os.listdir("data/raw/"))
422 +
423 + # Use absolute paths if needed
424 + import os
425 + abs_path = os.path.abspath("data/raw/traffic_sensors.csv")
426 + df = spark.read.csv(abs_path, header=True, inferSchema=True)
427 + ```
428 +
429 + **Volume Mount Issues**
430 + ```bash
431 + # Check if volumes are mounted correctly
432 + docker-compose exec jupyter ls -la /home/jovyan/work/data/
433 +
434 + # Verify volume mounts in docker-compose.yml
435 + volumes:
436 + - ./data:/home/jovyan/work/data
437 + - ./notebooks:/home/jovyan/work/notebooks
438 + ```
439 +
440 + ### 2. Schema Inference Problems
441 +
442 + #### **Symptom**: Wrong data types or parsing errors
443 +
444 + #### **Solutions**:
445 +
446 + **Explicit Schema Definition**
447 + ```python
448 + from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
449 +
450 + # Define explicit schema
451 + schema = StructType([
452 + StructField("sensor_id", StringType(), False),
453 + StructField("timestamp", StringType(), False), # Read as string first
454 + StructField("vehicle_count", IntegerType(), True),
455 + StructField("avg_speed", DoubleType(), True)
456 + ])
457 +
458 + df = spark.read.csv("data/raw/traffic_sensors.csv",
459 + header=True, schema=schema)
460 +
461 + # Then convert timestamp
462 + df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
463 + ```
464 +
465 + **Handle Different Date Formats**
466 + ```python
467 + # Try different timestamp formats
468 + df = df.withColumn("timestamp",
469 + F.coalesce(
470 + F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"),
471 + F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"),
472 + F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
473 + ))
474 + ```
475 +
476 + ### 3. Large File Loading Issues
477 +
478 + #### **Symptom**: Out of memory when loading large files
479 +
480 + #### **Solutions**:
481 +
482 + **Process Files in Chunks**
483 + ```python
484 + # For very large CSV files, process line by line
485 + def process_large_csv(file_path, chunk_size=10000):
486 + # Read in smaller chunks
487 + df = spark.read.option("maxRecordsPerFile", chunk_size) \
488 + .csv(file_path, header=True, inferSchema=True)
489 + return df
490 +
491 + # Or split large files manually
492 + # split -l 100000 large_file.csv chunk_
493 + ```
494 +
495 + **Optimize File Format**
496 + ```python
497 + # Convert to Parquet for better performance
498 + df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet")
499 +
500 + # Read Parquet instead of CSV
501 + df = spark.read.parquet("data/processed/traffic_optimized.parquet")
502 + ```
503 +
504 + ---
505 +
506 + ## 🔧 Environment Setup Issues
507 +
508 + ### 1. Python Package Conflicts
509 +
510 + #### **Symptom**: `ImportError` or version conflicts
511 +
512 + #### **Solutions**:
513 +
514 + **Check Package Versions**
515 + ```python
516 + import sys
517 + print(f"Python version: {sys.version}")
518 +
519 + import pyspark
520 + print(f"PySpark version: {pyspark.__version__}")
521 +
522 + import pandas
523 + print(f"Pandas version: {pandas.__version__}")
524 + ```
525 +
526 + **Rebuild Jupyter Container**
527 + ```bash
528 + # Rebuild with latest packages
529 + docker-compose down
530 + docker-compose build --no-cache jupyter
531 + docker-compose up -d
532 + ```
533 +
534 + **Manual Package Installation**
535 + ```bash
536 + # Install packages in running container
537 + docker-compose exec jupyter pip install package_name
538 +
539 + # Or add to requirements.txt and rebuild
540 + ```
541 +
542 + ### 2. Jupyter Notebook Issues
543 +
544 + #### **Symptom**: Kernel won't start or crashes frequently
545 +
546 + #### **Solutions**:
547 +
548 + **Restart Jupyter Kernel**
549 + - In Jupyter: Kernel → Restart & Clear Output
550 +
551 + **Check Jupyter Logs**
552 + ```bash
553 + docker-compose logs jupyter
554 + ```
555 +
556 + **Increase Memory Limits**
557 + ```yaml
558 + # In docker-compose.yml
559 + jupyter:
560 + # ... other config
561 + deploy:
562 + resources:
563 + limits:
564 + memory: 4G
565 + ```
566 +
567 + **Clear Jupyter Cache**
568 + ```bash
569 + # Remove Jupyter cache
570 + docker-compose exec jupyter rm -rf ~/.jupyter/
571 + docker-compose restart jupyter
572 + ```
573 +
574 + ---
575 +
576 + ## 🚀 Performance Optimization Tips
577 +
578 + ### 1. Spark Configuration Tuning
579 +
580 + ```python
581 + # Optimal Spark configuration for development
582 + spark.conf.set("spark.sql.adaptive.enabled", "true")
583 + spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
584 + spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
585 + spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
586 +
587 + # Memory optimization
588 + spark.conf.set("spark.executor.memoryFraction", "0.8")
589 + spark.conf.set("spark.sql.shuffle.partitions", "200") # Adjust based on data size
590 + ```
591 +
592 + ### 2. Data Processing Best Practices
593 +
594 + ```python
595 + # Cache DataFrames used multiple times
596 + df.cache()
597 + df.count() # Trigger caching
598 +
599 + # Use appropriate file formats
600 + # CSV (slowest) → JSON → Parquet (fastest)
601 +
602 + # Partition data for better performance
603 + df.write.partitionBy("year", "month").parquet("partitioned_data")
604 +
605 + # Use column pruning
606 + df.select("col1", "col2").filter("col1 > 100") # Better than df.filter().select()
607 + ```
608 +
609 + ### 3. Memory Management
610 +
611 + ```python
612 + # Unpersist DataFrames when done
613 + df.unpersist()
614 +
615 + # Clear Spark context periodically
616 + spark.catalog.clearCache()
617 +
618 + # Monitor memory usage
619 + print(f"Cached tables: {spark.catalog.listTables()}")
620 + ```
621 +
622 + ---
623 +
624 + ## 🐞 Debugging Strategies
625 +
626 + ### 1. Enable Debug Logging
627 +
628 + ```python
629 + # Set log level for debugging
630 + spark.sparkContext.setLogLevel("DEBUG") # Very verbose
631 + spark.sparkContext.setLogLevel("INFO") # Moderate
632 + spark.sparkContext.setLogLevel("WARN") # Minimal (default)
633 + ```
634 +
635 + ### 2. Inspect Data at Each Step
636 +
637 + ```python
638 + # Check DataFrame at each transformation
639 + print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}")
640 + df1.show(5)
641 +
642 + df2 = df1.filter(F.col("value") > 0)
643 + print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}")
644 + df2.show(5)
645 + ```
646 +
647 + ### 3. Use Explain Plans
648 +
649 + ```python
650 + # See execution plan
651 + df.explain(True)
652 +
653 + # Check for expensive operations
654 + df.explain("cost")
655 + ```
656 +
657 + ### 4. Sample Data for Testing
658 +
659 + ```python
660 + # Use small samples for development
661 + sample_df = large_df.sample(0.01, seed=42) # 1% sample
662 +
663 + # Limit rows for testing
664 + test_df = df.limit(1000)
665 + ```
666 +
667 + ---
668 +
669 + ## 📋 Health Check Commands
670 +
671 + ### Quick System Check Script
672 +
673 + ```bash
674 + #!/bin/bash
675 + echo "🔍 Smart City IoT Pipeline Health Check"
676 + echo "======================================"
677 +
678 + echo "📋 Docker Status:"
679 + docker --version
680 + docker-compose --version
681 +
682 + echo "🐳 Container Status:"
683 + docker-compose ps
684 +
685 + echo "💾 Disk Usage:"
686 + df -h
687 + docker system df
688 +
689 + echo "🧠 Memory Usage:"
690 + free -h
691 +
692 + echo "🌐 Network Connectivity:"
693 + curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible"
694 + curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible"
695 +
696 + echo "🗄️ Database Status:"
697 + docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready"
698 +
699 + echo "📁 Data Files:"
700 + ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing"
701 + ```
702 +
703 + ### Python Health Check
704 +
705 + ```python
706 + def health_check():
707 + """Run comprehensive health check"""
708 + checks = {
709 + "spark_session": False,
710 + "database_connection": False,
711 + "data_files": False,
712 + "memory_usage": False
713 + }
714 +
715 + # Check Spark session
716 + try:
717 + spark.sparkContext.statusTracker()
718 + checks["spark_session"] = True
719 + print("✅ Spark session healthy")
720 + except:
721 + print("❌ Spark session issues")
722 +
723 + # Check database
724 + try:
725 + test_df = spark.read.format("jdbc") \
726 + .option("url", "jdbc:postgresql://postgres:5432/smartcity") \
727 + .option("dbtable", "(SELECT 1) as test") \
728 + .option("user", "postgres") \
729 + .option("password", "password") \
730 + .load()
731 + test_df.count()
732 + checks["database_connection"] = True
733 + print("✅ Database connection healthy")
734 + except Exception as e:
735 + print(f"❌ Database issues: {e}")
736 +
737 + # Check data files
738 + try:
739 + import os
740 + required_files = [
741 + "data/raw/traffic_sensors.csv",
742 + "data/raw/air_quality.json",
743 + "data/raw/weather_data.parquet"
744 + ]
745 +
746 + missing_files = [f for f in required_files if not os.path.exists(f)]
747 + if not missing_files:
748 + checks["data_files"] = True
749 + print("✅ All data files present")
750 + else:
751 + print(f"❌ Missing files: {missing_files}")
752 + except Exception as e:
753 + print(f"❌ File check failed: {e}")
754 +
755 + # Check memory usage
756 + try:
757 + import psutil
758 + memory_percent = psutil.virtual_memory().percent
759 + if memory_percent < 80:
760 + checks["memory_usage"] = True
761 + print(f"✅ Memory usage OK: {memory_percent:.1f}%")
762 + else:
763 + print(f"⚠️ High memory usage: {memory_percent:.1f}%")
764 + except:
765 + print("❓ Cannot check memory usage")
766 +
767 + overall_health = sum(checks.values()) / len(checks) * 100
768 + print(f"\n📊 Overall System Health: {overall_health:.1f}%")
769 +
770 + return checks
771 +
772 + # Run health check
773 + health_status = health_check()
774 + ```
775 +
776 + ---
777 +
778 + ## 🆘 When All Else Fails
779 +
780 + ### Complete Environment Reset
781 +
782 + ```bash
783 + # Nuclear option - complete reset
784 + docker-compose down -v --remove-orphans
785 + docker system prune -a --volumes
786 + docker builder prune -a
787 +
788 + # Remove all project data (CAUTION!)
789 + rm -rf data/processed/* data/features/*
790 +
791 + # Rebuild everything
792 + docker-compose build --no-cache
793 + docker-compose up -d
794 +
795 + # Regenerate sample data
796 + python scripts/generate_data.py
797 + ```
798 +
799 + ### Get Help
800 +
801 + 1. **Check GitHub Issues**: Look for similar problems in the project repository
802 + 2. **Stack Overflow**: Search for Spark/Docker specific errors
803 + 3. **Spark Documentation**: https://spark.apache.org/docs/latest/
804 + 4. **Docker Documentation**: https://docs.docker.com/
805 +
806 + ### Collect Diagnostic Information
807 +
808 + ```bash
809 + # Gather system information for help requests
810 + echo "System Information:" > diagnostic_info.txt
811 + uname -a >> diagnostic_info.txt
812 + docker --version >> diagnostic_info.txt
813 + docker-compose --version >> diagnostic_info.txt
814 + python --version >> diagnostic_info.txt
815 +
816 + echo "Container Status:" >> diagnostic_info.txt
817 + docker-compose ps >> diagnostic_info.txt
818 +
819 + echo "Container Logs:" >> diagnostic_info.txt
820 + docker-compose logs --tail=50 >> diagnostic_info.txt
821 +
822 + echo "Disk Usage:" >> diagnostic_info.txt
823 + df -h >> diagnostic_info.txt
824 + docker system df >> diagnostic_info.txt
825 + ```
826 +
827 + ---
828 +
829 + ## 📚 Additional Resources
830 +
831 + - **Spark Tuning Guide**: https://spark.apache.org/docs/latest/tuning.html
832 + - **Docker Best Practices**: https://docs.docker.com/develop/best-practices/
833 + - **PySpark API Documentation**: https://spark.apache.org/docs/latest/api/python/
834 + - **PostgreSQL Docker Guide**: https://hub.docker.com/_/postgres
835 +
836 + Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄
Neuer Älter