Revision of Spark/Docker Troubleshooting Guide

1

+

# Smart City IoT Pipeline - Troubleshooting Guide

2

+

3

+

## 🚨 Quick Emergency Fixes

4

+

5

+

### 🔥 "Everything is Broken" - Nuclear Option

6

+

```bash

7

+

# Stop everything and restart fresh

8

+

docker-compose down -v --remove-orphans

9

+

docker system prune -f

10

+

docker volume prune -f

11

+

docker-compose up -d --build

12

+

```

13

+

14

+

### ⚡ "Just Need to Restart" - Soft Reset

15

+

```bash

16

+

# Restart just the services

17

+

docker-compose restart

18

+

# Or restart specific service

19

+

docker-compose restart spark-master

20

+

```

21

+

22

+

---

23

+

24

+

## 🐳 Docker Issues

25

+

26

+

### 1. Container Won't Start

27

+

28

+

#### **Symptom**: `docker-compose up` fails or containers exit immediately

29

+

30

+

#### **Common Causes & Solutions**:

31

+

32

+

**Port Already in Use**

33

+

```bash

34

+

# Check what's using the port

35

+

lsof -i :8080 # For Spark UI

36

+

lsof -i :5432 # For PostgreSQL

37

+

lsof -i :8888 # For Jupyter

38

+

39

+

# Kill the process using the port

40

+

sudo kill -9 <PID>

41

+

42

+

# Or change ports in docker-compose.yml

43

+

ports:

44

+

- "8081:8080" # Use different host port

45

+

```

46

+

47

+

**Insufficient Memory**

48

+

```bash

49

+

# Check Docker resource allocation

50

+

docker system info | grep -i memory

51

+

52

+

# Increase Docker memory limit (Docker Desktop):

53

+

# Settings → Resources → Memory → Increase to 8GB+

54

+

55

+

# For Linux, check available memory

56

+

free -h

57

+

```

58

+

59

+

**Volume Mount Issues**

60

+

```bash

61

+

# Check if directories exist

62

+

ls -la data/

63

+

ls -la notebooks/

64

+

65

+

# Create missing directories

66

+

mkdir -p data/raw data/processed data/features

67

+

mkdir -p notebooks config sql

68

+

69

+

# Fix permissions

70

+

sudo chown -R $USER:$USER data/ notebooks/ config/

71

+

chmod -R 755 data/ notebooks/ config/

72

+

```

73

+

74

+

### 2. Cannot Connect to Services

75

+

76

+

#### **Symptom**: "Connection refused" when accessing Spark UI or Jupyter

77

+

78

+

#### **Solutions**:

79

+

80

+

**Check Container Status**

81

+

```bash

82

+

# See which containers are running

83

+

docker-compose ps

84

+

85

+

# Check logs for specific service

86

+

docker-compose logs spark-master

87

+

docker-compose logs jupyter

88

+

docker-compose logs postgres

89

+

```

90

+

91

+

**Network Issues**

92

+

```bash

93

+

# Check if services are listening

94

+

docker-compose exec spark-master netstat -tlnp | grep 8080

95

+

docker-compose exec postgres netstat -tlnp | grep 5432

96

+

97

+

# Test connectivity between containers

98

+

docker-compose exec jupyter ping spark-master

99

+

docker-compose exec jupyter ping postgres

100

+

```

101

+

102

+

**Firewall/Security Issues**

103

+

```bash

104

+

# Disable firewall temporarily (Linux)

105

+

sudo ufw disable

106

+

107

+

# For macOS, check System Preferences → Security & Privacy

108

+

109

+

# For Windows, check Windows Defender Firewall

110

+

```

111

+

112

+

### 3. Out of Disk Space

113

+

114

+

#### **Symptom**: "No space left on device"

115

+

116

+

#### **Solutions**:

117

+

```bash

118

+

# Check disk usage

119

+

df -h

120

+

docker system df

121

+

122

+

# Clean up Docker resources

123

+

docker system prune -a --volumes

124

+

docker builder prune -a

125

+

126

+

# Remove unused images

127

+

docker image prune -a

128

+

129

+

# Clean up old containers

130

+

docker container prune

131

+

```

132

+

133

+

### 4. Docker Compose Version Issues

134

+

135

+

#### **Symptom**: "version not supported" or syntax errors

136

+

137

+

#### **Solution**:

138

+

```bash

139

+

# Check Docker Compose version

140

+

docker-compose --version

141

+

142

+

# Update Docker Compose (Linux)

143

+

sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

144

+

sudo chmod +x /usr/local/bin/docker-compose

145

+

146

+

# For older versions, use version 3.7 instead of 3.8 in docker-compose.yml

147

+

```

148

+

149

+

---

150

+

151

+

## ⚡ Spark Issues

152

+

153

+

### 1. Spark Session Creation Fails

154

+

155

+

#### **Symptom**: `Cannot connect to Spark master` or session creation hangs

156

+

157

+

#### **Common Causes & Solutions**:

158

+

159

+

**Master Not Running**

160

+

```python

161

+

# Check if Spark master is accessible

162

+

import requests

163

+

try:

164

+

response = requests.get("http://localhost:8080")

165

+

print("Spark master is running")

166

+

except:

167

+

print("Cannot reach Spark master")

168

+

```

169

+

170

+

**Wrong Master URL**

171

+

```python

172

+

# Try different master configurations

173

+

# For local development

174

+

spark = SparkSession.builder.master("local[*]").getOrCreate()

175

+

176

+

# For Docker cluster

177

+

spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate()

178

+

179

+

# Check from inside Jupyter container

180

+

spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate()

181

+

```

182

+

183

+

**Memory Configuration Issues**

184

+

```python

185

+

spark = (SparkSession.builder

186

+

.appName("SmartCityIoTPipeline")

187

+

.master("local[*]")

188

+

.config("spark.driver.memory", "2g") # Reduce if needed

189

+

.config("spark.executor.memory", "1g") # Reduce if needed

190

+

.config("spark.driver.maxResultSize", "1g")

191

+

.getOrCreate())

192

+

```

193

+

194

+

### 2. Out of Memory Errors

195

+

196

+

#### **Symptom**: `Java heap space` or `GC overhead limit exceeded`

197

+

198

+

#### **Solutions**:

199

+

200

+

**Increase Memory Allocation**

201

+

```python

202

+

spark.conf.set("spark.driver.memory", "4g")

203

+

spark.conf.set("spark.executor.memory", "2g")

204

+

spark.conf.set("spark.driver.maxResultSize", "2g")

205

+

```

206

+

207

+

**Optimize Data Processing**

208

+

```python

209

+

# Use sampling for large datasets

210

+

sample_df = large_df.sample(0.1, seed=42)

211

+

212

+

# Cache frequently used DataFrames

213

+

df.cache()

214

+

df.count() # Trigger caching

215

+

216

+

# Repartition data

217

+

df = df.repartition(4) # Fewer partitions for small datasets

218

+

219

+

# Use coalesce to reduce partitions

220

+

df = df.coalesce(2)

221

+

```

222

+

223

+

**Process Data in Chunks**

224

+

```python

225

+

# Process data month by month

226

+

for month in range(1, 13):

227

+

monthly_data = df.filter(F.month("timestamp") == month)

228

+

# Process monthly_data

229

+

monthly_data.unpersist() # Free memory

230

+

```

231

+

232

+

### 3. Slow Spark Jobs

233

+

234

+

#### **Symptom**: Jobs take very long time or appear to hang

235

+

236

+

#### **Solutions**:

237

+

238

+

**Check Spark UI for Bottlenecks**

239

+

- Open http://localhost:4040 (or 4041, 4042 if multiple sessions)

240

+

- Look at the Jobs tab for failed/slow stages

241

+

- Check Executors tab for resource usage

242

+

243

+

**Optimize Partitioning**

244

+

```python

245

+

# Check current partitions

246

+

print(f"Partitions: {df.rdd.getNumPartitions()}")

247

+

248

+

# Optimal partitions = 2-3x number of cores

249

+

optimal_partitions = spark.sparkContext.defaultParallelism * 2

250

+

df = df.repartition(optimal_partitions)

251

+

```

252

+

253

+

**Avoid Expensive Operations**

254

+

```python

255

+

# Avoid repeated .count() calls

256

+

count = df.count()

257

+

print(f"Records: {count}")

258

+

259

+

# Use .cache() for DataFrames used multiple times

260

+

df.cache()

261

+

262

+

# Avoid .collect() on large datasets

263

+

# Instead of:

264

+

all_data = df.collect() # BAD: loads all data to driver

265

+

266

+

# Use:

267

+

sample_data = df.limit(1000).collect() # GOOD: only sample

268

+

```

269

+

270

+

**Optimize Joins**

271

+

```python

272

+

# Broadcast small DataFrames

273

+

from pyspark.sql.functions import broadcast

274

+

result = large_df.join(broadcast(small_df), "key")

275

+

276

+

# Use appropriate join strategies

277

+

spark.conf.set("spark.sql.adaptive.enabled", "true")

278

+

spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

279

+

```

280

+

281

+

### 4. DataFrame Operations Fail

282

+

283

+

#### **Symptom**: `AnalysisException` or column not found errors

284

+

285

+

#### **Solutions**:

286

+

287

+

**Check Schema and Column Names**

288

+

```python

289

+

# Print schema to see exact column names

290

+

df.printSchema()

291

+

292

+

# Show column names

293

+

print(df.columns)

294

+

295

+

# Check for case sensitivity

296

+

df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()])

297

+

```

298

+

299

+

**Handle Null Values**

300

+

```python

301

+

# Check for nulls before operations

302

+

df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()

303

+

304

+

# Drop nulls before joins

305

+

df_clean = df.na.drop(subset=['key_column'])

306

+

307

+

# Fill nulls with defaults

308

+

df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'})

309

+

```

310

+

311

+

**Fix Data Type Issues**

312

+

```python

313

+

# Cast columns to correct types

314

+

df = df.withColumn("timestamp", F.to_timestamp("timestamp"))

315

+

df = df.withColumn("numeric_col", F.col("numeric_col").cast("double"))

316

+

317

+

# Handle string/numeric conversion errors

318

+

df = df.withColumn("safe_numeric",

319

+

F.when(F.col("string_col").rlike("^[0-9.]+$"),

320

+

F.col("string_col").cast("double")).otherwise(0))

321

+

```

322

+

323

+

---

324

+

325

+

## 🗄️ Database Connection Issues

326

+

327

+

### 1. Cannot Connect to PostgreSQL

328

+

329

+

#### **Symptom**: `Connection refused` or authentication failed

330

+

331

+

#### **Solutions**:

332

+

333

+

**Check PostgreSQL Status**

334

+

```bash

335

+

# Check if PostgreSQL container is running

336

+

docker-compose ps postgres

337

+

338

+

# Check PostgreSQL logs

339

+

docker-compose logs postgres

340

+

341

+

# Test connection from host

342

+

psql -h localhost -p 5432 -U postgres -d smartcity

343

+

```

344

+

345

+

**From Jupyter/Spark Container**

346

+

```python

347

+

# Test database connection

348

+

import psycopg2

349

+

350

+

try:

351

+

conn = psycopg2.connect(

352

+

host="postgres", # Use container name, not localhost

353

+

port=5432,

354

+

user="postgres",

355

+

password="password",

356

+

database="smartcity"

357

+

)

358

+

print("Database connection successful")

359

+

conn.close()

360

+

except Exception as e:

361

+

print(f"Connection failed: {e}")

362

+

```

363

+

364

+

**Spark JDBC Connection**

365

+

```python

366

+

# Correct JDBC URL for Docker

367

+

jdbc_url = "jdbc:postgresql://postgres:5432/smartcity"

368

+

369

+

# Test Spark database connection

370

+

test_df = spark.read.format("jdbc") \

371

+

.option("url", jdbc_url) \

372

+

.option("dbtable", "(SELECT 1 as test) as test_table") \

373

+

.option("user", "postgres") \

374

+

.option("password", "password") \

375

+

.option("driver", "org.postgresql.Driver") \

376

+

.load()

377

+

378

+

test_df.show()

379

+

```

380

+

381

+

### 2. JDBC Driver Issues

382

+

383

+

#### **Symptom**: `ClassNotFoundException: org.postgresql.Driver`

384

+

385

+

#### **Solutions**:

386

+

387

+

**Add JDBC Driver to Spark**

388

+

```python

389

+

spark = SparkSession.builder \

390

+

.appName("SmartCityIoTPipeline") \

391

+

.config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \

392

+

.getOrCreate()

393

+

```

394

+

395

+

**Download Driver Manually**

396

+

```bash

397

+

# Download PostgreSQL JDBC driver

398

+

cd /opt/bitnami/spark/jars/

399

+

wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar

400

+

```

401

+

402

+

---

403

+

404

+

## 📊 Data Loading Issues

405

+

406

+

### 1. File Not Found Errors

407

+

408

+

#### **Symptom**: `FileNotFoundException` or path does not exist

409

+

410

+

#### **Solutions**:

411

+

412

+

**Check File Paths**

413

+

```python

414

+

import os

415

+

416

+

# Check if file exists

417

+

data_file = "data/raw/traffic_sensors.csv"

418

+

print(f"File exists: {os.path.exists(data_file)}")

419

+

420

+

# List directory contents

421

+

print(os.listdir("data/raw/"))

422

+

423

+

# Use absolute paths if needed

424

+

import os

425

+

abs_path = os.path.abspath("data/raw/traffic_sensors.csv")

426

+

df = spark.read.csv(abs_path, header=True, inferSchema=True)

427

+

```

428

+

429

+

**Volume Mount Issues**

430

+

```bash

431

+

# Check if volumes are mounted correctly

432

+

docker-compose exec jupyter ls -la /home/jovyan/work/data/

433

+

434

+

# Verify volume mounts in docker-compose.yml

435

+

volumes:

436

+

- ./data:/home/jovyan/work/data

437

+

- ./notebooks:/home/jovyan/work/notebooks

438

+

```

439

+

440

+

### 2. Schema Inference Problems

441

+

442

+

#### **Symptom**: Wrong data types or parsing errors

443

+

444

+

#### **Solutions**:

445

+

446

+

**Explicit Schema Definition**

447

+

```python

448

+

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

449

+

450

+

# Define explicit schema

451

+

schema = StructType([

452

+

StructField("sensor_id", StringType(), False),

453

+

StructField("timestamp", StringType(), False), # Read as string first

454

+

StructField("vehicle_count", IntegerType(), True),

455

+

StructField("avg_speed", DoubleType(), True)

456

+

])

457

+

458

+

df = spark.read.csv("data/raw/traffic_sensors.csv",

459

+

header=True, schema=schema)

460

+

461

+

# Then convert timestamp

462

+

df = df.withColumn("timestamp", F.to_timestamp("timestamp"))

463

+

```

464

+

465

+

**Handle Different Date Formats**

466

+

```python

467

+

# Try different timestamp formats

468

+

df = df.withColumn("timestamp",

469

+

F.coalesce(

470

+

F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"),

471

+

F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"),

472

+

F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")

473

+

))

474

+

```

475

+

476

+

### 3. Large File Loading Issues

477

+

478

+

#### **Symptom**: Out of memory when loading large files

479

+

480

+

#### **Solutions**:

481

+

482

+

**Process Files in Chunks**

483

+

```python

484

+

# For very large CSV files, process line by line

485

+

def process_large_csv(file_path, chunk_size=10000):

486

+

# Read in smaller chunks

487

+

df = spark.read.option("maxRecordsPerFile", chunk_size) \

488

+

.csv(file_path, header=True, inferSchema=True)

489

+

return df

490

+

491

+

# Or split large files manually

492

+

# split -l 100000 large_file.csv chunk_

493

+

```

494

+

495

+

**Optimize File Format**

496

+

```python

497

+

# Convert to Parquet for better performance

498

+

df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet")

499

+

500

+

# Read Parquet instead of CSV

501

+

df = spark.read.parquet("data/processed/traffic_optimized.parquet")

502

+

```

503

+

504

+

---

505

+

506

+

## 🔧 Environment Setup Issues

507

+

508

+

### 1. Python Package Conflicts

509

+

510

+

#### **Symptom**: `ImportError` or version conflicts

511

+

512

+

#### **Solutions**:

513

+

514

+

**Check Package Versions**

515

+

```python

516

+

import sys

517

+

print(f"Python version: {sys.version}")

518

+

519

+

import pyspark

520

+

print(f"PySpark version: {pyspark.__version__}")

521

+

522

+

import pandas

523

+

print(f"Pandas version: {pandas.__version__}")

524

+

```

525

+

526

+

**Rebuild Jupyter Container**

527

+

```bash

528

+

# Rebuild with latest packages

529

+

docker-compose down

530

+

docker-compose build --no-cache jupyter

531

+

docker-compose up -d

532

+

```

533

+

534

+

**Manual Package Installation**

535

+

```bash

536

+

# Install packages in running container

537

+

docker-compose exec jupyter pip install package_name

538

+

539

+

# Or add to requirements.txt and rebuild

540

+

```

541

+

542

+

### 2. Jupyter Notebook Issues

543

+

544

+

#### **Symptom**: Kernel won't start or crashes frequently

545

+

546

+

#### **Solutions**:

547

+

548

+

**Restart Jupyter Kernel**

549

+

- In Jupyter: Kernel → Restart & Clear Output

550

+

551

+

**Check Jupyter Logs**

552

+

```bash

553

+

docker-compose logs jupyter

554

+

```

555

+

556

+

**Increase Memory Limits**

557

+

```yaml

558

+

# In docker-compose.yml

559

+

jupyter:

560

+

# ... other config

561

+

deploy:

562

+

resources:

563

+

limits:

564

+

memory: 4G

565

+

```

566

+

567

+

**Clear Jupyter Cache**

568

+

```bash

569

+

# Remove Jupyter cache

570

+

docker-compose exec jupyter rm -rf ~/.jupyter/

571

+

docker-compose restart jupyter

572

+

```

573

+

574

+

---

575

+

576

+

## 🚀 Performance Optimization Tips

577

+

578

+

### 1. Spark Configuration Tuning

579

+

580

+

```python

581

+

# Optimal Spark configuration for development

582

+

spark.conf.set("spark.sql.adaptive.enabled", "true")

583

+

spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

584

+

spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

585

+

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

586

+

587

+

# Memory optimization

588

+

spark.conf.set("spark.executor.memoryFraction", "0.8")

589

+

spark.conf.set("spark.sql.shuffle.partitions", "200") # Adjust based on data size

590

+

```

591

+

592

+

### 2. Data Processing Best Practices

593

+

594

+

```python

595

+

# Cache DataFrames used multiple times

596

+

df.cache()

597

+

df.count() # Trigger caching

598

+

599

+

# Use appropriate file formats

600

+

# CSV (slowest) → JSON → Parquet (fastest)

601

+

602

+

# Partition data for better performance

603

+

df.write.partitionBy("year", "month").parquet("partitioned_data")

604

+

605

+

# Use column pruning

606

+

df.select("col1", "col2").filter("col1 > 100") # Better than df.filter().select()

607

+

```

608

+

609

+

### 3. Memory Management

610

+

611

+

```python

612

+

# Unpersist DataFrames when done

613

+

df.unpersist()

614

+

615

+

# Clear Spark context periodically

616

+

spark.catalog.clearCache()

617

+

618

+

# Monitor memory usage

619

+

print(f"Cached tables: {spark.catalog.listTables()}")

620

+

```

621

+

622

+

---

623

+

624

+

## 🐞 Debugging Strategies

625

+

626

+

### 1. Enable Debug Logging

627

+

628

+

```python

629

+

# Set log level for debugging

630

+

spark.sparkContext.setLogLevel("DEBUG") # Very verbose

631

+

spark.sparkContext.setLogLevel("INFO") # Moderate

632

+

spark.sparkContext.setLogLevel("WARN") # Minimal (default)

633

+

```

634

+

635

+

### 2. Inspect Data at Each Step

636

+

637

+

```python

638

+

# Check DataFrame at each transformation

639

+

print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}")

640

+

df1.show(5)

641

+

642

+

df2 = df1.filter(F.col("value") > 0)

643

+

print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}")

644

+

df2.show(5)

645

+

```

646

+

647

+

### 3. Use Explain Plans

648

+

649

+

```python

650

+

# See execution plan

651

+

df.explain(True)

652

+

653

+

# Check for expensive operations

654

+

df.explain("cost")

655

+

```

656

+

657

+

### 4. Sample Data for Testing

658

+

659

+

```python

660

+

# Use small samples for development

661

+

sample_df = large_df.sample(0.01, seed=42) # 1% sample

662

+

663

+

# Limit rows for testing

664

+

test_df = df.limit(1000)

665

+

```

666

+

667

+

---

668

+

669

+

## 📋 Health Check Commands

670

+

671

+

### Quick System Check Script

672

+

673

+

```bash

674

+

#!/bin/bash

675

+

echo "🔍 Smart City IoT Pipeline Health Check"

676

+

echo "======================================"

677

+

678

+

echo "📋 Docker Status:"

679

+

docker --version

680

+

docker-compose --version

681

+

682

+

echo "🐳 Container Status:"

683

+

docker-compose ps

684

+

685

+

echo "💾 Disk Usage:"

686

+

df -h

687

+

docker system df

688

+

689

+

echo "🧠 Memory Usage:"

690

+

free -h

691

+

692

+

echo "🌐 Network Connectivity:"

693

+

curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible"

694

+

curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible"

695

+

696

+

echo "🗄️ Database Status:"

697

+

docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready"

698

+

699

+

echo "📁 Data Files:"

700

+

ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing"

701

+

```

702

+

703

+

### Python Health Check

704

+

705

+

```python

706

+

def health_check():

707

+

"""Run comprehensive health check"""

708

+

checks = {

709

+

"spark_session": False,

710

+

"database_connection": False,

711

+

"data_files": False,

712

+

"memory_usage": False

713

+

}

714

+

715

+

# Check Spark session

716

+

try:

717

+

spark.sparkContext.statusTracker()

718

+

checks["spark_session"] = True

719

+

print("✅ Spark session healthy")

720

+

except:

721

+

print("❌ Spark session issues")

722

+

723

+

# Check database

724

+

try:

725

+

test_df = spark.read.format("jdbc") \

726

+

.option("url", "jdbc:postgresql://postgres:5432/smartcity") \

727

+

.option("dbtable", "(SELECT 1) as test") \

728

+

.option("user", "postgres") \

729

+

.option("password", "password") \

730

+

.load()

731

+

test_df.count()

732

+

checks["database_connection"] = True

733

+

print("✅ Database connection healthy")

734

+

except Exception as e:

735

+

print(f"❌ Database issues: {e}")

736

+

737

+

# Check data files

738

+

try:

739

+

import os

740

+

required_files = [

741

+

"data/raw/traffic_sensors.csv",

742

+

"data/raw/air_quality.json",

743

+

"data/raw/weather_data.parquet"

744

+

]

745

+

746

+

missing_files = [f for f in required_files if not os.path.exists(f)]

747

+

if not missing_files:

748

+

checks["data_files"] = True

749

+

print("✅ All data files present")

750

+

else:

751

+

print(f"❌ Missing files: {missing_files}")

752

+

except Exception as e:

753

+

print(f"❌ File check failed: {e}")

754

+

755

+

# Check memory usage

756

+

try:

757

+

import psutil

758

+

memory_percent = psutil.virtual_memory().percent

759

+

if memory_percent < 80:

760

+

checks["memory_usage"] = True

761

+

print(f"✅ Memory usage OK: {memory_percent:.1f}%")

762

+

else:

763

+

print(f"⚠️ High memory usage: {memory_percent:.1f}%")

764

+

except:

765

+

print("❓ Cannot check memory usage")

766

+

767

+

overall_health = sum(checks.values()) / len(checks) * 100

768

+

print(f"\n📊 Overall System Health: {overall_health:.1f}%")

769

+

770

+

return checks

771

+

772

+

# Run health check

773

+

health_status = health_check()

774

+

```

775

+

776

+

---

777

+

778

+

## 🆘 When All Else Fails

779

+

780

+

### Complete Environment Reset

781

+

782

+

```bash

783

+

# Nuclear option - complete reset

784

+

docker-compose down -v --remove-orphans

785

+

docker system prune -a --volumes

786

+

docker builder prune -a

787

+

788

+

# Remove all project data (CAUTION!)

789

+

rm -rf data/processed/* data/features/*

790

+

791

+

# Rebuild everything

792

+

docker-compose build --no-cache

793

+

docker-compose up -d

794

+

795

+

# Regenerate sample data

796

+

python scripts/generate_data.py

797

+

```

798

+

799

+

### Get Help

800

+

801

+

1. **Check GitHub Issues**: Look for similar problems in the project repository

802

+

2. **Stack Overflow**: Search for Spark/Docker specific errors

803

+

3. **Spark Documentation**: https://spark.apache.org/docs/latest/

804

+

4. **Docker Documentation**: https://docs.docker.com/

805

+

806

+

### Collect Diagnostic Information

807

+

808

+

```bash

809

+

# Gather system information for help requests

810

+

echo "System Information:" > diagnostic_info.txt

811

+

uname -a >> diagnostic_info.txt

812

+

docker --version >> diagnostic_info.txt

813

+

docker-compose --version >> diagnostic_info.txt

814

+

python --version >> diagnostic_info.txt

815

+

816

+

echo "Container Status:" >> diagnostic_info.txt

817

+

docker-compose ps >> diagnostic_info.txt

818

+

819

+

echo "Container Logs:" >> diagnostic_info.txt

820

+

docker-compose logs --tail=50 >> diagnostic_info.txt

821

+

822

+

echo "Disk Usage:" >> diagnostic_info.txt

823

+

df -h >> diagnostic_info.txt

824

+

docker system df >> diagnostic_info.txt

825

+

```

826

+

827

+

---

828

+

829

+

## 📚 Additional Resources

830

+

831

+

- **Spark Tuning Guide**: https://spark.apache.org/docs/latest/tuning.html

832

+

- **Docker Best Practices**: https://docs.docker.com/develop/best-practices/

833

+

- **PySpark API Documentation**: https://spark.apache.org/docs/latest/api/python/

834

+

- **PostgreSQL Docker Guide**: https://hub.docker.com/_/postgres

835

+

836

+

Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄

kristofer / Spark/Docker Troubleshooting Guide

kristofer revised this gist 1751371779. Go to revision

kristofer revised this gist 1751371732. Go to revision

			@@ -1,4 +1,4 @@
1		-	# Smart City IoT Pipeline - Troubleshooting Guide
	1	+	# Spark-Docker-SQL - Troubleshooting Guide
2	2
3	3		## 🚨 Quick Emergency Fixes
4	4