AWS EC2 & Spark Cluster Snippets

Partial download of parquet directory

aws s3 sync {source} {destination} --exclude="*" --include="part-0000*"

Save job logs as output file

spark-submit job.py 2>&1 | tee logs

Fix ouf of memory error when working locally

pyspark --driver-memory 10g --master local[1]

Using a single thread

pyspark --master local

Using all threads (master as worker)

pyspark --master local[*]

Change temp/shuffle/work dir to a different (external) disk

pyspark --conf spark.local.dir=/data

Load Hadoop Cloud

pyspark --packages org.apache.spark:spark-hadoop-cloud_2.12:3.5.4

Use Ipython for Pyspark shell

PYSPARK_DRIVER_PYTHON=ipython pyspark

Check file system disk size

df -h

Scrolling up and down in tmux

Update submodule

git submodule update –remote