AWS EC2 & Spark Cluster Snippets
AWS EC2 Instance
Spark / Hadoop
Snippets
Partial download of parquet directory
aws s3 sync {source} {destination} --exclude="*" --include="part-0000*"
Save job logs as output file
spark-submit job.py 2>&1 | tee logs
Fix ouf of memory error when working locally
pyspark --driver-memory 10g --master local[1]
Using a single thread
pyspark --master local
Using all threads (master as worker)
pyspark --master local[*]
Change temp/shuffle/work dir to a different (external) disk
pyspark --conf spark.local.dir=/data
Load Hadoop Cloud
pyspark --packages org.apache.spark:spark-hadoop-cloud_2.12:3.5.4
Use Ipython for Pyspark shell
PYSPARK_DRIVER_PYTHON=ipython pyspark
Misc
Check file system disk size
df -h
Scrolling up and down in tmux
- Press CTRL + B
- Press Page Up or Page Down to navigate
Update submodule
git submodule update –remote