Elasticsearch Parallel Bulk
Most of the AWS EC2 instance have a single thread per vCPU. This means that any multi-threading operations can’t be used to optimize the performance.
elasticsearch.helpers.parallel_bulk uses ThreadPool
to speed up ElasticSearch bulk ingest. Since this parallelization can’t be applied multiple vCPUs on AWS EC2 instances, the thread_count
parameter is non-functional.
At best, one can selectively use instances with 2 threads per vCPUs. Supported CPU options for Amazon EC2 instance types
Better yet, the ingest step could be parallelized by using multiprocessing.Pool
. However, the ElasticSearch client object cannot be easily pickled and it’s recommended to maintain only a single instance per Python shell. This approach could be explored more, but the one can simply launch multiple Python shells to run the bulk ingests in parallel.