Tuesday, May 26, 2015

Cost optimization through performance improvement of S3DistCp

We reduced the cost of running our production cluster by about 60% by reducing total size of production cluster from 15 nodes to 6 nodes through performance tuning of AWS utility S3DistCp. This page provides some details about the same.

What is S3DistCp

Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. For more information about the Apache DistCp open source project, go to http://hadoop.apache.org/docs/r1.2.1/distcp.html.
S3DistCp is an extension of DistCp that is optimized to work with AWS, particularly Amazon S3.

Issue

The AWS S3DistCp command, by default, is susceptible to take very high number of resources of the cluster to transfer even a small file!
As mentioned above, S3DistCp uses MapReduce to copy a file in distributed manner. By default, a MapReduce job creates approximately as many number of reducer as teh available container slots in the cluster.
Hive optimizes this behavior by estimating required number of reducers based on the input data size. However, S3DistCp uses the default MapReduce logic to trigger as many reducers as available slots in the cluster.
Because of this, S3DistCp ends up creating huge number of reducers for large clusters.

For example, for one instance it created 1 mapper and 498 reducers to transfer 694kb of data!
This behavior eats up lots of resources on the cluster and eventually slows down the overall throughput of the cluster.



Specifying required number of  reducers by passing argument  "-D mapreduce.job.reduces=${no_of_reduces}" can optimize the number of reducers being triggered for given S3DistCp operation.


After restricting the default count of reducers to 5, the load on cluster reduced and throughput increased significantly.

Immediately after deployment of fix

Number of spikes reduced in cluster CPU utilization and number of containers running in cluster reduced significantly immediately after deploying the fix.



After 10 days

In 10 days, after we deployed S3DistCp fix, the cluster performance improved significantly. (Green block in graph indicates the improvement because of deployment of the fix)
We reduced the total cluster size in production from 15 nodes to 6 nodes.
As a result of this, our average cluster utilization improved. (The yellow line in graph indicates improvement in utilization as we slowly reduced nodes from 15 to 6.)
We are monitoring the cluster utilization and would slowly bring down the cluster size to about 3 to 4 nodes in coming days.

- Sarang Anajwala

No comments:

Post a Comment