Thursday, February 26, 2015

Hive - dynamic partitions performance issue with S3/EMR

Problem:
We use Hive in Amazon EMR to parse our logs. Any insert query to insert data into a table where number of partitions are high (8000+ in our case), takes huge amount of time because the insert operation loads partitions every time.
This load partitions is done by making one hit per partition to S3 and one hit to metastore!
In our case, for one of the jobs I deciphered the logs - about 95 mins out of 102 mins are wasted in loading partitions.... which means almost 93% of time is wasted in loading partitions!

Following are the logs which suggest loading of partitions -
5538245 [main] INFO  hive.ql.metadata.Hive  - New loading path = s3://<path>/dth=2011-09-04-01 with partSpec {dth=2011-09-04-01}
5538697 [main] INFO  hive.ql.metadata.Hive  - New loading path = s3://<path>/dth=2015-01-01-15 with partSpec {dth=2015-01-01-15}
5539151 [main] INFO  hive.ql.metadata.Hive  - New loading path = s3://<path>/dth=2014-11-16-04 with partSpec {dth=2014-11-16-04}
5539661 [main] INFO  hive.ql.metadata.Hive  - New loading path = s3://<path>/dth=2014-08-16-19 with partSpec {dth=2014-08-16-19}
5540109 [main] INFO  hive.ql.metadata.Hive  - New loading path = s3://<path>/dth=2014-12-15-06 with partSpec {dth=2014-12-15-06}
…………………………………
…………………………………
6152836 [main] INFO  org.apache.hadoop.hive.ql.exec.Task  - Loading partition {dth=2014-12-04-18}
6152888 [main] INFO  org.apache.hadoop.hive.ql.exec.Task  - Loading partition {dth=2015-01-13-19}
6152941 [main] INFO  org.apache.hadoop.hive.ql.exec.Task  - Loading partition {dth=2014-12-27-11}
6152994 [main] INFO  org.apache.hadoop.hive.ql.exec.Task  - Loading partition {dth=2014-08-25-14}
6153046 [main] INFO  org.apache.hadoop.hive.ql.exec.Task  - Loading partition {dth=2014-08-28-18}


Solution:
For all insert (/overwrite) queries, insert should be performed into a new(empty) table pointing to a temporary staging location. Once the insert statement completes, the data file should be copied using a cp command (hdfs cp/distcp) command to final destination.
This would ensure that the insert query completes quickly as the destination table for insert query is empty and no partitions will have to be loaded.
File copy operation should also complete quickly as it will be done outside hive and no partition loading will be involved.

As we are copying data file for new partition directly on file system and not through hive, we will have to recover these new partitions to make sure they are added into hive metastore otherwise this new partitions will not be visible to any next job in workflow which uses this table as input.

In our case, we observed performance for a particular job increase by almost 90%. Run-time for the job reduced from 4+ hours to 25-30 minutes.

- Sarang

No comments:

Post a Comment