Oak.... Green.... Java.... BigData

Following are couple of important configuration changes that can improve the performance of hive queries (specially insert queries) great deal.

`hive.stats.autogather`

By default 'hive.stats.autogather' is 'true'. This setting governs collection of statistics for newly created tables/partitions.
Following are the stats collected:

Number of rows
Number of files
Size in Bytes

For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default.

In case of tables with high number of partitions, this can have big impact on performance.

It is advisable to disable this setting with set hive.stats.autogather=false;

The stats can be collected when required using following query:

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]

  COMPUTE STATISTICS 

  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)

  [NOSCAN];

`hive.fileformat.check`

This property governs whether to check file format or not when loading data files.

In our case, where the format of data file is governed by our own processes, checking file format every time we load data file may not add any value.

Hence, it is recommended to disable the file format check to gain some performance.

set hive.fileformat.check=false;

Performance

More than 50% of Performance improvement is observed by using these configuration changes. The time taken by hive queries for metering reduced from >100 minutes to <50 minutes.

References

https://cwiki.apache.org/confluence/display/Hive/StatsDev
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Oak.... Green.... Java.... BigData

Sunday, November 16, 2014

Hive Performance Improvement - statistics and file format checks

`hive.stats.autogather`

`hive.fileformat.check`

Performance

References

Thursday, October 2, 2014

Hive Query to get 95th Percentiled ranked item

Friday, September 26, 2014

Collections in CQL3 - How they are stored

Saturday, September 13, 2014

How to Stop compaction for specific keyspace in cassandra

Friday, September 12, 2014

Compaction Strategy in cassandra

Size Tiered Compaction

Level Tiered Compaction

Comparison at high level

Thursday, September 4, 2014

JVM Command: JInfo

Wednesday, August 20, 2014

Oracle SQL Developer - Java version issue

Thursday, August 14, 2014

HDFS Read Caching: cache-locality

Sunday, June 22, 2014

How to externalize property files of dependent libraries using maven and package it in a tar file alongside jar file