Following are couple of important configuration changes that
can improve the performance of hive queries (specially insert queries)
great deal.
By default 'hive.stats.autogather' is 'true'. This setting governs collection of statistics for newly created tables/partitions.
Following are the stats collected:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
hive.stats.autogather
By default 'hive.stats.autogather' is 'true'. This setting governs collection of statistics for newly created tables/partitions.Following are the stats collected:
- Number of rows
- Number of files
- Size in Bytes
In case of tables with high number of partitions, this can have big impact on performance.
It is advisable to disable this setting with
set hive.stats.autogather=
false
;
The stats can be collected when required using following query:
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive
0.10
.
0
and later.)
[NOSCAN];
hive.fileformat.check
This property governs whether to check file format or not when loading data files.
In
our case, where the format of data file is governed by our own
processes, checking file format every time we load data file may not add
any value.
Hence, it is recommended to disable the file format check to gain some performance.
set hive.fileformat.check=false;
Performance
More than 50% of Performance improvement is observed by using these configuration changes. The time taken by hive queries for metering reduced from >100 minutes to <50 minutes.References
https://cwiki.apache.org/confluence/display/Hive/StatsDevhttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties