We store all our data in S3. We create external tables pointing to the data in S3 and run hive queries on these tables.
In one of the use cases, we needed to update a table incrementally with incoming data.
The query was something like this –
INSERT OVERWRITE TABLE main_table1SELECT distinct my_fieldFROM(SELECT my_fieldFROM new_table1UNION ALLSELECT my_fieldFROM main_table1) s;
Basically what we are doing here is that we union new data with existing data and eliminate duplicates from this union.
The expected behavior is that the ‘main_table1’ should get updated on each run with new data from ‘new_table’.
What we observed is this query was overwriting ‘main_table1’ with the data of ‘new_table’. That means on each run, we will lose all the old data and only new data will remain in the table.
The behavior is due to a bug in EMR’s S3 library. S3NativeFileSystem class deletes the S3 file pointed by ‘main_table1’ while preparing query plan itself!
Even simple ‘Explain’ statement for this insert-overwrite-query deletes the data file in S3! This can result in SERIOUS DATA LOSS!
Use a staging area (tmp table) to store your results and then copy the result from tmp table to main table.
CREATE TABLE tmp_table….INSERT OVERWRITE TABLE tmp_tableSELECT distinct my_fieldFROM(SELECT my_fieldFROM new_table1UNION ALLSELECT my_fieldFROM main_table1) s;INSERT OVERWRITE main_table1SELECT my_field from tmp_table;
There is one problem with this solution though. If you stop the last INSERT OVERWRITE statement (INSERT OVERWRITE main_table1 SELECT my_field from tmp_table;) before it completes successfully, you lose all your data! (Remember? - The S3 file is deleted while preparing query plan itself!)
Use ‘INSERT INTO’ instead of ‘INSERT Overwrite’.
INSERT INTO main_table1SELECT t1.my_field FROM new_table1 t1WHERE t1.my_field NOT IN (SELECT t2.my_field FROM main_table1);
This will ensure that you incrementally update the ‘main_table1’ without using INSERT OVERWRITE!