odd behavior of ntile

2016-09-25 Thread Rex X
Hi All, I run following hive create table2 as select id, ntile(6) over (partition by city order by price) as price_tile, ntile(3) over (partition by city order by discount) as discount_tile, ntile(6) over (partition by city order by number) as number_tile from table1; Table1 contains 8 million

Re: populating Hive table periodically from files on HDFS

2016-09-25 Thread Eugene Koifman
Have you considered Hive Streaming? (https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) It's built for exactly such use case. Both Flume and Storm are integrated with it and write directly to your target table. Eugene From: Mich Talebzadeh mailto:mich.talebza...@gmail.com

Re: populating Hive table periodically from files on HDFS

2016-09-25 Thread Mich Talebzadeh
Thanks I agree I think using INSERT OWERWRITE to repopulate data in the partition is bullet proof with nothing left behind. Performance looks good as well. When creating partitions by date it seems to be more effective to partition by a single string of ‘-MM-DD’ rather than use a multi-depth

Re: populating Hive table periodically from files on HDFS

2016-09-25 Thread Jörn Franke
I think what you propose makes sense. If you would do a delta load you gain not much performance benefits (most likely you will have less performance because you need to figure out what has changed, have the typical issues of distributed systems that some changes may arrive later, error handling

populating Hive table periodically from files on HDFS

2016-09-25 Thread Mich Talebzadeh
Hi, I have trade data delivered through kafka and flume as csv files to HDFS. There are 100 prices every 2 seconds so in a minute there are 3000 new rows, 18K rows an hour and in a day 4,320,000 new rows. Flume creates a new sub directory partition ever day in the format -MM-DD like prices/20