Hi All,
I run following hive
create table2 as
select
id,
ntile(6) over (partition by city order by price) as price_tile,
ntile(3) over (partition by city order by discount) as discount_tile,
ntile(6) over (partition by city order by number) as number_tile
from table1;
Table1 contains 8 million
Have you considered Hive Streaming?
(https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
It's built for exactly such use case.
Both Flume and Storm are integrated with it and write directly to your target
table.
Eugene
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com
Thanks
I agree I think using INSERT OWERWRITE to repopulate data in the partition
is bullet proof with nothing left behind. Performance looks good as well.
When creating partitions by date it seems to be more effective to partition
by a single string of ‘-MM-DD’ rather than use a multi-depth
I think what you propose makes sense. If you would do a delta load you gain not
much performance benefits (most likely you will have less performance because
you need to figure out what has changed, have the typical issues of distributed
systems that some changes may arrive later, error handling
Hi,
I have trade data delivered through kafka and flume as csv files to HDFS.
There are 100 prices every 2 seconds so in a minute there are 3000 new
rows, 18K rows an hour and in a day 4,320,000 new rows.
Flume creates a new sub directory partition ever day in the format
-MM-DD like prices/20