Hi,

I have trade data delivered through kafka and flume as csv files to HDFS.
There are 100 prices every 2 seconds so in a minute there are 3000 new
rows, 18K rows an hour and in a day 4,320,000 new rows.

Flume creates a new sub directory partition ever day in the format
YYYY-MM-DD like prices/2015-09-25 on HDFS

There is an external Hive table pointing to new directory by simply
altering external table location

ALTER TABLE ${DATABASE}.externalMarketData set location
'hdfs://rhes564:9000/data/prices/${TODAY}';

This means that the external Hive table only points to the current
directory.

The target internal table in Hive is partitioned by  DateStamp ="YYYY-MM-DD"

PARTITIONED BY (DateStamp  string)

to populate the Hive table a cron job runs every 15 minutes and does simply

INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp =
"${TODAY}")
SELECT
'''''''''''''''''''''''''
)
FROM ${DATABASE}.externalMarketData

So effectively every 15 minutes *today's partition* is overwritten by new
data from the external table.

This seems to be OK.

The other option is only add new rows since last time with INSERT INTO
WHERE rows do not exist in target table.

Any other suggestions?


Thanks









Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to