Re: Help on loading data stream to hive table.

2014-01-07 Thread Chen Wang
Alan, The reason I am trying to write to the same file is that i don't want to persist each entry as a small file to hdfs. It will make hive loading very inefficient, right? (although i could do file merging in a separate job). My current thought is that i probably could set up a timer(say 6min) i

Re: Help on loading data stream to hive table.

2014-01-07 Thread Peyman Mohajerian
You may find summingbird relevant, I'm still investigating it: https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird On Tue, Jan 7, 2014 at 11:39 AM, Alan Gates wrote: > I am not wise enough in the ways of Storm to tell you how you should > partition data across bolts. However, th

Re: Help on loading data stream to hive table.

2014-01-07 Thread Alan Gates
I am not wise enough in the ways of Storm to tell you how you should partition data across bolts. However, there is no need in Hive for all data for a partition to be in the same file, only in the same directory. So if each bolt creates a file for each partition and then all those files are pl

Re: Help on loading data stream to hive table.

2014-01-06 Thread Chen Wang
Alan, the problem is that the data is partitioned by epoch ten hourly, and i want all data belong to that partition to be written into one file named with that partition. How can i share the file writer across different bolt? should I instruct data within the same partition to the same bolt? Thanks

Re: Help on loading data stream to hive table.

2014-01-03 Thread Alan Gates
You shouldn’t need to write each record to a separate file. Each Storm bolt should be able to write to it’s own file, appending records as it goes. As long as you only have one writer per file this should be fine. You can then close the files every 15 minutes (or whatever works for you) and h