You shouldn’t need to write each record to a separate file.  Each Storm bolt 
should be able to write to it’s own file, appending records as it goes.  As 
long as you only have one writer per file this should be fine.  You can then 
close the files every 15 minutes (or whatever works for you) and have a 
separate job that creates a new partition in your Hive table with the files 
created by your bolts.  

Alan.

On Jan 2, 2014, at 11:58 AM, Chen Wang <chen.apache.s...@gmail.com> wrote:

> Guys,
> I am using storm to read data stream from our socket server, entry by entry, 
> and then write them to file: one entry per file.  At some point, i need to 
> import the data into my hive table. There are several approaches i could 
> think of:
> 1. directly write to hive hdfs file whenever I get the entry(from our socket 
> server). The problem is that this could be very inefficient,  since we have 
> huge amount of data stream, and I would not want to write to hive hdfs one by 
> one. 
> Or
> 2 i can write the entries to files(normal file or hdfs file) on the disk, and 
> then have a separate job to merge those small files into big one, and then 
> load them into hive table.
> The problem with this is, a) how can I merge small files into big files for 
> hive? b) what is the best file size to upload to hive?
> 
> I am seeking advice on both approaches, and appreciate your insight.
> Thanks,
> Chen
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to