Guys, I am using storm to read data stream from our socket server, entry by entry, and then write them to file: one entry per file. At some point, i need to import the data into my hive table. There are several approaches i could think of: 1. directly write to hive hdfs file whenever I get the entry(from our socket server). The problem is that this could be very inefficient, since we have huge amount of data stream, and I would not want to write to hive hdfs one by one. Or 2 i can write the entries to files(normal file or hdfs file) on the disk, and then have a separate job to merge those small files into big one, and then load them into hive table. The problem with this is, a) how can I merge small files into big files for hive? b) what is the best file size to upload to hive?
I am seeking advice on both approaches, and appreciate your insight. Thanks, Chen