*Is there a way to increase the file/block size beyond 1MB? * *Thank you!*
On Mon, Sep 26, 2016 at 7:50 PM, Arun Patel <arunp.bigd...@gmail.com> wrote: > Thanks Dudu and Gopal. > > I tried HAR files and it works. > > I want to use Sequence file because I want to expose data using a table > (filename and content columns). *Can this be done for HAR files?* > > This is what I am doing to create a sequencefile: > > create external table raw_files (raw_data string) location > '/user/myid/myfiles'; > create table fies_seq (key string, value string) stored as sequencefile; > insert overwrite table files_seq > select REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) as > file_name, CONCAT_WS(' ', COLLECT_LIST(raw_data)) as raw_data > from raw_files group by INPUT__FILE__NAME; > > It works well. But, I am seeing 1MB files in fies_seq directory. I am > using below parameters. * Is there a way to increase the file/block size?* > > SET hive.exec.compress.output=true; > SET mapred.output.compression.codec=org.apache.hadoop.io. > compress.GzipCodec; > SET mapred.output.compression.type=BLOCK; > > > On Fri, Sep 23, 2016 at 7:16 PM, Gopal Vijayaraghavan <gop...@apache.org> > wrote: > >> >> > Is there a way to create an external table on a directory, extract >> 'key' as file name and 'value' as file content and write to a sequence file >> table? >> >> Do you care that it is a sequence file? >> >> The HDFS HAR format was invented for this particular problem, check if >> the "hadoop archive" command works for you and offers a filesystem >> abstraction. >> >> Otherwise, there's always the old Mahout "seqdirectory" job, which is >> great if you have like .jpg files and want to pack them for HDFS to handle >> better (like GPS tiles). >> >> Cheers, >> Gopal >> >> >> >