Hi INPUT ===== Hive can handle gz files out of the box with NO additional configurations
OUTPUT ====== If you want Hive to output to compressed files (say gz) then add the following as part of the hive SQL at the begining SET hive.exec.compress.output=true; SET mapred.reduce.tasks=16; // this will create max 16 gzip files as part of your Hive output query SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SideNote (may or may not be relevant to u ….nevertheless) You may know that GZIP is not splittable and unless u have a definite reason to use GZIP (like multiple lines in a log file actually constitute one logical Object or Record) , I would recommend LZO… A little bit of plumbing is required since they discontinued LZO with Hadoop out of the box…..but its pretty straight forward….and remember to use the LZO indexer to create an index for your output so that the LZO files can be split going fwd Thanks sanjay From: Panshul Whisper <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, May 2, 2013 6:00 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: external table or gz compressed file Hello, Can somebody please explain me or point me in the right direction for : how Hive handles gz compressed files, If I create an external table pointing to a .gz compressed file stored on AWS S3. Does hive copy the file to the HDFS and decompress it before it uses the file? OR does it use the file directly? If we use a decompressed file stored on S3... does hive still copy the file to HDFS or read records directly from S3? Please help me understand the working. Thanking You, -- Regards, Ouch Whisper 010101010101 CONFIDENTIALITY NOTICE ====================== This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
