bzip2 or snappy-codec will be very usefull for that.

- Alex

On Wed, Nov 2, 2011 at 11:00 AM, Martin Kuhn <martin.k...@affinitas.de>wrote:

> You could try to use splittable LZO compression instead:
> https://github.com/kevinweil/hadoop-lzo (a gz file can't be split)
>
>
> > We have multiple terabytes of data (currently in gz format approx size
> 2GB per file). What is best way to load that data into Hadoop?
>
> > We have seen that (especially when loaded using hive's load data local
> inpath ....) to load a gz file it takes around 12 seconds and when we
> decompress it (around 4~5GB) it takes 8 minutes to load the file.
>
> > We want these files to be processed using multiple mappers on the Hadoop
> and not with singles.
>
> > What would be best way to load these files in Hive/hdfs so that it takes
> less time to load as well as use multiple mappers to process the files.
>



-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*

Reply via email to