bzip2 or snappy-codec will be very usefull for that. - Alex
On Wed, Nov 2, 2011 at 11:00 AM, Martin Kuhn <martin.k...@affinitas.de>wrote: > You could try to use splittable LZO compression instead: > https://github.com/kevinweil/hadoop-lzo (a gz file can't be split) > > > > We have multiple terabytes of data (currently in gz format approx size > 2GB per file). What is best way to load that data into Hadoop? > > > We have seen that (especially when loaded using hive's load data local > inpath ....) to load a gz file it takes around 12 seconds and when we > decompress it (around 4~5GB) it takes 8 minutes to load the file. > > > We want these files to be processed using multiple mappers on the Hadoop > and not with singles. > > > What would be best way to load these files in Hive/hdfs so that it takes > less time to load as well as use multiple mappers to process the files. > -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*