Re: Handling LZO files

2015-12-04 Thread Jörn Franke
You can put several small files packed in one Hadoop archive (HAR). The other alternative was to set the split size of the execution engine (TEZ, mr,...), which you probably do not want to do on a global level. In general, one should replace xml, json etc with Avro where possible and then use fo

Re: Handling LZO files

2015-12-03 Thread Jörn Franke
Your Hive version is too old. You may want to use also another execution engine. I think your problem might then be related to external tables for which the parameter you set probably do not apply. I had once the same problem, but I needed to change the block size on the Hadoop level (hdfs-site.

Re: Handling LZO files

2015-12-03 Thread Harsha HN
Hi Franke, It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available when I ran my job. So infrastructure is not a problem here. Hive version is 0.13 About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or PARQUET format. Though it might be performance efficient,

Re: Handling LZO files

2015-12-03 Thread Jörn Franke
How many nodes, cores and memory do you have? What hive version? Do you have the opportunity to use tez as an execution engine? Usually I use external tables only for reading them and inserting them into a table in Orc or parquet format for doing analytics. This is much more performant than jso

Handling LZO files

2015-12-03 Thread Harsha HN
Hi, We have LZO compressed JSON files in our HDFS locations. I am creating an "External" table on the data in HDFS for the purpose of analytics. There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 MB respectively along with their index files. When I run count(*) query on t