Re: Handling LZO files

Jörn Franke Thu, 03 Dec 2015 06:29:44 -0800

Your Hive version is too old. You may want to use also another execution 
engine. I think your problem might then be related to external tables for which 
the parameter you set probably do not apply. I had once the same problem, but I 
needed to change the block size on the Hadoop level (hdfs-site.xml) or on the 
Hive level (hive-site.xml). It was definitely not possible as part of a hive 
session (set ...). I would need to check the documentation.
In any case , loading it into ORC or parquet makes a lot of sense, but only 
with a recent Hive version and tez or spark as an execution engine.


> On 03 Dec 2015, at 14:58, Harsha HN <99harsha.h....@gmail.com> wrote:
> Hi Franke,
> 
> It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available 
> when I ran my job. So infrastructure is not a problem here. 
> Hive version is 0.13
> 
> About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or 
> PARQUET format. Though it might be performance efficient, it increases data 
> redundancy. 
> But we will explore that option. 
> 
> Currently I want to understand when I am unable to scale up mappers.
> 
> Thanks,
> Harsha
> 
>> On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> 
>> How many nodes, cores and memory do you have?
>> What hive version?
>> 
>> Do you have the opportunity to use tez as an execution engine?
>> Usually  I use external tables only for reading them and inserting them into 
>> a table in Orc or parquet format for doing analytics.
>> This is much more performant than json or any other text-based format.
>> 
>>> On 03 Dec 2015, at 14:20, Harsha HN <99harsha.h....@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> We have LZO compressed JSON files in our HDFS locations. I am creating an 
>>> "External" table on the data in HDFS for the purpose of analytics. 
>>> 
>>> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 
>>> MB respectively along with their index files. 
>>> 
>>> When I run count(*) query on the table I observe only 10 mappers causing 
>>> performance bottleneck. 
>>> 
>>> I even tried following, (going for 30MB split)
>>>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
>>> 2) set dfs.blocksize=31457280;
>>> But still I am getting 10 mappers.
>>> 
>>> Can you please guide me in fixing the same?
>>> 
>>> Thanks,
>>> Sree Harsha
>

Re: Handling LZO files

Reply via email to