Hi Franke,

It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available
when I ran my job. So infrastructure is not a problem here.
Hive version is 0.13

About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or
PARQUET format. Though it might be performance efficient, it increases data
redundancy.
But we will explore that option.

Currently I want to understand when I am unable to scale up mappers.

Thanks,
Harsha

On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jornfra...@gmail.com> wrote:

>
> How many nodes, cores and memory do you have?
> What hive version?
>
> Do you have the opportunity to use tez as an execution engine?
> Usually  I use external tables only for reading them and inserting them
> into a table in Orc or parquet format for doing analytics.
> This is much more performant than json or any other text-based format.
>
> On 03 Dec 2015, at 14:20, Harsha HN <99harsha.h....@gmail.com> wrote:
>
> Hi,
>
> We have LZO compressed JSON files in our HDFS locations. I am creating an
> "External" table on the data in HDFS for the purpose of analytics.
>
> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61
> MB respectively along with their index files.
>
> When I run count(*) query on the table I observe only 10 mappers causing
> performance bottleneck.
>
> I even tried following, (going for 30MB split)
>  1)  set mapreduce.input.fileinputformat.split.maxsize=31457280;
>
> 2) set dfs.blocksize=31457280;
>
> But still I am getting 10 mappers.
>
> Can you please guide me in fixing the same?
>
> Thanks,
> Sree Harsha
>
>

Reply via email to