Your Hive version is too old. You may want to use also another execution engine. I think your problem might then be related to external tables for which the parameter you set probably do not apply. I had once the same problem, but I needed to change the block size on the Hadoop level (hdfs-site.xml) or on the Hive level (hive-site.xml). It was definitely not possible as part of a hive session (set ...). I would need to check the documentation. In any case , loading it into ORC or parquet makes a lot of sense, but only with a recent Hive version and tez or spark as an execution engine.
> On 03 Dec 2015, at 14:58, Harsha HN <99harsha.h....@gmail.com> wrote: > Hi Franke, > > It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available > when I ran my job. So infrastructure is not a problem here. > Hive version is 0.13 > > About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or > PARQUET format. Though it might be performance efficient, it increases data > redundancy. > But we will explore that option. > > Currently I want to understand when I am unable to scale up mappers. > > Thanks, > Harsha > >> On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> How many nodes, cores and memory do you have? >> What hive version? >> >> Do you have the opportunity to use tez as an execution engine? >> Usually I use external tables only for reading them and inserting them into >> a table in Orc or parquet format for doing analytics. >> This is much more performant than json or any other text-based format. >> >>> On 03 Dec 2015, at 14:20, Harsha HN <99harsha.h....@gmail.com> wrote: >>> >>> Hi, >>> >>> We have LZO compressed JSON files in our HDFS locations. I am creating an >>> "External" table on the data in HDFS for the purpose of analytics. >>> >>> There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 >>> MB respectively along with their index files. >>> >>> When I run count(*) query on the table I observe only 10 mappers causing >>> performance bottleneck. >>> >>> I even tried following, (going for 30MB split) >>> 1) set mapreduce.input.fileinputformat.split.maxsize=31457280; >>> 2) set dfs.blocksize=31457280; >>> But still I am getting 10 mappers. >>> >>> Can you please guide me in fixing the same? >>> >>> Thanks, >>> Sree Harsha >