Hi Franke, It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available when I ran my job. So infrastructure is not a problem here. Hive version is 0.13
About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or PARQUET format. Though it might be performance efficient, it increases data redundancy. But we will explore that option. Currently I want to understand when I am unable to scale up mappers. Thanks, Harsha On Thu, Dec 3, 2015 at 7:02 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > How many nodes, cores and memory do you have? > What hive version? > > Do you have the opportunity to use tez as an execution engine? > Usually I use external tables only for reading them and inserting them > into a table in Orc or parquet format for doing analytics. > This is much more performant than json or any other text-based format. > > On 03 Dec 2015, at 14:20, Harsha HN <99harsha.h....@gmail.com> wrote: > > Hi, > > We have LZO compressed JSON files in our HDFS locations. I am creating an > "External" table on the data in HDFS for the purpose of analytics. > > There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61 > MB respectively along with their index files. > > When I run count(*) query on the table I observe only 10 mappers causing > performance bottleneck. > > I even tried following, (going for 30MB split) > 1) set mapreduce.input.fileinputformat.split.maxsize=31457280; > > 2) set dfs.blocksize=31457280; > > But still I am getting 10 mappers. > > Can you please guide me in fixing the same? > > Thanks, > Sree Harsha > >