You can put several small files packed in one Hadoop archive (HAR). The other
alternative was to set the split size of the execution engine (TEZ, mr,...),
which you probably do not want to do on a global level. In general, one should
replace xml, json etc with Avro where possible and then use fo
Your Hive version is too old. You may want to use also another execution
engine. I think your problem might then be related to external tables for which
the parameter you set probably do not apply. I had once the same problem, but I
needed to change the block size on the Hadoop level (hdfs-site.
Hi Franke,
It's 100+ node cluster. Roughly 2TB memory and 1000+ vCores were available
when I ran my job. So infrastructure is not a problem here.
Hive version is 0.13
About ORC or PARQUET, requires us to load 5 years of LZO data in ORC or
PARQUET format. Though it might be performance efficient,
How many nodes, cores and memory do you have?
What hive version?
Do you have the opportunity to use tez as an execution engine?
Usually I use external tables only for reading them and inserting them into a
table in Orc or parquet format for doing analytics.
This is much more performant than jso
Hi,
We have LZO compressed JSON files in our HDFS locations. I am creating an
"External" table on the data in HDFS for the purpose of analytics.
There are 3 LZO compressed part files of size 229.16 MB, 705.79 MB, 157.61
MB respectively along with their index files.
When I run count(*) query on t