Hi all, We are compacting snappy compressed sequence files in our input stream on which we run Hive queries. Generally working towards files of 50+ GB.
We see some unexpected behaviour where even smaller files, for example 5.2GB are only getting a single mapper in a Hive query using the default CombineHiveInputFormat. Switching to HiveInputFormat gives us the 21 mappers in the Hive query which is what we would expect using 256MB block sizes. What would be the drawbacks for switching over to HiveInputFormat over Combined? I would imaging more potential splits when we would have many smaller files on the same node, which in our case would not happen that often and we have enough resources to handle the potential extra mappers. Is this thinking correct? Any other drawbacks? Best regards Rob