All,

I have a cluster running in AWS on 19 EC2 nodes with Hive 3.1.0 installed, 
using Tez as the default computation engine. I have a table which points to 
about 3.6 GB of partitioned data located on S3. When I have just two files (one 
per partition), a simple COUNT(*) select takes 300+ seconds, and uses only 2 
mappers. When I have them dispersed into 80 files across two partitions, it 
drops heavily to 30 seconds, and uses 40 - 50 mappers. In HDFS, I see the 
faster runtime whether or not the data is in two files or 80 files.

It seems like we are choosing the number of mappers here based on the number of 
files, but only if the files are located in S3. Can someone confirm this?

If this is the case, is there a JIRA tracking a fix, or documentation on why 
this has to be this way?

If not, how can I make sure we use more mappers in cases like above?

Thanks!

David McGinnis

Reply via email to