All, I have a cluster running in AWS on 19 EC2 nodes with Hive 3.1.0 installed, using Tez as the default computation engine. I have a table which points to about 3.6 GB of partitioned data located on S3. When I have just two files (one per partition), a simple COUNT(*) select takes 300+ seconds, and uses only 2 mappers. When I have them dispersed into 80 files across two partitions, it drops heavily to 30 seconds, and uses 40 - 50 mappers. In HDFS, I see the faster runtime whether or not the data is in two files or 80 files.
It seems like we are choosing the number of mappers here based on the number of files, but only if the files are located in S3. Can someone confirm this? If this is the case, is there a JIRA tracking a fix, or documentation on why this has to be this way? If not, how can I make sure we use more mappers in cases like above? Thanks! David McGinnis