How is the number of mappers to be launched calculated exactly? Is the file format and compression taken into the picture (256MB compressed data would give much more MB when mapper decompresses it)?
I've created a couple of ORC files (no compression, 1file=1table) with different stripe size settings: 256, 128, 64 and 16MB. Their sizes are respectively (327,814,200; 413,030,657; 413,030,290; 433,481,175) When I run a query (SELECT * FROM … ORDER BY) over those tables the number of map tasks launched is respectively: 1, 2, 2, 2. I would expect it to be aligned with my chunk size (256MB) so always 2 as it's always a multiplier of the stripe sizes I've chosen. After I change the engine to TEZ it gets even more interesting, the number of mappers is respectively; 2, 2, 4, 13 Why is it different? Also when I examine the source table files using orcdump utility I can see the number of stripes is not consistent with declared stripe size, respectively: 8, 118, 118, 118 Is it like the number of mappers is based on the declared stripe size (DDL = Hive metastore) rather than the file itself? ~Maciek