> While our StorageHandler does utilize a SERDE that correctly returns >SerDeStats, it seems like the optimizer is ignoring these values.
AFAIK, the stats impl is assumed to be approximate & aggregate and is never used for setting up execution. > Would anyone know how to correctly set these values? <https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/h ive/ql/exec/tez/ColumnarSplitSizeEstimator.java#L48> That's where the sizes are read out for distribution (i.e grouping of splits etc). Implementing ColumnarSplit in your split object should do the trick or wrapping it in an impl. This is a departure from MapReduce which uses File size instead - with columnar formats, the total amount of data read out of the file varies as the selected columns go up/down & similarly how much would be shuffled out. Cheers, Gopal