> While our StorageHandler does utilize a SERDE that correctly returns
>SerDeStats, it seems like the optimizer
 is ignoring these values.

AFAIK, the stats impl is assumed to be approximate & aggregate and is
never used for setting up execution.

> Would anyone know how to correctly set these values?

<https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/h
ive/ql/exec/tez/ColumnarSplitSizeEstimator.java#L48>


That's where the sizes are read out for distribution (i.e grouping of
splits etc).

Implementing ColumnarSplit in your split object should do the trick or
wrapping it in an impl.

This is a departure from MapReduce which uses File size instead - with
columnar formats, the total amount of data read out of the file varies as
the selected columns go up/down & similarly how much would be shuffled out.

Cheers,
Gopal
 

 


Reply via email to