>
> This brings up another question/issue - there doesn't seem to be a way to
> partition cached tables in the same way you can partition, say a Hive
> table.  For example, we would like to partition the overall dataset (233m
> rows, 9.2Gb) by (product, coupon) so when we run one of these queries
> Spark won't have to scan all the data, just the partition from the query,
> eg, (FNM30, 3.0).
>

If you order the data on the interesting column before caching, we keep
min/max statistics that let us do similar data skipping automatically.

Reply via email to