> > This brings up another question/issue - there doesn't seem to be a way to > partition cached tables in the same way you can partition, say a Hive > table. For example, we would like to partition the overall dataset (233m > rows, 9.2Gb) by (product, coupon) so when we run one of these queries > Spark won't have to scan all the data, just the partition from the query, > eg, (FNM30, 3.0). >
If you order the data on the interesting column before caching, we keep min/max statistics that let us do similar data skipping automatically.