Hi,

I have a spark job output DataFrame which contains a column named Id, which
is a GUID string.
We will use Id to filter data in another spark application, so it should be
a partition key.

I found these two methods in Internet:

1.
DataFrame.write.save("Id") method will help, but the possible value space
for GUID is too big, I prefer to do a range partition to make it 100
partitions evenly.

2.
Another way is DataFrame.repartition("Id"), but the result seems to only
stay in memory, once it's saved, then loaded from another spark
application, we need to repartition it again?

After all, what is the relationship between Parquet partitions and
DataFrame.repartition?
E.g.
The parquet data is stored physically under /year=X/month=Y, I get this
data into DataFrame, then call DataFrame.repartition("Id").
Run this query:
df.filter("year=2016 and month=5 and Id='xxxxxxxx')
Will Parquet folder pruning still work? Or it's already been partitioned
into Id, so it needs to scan all year/month combinations?

Thanks

Reply via email to