Hi, I have a spark job output DataFrame which contains a column named Id, which is a GUID string. We will use Id to filter data in another spark application, so it should be a partition key.
I found these two methods in Internet: 1. DataFrame.write.save("Id") method will help, but the possible value space for GUID is too big, I prefer to do a range partition to make it 100 partitions evenly. 2. Another way is DataFrame.repartition("Id"), but the result seems to only stay in memory, once it's saved, then loaded from another spark application, we need to repartition it again? After all, what is the relationship between Parquet partitions and DataFrame.repartition? E.g. The parquet data is stored physically under /year=X/month=Y, I get this data into DataFrame, then call DataFrame.repartition("Id"). Run this query: df.filter("year=2016 and month=5 and Id='xxxxxxxx') Will Parquet folder pruning still work? Or it's already been partitioned into Id, so it needs to scan all year/month combinations? Thanks