Hello, I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") to write a dataset while splitting by day.
I would like to run a Spark job to process, e.g., a month: dataset.parquet/day=2017-01-01/... ... and then run another Spark job to add another month using the same folder structure, getting me dataset.parquet/day=2017-01-01/ ... [dataset.parquet/day=2017-02-01/](http://dataset.parquet/day=2017-01-01/) ... However: - with save mode "overwrite", when I process the second month, all of dataset.parquet/ gets removed and I lose whatever was already computed for the previous month. - with save mode "append", then I can't get idempotence: if I run the job to process a given month twice, I'll get duplicate data in all the subfolders for that month. Is there a way to do "append in terms of the subfolders from partitionBy, but overwrite within each such partitions? Any help would be appreciated. Thanks!