Correction - dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Append).text(“s3://data/test2/events/”)
On Tue, Jul 26, 2016 at 10:59 AM, Yash Sharma <yash...@gmail.com> wrote: > Based on the behavior of spark [1], Overwrite mode will delete all your > data when you try to overwrite a particular partition. > > What I did- > - Use S3 api to delete all partitions > - Use spark df to write in Append mode [2] > > > 1. > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html > > 2. dataDF.write.partitionBy(“year”, “month”, > “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”) > > On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >> Probably should have been more specific with the code we are using, which >> is something like >> >> val df = .... >> df.write.mode("append or overwrite >> here").partitionBy("date").saveAsTable("my_table") >> >> Unless there is something like what I described on the native API, I will >> probably take the approach of having a S3 API call to wipe out that >> partition before the job starts, but it would be nice to not have to >> incorporate another step in the job. >> >> Pedro >> >> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkad...@collectivei.com> >> wrote: >> >>> You can have a temporary file to capture the data that you would like to >>> overwrite. And swap that with existing partition that you would want to >>> wipe the data away. Swapping can be done by simple rename of the partition >>> and just repair the table to pick up the new partition. >>> >>> Am not sure if that addresses your scenario. >>> >>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> >>> wrote: >>> >>> What would be the best way to accomplish the following behavior: >>> >>> 1. There is a table which is partitioned by date >>> 2. Spark job runs on a particular date, we would like it to wipe out all >>> data for that date. This is to make the job idempotent and lets us rerun a >>> job if it failed without fear of duplicated data >>> 3. Preserve data for all other dates >>> >>> I am guessing that overwrite would not work here or if it does its not >>> guaranteed to stay that way, but am not sure. If thats the case, is there a >>> good/robust way to get this behavior? >>> >>> -- >>> Pedro Rodriguez >>> PhD Student in Distributed Machine Learning | CU Boulder >>> UC Berkeley AMPLab Alumni >>> >>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>> Github: github.com/EntilZha | LinkedIn: >>> https://www.linkedin.com/in/pedrorodriguezscience >>> >>> >>> >>> Collective[i] dramatically improves sales and marketing performance >>> using technology, applications and a revolutionary network designed to >>> provide next generation analytics and decision-support directly to business >>> users. Our goal is to maximize human potential and minimize mistakes. In >>> most cases, the results are astounding. We cannot, however, stop emails >>> from sometimes being sent to the wrong person. If you are not the intended >>> recipient, please notify us by replying to this email's sender and deleting >>> it (and any attachments) permanently from your system. If you are, please >>> respect the confidentiality of this communication's contents. >> >> >> >> >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> >