You run compaction, i.e. save the modified/deleted records in a dedicated file. Every now and then you compare the original and delta file and create a new version. When querying before compaction then you need to check in original and delta file. I don to think orc need tez for it , but it probably improves performance.
> On 17 Jan 2017, at 17:21, Michael Segel <msegel_had...@hotmail.com> wrote: > > Hi, > While the parquet file is immutable and the data sets are immutable, how does > sparkSQL handle updates or deletes? > I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a > row, and then persist it, I now have two files. If I reread the table back in > … will I see duplicates or not? > > The larger issue is how to handle mutable data in a multi-user / multi-tenant > situation and using Parquet as the storage. > > Would this be the right tool? > > W.R.T ORC files, mutation is handled by Tez. > > Thanks in Advance, > > -Mike > > ТÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÐÐ¥FòVç7V'67&–&RRÖ֖âW6W"×Vç7V'67&–&T7&²æ6†Ræ÷&pÐ > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org