That works for me. On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> How about 9AM PDT on Friday, 5 July then? > > On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <owen.omal...@gmail.com> > wrote: > >> I'd like to call in, but I'm out Thursday. Friday would work except 11am >> to 1pm pdt. >> >> .. Owen >> >> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> I'm available Thursday and Friday this week as well, but it's a holiday >>> in the US so some people may be out. If there are no objections from anyone >>> that would like to attend, then I'm up for that. >>> >>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <aokolnyc...@apple.com> >>> wrote: >>> >>>> I apologize for the delay on my side. I’ll still have to go through the >>>> last emails. I am available on Thursday/Friday this week and would be great >>>> to sync. >>>> >>>> Thanks, >>>> Anton >>>> >>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote: >>>> >>>> Sorry I didn't get back to this thread last week. Let's try to have a >>>> video call to sync up on this next week. What days would work for everyone? >>>> >>>> rb >>>> >>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wri...@shopify.com> >>>> wrote: >>>> >>>>> With regards to operation values. Currently they are: >>>>> >>>>> - append: data files were added and no files were removed. >>>>> - replace: data files were rewritten with the same data; i.e., >>>>> compaction, changing the data file format, or relocating data files. >>>>> - overwrite: data files were deleted and added in a logical >>>>> overwrite operation. >>>>> - delete: data files were removed and their contents logically >>>>> deleted. >>>>> >>>>> If deletion files (with or without data files) are appended to the >>>>> dataset, will we consider that an `append` operation? If so, if deletion >>>>> and/or data files are appended, and whole files are also deleted, will we >>>>> consider that an `overwrite`? >>>>> >>>>> Given that the only apparent purpose of the operation field is to >>>>> optimize snapshot expiration the above seems to meet its needs. An >>>>> incremental reader can also skip `replace` snapshots but no others. Once >>>>> it >>>>> decides to read a snapshot I don't think there's any difference in how it >>>>> processes the data for append/overwrite/delete cases. >>>>> >>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes, >>>>>> since they apply to a specific file. They’re not harmful, but the don’t >>>>>> seem relevant. >>>>>> >>>>>> These delete files will probably contain a path and an offset and >>>>>> could contain deletes for multiple files. In that case, the sequence >>>>>> number >>>>>> can be used to eliminate delete files that don’t need to be applied to a >>>>>> particular data file, just like the column equality deletes. Likewise, it >>>>>> can be used to drop the delete files when there are no data files with an >>>>>> older sequence number. >>>>>> >>>>>> I don’t understand the purpose of the min sequence number, nor what >>>>>> the “min data seq” is. >>>>>> >>>>>> Min sequence number would be used for pruning delete files without >>>>>> reading all the manifests to find out if there are old data files. If no >>>>>> manifest with data for a partition contains a file older than some >>>>>> sequence >>>>>> number N, then any delete file with a sequence number < N can be removed. >>>>>> >>>>> OK, so the minimum sequence number is an attribute of manifest files. >>>>> Sounds good. It can likely permit us to optimize compaction operations as >>>>> well (i.e., you can easily limit the operation to a subset of manifest >>>>> files as long as they are the oldest ones). >>>>> >>>>> >>>>>> The “min data seq” is the minimum sequence number of a data file. >>>>>> That seems like what we actually want for the pruning I described above. >>>>>> >>>>> I would expect a data file (appended rows or deletions by column >>>>> value) to have a single sequence number that applies to the whole file. >>>>> Even a delete-by-file-and-offset file can do with only a single sequence >>>>> number (which must be larger than the sequence numbers of all deleted >>>>> files). Why do we need a "minimum" data sequence per file? >>>>> >>>>>> Off the top of my head [supporting non-key delete] requires adding >>>>>> additional information to the manifest file, indicating the columns that >>>>>> are used for the deletion. Only equality would be supported; if multiple >>>>>> columns were used, they would be combined with boolean-and. I don’t see >>>>>> anything too tricky about it. >>>>>> >>>>>> Yes, exactly. I actually phrased it wrong initially. I think it would >>>>>> be simple to extend the equality deletes to do this. We just need a way >>>>>> to >>>>>> have global scope, not just partition scope. >>>>>> >>>>> I don't think anything special needs to be done with regards to >>>>> scoping/partitioning of delete files. When scanning one or more data >>>>> files, >>>>> one must also consider any and all deletion files that could apply to >>>>> them. >>>>> The only way to prune deletion files from consideration is: >>>>> >>>>> 1. All of your data files have at least one partition column in >>>>> common. >>>>> 2. The deletion file is also partitioned on that column (at least). >>>>> 3. The value sets of the data files do not overlap the value sets >>>>> of the deletion files in that column. >>>>> >>>>> So given a dataset of sessions that is partitioned by device form >>>>> factor and date, for example, you could have a delete (user_id=9876) in a >>>>> deletion file that is not partitioned. And it would be "in scope" for all >>>>> of those data files. >>>>> >>>>> If you had the same dataset partitioned by hash(user_id) and your >>>>> deletes were _also_ partitioned by hash(user_id) you would be able to >>>>> prune >>>>> those deletes while scanning the sessions. >>>>> >>>>>> If we add this on a per-deletion file basis it is not clear if there >>>>>> is any relevance in preserving the concept of a unique row ID. >>>>>> >>>>>> Agreed. That’s why I’ve been steering us away from the debate about >>>>>> whether keys are unique or not. Either way, a natural key delete must >>>>>> delete all of the records it matches. >>>>>> >>>>>> I would assume that the maximum sequence number should appear in the >>>>>> table metadata >>>>>> >>>>>> Agreed. >>>>>> >>>>>> [W]ould you make it optional to assign a sequence number to a >>>>>> snapshot? “Replace” snapshots would not need one. >>>>>> >>>>>> The only requirement is that it is monotonically increasing. If one >>>>>> isn’t used, we don’t have to increment. I’d say it is up to the >>>>>> implementation to decide. I would probably increment it every time to >>>>>> avoid >>>>>> errors. >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >