Hello Everyone, Finding slightly hard to find slots where most of the folks across time zones are available / comfortable join, will hold back till then and will keep this thread updated. Apologies for any inconvenience.
Regards, Prashant Singh On Thu, Oct 13, 2022 at 5:57 PM Prashant Singh <prashant010...@gmail.com> wrote: > Thanks for the feedbacks Péter, > > > Do I understand correctly that the main issue is that the concurrent > compactions and writes (with deletes/updates) cause conflicts? > > Yes, the main issue we are trying to solve is the conflicts happening > between maintenance processes and other writes. > > Regarding the hive approach, you suggested, As you already pointed out the > drawbacks of having the above approach and the major one being spec change > and performance penalty, we wanted to avoid that. > > In the proposed approach, we wanted to utilize existing functionalities to > achieve this, and storing this extra info, if the writers had this info > (process running against the partition) it could benefit from it, otherwise > would continue working as they were without this change. > > Regards, > Prashant Singh > > On Fri, Oct 7, 2022 at 10:39 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> One more thing: >> - In Hive, we order the data files based on the original fileName and >> rowNum. This helps when reading the delete files, as we do not need to keep >> delete file data in memory. Iceberg tables could be sorted, so we either >> has to keep the already read delete data in memory, or reread the delete >> files when the order of the rows are changed. >> >> I do not think we would like to sacrifice query performance for table >> maintenance. >> >> On Thu, Oct 6, 2022, 22:26 Prashant Singh <prashant010...@gmail.com> >> wrote: >> >>> Hello all, >>> >>> I was OOO, just saw the mail. >>> >>> Thanks Ryan and Peter for the feedback, will address it and update the >>> doc accordingly. >>> >>> As some of us are not available in the proposed slot, and also need some >>> time to address the feedback. >>> >>> Will move this meeting to next week, and propose some slots accordingly >>> (will reach out to interested folks via slack as well to get the slots). >>> Apologies for any inconvenience. >>> >>> Regards, >>> Prashant Singh >>> >>> On Wed, Oct 5, 2022 at 11:17 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Do I understand correctly that the main issue is that the concurrent >>>> compactions and writes (with deletes/updates) cause conflicts? >>>> >>>> In Hive we do compactions in different way, by storing the original >>>> fileName/rowId (translated to Iceberg concepts) in the compacted files. >>>> This way when a concurrent delete comes in later, any follow-up query still >>>> can find the corresponding delete and omit the row in the result. >>>> >>>> The big advantage of this approach is that compactions can happen in >>>> the background without any interference with the concurrent queries. >>>> >>>> There several drawbacks: >>>> - This would be a table format change! >>>> - Readding a file with the same fileName will become even more >>>> problematic >>>> - Size of the files will grow >>>> - Queries become somewhat more complex as we need to implement >>>> different delete file lookup for compacted files >>>> >>>> Do we see this issue important enough to merit the above charges/added >>>> complexities? >>>> >>>> Thanks, >>>> Peter >>>> >>>> >>>> On Wed, Oct 5, 2022, 01:21 Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I won't be able to make it to the discussion, so I wanted to share a >>>>> few thoughts here ahead of time. >>>>> >>>>> I'm fairly skeptical that this is the right approach. A locking scheme >>>>> that requires participation is going to require a significant change to >>>>> the >>>>> way we think about concurrency. And a locking scheme that is at the >>>>> partition granularity is going to be difficult to set up. >>>>> >>>>> Also, I don't think that the design doc covers these issues in enough >>>>> detail. I think there are some gaps with significant questions, like how >>>>> to >>>>> proceed when a lock check has been done, but another process with higher >>>>> priority comes in. It seems like even ignoring the partition granularity >>>>> problem and assuming that we have writers that all participate, combining >>>>> priority with locking creates a situation where a process can think it >>>>> holds the lock but does not because another process preempted it. >>>>> >>>>> I think some of these could be resolved by making this locking scheme >>>>> informational but still using the existing method to handle concurrency. >>>>> But does that actually fix the problem? >>>>> >>>>> Ryan >>>>> >>>>> On Mon, Oct 3, 2022 at 12:56 PM Prashant Singh < >>>>> prashant010...@gmail.com> wrote: >>>>> >>>>>> Thanks Wing, >>>>>> >>>>>> Great to have you onboard, really appreciate your feedback so far on >>>>>> the proposal. Looking forward to more in the discussion. >>>>>> >>>>>> Regards, >>>>>> Prashant >>>>>> >>>>>> On Tue, Oct 4, 2022 at 12:45 AM Wing Yew Poon >>>>>> <wyp...@cloudera.com.invalid> wrote: >>>>>> >>>>>>> Prashant, just saw Jack's post mentioning that you're in India Time. >>>>>>> Obviously day time Pacific is not convenient for you. I'm fine with 9 pm >>>>>>> Pacific. >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 3, 2022 at 12:09 PM Wing Yew Poon <wyp...@cloudera.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Prashant, >>>>>>>> I am very interested in this proposal and would like to attend this >>>>>>>> meeting. Friday October 7 is fine with me; I can do 9 pm Pacific Time >>>>>>>> if >>>>>>>> that is what works for you (I don't know what time zone you're in), >>>>>>>> although any time between 2 and 6 pm would be more convenient. >>>>>>>> Thanks, >>>>>>>> Wing Yew >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 3, 2022 at 11:58 AM Prashant Singh < >>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks Ryan, >>>>>>>>> >>>>>>>>> Should I go ahead and schedule this somewhere around 10/7 9:00 PM >>>>>>>>> PST, will it work ? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Prashant Singh >>>>>>>>> >>>>>>>>> On Fri, Sep 30, 2022 at 9:21 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> Prashant, great to see the PR for rollback on conflict! I'll take >>>>>>>>>> a look at that one. Friday 10/7 after 1:30 PM works for me. Looking >>>>>>>>>> forward >>>>>>>>>> to the discussion! >>>>>>>>>> >>>>>>>>>> On Fri, Sep 30, 2022 at 6:38 AM Prashant Singh < >>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hello folks, >>>>>>>>>>> >>>>>>>>>>> I was planning to host a discussion on this proposal >>>>>>>>>>> <https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit> >>>>>>>>>>> somewhere around late next week. >>>>>>>>>>> >>>>>>>>>>> Please let me know your availability if you are interested in >>>>>>>>>>> attending the same, will schedule the meeting (online) accordingly. >>>>>>>>>>> >>>>>>>>>>> Meanwhile I have a PR >>>>>>>>>>> <https://github.com/apache/iceberg/pull/5888> out as well, to >>>>>>>>>>> rollback compaction on conflict detection (an approach that came up >>>>>>>>>>> as an >>>>>>>>>>> alternative to the proposal in sync). Appreciate your feedback here >>>>>>>>>>> as well. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Prashant Singh >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 17, 2022 at 6:25 PM Prashant Singh < >>>>>>>>>>> prashant010...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello all, >>>>>>>>>>>> >>>>>>>>>>>> We have been working on a proposal [link >>>>>>>>>>>> <https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit#>] >>>>>>>>>>>> to determine the precedence between two or more concurrently >>>>>>>>>>>> running jobs, >>>>>>>>>>>> in case of conflicts. >>>>>>>>>>>> >>>>>>>>>>>> Please take some time to review the proposal. >>>>>>>>>>>> >>>>>>>>>>>> We would appreciate any feedback on this from the community! >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Prashant Singh >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>>