Alexey, As for me, it does not matter will it be IEP, umbrella or a single issue. The most important thing is Assignee :)
On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk <alexey.goncha...@gmail.com> wrote: > Anton, do you think we should file a single ticket for this or should we go > with an IEP? As of now, the change does not look big enough for an IEP for > me. > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <a...@apache.org>: > > > Alexey, > > > > Sounds good to me. > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < > > alexey.goncha...@gmail.com> > > wrote: > > > > > Anton, > > > > > > Switching a partition to and from the SHRINKING state will require > > > intricate synchronizations in order to properly determine the start > > > position for historical rebalance without PME. > > > > > > I would still go with an offline-node approach, but instead of cleaning > > the > > > persistence, we can do effective defragmentation when the node is > offline > > > because we are sure that there is no concurrent load. After the > > > defragmentation completes, we bring the node back to the cluster and > > > historical rebalance will kick in automatically. It will still require > > > manual node restarts, but since the data is not removed, there are no > > > additional risks. Also, this will be an excellent solution for those > who > > > can afford downtime and execute the defragment command on all nodes in > > the > > > cluster simultaneously - this will be the fastest way possible. > > > > > > --AG > > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <a...@apache.org>: > > > > > > > Alexei, > > > > >> stopping fragmented node and removing partition data, then > starting > > it > > > > again > > > > > > > > That's exactly what we're doing to solve the fragmentation issue. > > > > The problem here is that we have to perform N/B restart-rebalance > > > > operations (N - cluster size, B - backups count) and it takes a lot > of > > > time > > > > with risks to lose the data. > > > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < > > > > alexey.scherbak...@gmail.com> wrote: > > > > > > > > > Probably this should be allowed to do using public API, actually > this > > > is > > > > > same as manual rebalancing. > > > > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < > > > > > alexey.scherbak...@gmail.com>: > > > > > > > > > > > The poor man's solution for the problem would be stopping > > fragmented > > > > node > > > > > > and removing partition data, then starting it again allowing full > > > state > > > > > > transfer already without deletes. > > > > > > Rinse and repeat for all owners. > > > > > > > > > > > > Anton Vinogradov, would this work for you as workaround ? > > > > > > > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <a...@apache.org>: > > > > > > > > > > > >> Alexey, > > > > > >> > > > > > >> Let's combine your and Ivan's proposals. > > > > > >> > > > > > >> >> vacuum command, which acquires exclusive table lock, so no > > > > concurrent > > > > > >> activities on the table are possible. > > > > > >> and > > > > > >> >> Could the problem be solved by stopping a node which needs to > > be > > > > > >> defragmented, clearing persistence files and restarting the > node? > > > > > >> >> After rebalancing the node will receive all data back without > > > > > >> fragmentation. > > > > > >> > > > > > >> How about to have special partition state SHRINKING? > > > > > >> This state should mean that partition unavailable for reads and > > > > updates > > > > > >> but > > > > > >> should keep it's update-counters and should not be marked as > lost, > > > > > renting > > > > > >> or evicted. > > > > > >> At this state we able to iterate over the partition and apply > it's > > > > > entries > > > > > >> to another file in a compact way. > > > > > >> Indices should be updated during the copy-on-shrink procedure or > > at > > > > the > > > > > >> shrink completion. > > > > > >> Once shrank file is ready we should replace the original > partition > > > > file > > > > > >> with it and mark it as MOVING which will start the historical > > > > rebalance. > > > > > >> Shrinking should be performed during the low activity periods, > but > > > > even > > > > > in > > > > > >> case we found that activity was high and historical rebalance is > > not > > > > > >> suitable we may just remove the file and use regular rebalance > to > > > > > restore > > > > > >> the partition (this will also lead to shrink). > > > > > >> > > > > > >> BTW, seems, we able to implement partition shrink in a cheap > way. > > > > > >> We may just use rebalancing code to apply fat partition's > entries > > to > > > > the > > > > > >> new file. > > > > > >> So, 3 stages here: local rebalance, indices update and global > > > > historical > > > > > >> rebalance. > > > > > >> > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk < > > > > > >> alexey.goncha...@gmail.com> wrote: > > > > > >> > > > > > >> > Anton, > > > > > >> > > > > > > >> > > > > > > >> > > >> The solution which Anton suggested does not look easy > > > because > > > > it > > > > > >> will > > > > > >> > > most likely significantly hurt performance > > > > > >> > > Mostly agree here, but what drop do we expect? What price do > > we > > > > > ready > > > > > >> to > > > > > >> > > pay? > > > > > >> > > Not sure, but seems some vendors ready to pay, for example, > 5% > > > > drop > > > > > >> for > > > > > >> > > this. > > > > > >> > > > > > > >> > 5% may be a big drop for some use-cases, so I think we should > > look > > > > at > > > > > >> how > > > > > >> > to improve performance, not how to make it worse. > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > >> > > >> it is hard to maintain a data structure to choose "page > > from > > > > > >> free-list > > > > > >> > > with enough space closest to the beginning of the file". > > > > > >> > > We can just split each free-list bucket to the couple and > use > > > > first > > > > > >> for > > > > > >> > > pages in the first half of the file and the second for the > > last. > > > > > >> > > Only two buckets required here since, during the file > shrink, > > > > first > > > > > >> > > bucket's window will be shrank too. > > > > > >> > > Seems, this give us the same price on put, just use the > first > > > > bucket > > > > > >> in > > > > > >> > > case it's not empty. > > > > > >> > > Remove price (with merge) will be increased, of course. > > > > > >> > > > > > > > >> > > The compromise solution is to have priority put (to the > first > > > path > > > > > of > > > > > >> the > > > > > >> > > file), with keeping removal as is, and schedulable per-page > > > > > migration > > > > > >> for > > > > > >> > > the rest of the data during the low activity period. > > > > > >> > > > > > > > >> > Free lists are large and slow by themselves, it is expensive > to > > > > > >> checkpoint > > > > > >> > and read them on start, so as a long-term solution I would > look > > > into > > > > > >> > removing them. Moreover, not sure if adding yet another > > background > > > > > >> process > > > > > >> > will improve the codebase reliability and simplicity. > > > > > >> > > > > > > >> > If we want to go the hard path, I would look at free page > > tracking > > > > > >> bitmap - > > > > > >> > a special bitmask page, where each page in an adjacent block > is > > > > marked > > > > > >> as 0 > > > > > >> > if it has free space more than a certain configurable > threshold > > > > (say, > > > > > >> 80%) > > > > > >> > - free, and 1 if less (full). Some vendors have successfully > > > > > implemented > > > > > >> > this approach, which looks much more promising, but harder to > > > > > implement. > > > > > >> > > > > > > >> > --AG > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Best regards, > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best regards, > > > > > Alexei Scherbakov > > > > > > > > > > > > > > >