Anton, Switching a partition to and from the SHRINKING state will require intricate synchronizations in order to properly determine the start position for historical rebalance without PME.
I would still go with an offline-node approach, but instead of cleaning the persistence, we can do effective defragmentation when the node is offline because we are sure that there is no concurrent load. After the defragmentation completes, we bring the node back to the cluster and historical rebalance will kick in automatically. It will still require manual node restarts, but since the data is not removed, there are no additional risks. Also, this will be an excellent solution for those who can afford downtime and execute the defragment command on all nodes in the cluster simultaneously - this will be the fastest way possible. --AG пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <a...@apache.org>: > Alexei, > >> stopping fragmented node and removing partition data, then starting it > again > > That's exactly what we're doing to solve the fragmentation issue. > The problem here is that we have to perform N/B restart-rebalance > operations (N - cluster size, B - backups count) and it takes a lot of time > with risks to lose the data. > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < > alexey.scherbak...@gmail.com> wrote: > > > Probably this should be allowed to do using public API, actually this is > > same as manual rebalancing. > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < > > alexey.scherbak...@gmail.com>: > > > > > The poor man's solution for the problem would be stopping fragmented > node > > > and removing partition data, then starting it again allowing full state > > > transfer already without deletes. > > > Rinse and repeat for all owners. > > > > > > Anton Vinogradov, would this work for you as workaround ? > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <a...@apache.org>: > > > > > >> Alexey, > > >> > > >> Let's combine your and Ivan's proposals. > > >> > > >> >> vacuum command, which acquires exclusive table lock, so no > concurrent > > >> activities on the table are possible. > > >> and > > >> >> Could the problem be solved by stopping a node which needs to be > > >> defragmented, clearing persistence files and restarting the node? > > >> >> After rebalancing the node will receive all data back without > > >> fragmentation. > > >> > > >> How about to have special partition state SHRINKING? > > >> This state should mean that partition unavailable for reads and > updates > > >> but > > >> should keep it's update-counters and should not be marked as lost, > > renting > > >> or evicted. > > >> At this state we able to iterate over the partition and apply it's > > entries > > >> to another file in a compact way. > > >> Indices should be updated during the copy-on-shrink procedure or at > the > > >> shrink completion. > > >> Once shrank file is ready we should replace the original partition > file > > >> with it and mark it as MOVING which will start the historical > rebalance. > > >> Shrinking should be performed during the low activity periods, but > even > > in > > >> case we found that activity was high and historical rebalance is not > > >> suitable we may just remove the file and use regular rebalance to > > restore > > >> the partition (this will also lead to shrink). > > >> > > >> BTW, seems, we able to implement partition shrink in a cheap way. > > >> We may just use rebalancing code to apply fat partition's entries to > the > > >> new file. > > >> So, 3 stages here: local rebalance, indices update and global > historical > > >> rebalance. > > >> > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk < > > >> alexey.goncha...@gmail.com> wrote: > > >> > > >> > Anton, > > >> > > > >> > > > >> > > >> The solution which Anton suggested does not look easy because > it > > >> will > > >> > > most likely significantly hurt performance > > >> > > Mostly agree here, but what drop do we expect? What price do we > > ready > > >> to > > >> > > pay? > > >> > > Not sure, but seems some vendors ready to pay, for example, 5% > drop > > >> for > > >> > > this. > > >> > > > >> > 5% may be a big drop for some use-cases, so I think we should look > at > > >> how > > >> > to improve performance, not how to make it worse. > > >> > > > >> > > > >> > > > > >> > > >> it is hard to maintain a data structure to choose "page from > > >> free-list > > >> > > with enough space closest to the beginning of the file". > > >> > > We can just split each free-list bucket to the couple and use > first > > >> for > > >> > > pages in the first half of the file and the second for the last. > > >> > > Only two buckets required here since, during the file shrink, > first > > >> > > bucket's window will be shrank too. > > >> > > Seems, this give us the same price on put, just use the first > bucket > > >> in > > >> > > case it's not empty. > > >> > > Remove price (with merge) will be increased, of course. > > >> > > > > >> > > The compromise solution is to have priority put (to the first path > > of > > >> the > > >> > > file), with keeping removal as is, and schedulable per-page > > migration > > >> for > > >> > > the rest of the data during the low activity period. > > >> > > > > >> > Free lists are large and slow by themselves, it is expensive to > > >> checkpoint > > >> > and read them on start, so as a long-term solution I would look into > > >> > removing them. Moreover, not sure if adding yet another background > > >> process > > >> > will improve the codebase reliability and simplicity. > > >> > > > >> > If we want to go the hard path, I would look at free page tracking > > >> bitmap - > > >> > a special bitmask page, where each page in an adjacent block is > marked > > >> as 0 > > >> > if it has free space more than a certain configurable threshold > (say, > > >> 80%) > > >> > - free, and 1 if less (full). Some vendors have successfully > > implemented > > >> > this approach, which looks much more promising, but harder to > > implement. > > >> > > > >> > --AG > > >> > > > >> > > > > > > > > > -- > > > > > > Best regards, > > > Alexei Scherbakov > > > > > > > > > -- > > > > Best regards, > > Alexei Scherbakov > > >