Denis Mekhanikov, I think at least one node (coordinator for example) still should write metadata synchronously to protect from a scenario:
tx creating new metadata is commited <- all nodes in grid are failed (powered off) <- async writing to disk is completed where <- means "happens before" All other nodes could write asynchronously, by using separate thread or not doing fsync( same effect) ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <dmekhani...@gmail.com>: > Alexey, > > I’m not suggesting to duplicate anything. > My point is that the proper fix will be implemented in a relatively > distant future. Why not improve the existing mechanism now instead of > waiting for the proper fix? > If we don’t agree on doing this fix in master, I can do it in a fork and > use it in my setup. So please let me know if you see any other drawbacks in > the proposed solution. > > Denis > > > On 21 Aug 2019, at 15:53, Alexei Scherbakov < > alexey.scherbak...@gmail.com> wrote: > > > > Denis Mekhanikov, > > > > If we are still talking about "proper" solution the metastore (I've meant > > of course distributed one) is the way to go. > > > > It has a contract to store cluster wide metadata in most efficient way > and > > can have any optimization for concurrent writing inside. > > > > I'm against creating some duplicating mechanism as you suggested. We do > not > > need another copy/paste code. > > > > Another possibility is to carry metadata along with appropriate request > if > > it's not found locally but this is a rather big modification. > > > > > > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhani...@gmail.com>: > > > >> Eduard, > >> > >> Usages will wait for the metadata to be registered and written to disk. > No > >> races should occur with such flow. > >> Or do you have some specific case on your mind? > >> > >> I agree, that using a distributed meta storage would be nice here. > >> But this way we will kind of move to the previous scheme with a > replicated > >> system cache, where metadata was stored before. > >> Will scheme with the metastorage be different in any way? Won’t we > decide > >> to move back to discovery messages again after a while? > >> > >> Denis > >> > >> > >>> On 20 Aug 2019, at 15:13, Eduard Shangareev < > eduard.shangar...@gmail.com> > >> wrote: > >>> > >>> Denis, > >>> How would we deal with races between registration and metadata usages > >> with > >>> such fast-fix? > >>> > >>> I believe, that we need to move it to distributed metastorage, and > await > >>> registration completeness if we can't find it (wait for work in > >> progress). > >>> Discovery shouldn't wait for anything here. > >>> > >>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov < > dmekhani...@gmail.com > >>> > >>> wrote: > >>> > >>>> Sergey, > >>>> > >>>> Currently metadata is written to disk sequentially on every node. Only > >> one > >>>> node at a time is able to write metadata to its storage. > >>>> Slowness accumulates when you add more nodes. A delay required to > write > >>>> one piece of metadata may be not that big, but if you multiply it by > say > >>>> 200, then it becomes noticeable. > >>>> But If we move the writing out from discovery threads, then nodes will > >> be > >>>> doing it in parallel. > >>>> > >>>> I think, it’s better to block some threads from a striped pool for a > >>>> little while rather than blocking discovery for the same period, but > >>>> multiplied by a number of nodes. > >>>> > >>>> What do you think? > >>>> > >>>> Denis > >>>> > >>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugu...@gmail.com > > > >>>> wrote: > >>>>> > >>>>> Denis, > >>>>> > >>>>> Thanks for bringing this issue up, decision to write binary metadata > >> from > >>>>> discovery thread was really a tough decision to make. > >>>>> I don't think that moving metadata to metastorage is a silver bullet > >> here > >>>>> as this approach also has its drawbacks and is not an easy change. > >>>>> > >>>>> In addition to workarounds suggested by Alexei we have two choices to > >>>>> offload write operation from discovery thread: > >>>>> > >>>>> 1. Your scheme with a separate writer thread and futures completed > >> when > >>>>> write operation is finished. > >>>>> 2. PME-like protocol with obvious complications like failover and > >>>>> asynchronous wait for replies over communication layer. > >>>>> > >>>>> Your suggestion looks easier from code complexity perspective but in > my > >>>>> view it increases chances to get into starvation. Now if some node > >> faces > >>>>> really long delays during write op it is gonna be kicked out of > >> topology > >>>> by > >>>>> discovery protocol. In your case it is possible that more and more > >>>> threads > >>>>> from other pools may stuck waiting on the operation future, it is > also > >>>> not > >>>>> good. > >>>>> > >>>>> What do you think? > >>>>> > >>>>> I also think that if we want to approach this issue systematically, > we > >>>> need > >>>>> to do a deep analysis of metastorage option as well and to finally > >> choose > >>>>> which road we wanna go. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky > >>>>> <arzamas...@mail.ru.invalid> wrote: > >>>>> > >>>>>> > >>>>>>> > >>>>>>>> 1. Yes, only on OS failures. In such case data will be received > from > >>>>>> alive > >>>>>>>> nodes later. > >>>>>> What behavior would be in case of one node ? I suppose someone can > >>>> obtain > >>>>>> cache data without unmarshalling schema, what in this case would be > >> with > >>>>>> grid operability? > >>>>>> > >>>>>>> > >>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But > such > >>>>>> mode > >>>>>>>> should not be used if you have more than two nodes in grid because > >> it > >>>>>> has > >>>>>>>> huge impact on performance. > >>>>>> Is wal mode affects metadata store ? > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < > >> dmekhani...@gmail.com > >>>>>>> : > >>>>>>>> > >>>>>>>>> Folks, > >>>>>>>>> > >>>>>>>>> Thanks for showing interest in this issue! > >>>>>>>>> > >>>>>>>>> Alexey, > >>>>>>>>> > >>>>>>>>>> I think removing fsync could help to mitigate performance issues > >>>> with > >>>>>>>>> current implementation > >>>>>>>>> > >>>>>>>>> Is my understanding correct, that if we remove fsync, then > >> discovery > >>>>>> won’t > >>>>>>>>> be blocked, and data will be flushed to disk in background, and > >> loss > >>>> of > >>>>>>>>> information will be possible only on OS failure? It sounds like > an > >>>>>>>>> acceptable workaround to me. > >>>>>>>>> > >>>>>>>>> Will moving metadata to metastore actually resolve this issue? > >> Please > >>>>>>>>> correct me if I’m wrong, but we will still need to write the > >>>>>> information to > >>>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, > >> then > >>>>>> the > >>>>>>>>> issue will still be there. Or is it planned to abandon the > >>>>>> discovery-based > >>>>>>>>> protocol at all? > >>>>>>>>> > >>>>>>>>> Evgeniy, Ivan, > >>>>>>>>> > >>>>>>>>> In my particular case the data wasn’t too big. It was a slow > >>>>>> virtualised > >>>>>>>>> disk with encryption, that made operations slow. Given that there > >> are > >>>>>> 200 > >>>>>>>>> nodes in a cluster, where every node writes slowly, and this > >> process > >>>> is > >>>>>>>>> sequential, one piece of metadata is registered extremely slowly. > >>>>>>>>> > >>>>>>>>> Ivan, answering to your other questions: > >>>>>>>>> > >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is > it > >>>> so > >>>>>>>>> accidentally? > >>>>>>>>> > >>>>>>>>> It should be checked, if it’s safe to stop writing marshaller > >>>> mappings > >>>>>> to > >>>>>>>>> disk without loosing any guarantees. > >>>>>>>>> But anyway, I would like to have a property, that would control > >> this. > >>>>>> If > >>>>>>>>> metadata registration is slow, then initial cluster warmup may > >> take a > >>>>>>>>> while. So, if we preserve metadata on disk, then we will need to > >> warm > >>>>>> it up > >>>>>>>>> only once, and further restarts won’t be affected. > >>>>>>>>> > >>>>>>>>>> Do we really need a fast fix here? > >>>>>>>>> > >>>>>>>>> I would like a fix, that could be implemented now, since the > >> activity > >>>>>> with > >>>>>>>>> moving metadata to metastore doesn’t sound like a quick one. > >> Having a > >>>>>>>>> temporary solution would be nice. > >>>>>>>>> > >>>>>>>>> Denis > >>>>>>>>> > >>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com > > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Denis, > >>>>>>>>>> > >>>>>>>>>> Several clarifying questions: > >>>>>>>>>> 1. Do you have an idea why metadata registration takes so long? > So > >>>>>>>>>> poor disks? So many data to write? A contention with disk writes > >> by > >>>>>>>>>> other subsystems? > >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is > it > >>>> so > >>>>>>>>>> accidentally? > >>>>>>>>>> > >>>>>>>>>> Generally, I think that it is possible to move metadata saving > >>>>>>>>>> operations out of discovery thread without loosing required > >>>>>>>>>> consistency/integrity. > >>>>>>>>>> > >>>>>>>>>> As Alex mentioned using metastore looks like a better solution. > Do > >>>> we > >>>>>>>>>> really need a fast fix here? (Are we talking about fast fix?) > >>>>>>>>>> > >>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky > >>>>>>>>> < arzamas...@mail.ru.invalid >: > >>>>>>>>>>> > >>>>>>>>>>> Alexey, but in this case customer need to be informed, that > whole > >>>>>> (for > >>>>>>>>> example 1 node) cluster crash (power off) could lead to partial > >> data > >>>>>>>>> unavailability. > >>>>>>>>>>> And may be further index corruption. > >>>>>>>>>>> 1. Why your meta takes a substantial size? may be context > >> leaking ? > >>>>>>>>>>> 2. Could meta be compressed ? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov < > >>>>>>>>> alexey.scherbak...@gmail.com >: > >>>>>>>>>>>> > >>>>>>>>>>>> Denis Mekhanikov, > >>>>>>>>>>>> > >>>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the > case > >>>> of > >>>>>>>>>>>> slow-downs in case of metadata burst writes. > >>>>>>>>>>>> I think removing fsync could help to mitigate performance > issues > >>>>>> with > >>>>>>>>>>>> current implementation until proper solution will be > >> implemented: > >>>>>>>>> moving > >>>>>>>>>>>> metadata to metastore. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < > >>>>>> dmekhani...@gmail.com > >>>>>>>>>> : > >>>>>>>>>>>> > >>>>>>>>>>>>> I would also like to mention, that marshaller mappings are > >>>> written > >>>>>> to > >>>>>>>>> disk > >>>>>>>>>>>>> even if persistence is disabled. > >>>>>>>>>>>>> So, this issue affects purely in-memory clusters as well. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Denis > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < > >>>>>> dmekhani...@gmail.com > > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi! > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> When persistence is enabled, binary metadata is written to > >> disk > >>>>>> upon > >>>>>>>>>>>>> registration. Currently it happens in the discovery thread, > >> which > >>>>>>>>> makes > >>>>>>>>>>>>> processing of related messages very slow. > >>>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make > >>>> every > >>>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks > >>>>>>>>> processing of > >>>>>>>>>>>>> other messages. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I propose starting a separate thread that will be > responsible > >>>> for > >>>>>>>>>>>>> writing binary metadata to disk. So, binary type registration > >>>> will > >>>>>> be > >>>>>>>>>>>>> considered finished before information about it will is > written > >>>> to > >>>>>>>>> disks on > >>>>>>>>>>>>> all nodes. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The main concern here is data consistency in cases when a > node > >>>>>>>>>>>>> acknowledges type registration and then fails before writing > >> the > >>>>>>>>> metadata > >>>>>>>>>>>>> to disk. > >>>>>>>>>>>>>> I see two parts of this issue: > >>>>>>>>>>>>>> Nodes will have different metadata after restarting. > >>>>>>>>>>>>>> If we write some data into a persisted cache and shut down > >> nodes > >>>>>>>>> faster > >>>>>>>>>>>>> than a new binary type is written to disk, then after a > restart > >>>> we > >>>>>>>>> won’t > >>>>>>>>>>>>> have a binary type to work with. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The first case is similar to a situation, when one node > fails, > >>>> and > >>>>>>>>> after > >>>>>>>>>>>>> that a new type is registered in the cluster. This issue is > >>>>>> resolved > >>>>>>>>> by the > >>>>>>>>>>>>> discovery data exchange. All nodes receive information about > >> all > >>>>>>>>> binary > >>>>>>>>>>>>> types in the initial discovery messages sent by other nodes. > >> So, > >>>>>> once > >>>>>>>>> you > >>>>>>>>>>>>> restart a node, it will receive information, that it failed > to > >>>>>> finish > >>>>>>>>>>>>> writing to disk, from other nodes. > >>>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata > >> to > >>>>>> disk, > >>>>>>>>>>>>> then after a restart the type will be considered > unregistered, > >> so > >>>>>>>>> another > >>>>>>>>>>>>> registration will be required. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The second case is a bit more complicated. But it can be > >>>> resolved > >>>>>> by > >>>>>>>>>>>>> making the discovery threads on every node create a future, > >> that > >>>>>> will > >>>>>>>>> be > >>>>>>>>>>>>> completed when writing to disk is finished. So, every node > will > >>>>>> have > >>>>>>>>> such > >>>>>>>>>>>>> future, that will reflect the current state of persisting the > >>>>>>>>> metadata to > >>>>>>>>>>>>> disk. > >>>>>>>>>>>>>> After that, if some operation needs this binary type, it > will > >>>>>> need to > >>>>>>>>>>>>> wait on that future until flushing to disk is finished. > >>>>>>>>>>>>>> This way discovery threads won’t be blocked, but other > >> threads, > >>>>>> that > >>>>>>>>>>>>> actually need this type, will be. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Please let me know what you think about that. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Denis > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> > >>>>>>>>>>>> Best regards, > >>>>>>>>>>>> Alexei Scherbakov > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Zhenya Stanilovsky > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Best regards, > >>>>>>>>>> Ivan Pavlukhin > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> > >>>>>>>> Best regards, > >>>>>>>> Alexei Scherbakov > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Zhenya Stanilovsky > >>>>>> > >>>> > >>>> > >> > >> > > > > -- > > > > Best regards, > > Alexei Scherbakov > > -- Best regards, Alexei Scherbakov