Eduard, Usages will wait for the metadata to be registered and written to disk. No races should occur with such flow. Or do you have some specific case on your mind?
I agree, that using a distributed meta storage would be nice here. But this way we will kind of move to the previous scheme with a replicated system cache, where metadata was stored before. Will scheme with the metastorage be different in any way? Won’t we decide to move back to discovery messages again after a while? Denis > On 20 Aug 2019, at 15:13, Eduard Shangareev <eduard.shangar...@gmail.com> > wrote: > > Denis, > How would we deal with races between registration and metadata usages with > such fast-fix? > > I believe, that we need to move it to distributed metastorage, and await > registration completeness if we can't find it (wait for work in progress). > Discovery shouldn't wait for anything here. > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <dmekhani...@gmail.com> > wrote: > >> Sergey, >> >> Currently metadata is written to disk sequentially on every node. Only one >> node at a time is able to write metadata to its storage. >> Slowness accumulates when you add more nodes. A delay required to write >> one piece of metadata may be not that big, but if you multiply it by say >> 200, then it becomes noticeable. >> But If we move the writing out from discovery threads, then nodes will be >> doing it in parallel. >> >> I think, it’s better to block some threads from a striped pool for a >> little while rather than blocking discovery for the same period, but >> multiplied by a number of nodes. >> >> What do you think? >> >> Denis >> >>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugu...@gmail.com> >> wrote: >>> >>> Denis, >>> >>> Thanks for bringing this issue up, decision to write binary metadata from >>> discovery thread was really a tough decision to make. >>> I don't think that moving metadata to metastorage is a silver bullet here >>> as this approach also has its drawbacks and is not an easy change. >>> >>> In addition to workarounds suggested by Alexei we have two choices to >>> offload write operation from discovery thread: >>> >>> 1. Your scheme with a separate writer thread and futures completed when >>> write operation is finished. >>> 2. PME-like protocol with obvious complications like failover and >>> asynchronous wait for replies over communication layer. >>> >>> Your suggestion looks easier from code complexity perspective but in my >>> view it increases chances to get into starvation. Now if some node faces >>> really long delays during write op it is gonna be kicked out of topology >> by >>> discovery protocol. In your case it is possible that more and more >> threads >>> from other pools may stuck waiting on the operation future, it is also >> not >>> good. >>> >>> What do you think? >>> >>> I also think that if we want to approach this issue systematically, we >> need >>> to do a deep analysis of metastorage option as well and to finally choose >>> which road we wanna go. >>> >>> Thanks! >>> >>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky >>> <arzamas...@mail.ru.invalid> wrote: >>> >>>> >>>>> >>>>>> 1. Yes, only on OS failures. In such case data will be received from >>>> alive >>>>>> nodes later. >>>> What behavior would be in case of one node ? I suppose someone can >> obtain >>>> cache data without unmarshalling schema, what in this case would be with >>>> grid operability? >>>> >>>>> >>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such >>>> mode >>>>>> should not be used if you have more than two nodes in grid because it >>>> has >>>>>> huge impact on performance. >>>> Is wal mode affects metadata store ? >>>> >>>>> >>>>>> >>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhani...@gmail.com >>>>> : >>>>>> >>>>>>> Folks, >>>>>>> >>>>>>> Thanks for showing interest in this issue! >>>>>>> >>>>>>> Alexey, >>>>>>> >>>>>>>> I think removing fsync could help to mitigate performance issues >> with >>>>>>> current implementation >>>>>>> >>>>>>> Is my understanding correct, that if we remove fsync, then discovery >>>> won’t >>>>>>> be blocked, and data will be flushed to disk in background, and loss >> of >>>>>>> information will be possible only on OS failure? It sounds like an >>>>>>> acceptable workaround to me. >>>>>>> >>>>>>> Will moving metadata to metastore actually resolve this issue? Please >>>>>>> correct me if I’m wrong, but we will still need to write the >>>> information to >>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then >>>> the >>>>>>> issue will still be there. Or is it planned to abandon the >>>> discovery-based >>>>>>> protocol at all? >>>>>>> >>>>>>> Evgeniy, Ivan, >>>>>>> >>>>>>> In my particular case the data wasn’t too big. It was a slow >>>> virtualised >>>>>>> disk with encryption, that made operations slow. Given that there are >>>> 200 >>>>>>> nodes in a cluster, where every node writes slowly, and this process >> is >>>>>>> sequential, one piece of metadata is registered extremely slowly. >>>>>>> >>>>>>> Ivan, answering to your other questions: >>>>>>> >>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it >> so >>>>>>> accidentally? >>>>>>> >>>>>>> It should be checked, if it’s safe to stop writing marshaller >> mappings >>>> to >>>>>>> disk without loosing any guarantees. >>>>>>> But anyway, I would like to have a property, that would control this. >>>> If >>>>>>> metadata registration is slow, then initial cluster warmup may take a >>>>>>> while. So, if we preserve metadata on disk, then we will need to warm >>>> it up >>>>>>> only once, and further restarts won’t be affected. >>>>>>> >>>>>>>> Do we really need a fast fix here? >>>>>>> >>>>>>> I would like a fix, that could be implemented now, since the activity >>>> with >>>>>>> moving metadata to metastore doesn’t sound like a quick one. Having a >>>>>>> temporary solution would be nice. >>>>>>> >>>>>>> Denis >>>>>>> >>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com > >>>> wrote: >>>>>>>> >>>>>>>> Denis, >>>>>>>> >>>>>>>> Several clarifying questions: >>>>>>>> 1. Do you have an idea why metadata registration takes so long? So >>>>>>>> poor disks? So many data to write? A contention with disk writes by >>>>>>>> other subsystems? >>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it >> so >>>>>>>> accidentally? >>>>>>>> >>>>>>>> Generally, I think that it is possible to move metadata saving >>>>>>>> operations out of discovery thread without loosing required >>>>>>>> consistency/integrity. >>>>>>>> >>>>>>>> As Alex mentioned using metastore looks like a better solution. Do >> we >>>>>>>> really need a fast fix here? (Are we talking about fast fix?) >>>>>>>> >>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky >>>>>>> < arzamas...@mail.ru.invalid >: >>>>>>>>> >>>>>>>>> Alexey, but in this case customer need to be informed, that whole >>>> (for >>>>>>> example 1 node) cluster crash (power off) could lead to partial data >>>>>>> unavailability. >>>>>>>>> And may be further index corruption. >>>>>>>>> 1. Why your meta takes a substantial size? may be context leaking ? >>>>>>>>> 2. Could meta be compressed ? >>>>>>>>> >>>>>>>>> >>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov < >>>>>>> alexey.scherbak...@gmail.com >: >>>>>>>>>> >>>>>>>>>> Denis Mekhanikov, >>>>>>>>>> >>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case >> of >>>>>>>>>> slow-downs in case of metadata burst writes. >>>>>>>>>> I think removing fsync could help to mitigate performance issues >>>> with >>>>>>>>>> current implementation until proper solution will be implemented: >>>>>>> moving >>>>>>>>>> metadata to metastore. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < >>>> dmekhani...@gmail.com >>>>>>>> : >>>>>>>>>> >>>>>>>>>>> I would also like to mention, that marshaller mappings are >> written >>>> to >>>>>>> disk >>>>>>>>>>> even if persistence is disabled. >>>>>>>>>>> So, this issue affects purely in-memory clusters as well. >>>>>>>>>>> >>>>>>>>>>> Denis >>>>>>>>>>> >>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < >>>> dmekhani...@gmail.com > >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi! >>>>>>>>>>>> >>>>>>>>>>>> When persistence is enabled, binary metadata is written to disk >>>> upon >>>>>>>>>>> registration. Currently it happens in the discovery thread, which >>>>>>> makes >>>>>>>>>>> processing of related messages very slow. >>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make >> every >>>>>>>>>>> binary type be registered for several minutes. Plus it blocks >>>>>>> processing of >>>>>>>>>>> other messages. >>>>>>>>>>>> >>>>>>>>>>>> I propose starting a separate thread that will be responsible >> for >>>>>>>>>>> writing binary metadata to disk. So, binary type registration >> will >>>> be >>>>>>>>>>> considered finished before information about it will is written >> to >>>>>>> disks on >>>>>>>>>>> all nodes. >>>>>>>>>>>> >>>>>>>>>>>> The main concern here is data consistency in cases when a node >>>>>>>>>>> acknowledges type registration and then fails before writing the >>>>>>> metadata >>>>>>>>>>> to disk. >>>>>>>>>>>> I see two parts of this issue: >>>>>>>>>>>> Nodes will have different metadata after restarting. >>>>>>>>>>>> If we write some data into a persisted cache and shut down nodes >>>>>>> faster >>>>>>>>>>> than a new binary type is written to disk, then after a restart >> we >>>>>>> won’t >>>>>>>>>>> have a binary type to work with. >>>>>>>>>>>> >>>>>>>>>>>> The first case is similar to a situation, when one node fails, >> and >>>>>>> after >>>>>>>>>>> that a new type is registered in the cluster. This issue is >>>> resolved >>>>>>> by the >>>>>>>>>>> discovery data exchange. All nodes receive information about all >>>>>>> binary >>>>>>>>>>> types in the initial discovery messages sent by other nodes. So, >>>> once >>>>>>> you >>>>>>>>>>> restart a node, it will receive information, that it failed to >>>> finish >>>>>>>>>>> writing to disk, from other nodes. >>>>>>>>>>>> If all nodes shut down before finishing writing the metadata to >>>> disk, >>>>>>>>>>> then after a restart the type will be considered unregistered, so >>>>>>> another >>>>>>>>>>> registration will be required. >>>>>>>>>>>> >>>>>>>>>>>> The second case is a bit more complicated. But it can be >> resolved >>>> by >>>>>>>>>>> making the discovery threads on every node create a future, that >>>> will >>>>>>> be >>>>>>>>>>> completed when writing to disk is finished. So, every node will >>>> have >>>>>>> such >>>>>>>>>>> future, that will reflect the current state of persisting the >>>>>>> metadata to >>>>>>>>>>> disk. >>>>>>>>>>>> After that, if some operation needs this binary type, it will >>>> need to >>>>>>>>>>> wait on that future until flushing to disk is finished. >>>>>>>>>>>> This way discovery threads won’t be blocked, but other threads, >>>> that >>>>>>>>>>> actually need this type, will be. >>>>>>>>>>>> >>>>>>>>>>>> Please let me know what you think about that. >>>>>>>>>>>> >>>>>>>>>>>> Denis >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> Alexei Scherbakov >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Zhenya Stanilovsky >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> Ivan Pavlukhin >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Best regards, >>>>>> Alexei Scherbakov >>>>> >>>> >>>> >>>> -- >>>> Zhenya Stanilovsky >>>> >> >>