Alexey, I still don’t understand completely if by using metastore we are going to stop using discovery for metadata registration, or not. Could you clarify that point? Is it going to be a distributed metastore or a local one?
Are there any relevant JIRA tickets for this change? Denis > On 14 Aug 2019, at 19:37, Alexei Scherbakov <alexey.scherbak...@gmail.com> > wrote: > > Denis Mekhanikov, > > 1. Yes, only on OS failures. In such case data will be received from alive > nodes later. > 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such mode > should not be used if you have more than two nodes in grid because it has > huge impact on performance. > > ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <dmekhani...@gmail.com>: > >> Folks, >> >> Thanks for showing interest in this issue! >> >> Alexey, >> >>> I think removing fsync could help to mitigate performance issues with >> current implementation >> >> Is my understanding correct, that if we remove fsync, then discovery won’t >> be blocked, and data will be flushed to disk in background, and loss of >> information will be possible only on OS failure? It sounds like an >> acceptable workaround to me. >> >> Will moving metadata to metastore actually resolve this issue? Please >> correct me if I’m wrong, but we will still need to write the information to >> WAL before releasing the discovery thread. If WAL mode is FSYNC, then the >> issue will still be there. Or is it planned to abandon the discovery-based >> protocol at all? >> >> Evgeniy, Ivan, >> >> In my particular case the data wasn’t too big. It was a slow virtualised >> disk with encryption, that made operations slow. Given that there are 200 >> nodes in a cluster, where every node writes slowly, and this process is >> sequential, one piece of metadata is registered extremely slowly. >> >> Ivan, answering to your other questions: >> >>> 2. Do we need a persistent metadata for in-memory caches? Or is it so >> accidentally? >> >> It should be checked, if it’s safe to stop writing marshaller mappings to >> disk without loosing any guarantees. >> But anyway, I would like to have a property, that would control this. If >> metadata registration is slow, then initial cluster warmup may take a >> while. So, if we preserve metadata on disk, then we will need to warm it up >> only once, and further restarts won’t be affected. >> >>> Do we really need a fast fix here? >> >> I would like a fix, that could be implemented now, since the activity with >> moving metadata to metastore doesn’t sound like a quick one. Having a >> temporary solution would be nice. >> >> Denis >> >>> On 14 Aug 2019, at 11:53, Павлухин Иван <vololo...@gmail.com> wrote: >>> >>> Denis, >>> >>> Several clarifying questions: >>> 1. Do you have an idea why metadata registration takes so long? So >>> poor disks? So many data to write? A contention with disk writes by >>> other subsystems? >>> 2. Do we need a persistent metadata for in-memory caches? Or is it so >>> accidentally? >>> >>> Generally, I think that it is possible to move metadata saving >>> operations out of discovery thread without loosing required >>> consistency/integrity. >>> >>> As Alex mentioned using metastore looks like a better solution. Do we >>> really need a fast fix here? (Are we talking about fast fix?) >>> >>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky >> <arzamas...@mail.ru.invalid>: >>>> >>>> Alexey, but in this case customer need to be informed, that whole (for >> example 1 node) cluster crash (power off) could lead to partial data >> unavailability. >>>> And may be further index corruption. >>>> 1. Why your meta takes a substantial size? may be context leaking ? >>>> 2. Could meta be compressed ? >>>> >>>> >>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov < >> alexey.scherbak...@gmail.com>: >>>>> >>>>> Denis Mekhanikov, >>>>> >>>>> Currently metadata are fsync'ed on write. This might be the case of >>>>> slow-downs in case of metadata burst writes. >>>>> I think removing fsync could help to mitigate performance issues with >>>>> current implementation until proper solution will be implemented: >> moving >>>>> metadata to metastore. >>>>> >>>>> >>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < dmekhani...@gmail.com >>> : >>>>> >>>>>> I would also like to mention, that marshaller mappings are written to >> disk >>>>>> even if persistence is disabled. >>>>>> So, this issue affects purely in-memory clusters as well. >>>>>> >>>>>> Denis >>>>>> >>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < dmekhani...@gmail.com > >>>>>> wrote: >>>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> When persistence is enabled, binary metadata is written to disk upon >>>>>> registration. Currently it happens in the discovery thread, which >> makes >>>>>> processing of related messages very slow. >>>>>>> There are cases, when a lot of nodes and slow disks can make every >>>>>> binary type be registered for several minutes. Plus it blocks >> processing of >>>>>> other messages. >>>>>>> >>>>>>> I propose starting a separate thread that will be responsible for >>>>>> writing binary metadata to disk. So, binary type registration will be >>>>>> considered finished before information about it will is written to >> disks on >>>>>> all nodes. >>>>>>> >>>>>>> The main concern here is data consistency in cases when a node >>>>>> acknowledges type registration and then fails before writing the >> metadata >>>>>> to disk. >>>>>>> I see two parts of this issue: >>>>>>> Nodes will have different metadata after restarting. >>>>>>> If we write some data into a persisted cache and shut down nodes >> faster >>>>>> than a new binary type is written to disk, then after a restart we >> won’t >>>>>> have a binary type to work with. >>>>>>> >>>>>>> The first case is similar to a situation, when one node fails, and >> after >>>>>> that a new type is registered in the cluster. This issue is resolved >> by the >>>>>> discovery data exchange. All nodes receive information about all >> binary >>>>>> types in the initial discovery messages sent by other nodes. So, once >> you >>>>>> restart a node, it will receive information, that it failed to finish >>>>>> writing to disk, from other nodes. >>>>>>> If all nodes shut down before finishing writing the metadata to disk, >>>>>> then after a restart the type will be considered unregistered, so >> another >>>>>> registration will be required. >>>>>>> >>>>>>> The second case is a bit more complicated. But it can be resolved by >>>>>> making the discovery threads on every node create a future, that will >> be >>>>>> completed when writing to disk is finished. So, every node will have >> such >>>>>> future, that will reflect the current state of persisting the >> metadata to >>>>>> disk. >>>>>>> After that, if some operation needs this binary type, it will need to >>>>>> wait on that future until flushing to disk is finished. >>>>>>> This way discovery threads won’t be blocked, but other threads, that >>>>>> actually need this type, will be. >>>>>>> >>>>>>> Please let me know what you think about that. >>>>>>> >>>>>>> Denis >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Best regards, >>>>> Alexei Scherbakov >>>> >>>> >>>> -- >>>> Zhenya Stanilovsky >>> >>> >>> >>> -- >>> Best regards, >>> Ivan Pavlukhin >> >> > > -- > > Best regards, > Alexei Scherbakov