Eduard,

Usages will wait for the metadata to be registered and written to disk. No 
races should occur with such flow.
Or do you have some specific case on your mind?

I agree, that using a distributed meta storage would be nice here. 
But this way we will kind of move to the previous scheme with a replicated 
system cache, where metadata was stored before.
Will scheme with the metastorage be different in any way? Won’t we decide to 
move back to discovery messages again after a while?

Denis


> On 20 Aug 2019, at 15:13, Eduard Shangareev <eduard.shangar...@gmail.com> 
> wrote:
> 
> Denis,
> How would we deal with races between registration and metadata usages with
> such fast-fix?
> 
> I believe, that we need to move it to distributed metastorage, and await
> registration completeness if we can't find it (wait for work in progress).
> Discovery shouldn't wait for anything here.
> 
> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <dmekhani...@gmail.com>
> wrote:
> 
>> Sergey,
>> 
>> Currently metadata is written to disk sequentially on every node. Only one
>> node at a time is able to write metadata to its storage.
>> Slowness accumulates when you add more nodes. A delay required to write
>> one piece of metadata may be not that big, but if you multiply it by say
>> 200, then it becomes noticeable.
>> But If we move the writing out from discovery threads, then nodes will be
>> doing it in parallel.
>> 
>> I think, it’s better to block some threads from a striped pool for a
>> little while rather than blocking discovery for the same period, but
>> multiplied by a number of nodes.
>> 
>> What do you think?
>> 
>> Denis
>> 
>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugu...@gmail.com>
>> wrote:
>>> 
>>> Denis,
>>> 
>>> Thanks for bringing this issue up, decision to write binary metadata from
>>> discovery thread was really a tough decision to make.
>>> I don't think that moving metadata to metastorage is a silver bullet here
>>> as this approach also has its drawbacks and is not an easy change.
>>> 
>>> In addition to workarounds suggested by Alexei we have two choices to
>>> offload write operation from discovery thread:
>>> 
>>>  1. Your scheme with a separate writer thread and futures completed when
>>>  write operation is finished.
>>>  2. PME-like protocol with obvious complications like failover and
>>>  asynchronous wait for replies over communication layer.
>>> 
>>> Your suggestion looks easier from code complexity perspective but in my
>>> view it increases chances to get into starvation. Now if some node faces
>>> really long delays during write op it is gonna be kicked out of topology
>> by
>>> discovery protocol. In your case it is possible that more and more
>> threads
>>> from other pools may stuck waiting on the operation future, it is also
>> not
>>> good.
>>> 
>>> What do you think?
>>> 
>>> I also think that if we want to approach this issue systematically, we
>> need
>>> to do a deep analysis of metastorage option as well and to finally choose
>>> which road we wanna go.
>>> 
>>> Thanks!
>>> 
>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
>>> <arzamas...@mail.ru.invalid> wrote:
>>> 
>>>> 
>>>>> 
>>>>>> 1. Yes, only on OS failures. In such case data will be received from
>>>> alive
>>>>>> nodes later.
>>>> What behavior would be in case of one node ? I suppose someone can
>> obtain
>>>> cache data without unmarshalling schema, what in this case would be with
>>>> grid operability?
>>>> 
>>>>> 
>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
>>>> mode
>>>>>> should not be used if you have more than two nodes in grid because it
>>>> has
>>>>>> huge impact on performance.
>>>> Is wal mode affects metadata store ?
>>>> 
>>>>> 
>>>>>> 
>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhani...@gmail.com
>>>>> :
>>>>>> 
>>>>>>> Folks,
>>>>>>> 
>>>>>>> Thanks for showing interest in this issue!
>>>>>>> 
>>>>>>> Alexey,
>>>>>>> 
>>>>>>>> I think removing fsync could help to mitigate performance issues
>> with
>>>>>>> current implementation
>>>>>>> 
>>>>>>> Is my understanding correct, that if we remove fsync, then discovery
>>>> won’t
>>>>>>> be blocked, and data will be flushed to disk in background, and loss
>> of
>>>>>>> information will be possible only on OS failure? It sounds like an
>>>>>>> acceptable workaround to me.
>>>>>>> 
>>>>>>> Will moving metadata to metastore actually resolve this issue? Please
>>>>>>> correct me if I’m wrong, but we will still need to write the
>>>> information to
>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
>>>> the
>>>>>>> issue will still be there. Or is it planned to abandon the
>>>> discovery-based
>>>>>>> protocol at all?
>>>>>>> 
>>>>>>> Evgeniy, Ivan,
>>>>>>> 
>>>>>>> In my particular case the data wasn’t too big. It was a slow
>>>> virtualised
>>>>>>> disk with encryption, that made operations slow. Given that there are
>>>> 200
>>>>>>> nodes in a cluster, where every node writes slowly, and this process
>> is
>>>>>>> sequential, one piece of metadata is registered extremely slowly.
>>>>>>> 
>>>>>>> Ivan, answering to your other questions:
>>>>>>> 
>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>> so
>>>>>>> accidentally?
>>>>>>> 
>>>>>>> It should be checked, if it’s safe to stop writing marshaller
>> mappings
>>>> to
>>>>>>> disk without loosing any guarantees.
>>>>>>> But anyway, I would like to have a property, that would control this.
>>>> If
>>>>>>> metadata registration is slow, then initial cluster warmup may take a
>>>>>>> while. So, if we preserve metadata on disk, then we will need to warm
>>>> it up
>>>>>>> only once, and further restarts won’t be affected.
>>>>>>> 
>>>>>>>> Do we really need a fast fix here?
>>>>>>> 
>>>>>>> I would like a fix, that could be implemented now, since the activity
>>>> with
>>>>>>> moving metadata to metastore doesn’t sound like a quick one. Having a
>>>>>>> temporary solution would be nice.
>>>>>>> 
>>>>>>> Denis
>>>>>>> 
>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com >
>>>> wrote:
>>>>>>>> 
>>>>>>>> Denis,
>>>>>>>> 
>>>>>>>> Several clarifying questions:
>>>>>>>> 1. Do you have an idea why metadata registration takes so long? So
>>>>>>>> poor disks? So many data to write? A contention with disk writes by
>>>>>>>> other subsystems?
>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>> so
>>>>>>>> accidentally?
>>>>>>>> 
>>>>>>>> Generally, I think that it is possible to move metadata saving
>>>>>>>> operations out of discovery thread without loosing required
>>>>>>>> consistency/integrity.
>>>>>>>> 
>>>>>>>> As Alex mentioned using metastore looks like a better solution. Do
>> we
>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>>>>> 
>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>>>>>> < arzamas...@mail.ru.invalid >:
>>>>>>>>> 
>>>>>>>>> Alexey, but in this case customer need to be informed, that whole
>>>> (for
>>>>>>> example 1 node) cluster crash (power off) could lead to partial data
>>>>>>> unavailability.
>>>>>>>>> And may be further index corruption.
>>>>>>>>> 1. Why your meta takes a substantial size? may be context leaking ?
>>>>>>>>> 2. Could meta be compressed ?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>>>>>> alexey.scherbak...@gmail.com >:
>>>>>>>>>> 
>>>>>>>>>> Denis Mekhanikov,
>>>>>>>>>> 
>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case
>> of
>>>>>>>>>> slow-downs in case of metadata burst writes.
>>>>>>>>>> I think removing fsync could help to mitigate performance issues
>>>> with
>>>>>>>>>> current implementation until proper solution will be implemented:
>>>>>>> moving
>>>>>>>>>> metadata to metastore.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
>>>> dmekhani...@gmail.com
>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> I would also like to mention, that marshaller mappings are
>> written
>>>> to
>>>>>>> disk
>>>>>>>>>>> even if persistence is disabled.
>>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>>>>>> 
>>>>>>>>>>> Denis
>>>>>>>>>>> 
>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
>>>> dmekhani...@gmail.com >
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi!
>>>>>>>>>>>> 
>>>>>>>>>>>> When persistence is enabled, binary metadata is written to disk
>>>> upon
>>>>>>>>>>> registration. Currently it happens in the discovery thread, which
>>>>>>> makes
>>>>>>>>>>> processing of related messages very slow.
>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
>> every
>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
>>>>>>> processing of
>>>>>>>>>>> other messages.
>>>>>>>>>>>> 
>>>>>>>>>>>> I propose starting a separate thread that will be responsible
>> for
>>>>>>>>>>> writing binary metadata to disk. So, binary type registration
>> will
>>>> be
>>>>>>>>>>> considered finished before information about it will is written
>> to
>>>>>>> disks on
>>>>>>>>>>> all nodes.
>>>>>>>>>>>> 
>>>>>>>>>>>> The main concern here is data consistency in cases when a node
>>>>>>>>>>> acknowledges type registration and then fails before writing the
>>>>>>> metadata
>>>>>>>>>>> to disk.
>>>>>>>>>>>> I see two parts of this issue:
>>>>>>>>>>>> Nodes will have different metadata after restarting.
>>>>>>>>>>>> If we write some data into a persisted cache and shut down nodes
>>>>>>> faster
>>>>>>>>>>> than a new binary type is written to disk, then after a restart
>> we
>>>>>>> won’t
>>>>>>>>>>> have a binary type to work with.
>>>>>>>>>>>> 
>>>>>>>>>>>> The first case is similar to a situation, when one node fails,
>> and
>>>>>>> after
>>>>>>>>>>> that a new type is registered in the cluster. This issue is
>>>> resolved
>>>>>>> by the
>>>>>>>>>>> discovery data exchange. All nodes receive information about all
>>>>>>> binary
>>>>>>>>>>> types in the initial discovery messages sent by other nodes. So,
>>>> once
>>>>>>> you
>>>>>>>>>>> restart a node, it will receive information, that it failed to
>>>> finish
>>>>>>>>>>> writing to disk, from other nodes.
>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata to
>>>> disk,
>>>>>>>>>>> then after a restart the type will be considered unregistered, so
>>>>>>> another
>>>>>>>>>>> registration will be required.
>>>>>>>>>>>> 
>>>>>>>>>>>> The second case is a bit more complicated. But it can be
>> resolved
>>>> by
>>>>>>>>>>> making the discovery threads on every node create a future, that
>>>> will
>>>>>>> be
>>>>>>>>>>> completed when writing to disk is finished. So, every node will
>>>> have
>>>>>>> such
>>>>>>>>>>> future, that will reflect the current state of persisting the
>>>>>>> metadata to
>>>>>>>>>>> disk.
>>>>>>>>>>>> After that, if some operation needs this binary type, it will
>>>> need to
>>>>>>>>>>> wait on that future until flushing to disk is finished.
>>>>>>>>>>>> This way discovery threads won’t be blocked, but other threads,
>>>> that
>>>>>>>>>>> actually need this type, will be.
>>>>>>>>>>>> 
>>>>>>>>>>>> Please let me know what you think about that.
>>>>>>>>>>>> 
>>>>>>>>>>>> Denis
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> Alexei Scherbakov
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Zhenya Stanilovsky
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Ivan Pavlukhin
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Best regards,
>>>>>> Alexei Scherbakov
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Zhenya Stanilovsky
>>>> 
>> 
>> 

Reply via email to