Re: Asynchronous registration of binary metadata

Denis Mekhanikov Wed, 14 Aug 2019 09:53:48 -0700

Alexey, 

I still don’t understand completely if by using metastore we are going to stop 
using discovery for metadata registration, or not. Could you clarify that point?
Is it going to be a distributed metastore or a local one?


Are there any relevant JIRA tickets for this change?

Denis

> On 14 Aug 2019, at 19:37, Alexei Scherbakov <alexey.scherbak...@gmail.com> 
> wrote:
> 
> Denis Mekhanikov,
> 
> 1. Yes, only on OS failures. In such case data will be received from alive
> nodes later.
> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such mode
> should not be used if you have more than two nodes in grid because it has
> huge impact on performance.
> 
> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <dmekhani...@gmail.com>:
> 
>> Folks,
>> 
>> Thanks for showing interest in this issue!
>> 
>> Alexey,
>> 
>>> I think removing fsync could help to mitigate performance issues with
>> current implementation
>> 
>> Is my understanding correct, that if we remove fsync, then discovery won’t
>> be blocked, and data will be flushed to disk in background, and loss of
>> information will be possible only on OS failure? It sounds like an
>> acceptable workaround to me.
>> 
>> Will moving metadata to metastore actually resolve this issue? Please
>> correct me if I’m wrong, but we will still need to write the information to
>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then the
>> issue will still be there. Or is it planned to abandon the discovery-based
>> protocol at all?
>> 
>> Evgeniy, Ivan,
>> 
>> In my particular case the data wasn’t too big. It was a slow virtualised
>> disk with encryption, that made operations slow. Given that there are 200
>> nodes in a cluster, where every node writes slowly, and this process is
>> sequential, one piece of metadata is registered extremely slowly.
>> 
>> Ivan, answering to your other questions:
>> 
>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>> accidentally?
>> 
>> It should be checked, if it’s safe to stop writing marshaller mappings to
>> disk without loosing any guarantees.
>> But anyway, I would like to have a property, that would control this. If
>> metadata registration is slow, then initial cluster warmup may take a
>> while. So, if we preserve metadata on disk, then we will need to warm it up
>> only once, and further restarts won’t be affected.
>> 
>>> Do we really need a fast fix here?
>> 
>> I would like a fix, that could be implemented now, since the activity with
>> moving metadata to metastore doesn’t sound like a quick one. Having a
>> temporary solution would be nice.
>> 
>> Denis
>> 
>>> On 14 Aug 2019, at 11:53, Павлухин Иван <vololo...@gmail.com> wrote:
>>> 
>>> Denis,
>>> 
>>> Several clarifying questions:
>>> 1. Do you have an idea why metadata registration takes so long? So
>>> poor disks? So many data to write? A contention with disk writes by
>>> other subsystems?
>>> 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>> accidentally?
>>> 
>>> Generally, I think that it is possible to move metadata saving
>>> operations out of discovery thread without loosing required
>>> consistency/integrity.
>>> 
>>> As Alex mentioned using metastore looks like a better solution. Do we
>>> really need a fast fix here? (Are we talking about fast fix?)
>>> 
>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>> <arzamas...@mail.ru.invalid>:
>>>> 
>>>> Alexey, but in this case customer need to be informed, that whole (for
>> example 1 node) cluster crash (power off) could lead to partial data
>> unavailability.
>>>> And may be further index corruption.
>>>> 1. Why your meta takes a substantial size? may be context leaking ?
>>>> 2. Could meta be compressed ?
>>>> 
>>>> 
>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>> alexey.scherbak...@gmail.com>:
>>>>> 
>>>>> Denis Mekhanikov,
>>>>> 
>>>>> Currently metadata are fsync'ed on write. This might be the case of
>>>>> slow-downs in case of metadata burst writes.
>>>>> I think removing fsync could help to mitigate performance issues with
>>>>> current implementation until proper solution will be implemented:
>> moving
>>>>> metadata to metastore.
>>>>> 
>>>>> 
>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < dmekhani...@gmail.com
>>> :
>>>>> 
>>>>>> I would also like to mention, that marshaller mappings are written to
>> disk
>>>>>> even if persistence is disabled.
>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>> 
>>>>>> Denis
>>>>>> 
>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < dmekhani...@gmail.com >
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi!
>>>>>>> 
>>>>>>> When persistence is enabled, binary metadata is written to disk upon
>>>>>> registration. Currently it happens in the discovery thread, which
>> makes
>>>>>> processing of related messages very slow.
>>>>>>> There are cases, when a lot of nodes and slow disks can make every
>>>>>> binary type be registered for several minutes. Plus it blocks
>> processing of
>>>>>> other messages.
>>>>>>> 
>>>>>>> I propose starting a separate thread that will be responsible for
>>>>>> writing binary metadata to disk. So, binary type registration will be
>>>>>> considered finished before information about it will is written to
>> disks on
>>>>>> all nodes.
>>>>>>> 
>>>>>>> The main concern here is data consistency in cases when a node
>>>>>> acknowledges type registration and then fails before writing the
>> metadata
>>>>>> to disk.
>>>>>>> I see two parts of this issue:
>>>>>>> Nodes will have different metadata after restarting.
>>>>>>> If we write some data into a persisted cache and shut down nodes
>> faster
>>>>>> than a new binary type is written to disk, then after a restart we
>> won’t
>>>>>> have a binary type to work with.
>>>>>>> 
>>>>>>> The first case is similar to a situation, when one node fails, and
>> after
>>>>>> that a new type is registered in the cluster. This issue is resolved
>> by the
>>>>>> discovery data exchange. All nodes receive information about all
>> binary
>>>>>> types in the initial discovery messages sent by other nodes. So, once
>> you
>>>>>> restart a node, it will receive information, that it failed to finish
>>>>>> writing to disk, from other nodes.
>>>>>>> If all nodes shut down before finishing writing the metadata to disk,
>>>>>> then after a restart the type will be considered unregistered, so
>> another
>>>>>> registration will be required.
>>>>>>> 
>>>>>>> The second case is a bit more complicated. But it can be resolved by
>>>>>> making the discovery threads on every node create a future, that will
>> be
>>>>>> completed when writing to disk is finished. So, every node will have
>> such
>>>>>> future, that will reflect the current state of persisting the
>> metadata to
>>>>>> disk.
>>>>>>> After that, if some operation needs this binary type, it will need to
>>>>>> wait on that future until flushing to disk is finished.
>>>>>>> This way discovery threads won’t be blocked, but other threads, that
>>>>>> actually need this type, will be.
>>>>>>> 
>>>>>>> Please let me know what you think about that.
>>>>>>> 
>>>>>>> Denis
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Best regards,
>>>>> Alexei Scherbakov
>>>> 
>>>> 
>>>> --
>>>> Zhenya Stanilovsky
>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> Ivan Pavlukhin
>> 
>> 
> 
> -- 
> 
> Best regards,
> Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Reply via email to