Re: Asynchronous registration of binary metadata

Alexei Scherbakov Thu, 22 Aug 2019 02:44:42 -0700

Denis Mekhanikov,

I think at least one node (coordinator for example)  still should write
metadata synchronously to protect from a scenario:


tx creating new metadata is commited <- all nodes in grid are failed
(powered off) <- async writing to disk is completed

where <- means "happens before"

All other nodes could write asynchronously, by using separate thread or not
doing fsync( same effect)



ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <dmekhani...@gmail.com>:

> Alexey,
>
> I’m not suggesting to duplicate anything.
> My point is that the proper fix will be implemented in a relatively
> distant future. Why not improve the existing mechanism now instead of
> waiting for the proper fix?
> If we don’t agree on doing this fix in master, I can do it in a fork and
> use it in my setup. So please let me know if you see any other drawbacks in
> the proposed solution.
>
> Denis
>
> > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> alexey.scherbak...@gmail.com> wrote:
> >
> > Denis Mekhanikov,
> >
> > If we are still talking about "proper" solution the metastore (I've meant
> > of course distributed one) is the way to go.
> >
> > It has a contract to store cluster wide metadata in most efficient way
> and
> > can have any optimization for concurrent writing inside.
> >
> > I'm against creating some duplicating mechanism as you suggested. We do
> not
> > need another copy/paste code.
> >
> > Another possibility is to carry metadata along with appropriate request
> if
> > it's not found locally but this is a rather big modification.
> >
> >
> >
> > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhani...@gmail.com>:
> >
> >> Eduard,
> >>
> >> Usages will wait for the metadata to be registered and written to disk.
> No
> >> races should occur with such flow.
> >> Or do you have some specific case on your mind?
> >>
> >> I agree, that using a distributed meta storage would be nice here.
> >> But this way we will kind of move to the previous scheme with a
> replicated
> >> system cache, where metadata was stored before.
> >> Will scheme with the metastorage be different in any way? Won’t we
> decide
> >> to move back to discovery messages again after a while?
> >>
> >> Denis
> >>
> >>
> >>> On 20 Aug 2019, at 15:13, Eduard Shangareev <
> eduard.shangar...@gmail.com>
> >> wrote:
> >>>
> >>> Denis,
> >>> How would we deal with races between registration and metadata usages
> >> with
> >>> such fast-fix?
> >>>
> >>> I believe, that we need to move it to distributed metastorage, and
> await
> >>> registration completeness if we can't find it (wait for work in
> >> progress).
> >>> Discovery shouldn't wait for anything here.
> >>>
> >>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> dmekhani...@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Sergey,
> >>>>
> >>>> Currently metadata is written to disk sequentially on every node. Only
> >> one
> >>>> node at a time is able to write metadata to its storage.
> >>>> Slowness accumulates when you add more nodes. A delay required to
> write
> >>>> one piece of metadata may be not that big, but if you multiply it by
> say
> >>>> 200, then it becomes noticeable.
> >>>> But If we move the writing out from discovery threads, then nodes will
> >> be
> >>>> doing it in parallel.
> >>>>
> >>>> I think, it’s better to block some threads from a striped pool for a
> >>>> little while rather than blocking discovery for the same period, but
> >>>> multiplied by a number of nodes.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Denis
> >>>>
> >>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugu...@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>> Denis,
> >>>>>
> >>>>> Thanks for bringing this issue up, decision to write binary metadata
> >> from
> >>>>> discovery thread was really a tough decision to make.
> >>>>> I don't think that moving metadata to metastorage is a silver bullet
> >> here
> >>>>> as this approach also has its drawbacks and is not an easy change.
> >>>>>
> >>>>> In addition to workarounds suggested by Alexei we have two choices to
> >>>>> offload write operation from discovery thread:
> >>>>>
> >>>>> 1. Your scheme with a separate writer thread and futures completed
> >> when
> >>>>> write operation is finished.
> >>>>> 2. PME-like protocol with obvious complications like failover and
> >>>>> asynchronous wait for replies over communication layer.
> >>>>>
> >>>>> Your suggestion looks easier from code complexity perspective but in
> my
> >>>>> view it increases chances to get into starvation. Now if some node
> >> faces
> >>>>> really long delays during write op it is gonna be kicked out of
> >> topology
> >>>> by
> >>>>> discovery protocol. In your case it is possible that more and more
> >>>> threads
> >>>>> from other pools may stuck waiting on the operation future, it is
> also
> >>>> not
> >>>>> good.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> I also think that if we want to approach this issue systematically,
> we
> >>>> need
> >>>>> to do a deep analysis of metastorage option as well and to finally
> >> choose
> >>>>> which road we wanna go.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> >>>>> <arzamas...@mail.ru.invalid> wrote:
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> 1. Yes, only on OS failures. In such case data will be received
> from
> >>>>>> alive
> >>>>>>>> nodes later.
> >>>>>> What behavior would be in case of one node ? I suppose someone can
> >>>> obtain
> >>>>>> cache data without unmarshalling schema, what in this case would be
> >> with
> >>>>>> grid operability?
> >>>>>>
> >>>>>>>
> >>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But
> such
> >>>>>> mode
> >>>>>>>> should not be used if you have more than two nodes in grid because
> >> it
> >>>>>> has
> >>>>>>>> huge impact on performance.
> >>>>>> Is wal mode affects metadata store ?
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> >> dmekhani...@gmail.com
> >>>>>>> :
> >>>>>>>>
> >>>>>>>>> Folks,
> >>>>>>>>>
> >>>>>>>>> Thanks for showing interest in this issue!
> >>>>>>>>>
> >>>>>>>>> Alexey,
> >>>>>>>>>
> >>>>>>>>>> I think removing fsync could help to mitigate performance issues
> >>>> with
> >>>>>>>>> current implementation
> >>>>>>>>>
> >>>>>>>>> Is my understanding correct, that if we remove fsync, then
> >> discovery
> >>>>>> won’t
> >>>>>>>>> be blocked, and data will be flushed to disk in background, and
> >> loss
> >>>> of
> >>>>>>>>> information will be possible only on OS failure? It sounds like
> an
> >>>>>>>>> acceptable workaround to me.
> >>>>>>>>>
> >>>>>>>>> Will moving metadata to metastore actually resolve this issue?
> >> Please
> >>>>>>>>> correct me if I’m wrong, but we will still need to write the
> >>>>>> information to
> >>>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC,
> >> then
> >>>>>> the
> >>>>>>>>> issue will still be there. Or is it planned to abandon the
> >>>>>> discovery-based
> >>>>>>>>> protocol at all?
> >>>>>>>>>
> >>>>>>>>> Evgeniy, Ivan,
> >>>>>>>>>
> >>>>>>>>> In my particular case the data wasn’t too big. It was a slow
> >>>>>> virtualised
> >>>>>>>>> disk with encryption, that made operations slow. Given that there
> >> are
> >>>>>> 200
> >>>>>>>>> nodes in a cluster, where every node writes slowly, and this
> >> process
> >>>> is
> >>>>>>>>> sequential, one piece of metadata is registered extremely slowly.
> >>>>>>>>>
> >>>>>>>>> Ivan, answering to your other questions:
> >>>>>>>>>
> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is
> it
> >>>> so
> >>>>>>>>> accidentally?
> >>>>>>>>>
> >>>>>>>>> It should be checked, if it’s safe to stop writing marshaller
> >>>> mappings
> >>>>>> to
> >>>>>>>>> disk without loosing any guarantees.
> >>>>>>>>> But anyway, I would like to have a property, that would control
> >> this.
> >>>>>> If
> >>>>>>>>> metadata registration is slow, then initial cluster warmup may
> >> take a
> >>>>>>>>> while. So, if we preserve metadata on disk, then we will need to
> >> warm
> >>>>>> it up
> >>>>>>>>> only once, and further restarts won’t be affected.
> >>>>>>>>>
> >>>>>>>>>> Do we really need a fast fix here?
> >>>>>>>>>
> >>>>>>>>> I would like a fix, that could be implemented now, since the
> >> activity
> >>>>>> with
> >>>>>>>>> moving metadata to metastore doesn’t sound like a quick one.
> >> Having a
> >>>>>>>>> temporary solution would be nice.
> >>>>>>>>>
> >>>>>>>>> Denis
> >>>>>>>>>
> >>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com >
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Denis,
> >>>>>>>>>>
> >>>>>>>>>> Several clarifying questions:
> >>>>>>>>>> 1. Do you have an idea why metadata registration takes so long?
> So
> >>>>>>>>>> poor disks? So many data to write? A contention with disk writes
> >> by
> >>>>>>>>>> other subsystems?
> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is
> it
> >>>> so
> >>>>>>>>>> accidentally?
> >>>>>>>>>>
> >>>>>>>>>> Generally, I think that it is possible to move metadata saving
> >>>>>>>>>> operations out of discovery thread without loosing required
> >>>>>>>>>> consistency/integrity.
> >>>>>>>>>>
> >>>>>>>>>> As Alex mentioned using metastore looks like a better solution.
> Do
> >>>> we
> >>>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
> >>>>>>>>>>
> >>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>>>>>>>> < arzamas...@mail.ru.invalid >:
> >>>>>>>>>>>
> >>>>>>>>>>> Alexey, but in this case customer need to be informed, that
> whole
> >>>>>> (for
> >>>>>>>>> example 1 node) cluster crash (power off) could lead to partial
> >> data
> >>>>>>>>> unavailability.
> >>>>>>>>>>> And may be further index corruption.
> >>>>>>>>>>> 1. Why your meta takes a substantial size? may be context
> >> leaking ?
> >>>>>>>>>>> 2. Could meta be compressed ?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
> >>>>>>>>> alexey.scherbak...@gmail.com >:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Denis Mekhanikov,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the
> case
> >>>> of
> >>>>>>>>>>>> slow-downs in case of metadata burst writes.
> >>>>>>>>>>>> I think removing fsync could help to mitigate performance
> issues
> >>>>>> with
> >>>>>>>>>>>> current implementation until proper solution will be
> >> implemented:
> >>>>>>>>> moving
> >>>>>>>>>>>> metadata to metastore.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> >>>>>> dmekhani...@gmail.com
> >>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I would also like to mention, that marshaller mappings are
> >>>> written
> >>>>>> to
> >>>>>>>>> disk
> >>>>>>>>>>>>> even if persistence is disabled.
> >>>>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> >>>>>> dmekhani...@gmail.com >
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When persistence is enabled, binary metadata is written to
> >> disk
> >>>>>> upon
> >>>>>>>>>>>>> registration. Currently it happens in the discovery thread,
> >> which
> >>>>>>>>> makes
> >>>>>>>>>>>>> processing of related messages very slow.
> >>>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
> >>>> every
> >>>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
> >>>>>>>>> processing of
> >>>>>>>>>>>>> other messages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I propose starting a separate thread that will be
> responsible
> >>>> for
> >>>>>>>>>>>>> writing binary metadata to disk. So, binary type registration
> >>>> will
> >>>>>> be
> >>>>>>>>>>>>> considered finished before information about it will is
> written
> >>>> to
> >>>>>>>>> disks on
> >>>>>>>>>>>>> all nodes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The main concern here is data consistency in cases when a
> node
> >>>>>>>>>>>>> acknowledges type registration and then fails before writing
> >> the
> >>>>>>>>> metadata
> >>>>>>>>>>>>> to disk.
> >>>>>>>>>>>>>> I see two parts of this issue:
> >>>>>>>>>>>>>> Nodes will have different metadata after restarting.
> >>>>>>>>>>>>>> If we write some data into a persisted cache and shut down
> >> nodes
> >>>>>>>>> faster
> >>>>>>>>>>>>> than a new binary type is written to disk, then after a
> restart
> >>>> we
> >>>>>>>>> won’t
> >>>>>>>>>>>>> have a binary type to work with.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The first case is similar to a situation, when one node
> fails,
> >>>> and
> >>>>>>>>> after
> >>>>>>>>>>>>> that a new type is registered in the cluster. This issue is
> >>>>>> resolved
> >>>>>>>>> by the
> >>>>>>>>>>>>> discovery data exchange. All nodes receive information about
> >> all
> >>>>>>>>> binary
> >>>>>>>>>>>>> types in the initial discovery messages sent by other nodes.
> >> So,
> >>>>>> once
> >>>>>>>>> you
> >>>>>>>>>>>>> restart a node, it will receive information, that it failed
> to
> >>>>>> finish
> >>>>>>>>>>>>> writing to disk, from other nodes.
> >>>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata
> >> to
> >>>>>> disk,
> >>>>>>>>>>>>> then after a restart the type will be considered
> unregistered,
> >> so
> >>>>>>>>> another
> >>>>>>>>>>>>> registration will be required.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The second case is a bit more complicated. But it can be
> >>>> resolved
> >>>>>> by
> >>>>>>>>>>>>> making the discovery threads on every node create a future,
> >> that
> >>>>>> will
> >>>>>>>>> be
> >>>>>>>>>>>>> completed when writing to disk is finished. So, every node
> will
> >>>>>> have
> >>>>>>>>> such
> >>>>>>>>>>>>> future, that will reflect the current state of persisting the
> >>>>>>>>> metadata to
> >>>>>>>>>>>>> disk.
> >>>>>>>>>>>>>> After that, if some operation needs this binary type, it
> will
> >>>>>> need to
> >>>>>>>>>>>>> wait on that future until flushing to disk is finished.
> >>>>>>>>>>>>>> This way discovery threads won’t be blocked, but other
> >> threads,
> >>>>>> that
> >>>>>>>>>>>>> actually need this type, will be.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please let me know what you think about that.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Alexei Scherbakov
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Zhenya Stanilovsky
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Ivan Pavlukhin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Alexei Scherbakov
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Zhenya Stanilovsky
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>
>

-- 

Best regards,
Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Reply via email to