Re: Ignite index corruption issue -> unrecoverable cluster

Denis Magda Thu, 07 Feb 2019 11:55:47 -0800

Stan, great, thanks for sharing the knowledge!

Prachi, could you please document this on readme and share the docs in the
thread?
https://issues.apache.org/jira/browse/IGNITE-11252


--
Denis Magda


On Thu, Feb 7, 2019 at 6:33 AM Stanislav Lukyanov <stanlukya...@gmail.com>
wrote:

> Denis,
>
> When an index is corrupted you just need to remove index.bin file of the
> affected cache.
> After that, when the node starts it will rebuild the indexes.
> The performance of the SQL queries will be low until the index is rebuilt,
> so you need to be cautious.
>
> The main problem is to understand that the indexes are corrupted.
> Usually one needs to analyze the exception stack trace to find this out,
> and it requires some familiarity with Ignite code base.
>
> The TODO lists I can come up with are:
>
> # Recovering from an index corruption
> ## Applicable if
> It is known that an index of a cache is corrupted, but the main data
> (partition files and WAL) is fine.
>
> ## Steps to recover
> 1. Stop the node
> 2. Delete index.bin of the affected caches (path is
> db/<consistent_id>/cache-<cache_name>/index.bin)
> 3. Start the node
> - Note: At this point the node is active in the cluster but don’t have
> indexes.
> It means that it serves SQL queries but their performance can be low.
> Avoid running SQL queries on large tables at this point
> 4. Wait for message “Finished indexes rebuilding for cache <cache_name>”
> in the Ignite log
>
> # Recovering from a persistent storage corruption
> ## Applicable if
> A part of the persistent storage (partition files, checkpoint markers or
> WAL) was corrupted
> and there is no other way to recover it, but there are healthy copies of
> all data on other nodes.
>
> ## Steps to recover
> 1. Stop the node
> 2. Delete all persistence files of the node (best to clear Ignite working
> directory, storage directory, WAL and WAL archive directories)
> 3. Make sure consistentId is explicitly set in the configuration of the
> node
> - If it isn’t, lookup the generated consistentId using control.sh and set
> it explicitly in the config or via IGNITE_CONSISTENT_ID (2.8+ only)
> 4. Start the node
> 5. Wait for messages <Finished rebalancing cache> for all caches
>
>
> We could have more fine-grained ways to handle data corruption once we
> address issues from the
> “Stating with missing PDS pieces” thread, create a WAL and/or partition
> files recovery tool,
> allow to have records in WAL for a missing cache (say, we deleted
> corrupted files of a single cache), etc.
>
> Stan
>
> From: Denis Magda
> Sent: 7 февраля 2019 г. 3:12
> To: dev; Stanislav Lukyanov
> Subject: Re: Ignite index corruption issue -> unrecoverable cluster
>
> Stan,
>
> Thanks for staring "Starting with missing PDS pieces" that is promising to
> embed usability changes into the source code. In the meantime, could you
> propose a TODO list for recovering from index corruption and similar
> scenarios? I know that you're experienced in that and it will be great to
> document the procedures until the code is modified.
>
> -
> Denis
>
>
> On Wed, Jan 30, 2019 at 1:02 PM Denis Magda <dma...@apache.org> wrote:
>
> > Dmitry,
> >
> > Thanks, the FAQ section might make sense but, as the practice shows, it's
> > hard to get recommendations even for questions like this one :)
> >
> > Ignite experts, please chime in, the project fails with data corruption
> > periodically and we have to explain how to come around until an issue is
> > resolved.
> >
> > -
> > Denis
> >
> >
> > On Wed, Jan 30, 2019 at 11:55 AM Dmitriy Pavlov <dpav...@apache.org>
> > wrote:
> >
> >> Denis,
> >>
> >> BTW one case of corruption is fixed here,
> >> https://issues.apache.org/jira/browse/IGNITE-11030
> >>
> >> I still need a review from Ignite Native Persistence Experts. I feel it
> is
> >> really important to apply such fixes.
> >>
> >> Sincerely,
> >> Dmitriy Pavlov
> >>
> >> чт, 24 янв. 2019 г. в 16:29, Dmitriy Pavlov <dpav...@apache.org>:
> >>
> >> > Denis, Whan do you think about a more general idea of creating FAQs
> for
> >> > Ignite users?
> >> >
> >> > What if experts will once place their answer in a wiki page and then
> >> > develop answers for frequent problems.
> >> >
> >> > And before diving into researching each problem, experienced community
> >> > members will ask users to check the FAQ first?
> >> >
> >> > Sincerely,
> >> > Dmitriy Pavlov
> >> >
> >> > P.S. here is an article, Apache guides have reference to
> >> > http://www.catb.org/~esr/faqs/smart-questions.html - one from
> required
> >> > actions from users is to search for information.
> >> >
> >> > чт, 24 янв. 2019 г. в 01:55, Denis Magda <dma...@gridgain.com>:
> >> >
> >> >> Another data/index corruption issue:
> >> >>
> >> >>
> >>
> https://stackoverflow.com/questions/54295401/ignite-transaction-failure-not-recoverable-with-persistance
> >> >>
> >> >> It's suggested to clean index.bin to be able to recover the cluster.
> >> >> Folks,
> >> >> let's prepare a list of actions to do if a cluster becomes
> >> unrecoverable
> >> >> due to data or index corruption issue. What should we do depending on
> >> an
> >> >> exception:
> >> >>
> >> >>    - Remove index.bin if X or Y or Z
> >> >>    - etc
> >> >>
> >> >>
> >> >> --
> >> >> Denis Magda
> >> >>
> >> >>
> >> >> On Sun, Dec 30, 2018 at 10:06 AM Denis Magda <dma...@gridgain.com>
> >> wrote:
> >> >>
> >> >> > Ignite SQL and memory experts,
> >> >> >
> >> >> > The following issue was reported on SO:
> >> >> >
> >> >> >
> >> >>
> >>
> https://stackoverflow.com/questions/53979106/ignite-corruptedtreeexception-leads-to-cluster-failure
> >> >> >
> >> >> > The stack trace starts with the message below, more details are in
> >> that
> >> >> > forum:
> >> >> >
> >> >> > [SEVERE][data-streamer-stripe-2-#15][GridDhtAtomicCache] <MyCache>
> >> >> > Unexpected exception during cache update
> >> >> > org.h2.message.DbException: General error: "class
> >> >> >
> >> >>
> >>
> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
> >> >> > Runtime failure on row: Row@75ab6623[ key: CacheKey
> >> [idHash=242632156,
> >> >> > hash=-841684964, parentId=-8607237606486310912, hour=9,
> >> >> > id=-8607237528489033728, date=2018-09-09 00:00:00.0], val:
> CacheValue
> >> >> > [idHash=843227122, hash=-801894604, ....
> >> >> >
> >> >> > Let's see if it's addressed in the latest release. Also, the user
> >> asked
> >> >> a
> >> >> > reasonable question - how to recover? Yes, it's possible to use
> >> >> snapshots
> >> >> > of GridGain if they are created before but I remember some
> >> discussions
> >> >> > around a recovery tool.
> >> >> >
> >> >> > --
> >> >> > Denis
> >> >> >
> >> >>
> >> >
> >>
> >
>
>

Re: Ignite index corruption issue -> unrecoverable cluster

Reply via email to