Stan, great, thanks for sharing the knowledge! Prachi, could you please document this on readme and share the docs in the thread? https://issues.apache.org/jira/browse/IGNITE-11252
-- Denis Magda On Thu, Feb 7, 2019 at 6:33 AM Stanislav Lukyanov <stanlukya...@gmail.com> wrote: > Denis, > > When an index is corrupted you just need to remove index.bin file of the > affected cache. > After that, when the node starts it will rebuild the indexes. > The performance of the SQL queries will be low until the index is rebuilt, > so you need to be cautious. > > The main problem is to understand that the indexes are corrupted. > Usually one needs to analyze the exception stack trace to find this out, > and it requires some familiarity with Ignite code base. > > The TODO lists I can come up with are: > > # Recovering from an index corruption > ## Applicable if > It is known that an index of a cache is corrupted, but the main data > (partition files and WAL) is fine. > > ## Steps to recover > 1. Stop the node > 2. Delete index.bin of the affected caches (path is > db/<consistent_id>/cache-<cache_name>/index.bin) > 3. Start the node > - Note: At this point the node is active in the cluster but don’t have > indexes. > It means that it serves SQL queries but their performance can be low. > Avoid running SQL queries on large tables at this point > 4. Wait for message “Finished indexes rebuilding for cache <cache_name>” > in the Ignite log > > # Recovering from a persistent storage corruption > ## Applicable if > A part of the persistent storage (partition files, checkpoint markers or > WAL) was corrupted > and there is no other way to recover it, but there are healthy copies of > all data on other nodes. > > ## Steps to recover > 1. Stop the node > 2. Delete all persistence files of the node (best to clear Ignite working > directory, storage directory, WAL and WAL archive directories) > 3. Make sure consistentId is explicitly set in the configuration of the > node > - If it isn’t, lookup the generated consistentId using control.sh and set > it explicitly in the config or via IGNITE_CONSISTENT_ID (2.8+ only) > 4. Start the node > 5. Wait for messages <Finished rebalancing cache> for all caches > > > We could have more fine-grained ways to handle data corruption once we > address issues from the > “Stating with missing PDS pieces” thread, create a WAL and/or partition > files recovery tool, > allow to have records in WAL for a missing cache (say, we deleted > corrupted files of a single cache), etc. > > Stan > > From: Denis Magda > Sent: 7 февраля 2019 г. 3:12 > To: dev; Stanislav Lukyanov > Subject: Re: Ignite index corruption issue -> unrecoverable cluster > > Stan, > > Thanks for staring "Starting with missing PDS pieces" that is promising to > embed usability changes into the source code. In the meantime, could you > propose a TODO list for recovering from index corruption and similar > scenarios? I know that you're experienced in that and it will be great to > document the procedures until the code is modified. > > - > Denis > > > On Wed, Jan 30, 2019 at 1:02 PM Denis Magda <dma...@apache.org> wrote: > > > Dmitry, > > > > Thanks, the FAQ section might make sense but, as the practice shows, it's > > hard to get recommendations even for questions like this one :) > > > > Ignite experts, please chime in, the project fails with data corruption > > periodically and we have to explain how to come around until an issue is > > resolved. > > > > - > > Denis > > > > > > On Wed, Jan 30, 2019 at 11:55 AM Dmitriy Pavlov <dpav...@apache.org> > > wrote: > > > >> Denis, > >> > >> BTW one case of corruption is fixed here, > >> https://issues.apache.org/jira/browse/IGNITE-11030 > >> > >> I still need a review from Ignite Native Persistence Experts. I feel it > is > >> really important to apply such fixes. > >> > >> Sincerely, > >> Dmitriy Pavlov > >> > >> чт, 24 янв. 2019 г. в 16:29, Dmitriy Pavlov <dpav...@apache.org>: > >> > >> > Denis, Whan do you think about a more general idea of creating FAQs > for > >> > Ignite users? > >> > > >> > What if experts will once place their answer in a wiki page and then > >> > develop answers for frequent problems. > >> > > >> > And before diving into researching each problem, experienced community > >> > members will ask users to check the FAQ first? > >> > > >> > Sincerely, > >> > Dmitriy Pavlov > >> > > >> > P.S. here is an article, Apache guides have reference to > >> > http://www.catb.org/~esr/faqs/smart-questions.html - one from > required > >> > actions from users is to search for information. > >> > > >> > чт, 24 янв. 2019 г. в 01:55, Denis Magda <dma...@gridgain.com>: > >> > > >> >> Another data/index corruption issue: > >> >> > >> >> > >> > https://stackoverflow.com/questions/54295401/ignite-transaction-failure-not-recoverable-with-persistance > >> >> > >> >> It's suggested to clean index.bin to be able to recover the cluster. > >> >> Folks, > >> >> let's prepare a list of actions to do if a cluster becomes > >> unrecoverable > >> >> due to data or index corruption issue. What should we do depending on > >> an > >> >> exception: > >> >> > >> >> - Remove index.bin if X or Y or Z > >> >> - etc > >> >> > >> >> > >> >> -- > >> >> Denis Magda > >> >> > >> >> > >> >> On Sun, Dec 30, 2018 at 10:06 AM Denis Magda <dma...@gridgain.com> > >> wrote: > >> >> > >> >> > Ignite SQL and memory experts, > >> >> > > >> >> > The following issue was reported on SO: > >> >> > > >> >> > > >> >> > >> > https://stackoverflow.com/questions/53979106/ignite-corruptedtreeexception-leads-to-cluster-failure > >> >> > > >> >> > The stack trace starts with the message below, more details are in > >> that > >> >> > forum: > >> >> > > >> >> > [SEVERE][data-streamer-stripe-2-#15][GridDhtAtomicCache] <MyCache> > >> >> > Unexpected exception during cache update > >> >> > org.h2.message.DbException: General error: "class > >> >> > > >> >> > >> > org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: > >> >> > Runtime failure on row: Row@75ab6623[ key: CacheKey > >> [idHash=242632156, > >> >> > hash=-841684964, parentId=-8607237606486310912, hour=9, > >> >> > id=-8607237528489033728, date=2018-09-09 00:00:00.0], val: > CacheValue > >> >> > [idHash=843227122, hash=-801894604, .... > >> >> > > >> >> > Let's see if it's addressed in the latest release. Also, the user > >> asked > >> >> a > >> >> > reasonable question - how to recover? Yes, it's possible to use > >> >> snapshots > >> >> > of GridGain if they are created before but I remember some > >> discussions > >> >> > around a recovery tool. > >> >> > > >> >> > -- > >> >> > Denis > >> >> > > >> >> > >> > > >> > > > >