Do we have a real reproducer for thread unsafe behavior, which causes data inconsistency ? AFAIK all required parts of txn processing are already properly linearized, and other parts are ready to be processed in parallel (like txn recovery)
пн, 19 июн. 2023 г. в 22:25, Anton Vinogradov <a...@apache.org>: > Folks, idea to synchronize all methods unfortunately failed :( > 1) TxState has 4 implementations, a lot of changes are required > 2) IgniteTxEntry is not synchronized as well ... > 3) And even IgniteInternalTx implementations (10000+ lines) are not > synchronized as well ... > It seems to be unreal to refactor this properly. > > Also, the methods synchronization is just provides current data read > guarantee, not a thread safety. > > If I understand correctly, the only proper fix is to keep everything > unsynchronized, but guarantee every tx processing only at one thread at the > same time + data visibility. > Possible fix is to process same tx at the same thread each time, but we > already found that tx can be created at the user thread, and can be, for > example, suspended or committed from the user thread again. So, seems, it's > impossible to provide such guatantee. > > But, the possible solution is to wrap each tx processing with some lock or > synchronize section, like: > synchronize(tx){ > val aaa = tx.getAAA(); > tx.updateXXX(); > tx.updateYYY(); > } > This will guarantee fields visibility as we as strict tx processing, step > by step. > Single lock/synchronize should not cause the perfomance problem, I think. > > But, this may cause a deadlock it case some such executions will require > another at the other threads, but related to the same tx. > > And the current question is: > Do we expect that Ignite is not required to process something related to > the same tx at different threads simultaneously? > > On Wed, May 24, 2023 at 4:11 PM Anton Vinogradov <a...@apache.org> wrote: > > > >> could you please point to this at code, it may be not needed after the > > fix and can bring the performance growth. > > BTW, found the trick. > > Still necessary to keep copying. > > > > On Wed, May 24, 2023 at 2:44 PM Anton Vinogradov <a...@apache.org> wrote: > > > >> Andrey, > >> > >> Thanks for the tip. > >> Of course, I'll benchmark the fix before the merge. > >> > >> According to the comment, > >> >> and return entries copy form tx state in order to avoid > >> ConcurrentModificationException. > >> , could you please point to this at code, it may be not needed after the > >> fix and can bring the performance growth. > >> > >> >> I believe that mentioned invariants were broken later but ... > >> >> ... this state should be accessed mostly from one thread > >> Code was never designed to fit this statement. > >> For example, the most of cctx.tm().newTx(...) calls dated by 2014 > (which > >> means "before 2014"). > >> Currently, allwost all tx creations happen not at the striped pool as > >> well as tx preparations. > >> Only 1/2 of the messages now striped correctly. > >> Of course, it's theoretically possible to process tx at the same thread > >> each time, but, global refactoring with a performance drop is required > in > >> this case, I think. > >> > >> My current Idea is to finish synchronization started by you. > >> I've pepared the fix [1], got the visa and going to benchmark it. > >> > >> [1] https://github.com/apache/ignite/pull/10732/files > >> > >> On Tue, May 23, 2023 at 8:54 PM Andrey Gura <ag...@apache.org> wrote: > >> > >>> Please, run benchmarks after fixing the problem. E.g. replacing HashMap > >>> to > >>> ConcurrentHashMap can significantly affect performance. > >>> > >>> See for example comments to IGNITE-2968 issue ( > >>> > >>> > https://issues.apache.org/jira/browse/IGNITE-2968?focusedCommentId=15415170&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15415170 > >>> ). > >>> > >>> I believe that mentioned invariants were broken later but in general I > >>> agree with Alexey, this state should be accessed mostly from one > thread. > >>> Exceptional cases should be synchronized or redesigned. E.g. if metrics > >>> read a transaction's state I prefer remove these metrics or ignore some > >>> inaccuracy then performance reducing. > >>> > >>> > >>> > >>> > >>> On Fri, May 19, 2023 at 7:32 PM Ivan Daschinsky <ivanda...@gmail.com> > >>> wrote: > >>> > >>> > >> Tx processing is supposed to be thread bound by hashing the > version > >>> to a > >>> > partition > >>> > This invariant is violated in many places. The most notorious example > >>> is tx > >>> > recovery. > >>> > > >>> > Another example: I just added an assertion that checks tId of a > creator > >>> > thread with tId of an accessor thread. > >>> > TxMultiCacheAsyncOpsTest fails immediately on processing of a tx > >>> prepare > >>> > request. Looks like a big issue, IMO > >>> > > >>> > > >>> > пт, 19 мая 2023 г. в 19:11, Alexei Scherbakov < > >>> > alexey.scherbak...@gmail.com > >>> > >: > >>> > > >>> > > Tx processing is supposed to be thread bound by hashing the version > >>> to a > >>> > > partition, see methods like [1] > >>> > > If for some cases this invariant is broken, this should be fixed. > >>> > > > >>> > > [1] > >>> > > > >>> > > >>> > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest#partition > >>> > > > >>> > > пт, 19 мая 2023 г. в 15:57, Anton Vinogradov <a...@apache.org>: > >>> > > > >>> > > > Igniters, > >>> > > > > >>> > > > My team was faced with node failure [1] because of non-threadsafe > >>> > > > collections usage. > >>> > > > > >>> > > > IgniteTxStateImpl's fields > >>> > > > - activeCacheIds > >>> > > > - txMap > >>> > > > are not thread safe, but are widely used from different threads > >>> without > >>> > > the > >>> > > > proper sync. > >>> > > > > >>> > > > The main question is ... why? > >>> > > > > >>> > > > According to the research, we have no guarantee that tx will be > >>> > processed > >>> > > > at the single thread. > >>> > > > It may be processed at the several! threads at the striped pool > >>> and at > >>> > > the > >>> > > > tx recovery thread as well. > >>> > > > > >>> > > > Thread at the striped pool will be selected by the message's > >>> > partition() > >>> > > > method, which can be calculated like this: > >>> > > > - return keys != null && !keys.isEmpty() ? > keys.get(0).partition() > >>> : > >>> > -1; > >>> > > > - return U.safeAbs(version().hashCode()); > >>> > > > - ..., > >>> > > > so, no guarantee it is processed at the same thread (proven by > >>> tests). > >>> > > > > >>> > > > Seems, we MAY lose the data. > >>> > > > For example, ignoring some or all keys from txMap at commit. > >>> > > > > >>> > > > If anyone knows why this is not a problem (I mean sync lack, not > >>> data > >>> > > loss) > >>> > > > or how to fix this properly, please give me a hint, or correct my > >>> > > > conclusions if necessary. > >>> > > > > >>> > > > [1] https://issues.apache.org/jira/browse/IGNITE-19445 > >>> > > > > >>> > > > >>> > > > >>> > > -- > >>> > > > >>> > > Best regards, > >>> > > Alexei Scherbakov > >>> > > > >>> > > >>> > > >>> > -- > >>> > Sincerely yours, Ivan Daschinskiy > >>> > > >>> > >> > -- Best regards, Alexei Scherbakov