[
https://issues.apache.org/jira/browse/IGNITE-12793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080557#comment-17080557
]
Sergey Chugunov commented on IGNITE-12793:
------------------------------------------
[~macrergate],
I reproduced the issue with your test, it seems this deadlock isn't related to
any recent changes and was always around since moving binary metadata exchange
to discovery protocol.
It seems to me that the best solution here is to introduce additional
GridIoPolicy with semantics "always create separate thread for message handler
execution".
Given that exchange of MetadataRequest/Response messages shouldn't be very
intensive this solution won't create too much pressure on client side.
What do you think?
> Deadlock in the System Pool on Metadata processing
> --------------------------------------------------
>
> Key: IGNITE-12793
> URL: https://issues.apache.org/jira/browse/IGNITE-12793
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.8, 2.7.6
> Reporter: Sergey Kosarev
> Assignee: Sergey Chugunov
> Priority: Major
> Fix For: 2.9
>
> Attachments: ignite-12793-threaddump.txt
>
>
> I've recently tried to apply Ilya's idea
> (https://issues.apache.org/jira/browse/IGNITE-12663) of minimizing thread
> pools and tried to set system pool to 3 in my own tests.
> It caused deadlock on a client node and I think it can happen not only on
> such small pool values.
> Details are following:
> I'm not using persistence currently (if it matters).
> On the client note I use ignite compute to call a job on every server node
> (there are 3 server nodes in the tests).
> Then I've found in logs:
> {{[10:55:21] : [Step 1/1] [2020-03-13 10:55:21,773]
> {grid-timeout-worker-#8}
> [WARN] [o.a.i.i.IgniteKernal] - Possible thread pool starvation detected (no
> task completed in last 30000ms, is system thread pool size large enough?)
> [10:55:21] : [Step 1/1] ^-- System thread pool [active=3, idle=0, qSize=9]}}
> I see in threaddumps that all 3 system pool workers do the same - processing
> of job responses:
> {{ "sys-#26" #605 daemon prio=5 os_prio=0 tid=0x0000000064a0a800 nid=0x1f34
> waiting on condition [0x000000007b91d000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140)
> at
> org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.metadata(CacheObjectBinaryProcessorImpl.java:749)
> at
> org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$1.metadata(CacheObjectBinaryProcessorImpl.java:250)
> at
> org.apache.ignite.internal.binary.BinaryContext.metadata(BinaryContext.java:1169)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.getOrCreateSchema(BinaryReaderExImpl.java:2005)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.<init>(BinaryReaderExImpl.java:285)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.<init>(BinaryReaderExImpl.java:184)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797)
> at
> org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.readField(BinaryReaderExImpl.java:1982)
> at
> org.apache.ignite.internal.binary.BinaryFieldAccessor$DefaultFinalClassAccessor.read0(BinaryFieldAccessor.java:702)
> at
> org.apache.ignite.internal.binary.BinaryFieldAccessor.read(BinaryFieldAccessor.java:187)
> at
> org.apache.ignite.internal.binary.BinaryClassDescriptor.read(BinaryClassDescriptor.java:887)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1762)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797)
> at
> org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160)
> at
> org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914)
> at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
> at
> org.apache.ignite.internal.binary.GridBinaryMarshaller.deserialize(GridBinaryMarshaller.java:306)
> at
> org.apache.ignite.internal.binary.BinaryMarshaller.unmarshal0(BinaryMarshaller.java:100)
> at
> org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:80)
> at
> org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10493)
> at
> org.apache.ignite.internal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:828)
> at
> org.apache.ignite.internal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:1134)
> }}
> As I found analyzing this stack trace, unmarshalling a user object the first
> time(per type) causes Binary metadata request (despite I've registered this
> type in BinaryConfiguration.setTypeConfiguration)
> And all this futures will be completed after consequent
> MetadataResponseMessage will be received and processed on the client node.
> But MetadataResponseMessage(GridTopic.TOPIC_METADATA_REQ) is also processing
> in system pool. (I see that method GridIoManager#processRegularMessage routes
> it to the System Pool).
> So it causes deadlock as the Sytem Pool is already full.
> *Update*
> I reproduced this on master on standard system pool size(see comments). Full
> threaddump is attached.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)