Gopal V created HDFS-8070:
-----------------------------

             Summary: ShortCircuitShmManager goes into dead mode, stopping all 
operations
                 Key: HDFS-8070
                 URL: https://issues.apache.org/jira/browse/HDFS-8070
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: caching
    Affects Versions: 2.8.0
            Reporter: Gopal V


HDFS ShortCircuitShm layer keeps the task locked up during multi-threaded 
split-generation.

I hit this immediately after I upgraded the data, so I wonder if the 
ShortCircuitShim wire protocol has trouble when 2.8.0 DN talks to a 2.7.0 
Client?

{code}
2015-04-06 00:04:30,780 INFO [ORC_GET_SPLITS #3] orc.OrcInputFormat: ORC 
pushdown predicate: leaf-0 = (IS_NULL ss_sold_date_sk)
expr = (not leaf-0)
2015-04-06 00:04:30,781 ERROR [ShortCircuitCache_SlotReleaser] 
shortcircuit.ShortCircuitCache: ShortCircuitCache(0x29e82045): failed to 
release short-circuit shared memory slot Slot(slotIdx=2, 
shm=DfsClientShm(a86ee34576d93c4964005d90b0d97c38)) by sending 
ReleaseShortCircuitAccessRequestProto to /grid/0/cluster/hdfs/dn_socket.  
Closing shared memory segment.
java.io.IOException: ERROR_INVALID: there is no shared memory segment 
registered with shmId a86ee34576d93c4964005d90b0d97c38
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:208)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2015-04-06 00:04:30,781 INFO [ORC_GET_SPLITS #5] orc.OrcInputFormat: ORC 
pushdown predicate: leaf-0 = (IS_NULL ss_sold_date_sk)
expr = (not leaf-0)
2015-04-06 00:04:30,781 WARN [ShortCircuitCache_SlotReleaser] 
shortcircuit.DfsClientShmManager: EndpointShmManager(172.19.128.60:50010, 
parent=ShortCircuitShmManager(5e763476)): error shutting down shm: got 
IOException calling shutdown(SHUT_RDWR)
java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.util.CloseableReferenceCount.reference(CloseableReferenceCount.java:57)
        at 
org.apache.hadoop.net.unix.DomainSocket.shutdown(DomainSocket.java:387)
        at 
org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.shutdown(DfsClientShmManager.java:378)
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:223)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
2015-04-06 00:04:30,783 INFO [ORC_GET_SPLITS #7] orc.OrcInputFormat: ORC 
pushdown predicate: leaf-0 = (IS_NULL cs_sold_date_sk)
expr = (not leaf-0)
2015-04-06 00:04:30,785 ERROR [ShortCircuitCache_SlotReleaser] 
shortcircuit.ShortCircuitCache: ShortCircuitCache(0x29e82045): failed to 
release short-circuit shared memory slot Slot(slotIdx=4, 
shm=DfsClientShm(a86ee34576d93c4964005d90b0d97c38)) by sending 
ReleaseShortCircuitAccessRequestProto to /grid/0/cluster/hdfs/dn_socket.  
Closing shared memory segment.
java.io.IOException: ERROR_INVALID: there is no shared memory segment 
registered with shmId a86ee34576d93c4964005d90b0d97c38
        at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:208)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}

Looks like a double free-fd condition?

{code}
2015-04-02 18:58:47,653 [DataXceiver for client 
unix:/grid/0/cluster/hdfs/dn_socket [Passing file descriptors for block 
BP-942051088-172.18.0.41-1370508013893:blk_1076973408_1099515627985]] INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Unregistering 
SlotId(3bd7fd9aed791e95acfb5034e6617d83:0) because the 
requestShortCircuitFdsForRead operation failed.
2015-04-02 18:58:47,653 [DataXceiver for client 
unix:/grid/0/cluster/hdfs/dn_socket [Passing file descriptors for block 
BP-942051088-<ip>-1370508013893:blk_1076973408_1099515627985]] INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, 
dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1076973408, srvID: 
ba7b6f19-47e0-4b86-af50-23981649318c, success: false
2015-04-02 18:58:47,654 [DataXceiver for client 
unix:/grid/0/cluster/hdfs/dn_socket [Passing file descriptors for block 
BP-942051088-172.18.0.41-1370508013893:blk_1076973408_1099515627985]] ERROR 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
cn060-10.l42scl.hortonworks.com:50010:DataXceiver error processing 
REQUEST_SHORT_CIRCUIT_FDS operation  src: unix:/grid/0/cluster/hdfs/dn_socket 
dst: <local>
java.io.EOFException
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitFds(DataXceiver.java:352)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitFds(Receiver.java:187)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:89)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
        at java.lang.Thread.run(Thread.java:745)
{code}

Investigating more, since the exact exception from the DataNode call is not 
logged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to