[jira] [Created] (HDFS-10861) Refactor StatefulStripeReader and PositionStripeReader, use ECChunk version decode API

2016-09-13 Thread SammiChen (JIRA)
SammiChen created HDFS-10861:


 Summary: Refactor StatefulStripeReader and PositionStripeReader, 
use ECChunk version decode API
 Key: HDFS-10861
 URL: https://issues.apache.org/jira/browse/HDFS-10861
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: SammiChen
Assignee: SammiChen


Refactor StatefulStripeReader and PositionStripeReader, use ECChunk version 
decode API. After the refactor, it is approaching very near now to the ideal 
state desired by next step, employing ErasureCoder API instead of 
RawErasureCoder API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2016-09-13 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/

[Sep 12, 2016 3:26:08 PM] (raviprak) HADOOP-13587. distcp.map.bandwidth.mb is 
overwritten even when
[Sep 12, 2016 6:10:00 PM] (aw) HADOOP-13341.  Deprecate HADOOP_SERVERNAME_OPTS; 
replace with
[Sep 12, 2016 10:45:26 PM] (aengineer) HDFS-10821. DiskBalancer: Report command 
support with multiple nodes.
[Sep 12, 2016 11:17:08 PM] (liuml07) HADOOP-13588. ConfServlet should respect 
Accept request header.
[Sep 12, 2016 11:40:11 PM] (jing9) HDFS-10858. FBR processing may generate 
incorrect
[Sep 13, 2016 4:25:06 AM] (yzhang) HDFS-10657. testAclCLI.xml setfacl test 
should expect mask r-x. (John
[Sep 13, 2016 5:50:03 AM] (aajisaka) HDFS-10856. Update the comment of




-1 overall


The following subsystems voted -1:
asflicense unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.contrib.bkjournal.TestBootstrapStandbyWithBKJM 
   hadoop.hdfs.server.namenode.ha.TestInitializeSharedEdits 
   
hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager
 
   hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices 
   hadoop.yarn.server.TestMiniYarnClusterNodeUtilization 
   hadoop.yarn.server.TestContainerManagerSecurity 
   hadoop.contrib.bkjournal.TestBootstrapStandbyWithBKJM 
  

   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-compile-javac-root.txt
  [168K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-checkstyle-root.txt
  [16M]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-patch-pylint.txt
  [16K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-patch-shellcheck.txt
  [20K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-patch-shelldocs.txt
  [16K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/whitespace-eol.txt
  [12M]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/whitespace-tabs.txt
  [1.3M]

   javadoc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/diff-javadoc-javadoc-root.txt
  [2.2M]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
  [152K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
  [36K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt
  [12K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-tests.txt
  [268K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-nativetask.txt
  [124K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs_src_contrib_bkjournal.txt
  [8.0K]

   asflicense:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/163/artifact/out/patch-asflicense-problems.txt
  [4.0K]

Powered by Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org



-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10862) Typos in 7 log messages

2016-09-13 Thread Mehran Hassani (JIRA)
Mehran Hassani created HDFS-10862:
-

 Summary: Typos in 7 log messages
 Key: HDFS-10862
 URL: https://issues.apache.org/jira/browse/HDFS-10862
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Mehran Hassani
Priority: Trivial


I am conducting research on log related bugs. I tried to make a tool to fix 
repetitive yet simple patterns of bugs that are related to logs. Typos in log 
messages are one of the reoccurring bugs. Therefore, I made a tool find typos 
in log statements. During my experiments, I managed to find the following typos 
in Hadoop HDFS:

In file 
/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java,
 LOG.info((success ? "S" : "Uns") +"uccessfully sent block report 0x" 
+Long.toHexString(reportId) + "   containing " + reports.length +" storage 
report(s)  of which we sent " + numReportsSent + "." +" The reports had " + 
totalBlockCount +" total blocks and used " + numRPCs +" RPC(s). This took " + 
brCreateCost +" msec to generate and " + brSendCost +" msecs for RPC and NN 
processing." +" Got back " +((nCmds == 0) ? "no commands" :((nCmds == 1) ? "one 
command: " + cmds.get(0) :(nCmds + " commands: " + Joiner.on("; 
").join(cmds +"."), 
uccessfully  successfully

In file 
/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java,
 LOG.info("Balancing bandwith is " + bandwidth + " bytes/s"), 
bandwith should be bandwidth

In file 
/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java,
 FsDatasetImpl.LOG.info("The volume " + v + " is closed while " +"addng 
replicas  ignored."), 
addng should be adding 

In file 
/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/CancelDelegationTokenServlet.java,
 LOG.info("Exception while cancelling token. Re-throwing. "  e), 
cancelling should be canceling

In file 
/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java,
 NameNode.LOG.info("Caching file names occuring more than " + threshold+ " 
times"), 
occuring should be occurring

In file 
/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java,
 LOG.info("NNStorage.attemptRestoreRemovedStorage: check removed(failed) 
"+"storarge. removedStorages size = " + removedStorageDirs.size()), 
storarge should be storage

In file 
/hadoop-hdfs-project/hadoop-hdfs-nfs/src/main/java/org/apache/hadoop/hdfs/nfs/nfs3/RpcProgramNfs3.java,
 LOG.info("Partical read. Asked offset: " + offset + " count: " + count+ " and 
read back: " + readCount + " file size: "+ attrs.getSize()), 
Partical should be Partial



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Active NameNode Image Download and Edit Log Roll Stuck in FileChannel Force/Truncate

2016-09-13 Thread Joey Paskhay
Reposting here to see if any of the HDFS developers have some good insight
into this.

Deep dive is in the below original message. The gist of it is after
upgrading to 2.7.2 on a ~260 node cluster, the active NN's fsimage download
and edit logs roll seem to get stuck in native FileChannel.force calls
(sometimes FileChannel.truncate). This leads to the ZKFC health monitor
failing (all the RPC handler threads back up waiting for the
FSNamesystem.fsLock to be released by the edit log roll process), and the
active NN gets killed.

Happens occasionally when the system is idle (once a day) but very
frequently when we run DistCp (every 20-30 minutes). We believe we saw this
every month or two on 2.2.1 (logs/files rolled over since last time so
can't confirm exact same issue), but with 2.7.2 it seems to be much more
frequent.

Any help or guidance would be much appreciated.

Thanks,
Joey


Hey there,

We're in the process of upgrading our Hadoop cluster from 2.2.1 to 2.7.2
and currently testing 2.7.2 in our pre-prod/backup cluster. We're seeing a
lot of active NameNode failovers (sometimes as often as every 30 minutes),
especially when we're running DistCp to copy data from our production
cluster for users to test with. We had seen similar failovers occasionally
while running 2.2.1, but not nearly as often (once every month or two).
Haven't been able to verify it's the exact same root cause in the 2.2.1
version since files/logs have rolled over since the last time it happened.

So here's the chain of events we've found so far. Hoping someone can
provide further direction.

The standby NameNode's checkpointing process succeeds locally and issues
the image PUT request in TransferFsImage.uploadImage. The active NameNode
finishes downloading the fsimage.ckpt file, but when it tries to issue
the fos.getChannel().force(true) call in TransferFsImage.receiveFile it
seems to get stuck in native code. The standby NameNode then gets a
SocketTimeoutException -- it happens 60 seconds after the last modification
time we see in the "stat" output for the fsimage.ckpt file that the active
NameNode pulled down.

Right after the time this is happening (~30 sec after the last modification
to the fsimage.ckpt file) we see a similar issue with the edit log roll.
The standby NameNode's EditLogTailer triggers the rolling of the edit log
on the active NameNode. We see the active NameNode enter its rollEditLog
process, and will either see the endCurrentLogSegment call get stuck in
EditLogFileOutputStream.close on the fc.truncate(fc.position()) call or the
startLogSegment call get stuck in EditLogFileOutputStream.flushAndSync on
the fc.force(true) call. They both get stuck in the native code. Looking at
the last modification time in the "stat" output of the edits file, we see
that 20 seconds later the standby NameNode's RPC call times out.

The rollEditLog ends up holding onto the FSNamesystem's write lock on
fsLock, and this causes all other RPC calls to pile up trying to acquire
read locks until ZKFC times out on the health monitor and signals for the
NameNode to be killed. We patched the SshFenceByTcpPort code to issue a
kill -3 to get a thread dump before it kills the active NameNode.

We're running on CentOS 6 using ext4 FS (w/ noatime) using kernel 2.6.32.
The fsimage file is typically ~7.2GB and the edits files are typically
~1MB-2MB. The cluster running 2.7.2 is 256 nodes. We're running on
JDK 1.8.0_92 (compiled against it too w/ a few JDK8 specific patches).

See the relevant stacks below of the FileChannel code getting stuck in the
native code. I can also provide the full thread dumps and any relevant
configs, if needed.

Tried looking in JIRA and online but didn't see anything directly related.
Any insight as to whether this is a bug in Hadoop or if it's a side-effect
of something else? When the cluster is mostly idle, everything seems fine.
Our dev/test clusters haven't had any issues with the upgrades but they're
only 10 nodes or less and have little load.

Thanks!
Joey

Example of both getting stuck in force calls:

"641242166@qtp-1147805316-11" #869 daemon prio=5 os_prio=0
tid=0x7fd9c3bc4800 nid=0x2f37 runnable [0x7fb7f6c8d000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.
receiveFile(TransferFsImage.java:530)
at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.
handleUploadImageRequest(TransferFsImage.java:132)
at org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(
ImageServlet.java:488)
at org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(
ImageServlet.java:458)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apach

Re: Active NameNode Image Download and Edit Log Roll Stuck in FileChannel Force/Truncate

2016-09-13 Thread Kihwal Lee
Is the system busy with I/O when it happens? Any other I/O activities preceding 
the event? In your case DistCp could have generated extra edits and also 
namenode daemon and audit log entries.  Depending on configuration, dirty pages 
can pile up quite a bit on Linux systems with a large memory and cause extreme 
I/O delays when they hit the drive. fsimage uploading might be contributing to 
that. But we haven't seen any issues like that. In one of large clusters (5000+ 
node, 2.7.3ish, jdk8), rollEdits() takes less than 30ms consistently. 
Kihwal

  From: Joey Paskhay 
 To: hdfs-dev@hadoop.apache.org 
 Sent: Tuesday, September 13, 2016 12:06 PM
 Subject: Active NameNode Image Download and Edit Log Roll Stuck in FileChannel 
Force/Truncate
   
Reposting here to see if any of the HDFS developers have some good insight
into this.

Deep dive is in the below original message. The gist of it is after
upgrading to 2.7.2 on a ~260 node cluster, the active NN's fsimage download
and edit logs roll seem to get stuck in native FileChannel.force calls
(sometimes FileChannel.truncate). This leads to the ZKFC health monitor
failing (all the RPC handler threads back up waiting for the
FSNamesystem.fsLock to be released by the edit log roll process), and the
active NN gets killed.

Happens occasionally when the system is idle (once a day) but very
frequently when we run DistCp (every 20-30 minutes). We believe we saw this
every month or two on 2.2.1 (logs/files rolled over since last time so
can't confirm exact same issue), but with 2.7.2 it seems to be much more
frequent.

Any help or guidance would be much appreciated.

Thanks,
Joey


Hey there,

We're in the process of upgrading our Hadoop cluster from 2.2.1 to 2.7.2
and currently testing 2.7.2 in our pre-prod/backup cluster. We're seeing a
lot of active NameNode failovers (sometimes as often as every 30 minutes),
especially when we're running DistCp to copy data from our production
cluster for users to test with. We had seen similar failovers occasionally
while running 2.2.1, but not nearly as often (once every month or two).
Haven't been able to verify it's the exact same root cause in the 2.2.1
version since files/logs have rolled over since the last time it happened.

So here's the chain of events we've found so far. Hoping someone can
provide further direction.

The standby NameNode's checkpointing process succeeds locally and issues
the image PUT request in TransferFsImage.uploadImage. The active NameNode
finishes downloading the fsimage.ckpt file, but when it tries to issue
the fos.getChannel().force(true) call in TransferFsImage.receiveFile it
seems to get stuck in native code. The standby NameNode then gets a
SocketTimeoutException -- it happens 60 seconds after the last modification
time we see in the "stat" output for the fsimage.ckpt file that the active
NameNode pulled down.

Right after the time this is happening (~30 sec after the last modification
to the fsimage.ckpt file) we see a similar issue with the edit log roll.
The standby NameNode's EditLogTailer triggers the rolling of the edit log
on the active NameNode. We see the active NameNode enter its rollEditLog
process, and will either see the endCurrentLogSegment call get stuck in
EditLogFileOutputStream.close on the fc.truncate(fc.position()) call or the
startLogSegment call get stuck in EditLogFileOutputStream.flushAndSync on
the fc.force(true) call. They both get stuck in the native code. Looking at
the last modification time in the "stat" output of the edits file, we see
that 20 seconds later the standby NameNode's RPC call times out.

The rollEditLog ends up holding onto the FSNamesystem's write lock on
fsLock, and this causes all other RPC calls to pile up trying to acquire
read locks until ZKFC times out on the health monitor and signals for the
NameNode to be killed. We patched the SshFenceByTcpPort code to issue a
kill -3 to get a thread dump before it kills the active NameNode.

We're running on CentOS 6 using ext4 FS (w/ noatime) using kernel 2.6.32.
The fsimage file is typically ~7.2GB and the edits files are typically
~1MB-2MB. The cluster running 2.7.2 is 256 nodes. We're running on
JDK 1.8.0_92 (compiled against it too w/ a few JDK8 specific patches).

See the relevant stacks below of the FileChannel code getting stuck in the
native code. I can also provide the full thread dumps and any relevant
configs, if needed.

Tried looking in JIRA and online but didn't see anything directly related.
Any insight as to whether this is a bug in Hadoop or if it's a side-effect
of something else? When the cluster is mostly idle, everything seems fine.
Our dev/test clusters haven't had any issues with the upgrades but they're
only 10 nodes or less and have little load.

Thanks!
Joey

Example of both getting stuck in force calls:

"641242166@qtp-1147805316-11" #869 daemon prio=5 os_prio=0
tid=0x7fd9c3bc4800 nid=0x2f37 runnable [0x7fb7f6c8d000]
  java.lang.Thread.State: RUNNABLE
        at 

[jira] [Resolved] (HDFS-10378) FSDirAttrOp#setOwner throws ACE with misleading message

2016-09-13 Thread John Zhuge (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge resolved HDFS-10378.
---
Resolution: Invalid

In {{HDFS-10378-unit.patch}}, super user creates {{CHILD_DIR1}}, so it makes 
sense for setOwner by non-super user to throw ACE with "Permission denied".

> FSDirAttrOp#setOwner throws ACE with misleading message
> ---
>
> Key: HDFS-10378
> URL: https://issues.apache.org/jira/browse/HDFS-10378
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
>  Labels: supportability
> Attachments: HDFS-10378-unit.patch, HDFS-10378.001.patch, 
> HDFS-10378.002.patch, HDFS-10378.003.patch
>
>
> Calling {{setOwner}} as a non-super user does trigger 
> {{AccessControlException}}, however, the message "Permission denied. 
> user=user1967821757 is not the owner of inode=child" is wrong. Expect this 
> message: "Non-super user cannot change owner".
> Output of patched unit test {{TestPermission.testFilePermission}}:
> {noformat}
> 2016-05-06 16:45:44,915 [main] INFO  security.TestPermission 
> (TestPermission.java:testFilePermission(280)) - GOOD: got 
> org.apache.hadoop.security.AccessControlException: Permission denied. 
> user=user1967821757 is not the owner of inode=child1
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkOwner(FSPermissionChecker.java:273)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:250)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1642)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1626)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkOwner(FSDirectory.java:1595)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setOwner(FSDirAttrOp.java:88)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setOwner(FSNamesystem.java:1717)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setOwner(NameNodeRpcServer.java:835)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setOwner(ClientNamenodeProtocolServerSideTranslatorPB.java:481)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:665)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2423)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2419)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1755)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2417)
> {noformat}
> Will upload the unit test patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org