[jira] [Created] (HDFS-10869) Unused method checkId() in InodeId.java file

2016-09-19 Thread Jagadesh Kiran N (JIRA)
Jagadesh Kiran N created HDFS-10869:
---

 Summary: Unused method checkId() in InodeId.java file 
 Key: HDFS-10869
 URL: https://issues.apache.org/jira/browse/HDFS-10869
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Jagadesh Kiran N


The following method in InodeId.java is not used anywhere , We can remove the 
code 

{code}
public static void checkId(long requestId, INode inode)
  throws FileNotFoundException {
if (requestId != HdfsConstants.GRANDFATHER_INODE_ID && requestId != 
inode.getId()) {
  throw new FileNotFoundException(
  "ID mismatch. Request id and saved id: " + requestId + " , "
  + inode.getId() + " for file " + inode.getFullPathName());
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2016-09-19 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/

[Sep 18, 2016 5:55:18 PM] (Arun Suresh) YARN-5637. Changes in NodeManager to 
support Container rollback and




-1 overall


The following subsystems voted -1:
asflicense unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices 
   hadoop.yarn.server.TestMiniYarnClusterNodeUtilization 
   hadoop.yarn.server.TestContainerManagerSecurity 
  

   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-compile-javac-root.txt
  [168K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-checkstyle-root.txt
  [16M]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-patch-pylint.txt
  [16K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-patch-shellcheck.txt
  [20K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-patch-shelldocs.txt
  [16K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/whitespace-eol.txt
  [11M]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/whitespace-tabs.txt
  [1.3M]

   javadoc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/diff-javadoc-javadoc-root.txt
  [2.2M]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt
  [12K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-tests.txt
  [268K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-nativetask.txt
  [124K]

   asflicense:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/169/artifact/out/patch-asflicense-problems.txt
  [4.0K]

Powered by Apache Yetus 0.4.0-SNAPSHOT   http://yetus.apache.org



-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-10870) Wrong dfs.namenode.acls.enabled default in HdfsPermissionsGuide.apt.vm

2016-09-19 Thread John Zhuge (JIRA)
John Zhuge created HDFS-10870:
-

 Summary: Wrong dfs.namenode.acls.enabled default in 
HdfsPermissionsGuide.apt.vm
 Key: HDFS-10870
 URL: https://issues.apache.org/jira/browse/HDFS-10870
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.6.0
Reporter: John Zhuge
Assignee: John Zhuge
Priority: Trivial


Wrong {{dfs.namenode.acls.enabled = true}} in {{HdfsPermissionsGuide.apt.vm}}. 
The default should be {{false}} as correctly stated in the description and in 
{{DFS_NAMENODE_ACLS_ENABLED_DEFAULT}}:
{code}
 * <<>>

   Set to true to enable support for HDFS ACLs (Access Control Lists).  By
   default, ACLs are disabled.  When ACLs are disabled, the NameNode rejects
   all attempts to set an ACL.
{code}

{code:title=DFSConfigKeys.java}
  public static final String  DFS_NAMENODE_ACLS_ENABLED_KEY = 
"dfs.namenode.acls.enabled";
  public static final boolean DFS_NAMENODE_ACLS_ENABLED_DEFAULT = false;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-10871) DiskBalancerWorkItem should not import jackson relocated by htrace

2016-09-19 Thread Masatake Iwasaki (JIRA)
Masatake Iwasaki created HDFS-10871:
---

 Summary: DiskBalancerWorkItem should not import jackson relocated 
by htrace
 Key: HDFS-10871
 URL: https://issues.apache.org/jira/browse/HDFS-10871
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0-alpha1
Reporter: Masatake Iwasaki


Compiling trunk against upstream htrace fails since it does not bundle the 
{{org.apache.htrace.fasterxml.jackson.annotation.JsonInclude}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Permission bit 12 in getFileInfo response

2016-09-19 Thread John Zhuge
Hi Gurus,

Does anyone know the meaning of bit 12 in the permission field of
"getFileInfo" response? To my understanding, the bit 9 is sticky bit, along
with the lower 9 bits for user/group/other.

In this following trace, the "perm" field is "4584", i.e., "10750" in oct:

16/09/15 15:54:56 TRACE ipc.ProtobufRpcEngine: 1: Response <-
NAMENODE:8020: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
permission { perm: 4584 } owner: "USER" group: "supergroup"
modification_time: 1473884314570 access_time: 0 block_replication: 0
blocksize: 0 fileId: 8798130 childrenNum: 1 storagePolicy: 0 }}

Thanks,
John Zhuge
Software Engineer, Cloudera


Re: Permission bit 12 in getFileInfo response

2016-09-19 Thread Chris Nauroth
Hello John,

That is the ACL bit.  The NameNode toggles on the ACL bit in getFileInfo 
responses for inodes that have ACL entries attached to them.  On the client 
side, this results in calls to FsPermission#getAclBit returning true.

The purpose of the ACL bit is to help client applications identify files and 
directories that have ACL entries attached.  One specific example where this is 
useful is in the output of the file system shell "ls" command.  (See 
org.apache.hadoop.fs.shell.Ls#processPath.)  If the ACL bit is turned on, then 
this is how the shell decides to append a '+' character after the basic 
permissions, so the end user knows that ACL entries are present.  If the ACL 
bit didn’t exist, then applications like this would have to be implemented with 
a more costly FileSystem#getAclStatus call, in addition to the existing 
getFileInfo RPC.

Test cases defined in FSAclBaseTest check for the presence of the ACL bit where 
expected.

--Chris Nauroth

On 9/19/16, 10:55 AM, "John Zhuge"  wrote:

Hi Gurus,

Does anyone know the meaning of bit 12 in the permission field of
"getFileInfo" response? To my understanding, the bit 9 is sticky bit, along
with the lower 9 bits for user/group/other.

In this following trace, the "perm" field is "4584", i.e., "10750" in oct:

16/09/15 15:54:56 TRACE ipc.ProtobufRpcEngine: 1: Response <-
NAMENODE:8020: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
permission { perm: 4584 } owner: "USER" group: "supergroup"
modification_time: 1473884314570 access_time: 0 block_replication: 0
blocksize: 0 fileId: 8798130 childrenNum: 1 storagePolicy: 0 }}

Thanks,
John Zhuge
Software Engineer, Cloudera




Apache Hadoop qbt Report: trunk+JDK8 on Linux/ppc64le

2016-09-19 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/

[Sep 18, 2016 5:55:18 PM] (Arun Suresh) YARN-5637. Changes in NodeManager to 
support Container rollback and
[Sep 19, 2016 9:03:06 AM] (varunsaxena) YARN-5577. [Atsv2] Document object 
passing in infofilters with an
[Sep 19, 2016 9:08:01 AM] (jianhe) YARN-3141. Improve locks in




-1 overall


The following subsystems voted -1:
compile unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc javac


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.hdfs.TestBlockStoragePolicy 
   hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer 
   hadoop.hdfs.web.TestWebHdfsTimeouts 
   hadoop.hdfs.server.namenode.ha.TestHAAppend 
   hadoop.yarn.server.nodemanager.recovery.TestNMLeveldbStateStoreService 
   hadoop.yarn.server.nodemanager.TestNodeManagerShutdown 
   hadoop.yarn.server.timeline.TestRollingLevelDB 
   hadoop.yarn.server.applicationhistoryservice.webapp.TestAHSWebServices 
   hadoop.yarn.server.timeline.TestTimelineDataManager 
   hadoop.yarn.server.timeline.TestLeveldbTimelineStore 
   hadoop.yarn.server.timeline.recovery.TestLeveldbTimelineStateStore 
   hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore 
   
hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer 
   hadoop.yarn.server.timelineservice.storage.common.TestRowKeys 
   hadoop.yarn.server.timelineservice.storage.common.TestKeyConverters 
   hadoop.yarn.server.timelineservice.storage.common.TestSeparator 
   hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore 
   hadoop.yarn.server.resourcemanager.TestRMRestart 
   hadoop.yarn.server.resourcemanager.TestResourceTrackerService 
   hadoop.yarn.server.TestMiniYarnClusterNodeUtilization 
   hadoop.yarn.server.TestContainerManagerSecurity 
   hadoop.yarn.client.api.impl.TestNMClient 
   hadoop.yarn.server.timeline.TestLevelDBCacheTimelineStore 
   hadoop.yarn.server.timeline.TestOverrideTimelineStoreYarnClient 
   hadoop.yarn.server.timeline.TestEntityGroupFSTimelineStore 
   hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineStorage 
   
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction
 
   hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRun 
   
hadoop.yarn.server.timelineservice.storage.TestPhoenixOfflineAggregationWriterImpl
 
   
hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
 
   
hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowActivity 
   hadoop.yarn.applications.distributedshell.TestDistributedShell 
   hadoop.mapred.TestShuffleHandler 
   hadoop.mapreduce.v2.hs.TestHistoryServerLeveldbStateStoreService 
   hadoop.mapred.TestMiniMRWithDFSWithDistinctUsers 

Timed out junit tests :

   org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache 
   org.apache.hadoop.mapred.TestMRIntermediateDataEncryption 
   org.apache.hadoop.mapred.TestMROpportunisticMaps 
  

   compile:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-compile-root.txt
  [308K]

   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-compile-root.txt
  [308K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-compile-root.txt
  [308K]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
  [196K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
  [52K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt
  [52K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice.txt
  [20K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
  [72K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-tests.txt
  [268K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/99/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-client.txt
  [12K]
   
https://builds.apach

Re: Permission bit 12 in getFileInfo response

2016-09-19 Thread John Zhuge
Thanks Chris!  Silly me, didn't look at "FsPermissionExtension".

John Zhuge
Software Engineer, Cloudera

On Mon, Sep 19, 2016 at 11:33 AM, Chris Nauroth 
wrote:

> Hello John,
>
> That is the ACL bit.  The NameNode toggles on the ACL bit in getFileInfo
> responses for inodes that have ACL entries attached to them.  On the client
> side, this results in calls to FsPermission#getAclBit returning true.
>
> The purpose of the ACL bit is to help client applications identify files
> and directories that have ACL entries attached.  One specific example where
> this is useful is in the output of the file system shell "ls" command.
> (See org.apache.hadoop.fs.shell.Ls#processPath.)  If the ACL bit is
> turned on, then this is how the shell decides to append a '+' character
> after the basic permissions, so the end user knows that ACL entries are
> present.  If the ACL bit didn’t exist, then applications like this would
> have to be implemented with a more costly FileSystem#getAclStatus call, in
> addition to the existing getFileInfo RPC.
>
> Test cases defined in FSAclBaseTest check for the presence of the ACL bit
> where expected.
>
> --Chris Nauroth
>
> On 9/19/16, 10:55 AM, "John Zhuge"  wrote:
>
> Hi Gurus,
>
> Does anyone know the meaning of bit 12 in the permission field of
> "getFileInfo" response? To my understanding, the bit 9 is sticky bit,
> along
> with the lower 9 bits for user/group/other.
>
> In this following trace, the "perm" field is "4584", i.e., "10750" in
> oct:
>
> 16/09/15 15:54:56 TRACE ipc.ProtobufRpcEngine: 1: Response <-
> NAMENODE:8020: getFileInfo {fs { fileType: IS_DIR path: "" length: 0
> permission { perm: 4584 } owner: "USER" group: "supergroup"
> modification_time: 1473884314570 access_time: 0 block_replication: 0
> blocksize: 0 fileId: 8798130 childrenNum: 1 storagePolicy: 0 }}
>
> Thanks,
> John Zhuge
> Software Engineer, Cloudera
>
>
>


[jira] [Created] (HDFS-10873) Add histograms for FSNamesystemLock Metrics

2016-09-19 Thread Erik Krogen (JIRA)
Erik Krogen created HDFS-10873:
--

 Summary: Add histograms for FSNamesystemLock Metrics
 Key: HDFS-10873
 URL: https://issues.apache.org/jira/browse/HDFS-10873
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Erik Krogen
Assignee: Erik Krogen


It could be useful to have full histograms for how long operations are holding 
the namesystem lock in addition to just rate information. This will, however, 
be a large number of emitted metrics and likely require more coordination, so 
it may be best to have this be separately configurable from more simple 
namesystem lock metrics. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-10872) Add MutableRate metrics for FSNamesystemLock operations

2016-09-19 Thread Erik Krogen (JIRA)
Erik Krogen created HDFS-10872:
--

 Summary: Add MutableRate metrics for FSNamesystemLock operations
 Key: HDFS-10872
 URL: https://issues.apache.org/jira/browse/HDFS-10872
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Erik Krogen
Assignee: Erik Krogen


Add metrics for FSNamesystemLock operations to see, overall, how long each 
operation is holding the lock for. Use MutableRate metrics for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-10874) libhdfs++: Public API headers should not depend on internal implementation

2016-09-19 Thread James Clampffer (JIRA)
James Clampffer created HDFS-10874:
--

 Summary: libhdfs++: Public API headers should not depend on 
internal implementation
 Key: HDFS-10874
 URL: https://issues.apache.org/jira/browse/HDFS-10874
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: James Clampffer


Public headers need to do some combination of the following: stop including 
parts of the implementation, forward declare bits of the implementation where 
absolutely needed, or pull the implementation into include/hdfspp if it's 
inseparable.

Example:
If you want to use the C++ API and only stick include/hdfspp in the include 
path you'll get an error when you include include/hdfspp/options.h because that 
goes and includes common/uri.h.

Related to the work described in HDFS-10787.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Re: Active NameNode Image Download and Edit Log Roll Stuck in FileChannel Force/Truncate

2016-09-19 Thread Joey Paskhay
Hi Kihwal,

Thank you for the response. This led to us deep diving into the system
level I/O behavior (with help from some resident ops experts), and we
definitely identified some bottlenecks. We also realized we had recently
increased the RAM on the NN servers, so the allowable dirty page cache size
would've increased proportionally.

We've been working on the disk partition setup and tuning the dirty page
ratio settings and have been seeing some improvements. Definitely seems
like we're on the right track now. We'll be pushing hard to get SSDs in
these as well.

Apologies for the false alarm/distraction. Thank you again for your help!

Joey

On Tue, Sep 13, 2016 at 11:12 AM, Kihwal Lee  wrote:

> Is the system busy with I/O when it happens? Any other I/O activities
> preceding the event? In your case DistCp could have generated extra edits
> and also namenode daemon and audit log entries.  Depending on
> configuration, dirty pages can pile up quite a bit on Linux systems with a
> large memory and cause extreme I/O delays when they hit the drive. fsimage
> uploading might be contributing to that. But we haven't seen any issues
> like that. In one of large clusters (5000+ node, 2.7.3ish, jdk8),
> rollEdits() takes less than 30ms consistently.
>
> Kihwal
>
>
> --
> *From:* Joey Paskhay 
> *To:* hdfs-dev@hadoop.apache.org
> *Sent:* Tuesday, September 13, 2016 12:06 PM
> *Subject:* Active NameNode Image Download and Edit Log Roll Stuck in
> FileChannel Force/Truncate
>
> Reposting here to see if any of the HDFS developers have some good insight
> into this.
>
> Deep dive is in the below original message. The gist of it is after
> upgrading to 2.7.2 on a ~260 node cluster, the active NN's fsimage download
> and edit logs roll seem to get stuck in native FileChannel.force calls
> (sometimes FileChannel.truncate). This leads to the ZKFC health monitor
> failing (all the RPC handler threads back up waiting for the
> FSNamesystem.fsLock to be released by the edit log roll process), and the
> active NN gets killed.
>
> Happens occasionally when the system is idle (once a day) but very
> frequently when we run DistCp (every 20-30 minutes). We believe we saw this
> every month or two on 2.2.1 (logs/files rolled over since last time so
> can't confirm exact same issue), but with 2.7.2 it seems to be much more
> frequent.
>
> Any help or guidance would be much appreciated.
>
> Thanks,
> Joey
>
>
> Hey there,
>
> We're in the process of upgrading our Hadoop cluster from 2.2.1 to 2.7.2
> and currently testing 2.7.2 in our pre-prod/backup cluster. We're seeing a
> lot of active NameNode failovers (sometimes as often as every 30 minutes),
> especially when we're running DistCp to copy data from our production
> cluster for users to test with. We had seen similar failovers occasionally
> while running 2.2.1, but not nearly as often (once every month or two).
> Haven't been able to verify it's the exact same root cause in the 2.2.1
> version since files/logs have rolled over since the last time it happened.
>
> So here's the chain of events we've found so far. Hoping someone can
> provide further direction.
>
> The standby NameNode's checkpointing process succeeds locally and issues
> the image PUT request in TransferFsImage.uploadImage. The active NameNode
> finishes downloading the fsimage.ckpt file, but when it tries to issue
> the fos.getChannel().force(true) call in TransferFsImage.receiveFile it
> seems to get stuck in native code. The standby NameNode then gets a
> SocketTimeoutException -- it happens 60 seconds after the last modification
> time we see in the "stat" output for the fsimage.ckpt file that the active
> NameNode pulled down.
>
> Right after the time this is happening (~30 sec after the last modification
> to the fsimage.ckpt file) we see a similar issue with the edit log roll.
> The standby NameNode's EditLogTailer triggers the rolling of the edit log
> on the active NameNode. We see the active NameNode enter its rollEditLog
> process, and will either see the endCurrentLogSegment call get stuck in
> EditLogFileOutputStream.close on the fc.truncate(fc.position()) call or the
> startLogSegment call get stuck in EditLogFileOutputStream.flushAndSync on
> the fc.force(true) call. They both get stuck in the native code. Looking at
> the last modification time in the "stat" output of the edits file, we see
> that 20 seconds later the standby NameNode's RPC call times out.
>
> The rollEditLog ends up holding onto the FSNamesystem's write lock on
> fsLock, and this causes all other RPC calls to pile up trying to acquire
> read locks until ZKFC times out on the health monitor and signals for the
> NameNode to be killed. We patched the SshFenceByTcpPort code to issue a
> kill -3 to get a thread dump before it kills the active NameNode.
>
> We're running on CentOS 6 using ext4 FS (w/ noatime) using kernel 2.6.32.
> The fsimage file is typically ~7.2GB and the edits files are typically
> ~

[jira] [Created] (HDFS-10875) Optimize du -x to cache intermediate result

2016-09-19 Thread Xiao Chen (JIRA)
Xiao Chen created HDFS-10875:


 Summary: Optimize du -x to cache intermediate result
 Key: HDFS-10875
 URL: https://issues.apache.org/jira/browse/HDFS-10875
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: snapshots
Affects Versions: 2.8.0
Reporter: Xiao Chen
Assignee: Xiao Chen


As [~jingzhao] pointed out in HDFS-8986, we can save a 
{{computeContentSummary4Snapshot}} call in 
{{INodeDirectory#computeContentSummary}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org