Thank you Arpit Agarwal for notifying me on this user mail. Yes, heap pressure is introduced due to block-layout migration. Right, high heap usage is only during the upgrade, and once the upgrade is done then heap usage back to normal. Have experienced this issue from many clusters(20+) but only noticeable in large datanode (where it has millions of blocks).
IIUC, the high heap usage is introduced from 2.6.0 and 2.6.1 onwards [ HDFS-6482 <https://issues.apache.org/jira/browse/HDFS-6482> (2.6.0) and HDFS-7443 <https://issues.apache.org/jira/browse/HDFS-7443>(2.6.1)] 2.6.0 => https://github.com/apache/hadoop/blame/branch-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java#L1060 2.6.1 => https://github.com/apache/hadoop/blame/branch-2.6.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataStorage.java#L1067 As long as datanode have memory then no issue. However, If DN is configured less heap then CMS struggled a lot because there is no memory for reclaim. Multiple users reported datanode runs more than 1 hour without even completing block migration(for 3.3M blocks with a 6GB heap). It's just spent time on GC cycle where overall JVM pause is ~37 minutes (just hangs) I think we need to revisit these two jiras. Mainly HDFS-6482 <https://issues.apache.org/jira/browse/HDFS-6482>. Sorry If my understanding is wrong :) -Karthik On Wed, Oct 7, 2020 at 8:47 AM Kihwal Lee <kih...@verizonmedia.com.invalid> wrote: > We haven't experienced anything like that up to 2.8. We are still in the > process of stabilizing 2.10 as we upgrade some of the bigger clusters. We > will know soon how 2.10 datanodes behave under heavy load and storage > utilization. > > If you are seeing a significant change, it might be something post-2.8 or > even post-2.10. > > Kihwal > > On Tue, Oct 6, 2020 at 5:09 PM Wei-Chiu Chuang <weic...@cloudera.com> > wrote: > > > Sorry for not being specific. > > I was referring to HDFS-8791 > > < > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D8791&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=dAJ657NT-13Zjdb3zsUQxFoymNFB0SJd_2OTmE5mCR4&m=M36liML4Z0UBfc0vLFzg_C0fN_jTaH_ZbUGM_0Mnwjo&s=ukaowpvXdF0_o7i-UHB4046_L5Qyd0ZkEP9D778DM9c&e=> > (block > > ID-based DN storage layout can be very slow for datanode on ext4) where > it > > is in 2.8 and above. > > > > As I understand it, the increased heap usage only occurs during upgrade. > > No issue afterwards. > > > > My experience was based on CDH5 to CDH6 upgrade (Hadoop 2.6 -> Hadoop > 3.0) > > and HDP2 to HDP3 (Hadoop 2.7 -> Hadoop 3.1) upgrade. It is nearly > > impossible to tell which commit increases heap usage worse during > upgrade. > > > > > > > > On Tue, Oct 6, 2020 at 3:01 PM Kihwal Lee <kih...@verizonmedia.com> > wrote: > > > >> Which layout change are you referring to? The only layout change I know > >> of was done in 2.7, IIRC. We backported that to 2.6 and did not see any > >> adverse effects at that time. > >> > >> Is datanode using more heap all the time? Or is it running into trouble > >> when generating full block reports? > >> > >> Kihwal > >> > >> On Mon, Oct 5, 2020 at 1:40 PM Wei-Chiu Chuang > >> <weic...@cloudera.com.invalid> wrote: > >> > >>> We experienced this issue on CDH6 and HDP3, so roughly Hadoop 3.0.x and > >>> 3.1.x. > >>> Hermanth experienced the same issue on Hadoop 3.1.1 as well (HDFS-15569 > >>> < > >>> > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D15569&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=b6gUZYewojO-9YMJdyeI_g&m=itpohwgKPN5qoauYyyMxhGSnasaP3LLbbMVezETEenA&s=kgWYVv2utuAyPWBhv0KVH8ZZGJqQBMvUM7dZ8J0jaa8&e= > >>> >) > >>> > >>> On Mon, Oct 5, 2020 at 11:03 AM Igor Dvorzhak <i...@google.com> wrote: > >>> > >>> > What Hadoop 3 version do you use? > >>> > > >>> > On Mon, Oct 5, 2020 at 10:03 AM Wei-Chiu Chuang <weic...@apache.org> > >>> > wrote: > >>> > > >>> >> I have anecdotally learned of multiple data points where during the > >>> >> upgrading from Hadoop 2 to Hadoop 3, DN heap usage increases to the > >>> point > >>> >> where it goes OOM. > >>> >> > >>> >> Don't have much logs for this issue, but I suspect it's caused by > the > >>> >> layout change added in Hadoop 2.8.0. > >>> >> > >>> >> Does anyone else observe the same issue and how do you mitigate > this? > >>> For > >>> >> now we suggested increasing DN heap size prior to upgrade as part of > >>> >> pre-upgrade checklist. > >>> >> > >>> >> Thanks, > >>> >> Wei-Chiu > >>> >> > >>> > > >>> > >> >