Thanks -- yes, you're right. 200 seconds time out. According to this stackoverflow article, Jetty 9 default is 30 seconds. I think it makes sense to override the Jetty's time out and set to 200 seconds. I'll file a jira.
Other things to consider: 1. fsck serverlet? (somehow I suspect this is related to the socket timeout reported in HDFS-7175 <https://issues.apache.org/jira/browse/HDFS-7175>) 2. webhdfs, httpfs? --> we've also received reports that webhdfs can timeout. so having a longer timeout makes sense here. 2. kms? will the longer timeout cause more lingering sockets? On Mon, Dec 7, 2020 at 10:23 PM Jason Wen <zhenshan....@workday.com.invalid> wrote: > Hi Wei Chiu, > > We also observed same issue when NN replays large editlogs from JN. > It looks like in jetty 6 the default max idle timeout is 200 seconds. > > public abstract class AbstractConnector extends AbstractBuffers implements > Connector > { > .... > protected int _maxIdleTime=200000; > .... > } > > Thanks, > Jason > > On 12/7/20, 9:51 PM, "Wei-Chiu Chuang" <weic...@apache.org> wrote: > > Hi community, > > I want to share with you this observation. > > We received several case reports that users sometimes experience > JournalNode timeout when NN requests edits from JN. The end result is > (both!) NN crash after the timeout (10 seconds). > > It seems to only happen to Hadoop 3 users (CDH6 and HDP3). While > HADOOP-15696 < > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D15696&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=eEUnJdQK8HKIlsWlNRMzmhQs4DqKn8SFs4X4s2xIENs&e= > > offered a > configurable switch for you to increase hadoop.http.idle_timeout.ms, > it > looks like a regression in Hadoop 3 and NN shouldn't simply crash > because > JN is slightly slow. It looks to me a 10 second timeout for fetching > edits > from JN is simply too low. > > I believe this is a regression caused when we updated Jetty from 6 to > 9 in > Hadoop 3 (HADOOP-10075 < > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D10075&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=D_Tma-NaItInfNfm3UuoQbndqB4541VxEeyXpkYMkH4&e= > >). > We replaced SelectChannelConnector.setLowResourceMaxIdleTime() > with ServerConnector.setIdleTimeout() but they aren't the same. > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__archive.eclipse.org_jetty_7.0.0.RC0_apidocs_org_eclipse_jetty_server_nio_SelectChannelConnector.html-23getLowResourcesMaxIdleTime-28-29&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=PcA6g7BGB_1fGEHHCS1Dgl0i4fS_AeCRr1q5ceVduOo&e= > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.eclipse.org_jetty_javadoc_9.4.26.v20200117_org_eclipse_jetty_server_AbstractConnector.html-23setIdleTimeout-28long-29&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=FKfElxhHXM1PCAk0VpG9wt6Y6jyKbr-PN4H4v4m9Tfc&e= > > Does any know the behavior back in Hadoop 2/Jetty6? Does it use the > Jetty's > default idle time which is 300 seconds? > >