sebastian-nagel opened a new issue, #1899: URL: https://github.com/apache/stormcrawler/issues/1899
### Version main branch ### Describe what's wrong The OkHttp protocol uses the `topology.message.timeout.secs` (default 300 seconds) to define a `completionTime` after which an ongoing request is cancelled. However, the completion time is not safely checked if the data arrives in very small chunks in intervals shorter than the 10 seconds defined by `http.timeout`. The issue was originally discovered in Nutch. Please, see [NUTCH-3174](https://issues.apache.org/jira/browse/NUTCH-3174) for more details and further analysis. ### Error message and/or stacktrace In a topology the fetch of the robots.txt takes forever, the Java stack after 30 minutes: ``` "FetcherThread #5" #138 daemon prio=5 os_prio=0 cpu=78.52ms elapsed=1825.86s tid=0x000074e2340cfd40 nid=0x1bb runnable [0x000074e17d8fd000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.Net.poll([email protected]/Native Method) at sun.nio.ch.NioSocketImpl.park([email protected]/Unknown Source) at sun.nio.ch.NioSocketImpl.timedRead([email protected]/Unknown Source) at sun.nio.ch.NioSocketImpl.implRead([email protected]/Unknown Source) at sun.nio.ch.NioSocketImpl.read([email protected]/Unknown Source) at sun.nio.ch.NioSocketImpl$1.read([email protected]/Unknown Source) at java.net.Socket$SocketInputStream.read([email protected]/Unknown Source) at okio.internal.DefaultSocket$SocketSource.read(DefaultSocket.kt:124) at okio.RealBufferedSource.request(RealBufferedSource.kt:232) at okio.RealBufferedSource.require(RealBufferedSource.kt:225) at okio.RealBufferedSource.readHexadecimalUnsignedLong(RealBufferedSource.kt:411) at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.kt:485) at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.kt:464) at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:346) at okio.RealBufferedSource.request(RealBufferedSource.kt:232) at org.apache.stormcrawler.protocol.okhttp.HttpProtocol.toByteArray(HttpProtocol.java:486) at org.apache.stormcrawler.protocol.okhttp.HttpProtocol.getProtocolOutput(HttpProtocol.java:429) at org.apache.stormcrawler.protocol.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:135) at org.apache.stormcrawler.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:207) at org.apache.stormcrawler.protocol.AbstractHttpProtocol.getRobotRules(AbstractHttpProtocol.java:159) at org.apache.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:540) ``` The tuple is failed after about 10 minutes: ``` 2026-05-07 14:51:09.253 o.a.s.o.p.AggregationSpout I/O dispatcher 4 [INFO] [spout #4] OpenSearch query returned 1 hits from 1 buckets in 41 msec with 0 already being processed. Took 41.0 msec per doc on average. ... 2026-05-07 15:00:39.151 o.a.s.o.p.AbstractSpout Thread-32-spout-executor[17, 17] [INFO] [spout #7] Fail for http://cbhjhlccfkqdpknyu.org/ ``` ### How to reproduce Easiest it is reproduced calling the Protocol main method: ``` $> cat /tmp/crawler-conf-test.yaml config: http.content.limit: 1048576 http.content.partial.as.trimmed: true http.store.headers: true http.timeout: 10000 topology.message.timeout.secs: 60 # for testing use a lower value than the default (300 seconds) protocols: "http,https" http.protocol.implementation: org.apache.stormcrawler.protocol.okhttp.HttpProtocol https.protocol.implementation: org.apache.stormcrawler.protocol.okhttp.HttpProtocol http.protocol.versions: - "h2" - "http/1.1" http.trust.everything: true http.agent.name: "MyTestBot" http.agent.version: "3.0" http.agent.description: "" http.agent.url: "" http.agent.email: "" $> java -cp .../stormcrawler-core-3.5.2-SNAPSHOT.jar:... \ org.apache.stormcrawler.protocol.okhttp.HttpProtocol \ -f /tmp/crawler-conf-test.yaml http://cbhjhlccfkqdpknyu.org/ ... [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using protocol versions: [h2, http/1.1] [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using connection pool with max. 5 idle connections and 300 sec. connection keep-alive time ``` The Java program hangs without further output. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
