sebastian-nagel opened a new issue, #1899:
URL: https://github.com/apache/stormcrawler/issues/1899

   ### Version
   
   main branch
   
   ### Describe what's wrong
   
   The OkHttp protocol uses the `topology.message.timeout.secs` (default 300 
seconds) to define a `completionTime` after which an ongoing request is 
cancelled.
   
   However, the completion time is not safely checked if the data arrives in 
very small chunks in intervals shorter than the 10 seconds defined by 
`http.timeout`.
   
   The issue was originally discovered in Nutch. Please, see 
[NUTCH-3174](https://issues.apache.org/jira/browse/NUTCH-3174) for more details 
and further analysis.
   
   ### Error message and/or stacktrace
   
   In a topology the fetch of the robots.txt takes forever, the Java stack 
after 30 minutes:
   ```
   "FetcherThread #5" #138 daemon prio=5 os_prio=0 cpu=78.52ms elapsed=1825.86s 
tid=0x000074e2340cfd40 nid=0x1bb runnable  [0x000074e17d8fd000]
      java.lang.Thread.State: RUNNABLE
           at sun.nio.ch.Net.poll([email protected]/Native Method)
           at sun.nio.ch.NioSocketImpl.park([email protected]/Unknown Source)
           at sun.nio.ch.NioSocketImpl.timedRead([email protected]/Unknown 
Source)
           at sun.nio.ch.NioSocketImpl.implRead([email protected]/Unknown 
Source)
           at sun.nio.ch.NioSocketImpl.read([email protected]/Unknown Source)
           at sun.nio.ch.NioSocketImpl$1.read([email protected]/Unknown Source)
           at java.net.Socket$SocketInputStream.read([email protected]/Unknown 
Source)
           at 
okio.internal.DefaultSocket$SocketSource.read(DefaultSocket.kt:124)
           at okio.RealBufferedSource.request(RealBufferedSource.kt:232)
           at okio.RealBufferedSource.require(RealBufferedSource.kt:225)
           at 
okio.RealBufferedSource.readHexadecimalUnsignedLong(RealBufferedSource.kt:411)
           at 
okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.kt:485)
           at 
okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.kt:464)
           at 
okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:346)
           at okio.RealBufferedSource.request(RealBufferedSource.kt:232)
           at 
org.apache.stormcrawler.protocol.okhttp.HttpProtocol.toByteArray(HttpProtocol.java:486)
           at 
org.apache.stormcrawler.protocol.okhttp.HttpProtocol.getProtocolOutput(HttpProtocol.java:429)
           at 
org.apache.stormcrawler.protocol.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:135)
           at 
org.apache.stormcrawler.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:207)
           at 
org.apache.stormcrawler.protocol.AbstractHttpProtocol.getRobotRules(AbstractHttpProtocol.java:159)
           at 
org.apache.stormcrawler.bolt.FetcherBolt$FetcherThread.run(FetcherBolt.java:540)
   ```
   
   The tuple is failed after about 10 minutes:
   ```
   2026-05-07 14:51:09.253 o.a.s.o.p.AggregationSpout I/O dispatcher 4 [INFO] 
[spout #4]  OpenSearch query returned 1 hits from 1 buckets in 41 msec with 0 
already being processed. Took 41.0 msec per doc on average.
   ...
   2026-05-07 15:00:39.151 o.a.s.o.p.AbstractSpout Thread-32-spout-executor[17, 
17] [INFO] [spout #7]   Fail for http://cbhjhlccfkqdpknyu.org/
   ```
   
   
   ### How to reproduce
   
   Easiest it is reproduced calling the Protocol main method:
   
   ```
   $> cat /tmp/crawler-conf-test.yaml
   config:
     http.content.limit: 1048576
     http.content.partial.as.trimmed: true
     http.store.headers: true
     http.timeout: 10000
     topology.message.timeout.secs: 60  # for testing use a lower value than 
the default (300 seconds)
     protocols: "http,https"
     http.protocol.implementation: 
org.apache.stormcrawler.protocol.okhttp.HttpProtocol
     https.protocol.implementation: 
org.apache.stormcrawler.protocol.okhttp.HttpProtocol
     http.protocol.versions:
       - "h2"
       - "http/1.1"
     http.trust.everything: true
     http.agent.name: "MyTestBot"
     http.agent.version: "3.0"
     http.agent.description: ""
     http.agent.url: ""
     http.agent.email: ""
   
   $> java -cp .../stormcrawler-core-3.5.2-SNAPSHOT.jar:... \
        org.apache.stormcrawler.protocol.okhttp.HttpProtocol \
        -f /tmp/crawler-conf-test.yaml http://cbhjhlccfkqdpknyu.org/
   ...
   [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using 
protocol versions: [h2, http/1.1]
   [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using 
connection pool with max. 5 idle connections and 300 sec. connection keep-alive 
time
   ```
   
   The Java program hangs without further output.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to