sebastian-nagel opened a new pull request, #1900:
URL: https://github.com/apache/stormcrawler/pull/1900

   To force cancellation of the request:
   - set [OkHttp call 
timeout](https://square.github.io/okhttp/5.x/okhttp/okhttp3/-ok-http-client/-builder/call-timeout.html)
 to topology.message.timeout.secs (if not -1)
   
   Additional changes:
   - set the TrimmedReason to `TIME` if OkHttp throws an InterruptedIOException
   - log the reason why the response is trimmed
   - add type parameter to MutableObject's
   - replace deprecated method calls `getValue()`
   
   So far, the solution is only verified using the Protocol main method:
   ```
   $> java -cp .../stormcrawler-core-3.5.2-SNAPSHOT.jar:... \
        org.apache.stormcrawler.protocol.okhttp.HttpProtocol \
        -f /tmp/crawler-conf-test.yaml http://cbhjhlccfkqdpknyu.org/
   ...
   [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using 
protocol versions: [h2, http/1.1]
   [main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using 
connection pool with max. 5 idle connections and 300 sec. connection keep-alive 
time
   [Thread-0] WARN org.apache.stormcrawler.protocol.okhttp.HttpProtocol - HTTP 
content trimmed to 10 (reason: TIME)
   [Thread-0] WARN crawlercommons.robots.SimpleRobotRulesParser - Problem 
processing robots.txt for http://cbhjhlccfkqdpknyu.org/
   [Thread-0] WARN crawlercommons.robots.SimpleRobotRulesParser -   Unknown 
line in robots.txt file (size 10): DQEPigDriE
   [Thread-0] WARN org.apache.stormcrawler.protocol.okhttp.HttpProtocol - HTTP 
content trimmed to 10 (reason: TIME)
   http://cbhjhlccfkqdpknyu.org/
   robots allowed: true
   robots requests: 1
   sitemaps identified: 0
   date: Thu, 07 May 2026 14:02:24 GMT
   server: nginx/1.21.6
   transfer-encoding: chunked
   _protocol_versions_: http/1.1
   metrics.dns.resolution.msec: 4
   http.trimmed.reason: time
   keep-alive: timeout=20
   _request.headers_: GET / HTTP/1.1
   User-Agent: MyTestBot/3.0
   Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
   Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
   Accept-Encoding: zstd, br, gzip
   Host: cbhjhlccfkqdpknyu.org
   Connection: Keep-Alive
   
   
   http.trimmed: true
   _request.time_: 1778162544171
   content-type: application/octet-stream
   connection: keep-alive
   _response.ip_: 216.218.185.162
   _response.headers_: HTTP/1.1 200 OK
   Server: nginx/1.21.6
   Date: Thu, 07 May 2026 14:02:24 GMT
   Content-Type: application/octet-stream
   Transfer-Encoding: chunked
   Connection: keep-alive
   Keep-Alive: timeout=20
   
   
   
   status code: 200
   content length: 10
   fetched in : 60002 msec
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to