rzo1 opened a new issue, #1890:
URL: https://github.com/apache/stormcrawler/issues/1890
# Summary
HttpProtocol.getProtocolOutput(...) only calls page.content() when the
initial HTTP response status is in the FETCHED (2xx) bucket. For Single-Page
Applications that ship a stub document with a non-2xx status (e.g. 404) and
then hydrate the real content via JavaScript, the rendered DOM is thrown away
and ProtocolResponse.content is empty, defeating the whole point of using a
headless browser as the protocol.
# Reproduction
1. Build & run the Playwright protocol standalone against a
Scrivito-powered page:
```
cd external/playwright
mvn clean compile
mvn exec:java \
-Dexec.mainClass=org.apache.stormcrawler.protocol.playwright.HttpProtocol \
-Dexec.classpathScope=compile \
-Dexec.args='-f playwright-conf.yaml -b
https://www.hs-heilbronn.de/de/angewandte-informatik'
```
1. (playwright-conf.yaml needs http.agent.name, http.robots.file.skip:
true, and playwright.load.event: networkidle set.)
2. Observe: status code: 404; content length: 0
2. No content is dumped, even though opening the same URL in a real
browser displays a fully rendered page titled "Angewandte Informatik".
3. curl https://www.hs-heilbronn.de/de/angewandte-informatik confirms the
origin returns HTTP/2 404 with a Scrivito bootstrap shell ("Your Scrivito
powered site
is loading ..."); the real content is fetched at runtime from
api.scrivito.com.
# Expected behavior
When the headless browser successfully renders a page, the rendered DOM
should be returned as content, regardless of the initial HTTP status code. The
originating HTTP status (404 here) should still be reported in
ProtocolResponse.statusCode / metadata for diagnostics, but downstream
consumers should be able to access the rendered HTML.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]