rzo1 opened a new issue, #1890:
URL: https://github.com/apache/stormcrawler/issues/1890

                                                                               
   # Summary                                                                    
                                                                                
      
                            
     HttpProtocol.getProtocolOutput(...) only calls page.content() when the 
initial HTTP response status is in the FETCHED (2xx) bucket. For Single-Page 
Applications  that ship a stub document with a non-2xx status (e.g. 404) and 
then hydrate the real content via JavaScript, the rendered DOM is thrown away 
and ProtocolResponse.content is empty, defeating the whole point of using a 
headless browser as the protocol.                                               
        
                            
   # Reproduction                           
   
     1. Build & run the Playwright protocol standalone against a 
Scrivito-powered page:                                                          
                     
   ```
     cd external/playwright
     mvn clean compile                                                          
                                                                                
      
     mvn exec:java \                                                            
                                                                                
      
       
-Dexec.mainClass=org.apache.stormcrawler.protocol.playwright.HttpProtocol \
       -Dexec.classpathScope=compile \                                          
                                                                                
      
       -Dexec.args='-f playwright-conf.yaml -b 
https://www.hs-heilbronn.de/de/angewandte-informatik'
   ```
   
     1. (playwright-conf.yaml needs http.agent.name, http.robots.file.skip: 
true, and playwright.load.event: networkidle set.)                              
          
     2. Observe:   status code: 404; content length: 0                          
                                                                                
                                      
     2. No content is dumped, even though opening the same URL in a real 
browser displays a fully rendered page titled "Angewandte Informatik".          
             
     3. curl https://www.hs-heilbronn.de/de/angewandte-informatik confirms the 
origin returns HTTP/2 404 with a Scrivito bootstrap shell ("Your Scrivito 
powered site 
     is loading ..."); the real content is fetched at runtime from 
api.scrivito.com.                                                               
                   
                                                                                
                                                                                
      
   # Expected behavior                                                          
                                                                                
      
                                            
     When the headless browser successfully renders a page, the rendered DOM 
should be returned as content, regardless of the initial HTTP status code. The  
originating HTTP status (404 here) should still be reported in 
ProtocolResponse.statusCode / metadata for diagnostics, but downstream 
consumers should be able to access the rendered HTML.           


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to