The GitHub Actions job "Java CI with Maven" on stormcrawler.git/feature/466 has succeeded. Run started by GitHub user rzo1 (triggered by rzo1).
Head commit for run: 372e8f8ec751dcfda78cc875ec5946d5d59f56b9 / Richard Zowalla <[email protected]> #466 - Handle text/plain content in JSoupParserBolt text/plain content requires no markup parsing, so instead of raising an error (the default with jsoup.treat.non.html.as.error) the bolt now uses the decoded content directly as the extracted text and emits no outlinks. The plain-text path does not run the TextExtractor (there is no markup to extract from), so the two size-related knobs are read in prepare() and applied directly: empty text when textextractor.no.text is set, truncated to textextractor.skip.after otherwise. substring keeps the original layout, which is the point of a .txt; http.content.limit remains the bound for the raw fetched bytes. The include/exclude knobs require markup and have no effect. Adds unit tests for the verbatim, skip.after-truncation and no.text cases and documents the behaviour and bounds in configuration.adoc and internals.adoc. Closes #466 Report URL: https://github.com/apache/stormcrawler/actions/runs/27607704404 With regards, GitHub Actions via GitBox
