The GitHub Actions job "Java CI with Maven" on stormcrawler.git/feature/466 has 
succeeded.
Run started by GitHub user rzo1 (triggered by rzo1).

Head commit for run:
372e8f8ec751dcfda78cc875ec5946d5d59f56b9 / Richard Zowalla <[email protected]>
#466 - Handle text/plain content in JSoupParserBolt

text/plain content requires no markup parsing, so instead of raising an error
(the default with jsoup.treat.non.html.as.error) the bolt now uses the decoded
content directly as the extracted text and emits no outlinks.

The plain-text path does not run the TextExtractor (there is no markup to
extract from), so the two size-related knobs are read in prepare() and applied
directly: empty text when textextractor.no.text is set, truncated to
textextractor.skip.after otherwise. substring keeps the original layout, which
is the point of a .txt; http.content.limit remains the bound for the raw
fetched bytes. The include/exclude knobs require markup and have no effect.

Adds unit tests for the verbatim, skip.after-truncation and no.text cases and
documents the behaviour and bounds in configuration.adoc and internals.adoc.

Closes #466

Report URL: https://github.com/apache/stormcrawler/actions/runs/27607704404

With regards,
GitHub Actions via GitBox

Reply via email to