[
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-3814:
------------------------------
Priority: Minor (was: Critical)
> Extracted text from HTML file does not exclude newline chars from body
> ----------------------------------------------------------------------
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.0
> Reporter: Sai Konuri
> Priority: Minor
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png,
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a
> <span>,<p>,<text>, etc, the text that is extracted is not excluding those
> newlines.
> A sample html file is attached.
>
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>
> {*}Actual{*}:
> !image-2022-07-06-19-09-54-534.png!
>
>
> This is the code I am using to extract the text of the HTML file:
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream =
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)