[ 
https://issues.apache.org/jira/browse/TIKA-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093316#comment-18093316
 ] 

Hudson commented on TIKA-4770:
------------------------------

ABORTED: Integrated in Jenkins build Tika ยป tika-main-jdk17 #1453 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/1453/])
TIKA-4770: Add a Markdown parser with structured, lossless XHTML output (#2922) 
(github: 
[https://github.com/apache/tika/commit/aca20dc144a398420bc2193ad97303c81fe2fff1])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/test/java/org/apache/tika/parser/datauri/DataURISchemeParserTest.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/markdown/MarkdownParserTest.java
* (edit) tika-bom/pom.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/pom.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/main/java/org/apache/tika/parser/datauri/DataURIScheme.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/DataURISchemeParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/DataURISchemeParseException.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/main/java/org/apache/tika/parser/datauri/DataURISchemeParseException.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/main/java/org/apache/tika/parser/datauri/DataURISchemeUtil.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/DataURISchemeUtil.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
* (delete) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/DataURIScheme.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/markdown/MarkdownParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/pom.xml
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/pom.xml
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/test-documents/testMARKDOWN.md
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/test-documents/testMARKDOWN_dataURI.md


> Add a Markdown parser with structured XHTML output
> --------------------------------------------------
>
>                 Key: TIKA-4770
>                 URL: https://issues.apache.org/jira/browse/TIKA-4770
>             Project: Tika
>          Issue Type: Task
>          Components: parser
>            Reporter: Kristian Rickert
>            Priority: Major
>
> Markdown files are already detected as {{text/markdown}} ({{\*.md}} / 
> {{\*.markdown}} globs in tika-mimetypes.xml), but no parser claims the type, 
> so they fall through to {{TXTParser}} and come back as flat text.
> Add a {{MarkdownParser}} to {{tika-parser-text-module}} using commonmark-java 
> (already a Tika dependency behind {{ToMarkdownContentHandler}}, TIKA-4730). 
> It parses the markdown AST and emits structured XHTML: {{h1..h6}}, 
> {{ul}}/{{ol}}/{{li}}, {{blockquote}}, {{pre}}/{{code}} with a language class, 
> GFM tables as {{table}}/{{thead}}/{{tbody}}/{{tr}}/{{th}}/{{td}} with 
> alignment, {{em}}/{{strong}}/{{del}}, links, images, {{hr}}. Raw HTML in the 
> source is emitted as escaped text so nothing is injected. Encoding detection 
> via {{AutoDetectReader}}, consistent with {{TXTParser}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to