Kristian Rickert created TIKA-4770:
--------------------------------------
Summary: Add a Markdown parser with structured XHTML output
Key: TIKA-4770
URL: https://issues.apache.org/jira/browse/TIKA-4770
Project: Tika
Issue Type: Task
Components: parser
Reporter: Kristian Rickert
Markdown files are already detected as {{text/markdown}} ({{\*.md}} /
{{\*.markdown}} globs in tika-mimetypes.xml), but no parser claims the type, so
they fall through to {{TXTParser}} and come back as flat text.
Add a {{MarkdownParser}} to {{tika-parser-text-module}} using commonmark-java
(already a Tika dependency behind {{ToMarkdownContentHandler}}, TIKA-4730). It
parses the markdown AST and emits structured XHTML: {{h1..h6}},
{{ul}}/{{ol}}/{{li}}, {{blockquote}}, {{pre}}/{{code}} with a language class,
GFM tables as {{table}}/{{thead}}/{{tbody}}/{{tr}}/{{th}}/{{td}} with
alignment, {{em}}/{{strong}}/{{del}}, links, images, {{hr}}. Raw HTML in the
source is emitted as escaped text so nothing is injected. Encoding detection
via {{AutoDetectReader}}, consistent with {{TXTParser}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)