[
https://issues.apache.org/jira/browse/TIKA-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093095#comment-18093095
]
ASF GitHub Bot commented on TIKA-4770:
--------------------------------------
krickert opened a new pull request, #2922:
URL: https://github.com/apache/tika/pull/2922
## Summary
`.md` files are already detected as `text/markdown` (globs in
`tika-mimetypes.xml`), but no parser claims the type, so they fall through to
`TXTParser` and come back as flat text — headings, tables, and code fences all
collapse into an undifferentiated string.
This adds a `MarkdownParser` to `tika-parser-text-module` using
commonmark-java — already a Tika dependency (it backs
`ToMarkdownContentHandler`, TIKA-4730) — that parses the markdown AST and emits
structured XHTML:
| Markdown | XHTML |
|---|---|
| `#`..`######` / setext | `h1`..`h6` |
| lists (incl. GFM tight lists) | `ul` / `ol` (with `start` when not 1) /
`li` |
| fenced / indented code | `pre`/`code` with `class="language-x"` (+
`data-info` for any extra fence info) |
| GFM tables | `table`/`thead`/`tbody`/`tr`/`th`/`td` with `align` |
| emphasis / strong / GFM strikethrough | `em` / `strong` / `del` |
| links, images | `a href title`, `img src alt title` |
| block quotes, thematic breaks | `blockquote`, `hr` |
## Fidelity and safety
- **No content loss**: every literal the commonmark AST carries reaches the
output — including image alt text with code spans, ordered-list start numbers,
and full code-fence info strings. Only markdown *syntax presentation*
(bullet/fence/emphasis delimiter characters) is normalized, identical to
commonmark's reference `HtmlRenderer`.
- **Raw HTML in the source is emitted as escaped text** — preserved, but
never injected into the XHTML stream.
- **Encoding detection** via `AutoDetectReader`, the same idiom as
`TXTParser`; detected charset lands in `Content-Type`/`Content-Encoding`.
- Registered via `@TikaComponent`, same as the other text-module parsers. No
MIME changes needed.
Because the emitted vocabulary matches what `ToMarkdownContentHandler`
consumes, a markdown document round-trips markdown → XHTML → markdown (there's
a test for it).
## Relationship to other work
Independent of the gRPC Document-contract PR (#2921) — this is the input
direction (`.md` files into Tika); that PR is the output direction. They share
only the commonmark library.
## Test plan
- [x] `MarkdownParserTest`: 9 tests — structure, GFM tables with alignment
sections, raw-HTML escaping, ordered-list start numbers, code-span alt text,
fence info preservation, charset detection with non-ASCII content, markdown
round-trip
- [x] full `tika-parser-text-module` test suite green (no regressions in
TXT/CSV parsers)
- [x] `apache-rat:check` green
- [ ] CI
> Add a Markdown parser with structured XHTML output
> --------------------------------------------------
>
> Key: TIKA-4770
> URL: https://issues.apache.org/jira/browse/TIKA-4770
> Project: Tika
> Issue Type: Task
> Components: parser
> Reporter: Kristian Rickert
> Priority: Major
>
> Markdown files are already detected as {{text/markdown}} ({{\*.md}} /
> {{\*.markdown}} globs in tika-mimetypes.xml), but no parser claims the type,
> so they fall through to {{TXTParser}} and come back as flat text.
> Add a {{MarkdownParser}} to {{tika-parser-text-module}} using commonmark-java
> (already a Tika dependency behind {{ToMarkdownContentHandler}}, TIKA-4730).
> It parses the markdown AST and emits structured XHTML: {{h1..h6}},
> {{ul}}/{{ol}}/{{li}}, {{blockquote}}, {{pre}}/{{code}} with a language class,
> GFM tables as {{table}}/{{thead}}/{{tbody}}/{{tr}}/{{th}}/{{td}} with
> alignment, {{em}}/{{strong}}/{{del}}, links, images, {{hr}}. Raw HTML in the
> source is emitted as escaped text so nothing is injected. Encoding detection
> via {{AutoDetectReader}}, consistent with {{TXTParser}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)