[jira] [Commented] (TIKA-4770) Add a Markdown parser with structured XHTML output

ASF GitHub Bot (Jira) Wed, 01 Jul 2026 18:39:17 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093095#comment-18093095
 ]


ASF GitHub Bot commented on TIKA-4770:
--------------------------------------

krickert opened a new pull request, #2922:
URL: https://github.com/apache/tika/pull/2922

   ## Summary
   
   `.md` files are already detected as `text/markdown` (globs in 
`tika-mimetypes.xml`), but no parser claims the type, so they fall through to 
`TXTParser` and come back as flat text — headings, tables, and code fences all 
collapse into an undifferentiated string.
   
   This adds a `MarkdownParser` to `tika-parser-text-module` using 
commonmark-java — already a Tika dependency (it backs 
`ToMarkdownContentHandler`, TIKA-4730) — that parses the markdown AST and emits 
structured XHTML:
   
   | Markdown | XHTML |
   |---|---|
   | `#`..`######` / setext | `h1`..`h6` |
   | lists (incl. GFM tight lists) | `ul` / `ol` (with `start` when not 1) / 
`li` |
   | fenced / indented code | `pre`/`code` with `class="language-x"` (+ 
`data-info` for any extra fence info) |
   | GFM tables | `table`/`thead`/`tbody`/`tr`/`th`/`td` with `align` |
   | emphasis / strong / GFM strikethrough | `em` / `strong` / `del` |
   | links, images | `a href title`, `img src alt title` |
   | block quotes, thematic breaks | `blockquote`, `hr` |
   
   ## Fidelity and safety
   
   - **No content loss**: every literal the commonmark AST carries reaches the 
output — including image alt text with code spans, ordered-list start numbers, 
and full code-fence info strings. Only markdown *syntax presentation* 
(bullet/fence/emphasis delimiter characters) is normalized, identical to 
commonmark's reference `HtmlRenderer`.
   - **Raw HTML in the source is emitted as escaped text** — preserved, but 
never injected into the XHTML stream.
   - **Encoding detection** via `AutoDetectReader`, the same idiom as 
`TXTParser`; detected charset lands in `Content-Type`/`Content-Encoding`.
   - Registered via `@TikaComponent`, same as the other text-module parsers. No 
MIME changes needed.
   
   Because the emitted vocabulary matches what `ToMarkdownContentHandler` 
consumes, a markdown document round-trips markdown → XHTML → markdown (there's 
a test for it).
   
   ## Relationship to other work
   
   Independent of the gRPC Document-contract PR (#2921) — this is the input 
direction (`.md` files into Tika); that PR is the output direction. They share 
only the commonmark library.
   
   ## Test plan
   
   - [x] `MarkdownParserTest`: 9 tests — structure, GFM tables with alignment 
sections, raw-HTML escaping, ordered-list start numbers, code-span alt text, 
fence info preservation, charset detection with non-ASCII content, markdown 
round-trip
   - [x] full `tika-parser-text-module` test suite green (no regressions in 
TXT/CSV parsers)
   - [x] `apache-rat:check` green
   - [ ] CI
   




> Add a Markdown parser with structured XHTML output
> --------------------------------------------------
>
>                 Key: TIKA-4770
>                 URL: https://issues.apache.org/jira/browse/TIKA-4770
>             Project: Tika
>          Issue Type: Task
>          Components: parser
>            Reporter: Kristian Rickert
>            Priority: Major
>
> Markdown files are already detected as {{text/markdown}} ({{\*.md}} / 
> {{\*.markdown}} globs in tika-mimetypes.xml), but no parser claims the type, 
> so they fall through to {{TXTParser}} and come back as flat text.
> Add a {{MarkdownParser}} to {{tika-parser-text-module}} using commonmark-java 
> (already a Tika dependency behind {{ToMarkdownContentHandler}}, TIKA-4730). 
> It parses the markdown AST and emits structured XHTML: {{h1..h6}}, 
> {{ul}}/{{ol}}/{{li}}, {{blockquote}}, {{pre}}/{{code}} with a language class, 
> GFM tables as {{table}}/{{thead}}/{{tbody}}/{{tr}}/{{th}}/{{td}} with 
> alignment, {{em}}/{{strong}}/{{del}}, links, images, {{hr}}. Raw HTML in the 
> source is emitted as escaped text so nothing is injected. Encoding detection 
> via {{AutoDetectReader}}, consistent with {{TXTParser}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4770) Add a Markdown parser with structured XHTML output

Reply via email to