krickert opened a new pull request, #2922:
URL: https://github.com/apache/tika/pull/2922

   ## Summary
   
   `.md` files are already detected as `text/markdown` (globs in 
`tika-mimetypes.xml`), but no parser claims the type, so they fall through to 
`TXTParser` and come back as flat text — headings, tables, and code fences all 
collapse into an undifferentiated string.
   
   This adds a `MarkdownParser` to `tika-parser-text-module` using 
commonmark-java — already a Tika dependency (it backs 
`ToMarkdownContentHandler`, TIKA-4730) — that parses the markdown AST and emits 
structured XHTML:
   
   | Markdown | XHTML |
   |---|---|
   | `#`..`######` / setext | `h1`..`h6` |
   | lists (incl. GFM tight lists) | `ul` / `ol` (with `start` when not 1) / 
`li` |
   | fenced / indented code | `pre`/`code` with `class="language-x"` (+ 
`data-info` for any extra fence info) |
   | GFM tables | `table`/`thead`/`tbody`/`tr`/`th`/`td` with `align` |
   | emphasis / strong / GFM strikethrough | `em` / `strong` / `del` |
   | links, images | `a href title`, `img src alt title` |
   | block quotes, thematic breaks | `blockquote`, `hr` |
   
   ## Fidelity and safety
   
   - **No content loss**: every literal the commonmark AST carries reaches the 
output — including image alt text with code spans, ordered-list start numbers, 
and full code-fence info strings. Only markdown *syntax presentation* 
(bullet/fence/emphasis delimiter characters) is normalized, identical to 
commonmark's reference `HtmlRenderer`.
   - **Raw HTML in the source is emitted as escaped text** — preserved, but 
never injected into the XHTML stream.
   - **Encoding detection** via `AutoDetectReader`, the same idiom as 
`TXTParser`; detected charset lands in `Content-Type`/`Content-Encoding`.
   - Registered via `@TikaComponent`, same as the other text-module parsers. No 
MIME changes needed.
   
   Because the emitted vocabulary matches what `ToMarkdownContentHandler` 
consumes, a markdown document round-trips markdown → XHTML → markdown (there's 
a test for it).
   
   ## Relationship to other work
   
   Independent of the gRPC Document-contract PR (#2921) — this is the input 
direction (`.md` files into Tika); that PR is the output direction. They share 
only the commonmark library.
   
   ## Test plan
   
   - [x] `MarkdownParserTest`: 9 tests — structure, GFM tables with alignment 
sections, raw-HTML escaping, ordered-list start numbers, code-span alt text, 
fence info preservation, charset detection with non-ASCII content, markdown 
round-trip
   - [x] full `tika-parser-text-module` test suite green (no regressions in 
TXT/CSV parsers)
   - [x] `apache-rat:check` green
   - [ ] CI
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to