rawatsaurav01 opened a new pull request, #180: URL: https://github.com/apache/pdfbox/pull/180
Propose the addition of native Markdown extraction support in Apache PDFBox to simplify the conversion of PDF content to Markdown, eliminating the need for intermediate HTML conversion. **Description:** Currently, Apache PDFBox supports HTML extraction through `PdfText2HTML`. However, this requires an extra step of converting HTML to Markdown using external tools like CopyDown. To enhance efficiency, we suggest incorporating native Markdown extraction support within Apache PDFBox.**Sample Code Comparison:** **Current Process:** ```java File pdfFile = new File("sample/sample.pdf"); File mdFile = new File("sample/sample.md");PDFText2HTML pdfText2HTML = new PDFText2HTML(); CopyDown copyDown = new CopyDown();try (PDDocument pdDocument = Loader.loadPDF(pdfFile)) { Files.writeString(mdFile.toPath(), copyDown.convert(pdfText2HTML.getText(pdDocument))); } **Proposed Process:** ```java File pdfFile = new File("sample/sample.pdf"); File mdFile = new File("sample/sample.md");PDFText2Markdown pdfText2Markdown = new PDFText2Markdown();try (PDDocument pdDocument = Loader.loadPDF(pdfFile)) { Files.writeString(mdFile.toPath(), pdfText2Markdown.getText(pdDocument)); } **Benefits:** 1. **Streamlined Workflow:** Direct PDF to Markdown conversion without relying on external tools. 2. **Performance Improvement:** Reduced resource consumption, especially for large PDF files. 3. **Enhanced User Experience:** Aligns with common use cases, improving overall usability. **Proposed Changes:** Introduce `PDFText2Markdown` in Apache PDFBox to provide native Markdown extraction. **Compatibility:** Ensure backward compatibility with existing PDFBox functionalities while seamlessly adding Markdown extraction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org