rawatsaurav01 opened a new pull request, #180:
URL: https://github.com/apache/pdfbox/pull/180

   
   Propose the addition of native Markdown extraction support in Apache PDFBox 
to simplify the conversion of PDF content to Markdown, eliminating the need for 
intermediate HTML conversion.
   
   **Description:**
   Currently, Apache PDFBox supports HTML extraction through `PdfText2HTML`. 
However, this requires an extra step of converting HTML to Markdown using 
external tools like CopyDown. To enhance efficiency, we suggest incorporating 
native Markdown extraction support within Apache PDFBox.**Sample Code 
Comparison:**
   
   **Current Process:**
   
   ```java
   File pdfFile = new File("sample/sample.pdf");
   File mdFile = new File("sample/sample.md");PDFText2HTML pdfText2HTML = new 
PDFText2HTML();
   CopyDown copyDown = new CopyDown();try (PDDocument pdDocument = 
Loader.loadPDF(pdfFile)) {
       Files.writeString(mdFile.toPath(), 
copyDown.convert(pdfText2HTML.getText(pdDocument)));
   }
   
   **Proposed Process:**
   ```java
   File pdfFile = new File("sample/sample.pdf");
   File mdFile = new File("sample/sample.md");PDFText2Markdown pdfText2Markdown 
= new PDFText2Markdown();try (PDDocument pdDocument = Loader.loadPDF(pdfFile)) {
       Files.writeString(mdFile.toPath(), pdfText2Markdown.getText(pdDocument));
   }
   
   **Benefits:**
   1. **Streamlined Workflow:** Direct PDF to Markdown conversion without 
relying on external tools.
   2. **Performance Improvement:** Reduced resource consumption, especially for 
large PDF files.
   3. **Enhanced User Experience:** Aligns with common use cases, improving 
overall usability.
   
   **Proposed Changes:**
   Introduce `PDFText2Markdown` in Apache PDFBox to provide native Markdown 
extraction.
   
   **Compatibility:**
   Ensure backward compatibility with existing PDFBox functionalities while 
seamlessly adding Markdown extraction.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to