Re: [PR] Enable Native Markdown Extraction in Apache PDFBox [pdfbox]

via GitHub Sat, 03 Feb 2024 04:34:26 -0800


lehmi commented on code in PR #180:
URL: https://github.com/apache/pdfbox/pull/180#discussion_r1477059422



##########
tools/src/main/java/org/apache/pdfbox/tools/PDFText2Markdown.java:
##########
@@ -0,0 +1,375 @@
+//package com.pdfexample;
+
+package org.apache.pdfbox.tools;
+
+
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.font.PDFontDescriptor;
+import org.apache.pdfbox.text.PDFTextStripper;
+import org.apache.pdfbox.text.TextPosition;
+
+import java.io.IOException;
+import java.util.*;
+
+/**
+ * Wrap stripped text in simple HTML, trying to form HTML paragraphs. 
Paragraphs
+ * broken by pages, columns, or figures are not mended.
+ *
+ * @author John J Barton
+ *
+ */
+public class PDFText2Markdown extends PDFTextStripper{
+    private static final int INITIAL_PDF_TO_HTML_BYTES = 8192;
+
+    private final FontState fontState = new FontState();
+    /**
+     * Constructor.
+     * @throws IOException If there is an error during initialization.
+     */
+    public PDFText2Markdown() throws IOException
+    {
+        setLineSeparator(LINE_SEPARATOR);
+        setParagraphStart(LINE_SEPARATOR);
+        setParagraphEnd(LINE_SEPARATOR);
+        setPageStart(LINE_SEPARATOR);
+        setPageEnd(LINE_SEPARATOR);
+        setArticleStart(LINE_SEPARATOR);
+        setArticleEnd(LINE_SEPARATOR);
+    }
+
+    @Override
+    protected void startDocument(PDDocument document) throws IOException
+    {
+        StringBuilder buf = new StringBuilder(INITIAL_PDF_TO_HTML_BYTES);
+
+        super.writeString(buf.toString());
+    }
+
+
+    /**
+     * This method will attempt to guess the title of the document using
+     * either the document properties or the first lines of text.
+     *
+     * @return returns the title.
+     */
+    protected String getTitle()
+    {
+        String titleGuess = document.getDocumentInformation().getTitle();
+        if(titleGuess != null && titleGuess.length() > 0)
+        {
+            return titleGuess;
+        }
+        else
+        {
+            Iterator<List<TextPosition>> textIter = 
getCharactersByArticle().iterator();
+            float lastFontSize = -1.0f;
+
+            StringBuilder titleText = new StringBuilder();
+            while (textIter.hasNext())
+            {
+                for (TextPosition position : textIter.next())
+                {
+                    float currentFontSize = position.getFontSize();
+                    //If we're past 64 chars we will assume that we're past 
the title
+                    //64 is arbitrary
+                    if (Float.compare(currentFontSize, lastFontSize) != 0 || 
titleText.length() > 64)
+                    {
+                        if (titleText.length() > 0)
+                        {
+                            return titleText.toString();
+                        }
+                        lastFontSize = currentFontSize;
+                    }
+                    if (currentFontSize > 13.0f)
+                    { // most body text is 12pt
+                        titleText.append(position.getUnicode());
+                    }
+                }
+            }
+        }
+        return "";
+    }
+
+
+    /**
+     * Write out the article separator (div tag) with proper text direction
+     * information.
+     *
+     * @param isLTR true if direction of text is left to right
+     * @throws IOException
+     *             If there is an error writing to the stream.
+     */
+    @Override
+    protected void startArticle(boolean isLTR) throws IOException
+    {
+        if (isLTR)
+        {
+            super.writeString(LINE_SEPARATOR);
+        }
+        else
+        {
+            super.writeString(LINE_SEPARATOR);
+        }
+    }
+

Review Comment:
   Both cases are doing the same, so that the if/else can be removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: [PR] Enable Native Markdown Extraction in Apache PDFBox [pdfbox]

Reply via email to