[jira] [Created] (PDFBOX-5613) uncorrent paragraph split

Key Hutu (Jira) Sat, 27 May 2023 17:28:07 -0700

Key Hutu created PDFBOX-5613:
--------------------------------

             Summary: uncorrent paragraph split
                 Key: PDFBOX-5613
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5613
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing, Text extraction
    Affects Versions: 2.0.1
            Reporter: Key Hutu
         Attachments: Daily Report.pdf


when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info

<code>
public class PDFParagraphTextStripper extends PDFTextStripper {
    public PDFParagraphTextStripper() throws IOException {
        this.setLineSeparator(" ");
        this.setParagraphStart("");
        this.setParagraphEnd(this.LINE_SEPARATOR);
        this.setPageStart("");
        this.setPageEnd("");
        this.setArticleStart(this.LINE_SEPARATOR);
        this.setArticleEnd(this.LINE_SEPARATOR);
    }

}

public class PdfParser {
    private static final String dataPath = 
"D:\\IdeaProject\\PdfParser\\PdfParser\\data\\";
    public static void main(String[] args) {
        String fileName = "Daily Report.pdf";
        try {
            extract_pdfbox(dataPath + fileName);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void extract_pdfbox(String filePath) throws Exception {
        File file = new File(filePath);
        PDDocument document = PDDocument.load(file);
        PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
        String text = pdfTextStripper.getText(document);
        System.out.println(text);
        document.close();
    }
}
</code>

<output>
Daily Report  1) which language is your text in? - English  
2) some examples of sentences containing  
addresses you'd want to pick up - Data are  
contarct documents, it contains addresses in  
different formates(of different  
countries),some are comma saperated, some  
are new line saperated etc 3) perhaps  
examples of mistakes - currently en model  
of SpaCy is even not able to tag entities  
clearly 4) Are you training your own model  
or are you using a model as is? - tried as it is  
but very poor in results to need to know a  
generic approach to train own model. any  
</output>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5613) uncorrent paragraph split

Reply via email to