Key Hutu created PDFBOX-5613: -------------------------------- Summary: uncorrent paragraph split Key: PDFBOX-5613 URL: https://issues.apache.org/jira/browse/PDFBOX-5613 Project: PDFBox Issue Type: Improvement Components: Parsing, Text extraction Affects Versions: 2.0.1 Reporter: Key Hutu Attachments: Daily Report.pdf
when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info <code> public class PDFParagraphTextStripper extends PDFTextStripper { public PDFParagraphTextStripper() throws IOException { this.setLineSeparator(" "); this.setParagraphStart(""); this.setParagraphEnd(this.LINE_SEPARATOR); this.setPageStart(""); this.setPageEnd(""); this.setArticleStart(this.LINE_SEPARATOR); this.setArticleEnd(this.LINE_SEPARATOR); } } public class PdfParser { private static final String dataPath = "D:\\IdeaProject\\PdfParser\\PdfParser\\data\\"; public static void main(String[] args) { String fileName = "Daily Report.pdf"; try { extract_pdfbox(dataPath + fileName); } catch (Exception e) { e.printStackTrace(); } } private static void extract_pdfbox(String filePath) throws Exception { File file = new File(filePath); PDDocument document = PDDocument.load(file); PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper(); String text = pdfTextStripper.getText(document); System.out.println(text); document.close(); } } </code> <output> Daily Report 1) which language is your text in? - English 2) some examples of sentences containing addresses you'd want to pick up - Data are contarct documents, it contains addresses in different formates(of different countries),some are comma saperated, some are new line saperated etc 3) perhaps examples of mistakes - currently en model of SpaCy is even not able to tag entities clearly 4) Are you training your own model or are you using a model as is? - tried as it is but very poor in results to need to know a generic approach to train own model. any </output> -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org