[ https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726889#comment-17726889 ]
Tilman Hausherr commented on PDFBOX-5613: ----------------------------------------- Please retry with the current version, 2.0.28. Also why this weird settings? And what did you expect to happen? > uncorrent paragraph split > ------------------------- > > Key: PDFBOX-5613 > URL: https://issues.apache.org/jira/browse/PDFBOX-5613 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, Text extraction > Affects Versions: 2.0.1 > Reporter: Key Hutu > Priority: Major > Attachments: Daily Report.pdf > > > when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info > <code> > public class PDFParagraphTextStripper extends PDFTextStripper { > public PDFParagraphTextStripper() throws IOException{ > this.setLineSeparator(" "); > this.setParagraphStart(""); > this.setParagraphEnd(this.LINE_SEPARATOR); > this.setPageStart(""); > this.setPageEnd(""); > this.setArticleStart(this.LINE_SEPARATOR); > this.setArticleEnd(this.LINE_SEPARATOR); > } > } > public class PdfParser { > private static final String dataPath = > "D:\\IdeaProject\\PdfParser\\PdfParser\\data"; > public static void main(String[] args) { > String fileName = "Daily Report.pdf"; > try{ > extract_pdfbox(dataPath + fileName); > }catch (Exception e)\{ e.printStackTrace(); } > } > private static void extract_pdfbox(String filePath) throws Exception{ > File file = new File(filePath); > PDDocument document = PDDocument.load(file); > PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper(); > String text = pdfTextStripper.getText(document); > System.out.println(text); > document.close(); > } > } > </code> > <output> > Daily Report 1) which language is your text in? - English > 2) some examples of sentences containing > addresses you'd want to pick up - Data are > contarct documents, it contains addresses in > different formates(of different > countries),some are comma saperated, some > are new line saperated etc 3) perhaps > examples of mistakes - currently en model > of SpaCy is even not able to tag entities > clearly 4) Are you training your own model > or are you using a model as is? - tried as it is > but very poor in results to need to know a > generic approach to train own model. any > </output> -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org