[jira] [Commented] (PDFBOX-5613) uncorrent paragraph split

Michael Klink (Jira) Sun, 28 May 2023 05:47:27 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726936#comment-17726936
 ]


Michael Klink commented on PDFBOX-5613:
---------------------------------------

As the PDF in question is tagged, you may want to use tags in extraction.

The document is tagged like this:

{noformat}
<Document>
<P>
<Span>
Daily Report
</Span>
<Span>
 
</Span>
</P>
<P>

1) which language is your text in? - English 2) some examples of sentences 
containing addresses you'd want to pick up - Data are contarct documents, it 
contains addresses in different formates(of different countries),some are comma 
saperated, some are new line saperated etc 3) perhaps examples of mistakes - 
currently en model of SpaCy is even not able to tag entities clearly 4) Are you 
training your own model or are you using a model as is? - tried as it is but 
very poor in results to need to know a generic approach to train own model. any 
referance code will be helpfu;  Can you please edit your question to add what 
you wrote in your last comment (that was what I was trying to do by asking all 
of them). And please do add actual examples and not just "addresses are in 
different formats", that doesn't really help us understand what you are facing. 
I have added a link on how to train a SpaCy NER model in my answer. It's very 
well documented on their website
；
 Please look at my comment to add more information to your post. Based on the 
information you provided, here are my remarks: 
</P>
<L>
<LI>
<LBody>

• SpaCy is trained to find locations, not addresses per se 
</LBody>
</LI>
</L>
<P>

If you use a "common" language, SpaCy is trained using WikiNER data, where 
locations aren't addresses but more like geographical places like city names, 
country names etc. So it's quite normal to not be able to detect full 
addresses. 
</P>
<P>
<Span>

</Span>
<Span>
You likely need to train your own entity recognizer. They detail how to do this 
on their website, including code samples: 
</Span>
<Link>
?org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDObjectReference@66498326
?org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDObjectReference@cad498c
<Span>
https://spacy.io/usage/training#ner
</Span>
</Link>
<Span>
 
</Span>
</P>
<L>
<LI>
<LBody>

• Don't underestimate SpaCy's rule-based matching 
</LBody>
</LI>
</L>
<P>
<Span>

</Span>
<Span>
Is it a fancy neural network? No. Does it matter? Also no. SpaCy allows you to 
create 
</Span>
<Link>
?org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDObjectReference@1e6454ec
<Span>
rules to find entities
</Span>
</Link>
<Span>
 and in cases like addresses which are generally following a pattern across 
entities. 
</Span>
</P>
<P>
<Span>
 
</Span>
</P>
</Document>
{noformat}

(Ah, I see my simple implementation does not correctly inspect links.)

> uncorrent paragraph split
> -------------------------
>
>                 Key: PDFBOX-5613
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5613
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.1, 2.0.28
>            Reporter: Key Hutu
>            Priority: Major
>         Attachments: Daily Report.pdf
>
>
> when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
> {code}
> public class PDFParagraphTextStripper extends PDFTextStripper {
>      public PDFParagraphTextStripper() throws IOException{
>          this.setLineSeparator(" ");
>          this.setParagraphStart("");
>          this.setParagraphEnd(this.LINE_SEPARATOR);
>          this.setPageStart("");
>          this.setPageEnd("");
>          this.setArticleStart(this.LINE_SEPARATOR);
>          this.setArticleEnd(this.LINE_SEPARATOR);
>       }
> }
> public class PdfParser {
>     private static final String dataPath = 
> "D:\\IdeaProject\\PdfParser\\PdfParser\\data";
>     public static void main(String[] args) {
>          String fileName = "Daily Report.pdf";
>          try{
>               extract_pdfbox(dataPath + fileName);
>          }
>          catch (Exception e) { 
>             e.printStackTrace(); 
>         }
>       }
>      private static void extract_pdfbox(String filePath) throws Exception{
>           File file = new File(filePath);
>           PDDocument document = PDDocument.load(file);
>           PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
>           String text = pdfTextStripper.getText(document);
>           System.out.println(text);
>           document.close();
>      }
> }
> {code}
> {noformat}
> Daily Report 1) which language is your text in? - English 
> 2) some examples of sentences containing 
> addresses you'd want to pick up - Data are 
> contarct documents, it contains addresses in 
> different formates(of different 
> countries),some are comma saperated, some 
> are new line saperated etc 3) perhaps 
> examples of mistakes - currently en model 
> of SpaCy is even not able to tag entities 
> clearly 4) Are you training your own model 
> or are you using a model as is? - tried as it is 
> but very poor in results to need to know a 
> generic approach to train own model. any 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5613) uncorrent paragraph split

Reply via email to