[jira] [Updated] (PDFBOX-5613) uncorrent paragraph split

Key Hutu (Jira) Sat, 27 May 2023 17:31:04 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Key Hutu updated PDFBOX-5613:
-----------------------------
    Description: 
when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info

<code>
public class PDFParagraphTextStripper extends PDFTextStripper {
     public PDFParagraphTextStripper() throws IOException{

         this.setLineSeparator(" ");

         this.setParagraphStart("");

         this.setParagraphEnd(this.LINE_SEPARATOR);

         this.setPageStart("");

         this.setPageEnd("");

         this.setArticleStart(this.LINE_SEPARATOR);

         this.setArticleEnd(this.LINE_SEPARATOR);

      }

}

public class PdfParser {
    private static final String dataPath = 
"D:\\IdeaProject\\PdfParser\\PdfParser\\data";
    public static void main(String[] args) {
         String fileName = "Daily Report.pdf";
         try{

              extract_pdfbox(dataPath + fileName);

         }catch (Exception e)\{ e.printStackTrace(); }

      }

     private static void extract_pdfbox(String filePath) throws Exception{

          File file = new File(filePath);

          PDDocument document = PDDocument.load(file);

          PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();

          String text = pdfTextStripper.getText(document);

          System.out.println(text);

          document.close();

     }

}
</code>

<output>
Daily Report 1) which language is your text in? - English 
2) some examples of sentences containing 
addresses you'd want to pick up - Data are 
contarct documents, it contains addresses in 
different formates(of different 
countries),some are comma saperated, some 
are new line saperated etc 3) perhaps 
examples of mistakes - currently en model 
of SpaCy is even not able to tag entities 
clearly 4) Are you training your own model 
or are you using a model as is? - tried as it is 
but very poor in results to need to know a 
generic approach to train own model. any 
</output>

  was:
when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info

<code>
public class PDFParagraphTextStripper extends PDFTextStripper {
    public PDFParagraphTextStripper() throws IOException {
        this.setLineSeparator(" ");
        this.setParagraphStart("");
        this.setParagraphEnd(this.LINE_SEPARATOR);
        this.setPageStart("");
        this.setPageEnd("");
        this.setArticleStart(this.LINE_SEPARATOR);
        this.setArticleEnd(this.LINE_SEPARATOR);
    }

}

public class PdfParser {
    private static final String dataPath = 
"D:\\IdeaProject\\PdfParser\\PdfParser\\data\\";
    public static void main(String[] args) {
        String fileName = "Daily Report.pdf";
        try {
            extract_pdfbox(dataPath + fileName);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static void extract_pdfbox(String filePath) throws Exception {
        File file = new File(filePath);
        PDDocument document = PDDocument.load(file);
        PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
        String text = pdfTextStripper.getText(document);
        System.out.println(text);
        document.close();
    }
}
</code>

<output>
Daily Report  1) which language is your text in? - English  
2) some examples of sentences containing  
addresses you'd want to pick up - Data are  
contarct documents, it contains addresses in  
different formates(of different  
countries),some are comma saperated, some  
are new line saperated etc 3) perhaps  
examples of mistakes - currently en model  
of SpaCy is even not able to tag entities  
clearly 4) Are you training your own model  
or are you using a model as is? - tried as it is  
but very poor in results to need to know a  
generic approach to train own model. any  
</output>


> uncorrent paragraph split
> -------------------------
>
>                 Key: PDFBOX-5613
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5613
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 2.0.1
>            Reporter: Key Hutu
>            Priority: Major
>         Attachments: Daily Report.pdf
>
>
> when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
> <code>
> public class PDFParagraphTextStripper extends PDFTextStripper {
>      public PDFParagraphTextStripper() throws IOException{
>          this.setLineSeparator(" ");
>          this.setParagraphStart("");
>          this.setParagraphEnd(this.LINE_SEPARATOR);
>          this.setPageStart("");
>          this.setPageEnd("");
>          this.setArticleStart(this.LINE_SEPARATOR);
>          this.setArticleEnd(this.LINE_SEPARATOR);
>       }
> }
> public class PdfParser {
>     private static final String dataPath = 
> "D:\\IdeaProject\\PdfParser\\PdfParser\\data";
>     public static void main(String[] args) {
>          String fileName = "Daily Report.pdf";
>          try{
>               extract_pdfbox(dataPath + fileName);
>          }catch (Exception e)\{ e.printStackTrace(); }
>       }
>      private static void extract_pdfbox(String filePath) throws Exception{
>           File file = new File(filePath);
>           PDDocument document = PDDocument.load(file);
>           PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
>           String text = pdfTextStripper.getText(document);
>           System.out.println(text);
>           document.close();
>      }
> }
> </code>
> <output>
> Daily Report 1) which language is your text in? - English 
> 2) some examples of sentences containing 
> addresses you'd want to pick up - Data are 
> contarct documents, it contains addresses in 
> different formates(of different 
> countries),some are comma saperated, some 
> are new line saperated etc 3) perhaps 
> examples of mistakes - currently en model 
> of SpaCy is even not able to tag entities 
> clearly 4) Are you training your own model 
> or are you using a model as is? - tried as it is 
> but very poor in results to need to know a 
> generic approach to train own model. any 
> </output>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5613) uncorrent paragraph split

Reply via email to