..url for the pdf file: http://www.aclweb.org/anthology-new/P/P02/P02-1046.pdf
On Thu, May 10, 2012 at 1:26 AM, 叶严杰 <huoyanyo...@gmail.com> wrote: > I tried to get text from a pdf with pdfbox by striper.getText. (see code > attached below) > the pdf is attached as file. And bug info attached below. > anyway to solve this bug? > > regrads > > *Code* > public void read() > { > PDDocument document = null; > FileInputStream is = null; > try { > is = new FileInputStream(file); > PDFParser parser = new PDFParser(is); > parser.parse(); > document = parser.getPDDocument(); > PDFTextStripper stripper = new PDFTextStripper(); > content = stripper.getText(document); > } catch (FileNotFoundException e) { > e.printStackTrace(); > } catch (IOException e) { > e.printStackTrace(); > } finally { > if (is != null) { > try { > is.close(); > } catch (IOException e) { > e.printStackTrace(); > } > } > if (document != null) { > try { > document.close(); > } catch (IOException e) { > e.printStackTrace(); > } > } > } > } > > *Bug Info* > Exception in thread "main" java.lang.NumberFormatException: For input > string: "dup" > at java.lang.NumberFormatException.forInputString(Unknown Source) > at java.lang.Integer.parseInt(Unknown Source) > at java.lang.Integer.parseInt(Unknown Source) > at > org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344) > at > org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280) > at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181) > at > org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83) > at > org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152) > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108) > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:243) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) > at > org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242) > at get.read(get.java:33) > at get.main(get.java:60) >