PDF parse failing to capture entire text

Jack Park Fri, 04 Jan 2013 12:01:17 -0800

A two-column scientific paper. One column reads:

The effect of muscle a-tocopherol concentration
(induced by dietary treatment) on TBARS at different
storage times was evaluated (Figure 2). There was a
linear effect (P < 0·001) of muscle a-tocopherol
concentration on TBARS on day 0, but a linear plus
quadratic effect on the following days (P < 0·001).
Also in this case the linear plus quadratic effect
indicated an exponential response, which was fitted
in each case as follows:



The parser (code below) returns this:

The effect of m
(induced by dietar
storage times was
linear effect (P <
concentration on T
quadratic effect o
Also in this case
indicated an expo
in each case as foll


On some lines of parsing, characters at the left are missing, as if
the parser started after the beginning of the text, case in point:

ted storage (L = linear effect, P < 0·001;
 P< 0·001). The data were adjusted to a
l equation (solid line) as indicated in

is the fragment extracted from:

Figure 2 Relationship between a-tocopherol concentration
and lipid oxidation (assessed by the concentration of
thiobarbituric acid reactive substances, TBARS, mg
malonaldehyde per kg muscle) in longissimus lumborum
muscle of Manchego lambs after 0 (u), 3 (n), 6 (s) and 9
(l) days of refrigerated storage (L = linear effect, P< 0·001;
Q = quadratic effect, P<0·001). The data were adjusted to a
linear or exponential equation (solid line) as indicated in
the text.

The paper itself is found by following the link from here:
http://openagricola.nal.usda.gov/Record/IND23271089

(I will send the file offlist if needed; it's 64k)

Code used is this:

                        Parser parser = new AutoDetectParser();
                        Metadata metadata = new Metadata();
                        File f = new File("volume_73_part_3_p451-457.pdf");
                        TikaInputStream tis = TikaInputStream.get(f);
                        StringWriter writer = new StringWriter();       
                        WriteOutContentHandler handler = new 
WriteOutContentHandler(writer);
                        parser.parse(tis,handler,metadata,new ParseContext());
                        System.out.println(handler.toString());

My questions are these:

Can Tika (PdfBox) correctly parse multi-column content?
What am I missing?

Many thanks in advance.
Jack

PDF parse failing to capture entire text

Reply via email to