OK, thanks for looking into it, any way! On Mon, Aug 10, 2015 at 6:59 PM, Andreas Lehmkuehler <andr...@lehmi.de> wrote:
> Am 10.08.2015 um 18:48 schrieb Gilad Denneboom: > >> I guessed it was something like that... Do you think it's because it was >> generated with iText? >> > Sorry, but I don't know anything about the internals of itext or possible > bugs of older versions > > BR > Andreas > > > >> On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <andr...@lehmi.de> >> wrote: >> >> Hi, >>> >>> Am 10.08.2015 um 13:22 schrieb Gilad Denneboom: >>> >>> Hi Andreas, >>>> >>>> Of course the output itself is different, but I would expect that the >>>> underlying text each tool processes would be the same, and it's not. >>>> Have >>>> a >>>> look at the first line in the PrintTextLocations output file: >>>> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5 >>>> width=2.7799988]: >>>> It is repeated, with exactly the same information, 12 times throughout >>>> the >>>> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and >>>> 991. >>>> >>>> Why would the same information be processed 12 times in a single run? >>>> >>>> The pdf contains a lot of redundant information, e.g. the header is >>> repeated several times (I didn't count them but I guess it's 12 times). >>> PDFTextStripper eliminates overlapping text/characters and >>> PrintTextLocations doesn't. >>> >>> BR >>> Andreas >>> >>> >>> Gilad >>> >>>> >>>> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <andr...@lehmi.de> >>>> wrote: >>>> >>>> Hi Gilad, >>>> >>>>> >>>>> sorry for the late answer .... >>>>> >>>>> I'm not sure what you're expecting. You are using 2 totally different >>>>> approaches >>>>> to process a pdf. PrintTextLocations provides a lot of additional >>>>> information >>>>> for every piece of text, which may vary from one character up to whole >>>>> words or >>>>> lines of text. Consequently the output has to be totally different and >>>>> of >>>>> course >>>>> much bigger than the output of a simple text extraction. >>>>> >>>>> BR >>>>> Andreas >>>>> >>>>> Gilad Denneboom <gilad.denneb...@gmail.com> hat am 10. August 2015 um >>>>> >>>>>> >>>>>> 10:05 >>>>> >>>>> geschrieben: >>>>>> >>>>>> >>>>>> No one has any ideas? >>>>>> >>>>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom < >>>>>> >>>>>> gilad.denneb...@gmail.com> >>>>> >>>>> wrote: >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>>> >>>>>>> I'm looking for advice on a problem I'm encountering where the output >>>>>>> >>>>>>> of >>>>>> >>>>> >>>>> PDFTextStripper and PrintTextLocations is dramatically different when >>>>>> >>>>>>> processing the same file. >>>>>>> For some reason, the output of PrintTextLocations is 12 times longer >>>>>>> >>>>>>> than >>>>>> >>>>> >>>>> that of PDFTextStripper, ie the entire text is printed out 12 times, >>>>>> >>>>>>> instead of just once. >>>>>>> >>>>>>> I'm attaching the file in question, as well as the output produced >>>>>>> >>>>>>> using >>>>>> >>>>> >>>>> both methods via Google Drive... Hopefully it will come through. >>>>>> >>>>>>> >>>>>>> I'd appreciate any ideas as to what might be causing this issue (I'm >>>>>>> guessing there's something wrong with the structure of the file), and >>>>>>> >>>>>>> of >>>>>> >>>>> >>>>> course any possible solutions. >>>>>> >>>>>>> >>>>>>> Thanks in advance, Gilad. >>>>>>> >>>>>>> PS. I'm using 1.8.10. >>>>>>> >>>>>>> output problem.zip >>>>>>> < >>>>>>> >>>>>>> >>>>>> >>>>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web >>>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>>>> >>>>> >>>>> >>>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: users-h...@pdfbox.apache.org >>> >>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > >