OK, thanks for looking into it, any way!

On Mon, Aug 10, 2015 at 6:59 PM, Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

> Am 10.08.2015 um 18:48 schrieb Gilad Denneboom:
>
>> I guessed it was something like that... Do you think it's because it was
>> generated with iText?
>>
> Sorry, but I don't know anything about the internals of itext or possible
> bugs of older versions
>
> BR
> Andreas
>
>
>
>> On Mon, Aug 10, 2015 at 6:35 PM, Andreas Lehmkuehler <andr...@lehmi.de>
>> wrote:
>>
>> Hi,
>>>
>>> Am 10.08.2015 um 13:22 schrieb Gilad Denneboom:
>>>
>>> Hi Andreas,
>>>>
>>>> Of course the output itself is different, but I would expect that the
>>>> underlying text each tool processes would be the same, and it's not.
>>>> Have
>>>> a
>>>> look at the first line in the PrintTextLocations output file:
>>>> String[472.89,54.0 fs=10.0 xscale=10.0 height=7.21 space=2.5
>>>> width=2.7799988]:
>>>> It is repeated, with exactly the same information, 12 times throughout
>>>> the
>>>> output, lines 1, 91, 181, 271, 361, 451, 541, 631, 721, 811, 901 and
>>>> 991.
>>>>
>>>> Why would the same information be processed 12 times in a single run?
>>>>
>>>> The pdf contains a lot of redundant information, e.g. the header is
>>> repeated several times (I didn't count them but I guess it's 12 times).
>>> PDFTextStripper eliminates overlapping text/characters and
>>> PrintTextLocations doesn't.
>>>
>>> BR
>>> Andreas
>>>
>>>
>>> Gilad
>>>
>>>>
>>>> On Mon, Aug 10, 2015 at 12:18 PM, Andreas Lehmkühler <andr...@lehmi.de>
>>>> wrote:
>>>>
>>>> Hi Gilad,
>>>>
>>>>>
>>>>> sorry for the late answer ....
>>>>>
>>>>> I'm not sure what you're expecting. You are using 2 totally different
>>>>> approaches
>>>>> to process a pdf. PrintTextLocations provides a lot of additional
>>>>> information
>>>>> for every piece of text, which may vary from one character up to whole
>>>>> words or
>>>>> lines of text. Consequently the output has to be totally different and
>>>>> of
>>>>> course
>>>>> much bigger than the output of a simple text extraction.
>>>>>
>>>>> BR
>>>>> Andreas
>>>>>
>>>>> Gilad Denneboom <gilad.denneb...@gmail.com> hat am 10. August 2015 um
>>>>>
>>>>>>
>>>>>> 10:05
>>>>>
>>>>> geschrieben:
>>>>>>
>>>>>>
>>>>>> No one has any ideas?
>>>>>>
>>>>>> On Thu, Aug 6, 2015 at 5:49 PM, Gilad Denneboom <
>>>>>>
>>>>>> gilad.denneb...@gmail.com>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>>>
>>>>>>> I'm looking for advice on a problem I'm encountering where the output
>>>>>>>
>>>>>>> of
>>>>>>
>>>>>
>>>>> PDFTextStripper and PrintTextLocations is dramatically different when
>>>>>>
>>>>>>> processing the same file.
>>>>>>> For some reason, the output of PrintTextLocations is 12 times longer
>>>>>>>
>>>>>>> than
>>>>>>
>>>>>
>>>>> that of PDFTextStripper, ie the entire text is printed out 12 times,
>>>>>>
>>>>>>> instead of just once.
>>>>>>>
>>>>>>> I'm attaching the file in question, as well as the output produced
>>>>>>>
>>>>>>> using
>>>>>>
>>>>>
>>>>> both methods via Google Drive... Hopefully it will come through.
>>>>>>
>>>>>>>
>>>>>>> I'd appreciate any ideas as to what might be causing this issue (I'm
>>>>>>> guessing there's something wrong with the structure of the file), and
>>>>>>>
>>>>>>> of
>>>>>>
>>>>>
>>>>> course any possible solutions.
>>>>>>
>>>>>>>
>>>>>>> Thanks in advance, Gilad.
>>>>>>>
>>>>>>> PS. I'm using 1.8.10.
>>>>>>> ​
>>>>>>>    output problem.zip
>>>>>>> <
>>>>>>>
>>>>>>>
>>>>>>
>>>>> https://drive.google.com/file/d/0B_eBFHMNjkhseTVaQ0FxSkdmZUE/view?usp=drive_web
>>>>>
>>>>>
>>>>>> ​
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Reply via email to