RE: Stream parsing issue in multi-stream page

Esteban R Mon, 05 Feb 2018 08:14:22 -0800

Thanks for your answer. But I really need to process the streams one by one (a 
special requirement in my project).


Anyways, your answer gave me an idea for detecting the issue: I can compare the 
tokens for the individual streams with the tokens from pdPage.getContents().... 
double processing, but still useful.

Any other ideas are wellcome.

Esteban
________________________________
De: Maruan Sahyoun <[email protected]>
Enviado: lunes, 05 de febrero de 2018 03:25 p.m.
Para: [email protected]
Asunto: Re: Stream parsing issue in multi-stream page

Hi,



> Am 05.02.2018 um 15:43 schrieb Esteban R <[email protected]>:
>
> Hello. I need to rewrite a PDPage with many streams, one by one (making some 
> transformations, and there is a special need to do it one stream at a time). 
> Parsing (and pdfdebug) returns "wrong" tokens if one command begins at the 
> end of the first stream and ends at the begining of the next one. I'm using 
> pdfbox-2.0.8.
>
> Rewriting the stream with those tokens produces a corrupted page.
> How could we re-write the page without getting a corrupted page?
> Or, at least, how can we detect this kind of failures (or this one)?
>
> Please find a simplified example here:
> http://www.filedropper.com/out3unc
>
> The first stream is:
> /F1 10 Tf
> BT
> 40 764.138 Td
> 0 -12.138 Td
> [
>
> and the second one is:
> (CD) ] TJ
> ET
>
> In this case, running the following code:
>        Iterator<PDStream> itStreams = pdPage.getContentStreams();
>        while (itStreams.hasNext()) {
>            PDStream pdstream = itStreams.next();
>            PDFStreamParser parser = new 
> PDFStreamParser(pdstream.toByteArray());
>            parser.parse();
>            List<Object> tokens = parser.getTokens();
>            for (Object token: tokens){
>                System.out.println("Token: "+token);
>            }
>        }
>

instead of using pdPage.getContentStreams() and parsing the stream individually 
use pdPage.getContents() and read all content into a byte[]. You can then pass 
that to PDFStreamParser.

That will give you this output

Token: COSName{F1}
Token: COSInt{10}
Token: PDFOperator{Tf}
Token: PDFOperator{BT}
Token: COSInt{40}
Token: COSFloat{764.138}
Token: PDFOperator{Td}
Token: COSInt{0}
Token: COSFloat{-12.138}
Token: PDFOperator{Td}
Token: COSArray{[COSString{CD}]}
Token: PDFOperator{TJ}
Token: PDFOperator{ET}

BR
Maruan


> shows:
> Token: COSName{F1}
> Token: COSInt{10}
> Token: PDFOperator{Tf}
> Token: PDFOperator{BT}
> Token: COSInt{40}
> Token: COSFloat{764.138}
> Token: PDFOperator{Td}
> Token: COSInt{0}
> Token: COSFloat{-12.138}
> Token: PDFOperator{Td}
> Token: COSArray{[]}                    !!!!! empty array detected, end of 
> first stream
> Token: COSString{CD}                 !!!!! begining of second stream
> Token: COSNull{}                         !!!!! closing "]"
> Token: PDFOperator{TJ}
> Token: PDFOperator{ET}
>
>
> Esteban


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Stream parsing issue in multi-stream page

Reply via email to