Hello Andrea,

There is no "strict" mode. We process even the crappiest PDFs. Because of the many people who insist that they are OK, "but they display with Adobe Reader!".

All you could do is to insist on PDF/A. Then we'd have PDFBox preflight for you. But we don't have anything that just checks against the full PDF specification.

Checking that PDFs start with "%PDF" should be easy, i.e. you don't need PDFBox for that.

Tilman

Am 10.06.2016 um 12:29 schrieb Andrea Canu:
Hi Tilman, you are correct!
My file is a zip file which contains three signed PDF documents.

But now I'm in trouble again.

Why PDDocument PDFParser Irecognize
Reading this stream with the two classes PDDocument PDFParser I'm not able
to detect if some "header's junk" are skipped by the parser! In this case,
all PDSignature  extracted from the obtained PDDocument refers to a
byte-range with invalid offsets.
The problem

Is it possible to read the PDF stream in "strict-mode" ? This capability
could be useful to detect if a PDF is not "clean"

Alternatively, the PDDocument class can be provided by a new method that
should return the signed-content for a given PDSignature.


Andrea

p.s
No, the Signature's validation I'm refering to is obtained by a known
commercial library

On Thu, Jun 9, 2016 at 6:24 PM, Tilman Hausherr <[email protected]>
wrote:

Hello Andrea,

I disagree - IMHO your PDF is incorrect. "PK" means that it is a ZIP file.
Apparently with an uncompressed PDF in it (yes, ZIP can have uncompressed
files). Of course one could adjust the offsets, but this wouldn't be right:
the PDF has been modified, the PK header has been added. Try renaming that
file and then click on it to confirm my theory that it is really a ZIP file.

(I suspect you'll tell me that it validates with Adobe Reader. If so, then
I'd say Adobe is wrong. I just tried adding "XXXX" in front of a file with
NOTEPAD++ and Adobe does not tell that the file was modified.)

The good thing is that there is no bug in COSFilterInputStream (I was
afraid of that), so I'll use getSignedContent() in the signature example
instead of the code I have now.

Tilman


Am 09.06.2016 um 10:45 schrieb Andrea Canu:

Hi Tilman
thank you for your answer.

The PDF is a real document so I can't share it, but I can give you an
extract:

Those are the first 1044 bytes of the document.
--------------------------------------------------------------



*PK      ¹Js: ¼àð3£ 3£ <   CAACT-00-00-08 document.pdf*%PDF-1.6

%âãÏÓ
3582 0 obj
<</Linearized 1/L 697139/O 3585/E 118808/N 42/T 625450/H [ 1000 1986]>>
endobj

xref
3582 34
0000000016 00000 n
0000003154 00000 n
0000003481 00000 n
0000003680 00000 n
0000004019 00000 n
0000004048 00000 n
0000004265 00000 n
0000004495 00000 n
0000004765 00000 n
0000004950 00000 n
0000006189 00000 n
0000007372 00000 n
0000007629 00000 n
0000060752 00000 n
0000061525 00000 n
0000062245 00000 n
0000062284 00000 n
0000062509 00000 n
0000062740 00000 n
0000062819 00000 n
0000064540 00000 n
0000064945 00000 n
0000065082 00000 n
0000065306 00000 n
0000065606 00000 n
0000072471 00000 n
0000075166 00000 n
0000078960 00000 n
0000079194 00000 n
0000079411 00000 n
0000118645 00000 n
0000118722 00000 n
0000002986 00000 n
0000001000 00000 n
trailer
<</Size 3616/Prev 625437/XRefStm 2986/Root 3583 0 R/Info 3580 0

R/ID[<A71F76F2A24FB6D888EDCB04CB86B815><6CCE97BD63E74F479ED22F39881647F0>]>>
startxref
0
%%EOF

.....
--------------------------------------------------------------

I would to bring your attention to the first 60 bytes.
Those bytes are stripped out by the *COSParser *parser, skipped like
garbage.
The method that skips those bytes  is:

COSParser.parserHeader(PDF_HEADER, PDF_DEFAULT_VERSION)

....
private static final String PDF_HEADER = "%PDF-";


I've noticed that I must to manually skip too those 60 bytes from the
*pdfInputStream
*before to call the method

signature.getSignedContent ( *pdfInputStream *)


In this way, the returned byte-array digest HASH and the HASH inside
signature match.


Andrea


On Wed, Jun 8, 2016 at 6:06 PM, Tilman Hausherr <[email protected]>
wrote:

Am 08.06.2016 um 13:27 schrieb Andrea Canu:
Hi guys
I want to ask you about the correct way to get the signed-content from
the
signature.
Since now I've used the PDSignature class's method:

signature.getSignedContent ( *pdfInputStream *)

With this method I'm able to extract from the *pdfInputStream *the
byte-array of the signed-content based on the signature's ByteRange.

I've noticed that if I try to verify the signature based on that
byte-array, the verification sometime unexpectedly fails!

Hello Andrea,
Can you share the PDF (upload it)?

I doubt your theory re: bug in COSParser. I'd rather search if there is a
bug in COSFilterInputStream.

If you can't share the PDF, then please download the bytes "the hard
way":

                      // download the signed content, described in
/ByteRange COSArray:
                      // [offset1 len1 offset2 len2]
                      int[] byteRange = sig.getByteRange();
                      byte[] buf = new byte[byteRange[1] + byteRange[3]];
                      RandomAccessFile raf = new RandomAccessFile(infile,
"r");
                      raf.seek(byteRange[0]);
                      raf.readFully(buf, byteRange[0], byteRange[1]);
                      raf.seek(byteRange[2]);
                      raf.readFully(buf, byteRange[1], byteRange[3]);
                      raf.close();

This code is not fully correct, because /ByteRange might have more than 4
elements. So have a look at it to be sure.

Then compare the byte array "buf" with the one from getSignedContent.

Another possibility that it fails might be that there are different
signature methods. See the code at


https://svn.apache.org/viewvc/pdfbox/branches/2.0/examples/src/main/java/org/apache/pdfbox/examples/signature/ShowSignature.java?view=markup

I didn't use getsignedContent() there but I think I should. So I'd be
very
interested to find out if there is a bug there.

Tilman


Now, looking at the COSParser class I've found this method :
COSParser.parseHeader


This method, trying to find the correct document's header, is able to
skip
some garbage in the PDF document looking for the markers "%PDF-" and
"%FDF-".

So, I've noticed that the signature verification succeed if I skip that
garbage during the signed-content extraction.

My question is:
Why this garbage-management is not present also into the
getSignedContent
code?

The workaround I found is to skip that garbage manually from the
*pdfInputStream*, but now the problem is the correct way to calculate
the
offset for the *pdfInputStream.*

Any suggestion?

Kinds regards
Andrea.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to