I do not understand what is the purpose of your exercise. Do you want to write your own PDF signature validation application? You do not trust Adobe Reader to do it right? Or do you want to satisfy your curiosity and see how the digest looks and to manually compute and verify the digest? The solution for each of these cases is very different. If you tell us what exactly you want to accomplish we can try to help you.
Thanks for your reply.
So I basically want to see how the digest looks and to manually compute and verify the digest. I'm already extracting the certificates and validating them.
I've been trying to understand how it works.
I had this original PDF file, which I signed with Adobe Reader.
Then I compared the content of the original PDF file and the signed one, and realized it changes a lot after the signature process (it doesn't just add a pkcs7 object to the file).
So, if I hash the content of the signed PDF file, according to the byterange (therefore excluding the pkcs7 object), it will not match the original one because, apparently, the content changes in a lot of different places.
Is there something which is not clear in 32000-1, or are you trying to do this without a detailed understanding of 32000-1?
You cannot rely on the digest to be in a certain place in PDF. If you want to manually verify the digest in a PDF signature here's what you need to do.
1. Open PDF in a Text Editor.
2. Find Signature Dictionary for your signature.
3. Get the Hex String which is the value of the /Contents entry in the Signature Dictionary.
4. Convert Hex String to binary string and discard trailing zeros. Remember that in a Hex string each byte is represented with two characters and the last one might be a zero. So, when you discard zeros make sure that what you get left has even number of bytes.
5. Use one of the commercially available BER Viewers (you can find free BER Viewers on the Web) to convert the binary string to ANSI.1 representation.
6. Analyze the BER-decoded PKCS#7 signature object (RFC 2315 describes it) and find the digest that you are looking for in it. It is an OCTET STRING.
If you want to programmatically validate a signature, you need to write code that does all that. Signature validation includes much more than checking the digest. You need to build chain, validate each certificate in the chain, check revocation for each certificate in the chain, etc. RFC 5280 is the guide what to do.
This is the pdf sample: bit.ly/1oR8XHK I'm working on.
I extracted the /Contents value, and used an ASN.1 parser to check what's the digest value, obtaining bit.ly/1kcbZFK. The digest value is "77908DA519EF898F66166CC0ACE6B82461A6DE87BE00BA5A702EAB0C263678BE". Then I erased the /Contents value from the PDF, digested the whole document with SHA-256 algorithm (the same it was used), obtaining "C2F281B16FB896E39BE7CFA2A4ABE3C8DDDDA81FE284CFB2BD22933DA3A429B2", which is different.
Any clue why?
The digest value that you get after BER-decoding /Contents string is encrypted with the signer's private key. The encrypted content may also include authenticated attributes. When you calculate the digest you need to calculate it according to the ByteRange values, not the whole document.
If you can extract the contents dictionary from within the signature dictionary in the signed PDF file, and you've managed to hex decode it back into a binary CMS object, then the encrypted digest is the very last part of CMS object as per RFC 3852.