Skip navigation
xfrapp
Currently Being Moderated

Extract embedded xml from PDF/A-3b (also creation)

Aug 19, 2013 4:19 AM

Hello there,

 

in the context of a research project, we are currently trying to extract embedded xml from a PDF/A-3b document via code.

The project deals with establishing a new invoicing standard (Zugferd: ferd-net.de, only german). Invoices are expressed via xml, which is embedded in PDF/A.

What we are trying to archive is extraction of the xml via java code. For testing purposes, we are currently using an third party skd to extract the invoice-xml, by calling a .EXE file and then picking up the results in java.

 

I currently have only one valid example file that can be processed via this sdk. To get more data, i used the test version of acrobat pro to alter the embedded xml file. To be more specific, i deleted the embedded file, added a new xml file, and used preflight to make the PDF conform to /A-3b. Although the file seems to have the same properties as the original, it can no more be processed via the extraction sdk. Since messing around with acrobat does not seem to get me anywhere, i am now looking into extracting data from the pdf my self.

 

Is there any present implementation/library/solution for extracting data in a java context? The few third party tools i found are all based of a .net/windows native environment. I have heard rumors about Adobe giving out tools to extract embedded data from PDF/A?

How is it the other way around? Is it possible to embedd xml into a PDF via Java? Given there allready is PDF file which we can attach to.

 

I really appreciate reading and thanks for any help or input!

Greetings,

Florian

 
Replies
  • Currently Being Moderated
    Aug 19, 2013 4:52 AM   in reply to xfrapp

    There are many Java-based libraries available that can work with embedded files.  Adobe offers one (licensable from Datalogics) and there are others.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2013 6:44 AM   in reply to xfrapp

    Hi Florian,

     

    I would look for general purpose PDF libraries that can open a PDF and access data objects in it.

     

    All in all it is not too difficult to get to the embedded XML, once you have a library that can access and read data structures/data objects inside a PDF file. Some understanding of the inner workings of PDF data structures will help you get the job done (e.g. read the section about embedded files in the PDF standard / ISO 32000-1, as well as the chapter about PDF syntax).

     

    Olaf

     

     

     

    Am 19 Aug 2013 um 13:19 schrieb xfrapp <forums_noreply@adobe.com>:

     

     

    Extract embedded xml from PDF/A-3b (also creation)

    created by xfrapp in PDF Language and Specifications - View the full discussion

    Hello there,

     

     

    in the context of a research project, we are currently trying to extract embedded xml from a PDF/A-3b document via code.

     

    The project deals with establishing a new invoicing standard (Zugferd: ferd-net.de, only german). Invoices are expressed via xml, which is embedded in PDF/A.

     

    What we are trying to archive is extraction of the xml via java code. For testing purposes, we are currently using an third party skd to extract the invoice-xml, by calling a .EXE file and then picking up the results in java.

     

     

    I currently have only one valid example file that can be processed via this sdk. To get more data, i used the test version of acrobat pro to alter the embedded xml file. To be more specific, i deleted the embedded file, added a new xml file, and used preflight to make the PDF conform to /A-3b. Although the file seems to have the same properties as the original, it can no more be processed via the extraction sdk. Since messing around with acrobat does not seem to get me anywhere, i am now looking into extracting data from the pdf my self.

     

     

    Is there any present implementation/library/solution for extracting data in a java context? The few third party tools i found are all based of a .net/windows native environment. I have heard rumors about Adobe giving out tools to extract embedded data from PDF/A?

     

    How is it the other way around? Is it possible to embedd xml into a PDF via Java? Given there allready is PDF file which we can attach to.

     

     

    I really appreciate reading and thanks for any help or input!

     

    Greetings,

     

    Florian

     

    Please note that the Adobe Forums do not accept email attachments. If you want to embed a screen image in your message please visit the thread in the forum to embed the image at http://forums.adobe.com/message/5606424#5606424

    Replies to this message go to everyone subscribed to this thread, not directly to the person who posted the message. To post a reply, either reply to this email or visit the message page: Extract embedded xml from PDF/A-3b (also creation)

    To unsubscribe from this thread, please visit the message page at Extract embedded xml from PDF/A-3b (also creation). In the Actions box on the right, click the Stop Email Notifications link.

    Start a new discussion in PDF Language and Specifications by email or at Adobe Community

    For more information about maintaining your forum email notifications please go to http://forums.adobe.com/message/2936746#2936746.

     

    --

    Olaf Druemmer | Managing Director | callas software GmbH | Schoenhauser Allee 6/7 | 10119 Berlin

    Tel +49.30.4439031-0 | Fax +49.30.4416402 | o.druemmer@callassoftware.com | www.callassoftware.com

    Amtsgericht Charlottenburg, HRB 59615 | Geschäftsführung: Olaf Drümmer, Ulrich Frotscher

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points