34 Replies Latest reply: Feb 9, 2009 5:51 AM by (tyagigaurav) RSS

    PDF  to XML conversion

      Hi all,

      I asked a question in Acrobat SDK forum about using Acrobat standard SDK for automatic PDF to XML conversion on a server and I was told that Acrobat lisence does not permit that and I need to use the Adobe LiveCycle ES.

      I just need to automatically convert the incoming PDF files to XML on a server (automating Acrobat Standard's "SaveAS XML" function)

      Could you please tell me which LiveCyle component can do this for me and also give me its approximate price.

      Thanks very much for your help,
      Arash
        • 1. Re: PDF  to XML conversion
          Jasmin Charbonneau techies
          What excactly are you trying to get in that XML file. The data from the PDF document, the metadata,etc.

          Can you be a little bit more specific because there might be different products to do different things. I did a quick test with the SaveXML form Acrobat and it seem to contain only the metadata, but I'd lie to confirm.

          Thanks,

          Jasmin
          • 5. Re: PDF  to XML conversion
            Community Member
            Hi Jasmin,<br /><br />Thanks for you reply. I need the content of the PDF file converted to XML. For example in the XML file that I get from Acrobat SaveAS XML function paragraphs are tagged with <p?>, tables are tagged with <table?>, table cells are tagged with <TD?>, table rows are tagged with <TR?> and so on. There are a few PDF to XML/HTML tools but none of them does a clean conversion for example they might break the content of a single cell into two parts and tag each part as an individual cell. These kind of bugs cause problem when I want to extract information from the XML file via Natural Language Processing.  I have a small VB program that can be run from command line and gets a PDF file as Input and converts that PDF file to XML using Acrobat Standard and saves the generated XML file on the local drive. I want this program to run on a Server and converts all the incoming PDF files to XML. But I was told that the Acrobat Standard licence does not allow me to use Acrobat on a server for automatic batch conversion.<br /><br /> Thanks very much for your help.<br /><br />Following is part of an XML file generated by SaveAs XML function in Acrobat Standard 8.0:<br /><br />(I added Question marks to the tags manually so they would not be interpreted as HTML tags)<br /><br /><Table?><br /><br /><TR?><br /><br /><TH?>Master of Business Administration (MBA)Approved Course Schedule Information </TH?><br /><br /></TR?><br /><br /><TR?><br /><br /><TH?>Module Title </TH?><br /><br /><TD?><Figure? ActualText?="Leading Effectively "><br /><br /><ImageData? src=""/><br /><br />LeadingLeading EEffeffectivelyctively</?Figure><br /><br /></TD?><br /><br /></TR?><br /><br /> <TR?><br /><br /><TH?>Number of Credits </TH?><br /><br /><TD?>25 Credits </TD?><br /><br /></TR?><br /><br /><TR?><br /><br /><TH?>Subject Status </TH?><br /><br /><TD?>Mandatory </TD?><br /><br /></TR?><br /><br /><TR?><br /><br /><TH?>Quantity of Learning Experience </TH?><br /><br /><TD?>400 hours, broken down over Stages 1 and 2 as follows Directed Study 85 Independent Study 315 Total 400 </TD?><br /><br /></TR?><br /><br /> <TR?><br /><br /><TH?>Allocation of Marks </TH?><br /><br /><TD?>C/A </TD?><br /><br /><TD?>Project </TD?><br /><br /> <TD?>Practical </TD?><br /><br /> <TD?>Final </TD///?><br /><br /></TR?><br /><br /> <T/R><br /><br /><TD?/>0 </TD?><br /><br /> <TD?>100% </TD?><br /><br /><TD/?><br /><br /><TD?>0 </TD?><br /><br /></TR?><br /><br /></Table?>
            • 6. Re: PDF  to XML conversion
              Jasmin Charbonneau techies
              So is this really a html representation of the PDF.

              PDFG should be the product that can convert PDF into HTML. The operation is ExportPDF.

              Jasmin
              • 7. Re: PDF  to XML conversion
                Hi,
                My requirement is send the xml to a server and there allow another application use this xml.

                I am trying to use ASP to take this xml and save it on the server but I am unable to do it because dont know how. Is it possible?
                Thanks,
                merlin
                • 8. Re: PDF  to XML conversion
                  Jasmin Charbonneau techies
                  You can definitely post XML to an ASP from a PDF. You need to add a submit button into the form and make sure you specify the location where you want to post the information to ( in you case XML). You also need to specify what you want to post (PDF, XML, XDP).

                  In your asp, you can then get the xml out of the request object.

                  Is that what you're trying to do?

                  Jasmin
                  • 9. Re: PDF  to XML conversion
                    Community Member
                    Hi Jasmin,

                    I am using Acrobat Pro 6 and when I defined the submit button I just saw FDF, HTML, XFDF and PDF. Now, If I want to get the xml I guess I could use the XFDF option. However, what should I do in asp in order to get this xfdf file. I read something about request object, but don't know it. Could you please let me know what is the syntax I need to use.
                    Thank you very much.
                    • 10. Re: PDF  to XML conversion
                      Jasmin Charbonneau techies
                      This is a java example, but the ASP should be very similar.

                      public static byte[] getRequestBufferAsBytes(HttpServletRequest request)
                      throws IOException, ServletException
                      {
                      ServletInputStream oInput = request.getInputStream();
                      long nContentLength = request.getContentLength();
                      String contentType = request.getContentType();
                      if(nContentLength <= 0L)
                      return null;
                      byte cContent[] = new byte[(int)nContentLength];
                      int nRead = 0;
                      int nToRead = (int)nContentLength;
                      int nBlkSize = 512;
                      byte cTemp[] = new byte[512];
                      do
                      {
                      int n = 0;
                      int i = 0;
                      if(nToRead - nRead < 512)
                      nBlkSize = nToRead - nRead;
                      n = oInput.read(cTemp, 0, nBlkSize);
                      for(i = 0; i < n; i++)
                      cContent[i + nRead] = cTemp[i];

                      nRead += i;
                      } while(nRead < nToRead);
                      Long nBytesRead = new Long(nRead);
                      return cContent;
                      }

                      Jasmin
                      • 11. Re: PDF  to XML conversion
                        Community Member
                        Jasmin,

                        I will try to do the convertion to asp.

                        Another question: what about if I try to call my WebService from a button in the pdf. My webservice is defined on Apache-Axis. Is valid the command soap.connect(myURL) from pdf javascript to connect to Apache-Axis?. Is that doc wrapped for SOAP when I use the soap command?. I am afraid i have more question, I am new in this matter.

                        Thank you very much for your answers
                        M
                        • 12. Re: PDF  to XML conversion
                          Community Member
                          hi, want to know how you did the conversion of pdf to xml using the acrobat standard??... did you use the acrobat javascript object?...
                          please as i need the function that properly does the conversion of pdf to xml...

                          Private oapp As New Acrobat.AcroApp
                          Private oavdoc As New Acrobat.AcroAVDoc
                          Private odoc As New Acrobat.AcroPDDoc
                          Dim ojs As Object
                          Dim osaveas As Object
                          Dim input As String

                          input = "d:\document\time.pdf"

                          If oavdoc.Open(input, Path.GetFileName(input)) Then
                          odoc = oavdoc.GetPDDoc()
                          ojs = odoc.GetJSObject()
                          ojs.Saveas("/c/test.xml", "com.adobe.acrobat.xml-1-00")

                          End If
                          odoc.Close()
                          oapp.Exit()

                          here's the code in vb.net but i'm getting security error in acrobat professional please help me out
                          • 13. Re: PDF  to XML conversion
                            Community Member
                            Hi Asish,

                            I was struggling with this security error for a copule of days myself! For some security reasons (That I do not remeber) you are not allowed to save the converted document in the root directory. change your path from "/c/test.xml" to simething like "/c/test/text.xml" and it should work all right.

                            Good luck
                            Arash
                            • 14. Re: PDF  to XML conversion
                              hi
                              i too having the same issue as converting PDF file to XML format/file. this should be done using pure java code and till now i got the pdf file name by selecting the file using the choose file browser and storing it in a variable then with the help of PDFBOX i tried but the result is converted but not in XML format(i.e.,when i open the converted XML file it is in normal format but only the extension has been changed .PDF to .XML no other changed)so please let me know how to achive this
                              i have paste the code i have used

                              import java.io.File;
                              import java.io.FileOutputStream;
                              import java.io.IOException;
                              import java.io.OutputStreamWriter;
                              import java.io.Writer;
                              import java.net.MalformedURLException;
                              import java.net.URL;

                              import org.pdfbox.pdmodel.PDDocument;
                              import org.pdfbox.pdmodel.encryption.AccessPermission;
                              import org.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
                              import org.pdfbox.util.PDFText2HTML;
                              //import org.pdfbox.util.PDFTextStripper;
                              import org.pdfbox.pdmodel.font.PDFont.* ;
                              import org.pdfbox.util.PDFTextStripper;
                              import org.pdfbox.util.*;
                              import org.pdfbox.pdmodel.*;
                              import com.activegrid.util.AGObject;
                              //import executesqljavaxsd.types.*;
                              import com.activegrid.util.Logger;
                              import com.activegrid.data.DataService;
                              import java.util.List;
                              import java.util.ArrayList;

                              public class pdf2xml {
                              public static final String DEFAULT_ENCODING =
                              null;
                              //"ISO-8859-1";
                              //"ISO-8859-6"; //arabic
                              //"US-ASCII";
                              //"UTF-8";
                              //"UTF-16";
                              //"UTF-16BE";
                              //"UTF-16LE";

                              private static final String PASSWORD = "-password";
                              private static final String ENCODING = "-encoding";
                              private static final String CONSOLE = "-console";
                              private static final String START_PAGE = "-startPage";
                              private static final String END_PAGE = "-endPage";
                              private static final String SORT = "-sort";
                              private static final String HTML = "-html"; // jjb - added simple HTML output
                              private static String a;
                              public static void main( String[] args ) throws Exception
                              {

                              }
                              public static void abc(String inp)throws Exception
                              {
                              boolean toConsole = false;
                              boolean toHTML = false;
                              boolean sort = false;
                              String password = "";
                              String encoding = DEFAULT_ENCODING;
                              String pdfFile = inp;
                              String textFile = "C:/tut.xml";//file to store in XML format
                              int startPage = 1;
                              int endPage = Integer.MAX_VALUE;
                              a ="txt";
                              if( pdfFile == null )
                              {
                              usage();
                              }
                              else
                              {

                              Writer output = null;
                              PDDocument document = null;
                              try
                              {
                              try
                              {
                              //basically try to load it from a url first and if the URL
                              //is not recognized then try to load it from the file system.
                              URL url = new URL( pdfFile );
                              document = PDDocument.load( url );
                              String fileName = url.getFile();
                              if( textFile == null && fileName.length() >4 )
                              {
                              File outputFile =
                              new File( fileName.substring( 0, fileName.length() -4 ) + ".txt" );
                              textFile = outputFile.getName();
                              }
                              }
                              catch( MalformedURLException e )
                              {
                              document = PDDocument.load( pdfFile );
                              if( textFile == null && pdfFile.length() >4 )
                              {
                              textFile = pdfFile.substring( 0, pdfFile.length() -4 ) + ".txt";
                              }
                              }

                              //document.print();
                              if( document.isEncrypted() )
                              {
                              StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password );
                              document.openProtection( sdm );
                              AccessPermission ap = document.getCurrentAccessPermission();

                              if( ! ap.canExtractContent() )
                              {
                              throw new IOException( "You do not have permission to extract text" );
                              }
                              }
                              if( toConsole )
                              {
                              output = new OutputStreamWriter( System.out );
                              }
                              else
                              {
                              if( encoding != null )
                              {
                              output = new OutputStreamWriter(
                              new FileOutputStream( textFile ), encoding );
                              }
                              else
                              {
                              //use default encoding
                              output = new OutputStreamWriter(
                              new FileOutputStream( textFile ) );
                              }
                              }

                              PDFTextStripper stripper = null;
                              if(toHTML)
                              {
                              stripper = new PDFText2HTML();
                              }
                              else
                              {
                              stripper = new PDFTextStripper();
                              }
                              stripper.setSortByPosition( sort );
                              stripper.setStartPage( startPage );
                              stripper.setEndPage( endPage );
                              stripper.writeText( document, output );
                              }
                              finally
                              {
                              if( output != null )
                              {
                              output.close();
                              }
                              if( document != null )
                              {
                              document.close();
                              }
                              }
                              }
                              }
                              /**
                              * This will print the usage requirements and exit.
                              */
                              private static void usage()
                              {
                              System.err.println( "Usage: java org.pdfbox.ExtractText [OPTIONS] <PDF file> [Text File]\n" +
                              " -password <password> Password to decrypt document\n" +
                              " -encoding <output encoding> (ISO-8859-1,UTF-16BE,UTF-16LE,...)\n" +
                              " -console Send text to console instead of file\n" +
                              " -html Output in HTML format instead of raw text\n" +
                              " -sort Sort the text before writing\n" +
                              " -startPage <number> The first page to start extraction(1 based)\n" +
                              " -endPage <number> The last page to extract(inclusive)\n" +
                              " <PDF file> The PDF document to use\n" +
                              " [Text File] The file to write the text to\n"
                              );
                              System.exit( 1 );
                              }
                              public static java.lang.String pdf2xml(java.lang.String inp) {
                              try{
                              abc(inp);
                              }
                              catch( Exception e){}
                              a = inp;
                              java.lang.String out = a;

                              // your custom code goes here

                              return out;
                              }

                              }

                              reply me ASAP
                              regards
                              yuvaraj
                              • 15. Re: PDF  to XML conversion
                                Community Member
                                You should post this question in PDFBOX forum.
                                • 16. Re: PDF  to XML conversion
                                  How to convert a pdf document into xml file format... Help me with codings... I am in need of it.. Pls help me
                                  • 17. Re: PDF  to XML conversion
                                    Jasmin Charbonneau techies
                                    We can get the XML data out of a PDF, but we can't convert the PDF to XML.

                                    Jasmin
                                    • 18. Re: PDF  to XML conversion
                                      Community Member
                                      How do I get the XML data out of a PDF?
                                      • 19. Re: PDF  to XML conversion
                                        Jasmin Charbonneau techies
                                        Using the LiveCycle Form Data Integration's Export data or LiveCycle Form's processFormSubmission.

                                        Jasmin
                                        • 20. Re: PDF  to XML conversion
                                          Jasmin Charbonneau techies
                                          I just want to rectify something I said in a post earlier:

                                          "...but we can't convert the PDF to XML"

                                          We actually CAN convert PDF to XML using ExportPDF( ) and ExportPDF2( ) operations in GeneratePDF service.

                                          Sorry for the confusion.

                                          Jasmin
                                          • 21. Re: PDF  to XML conversion
                                            How to convert pdf file to xml
                                            • 22. Re: PDF  to XML conversion
                                              hi
                                              can any1 plz tel mi how to convert a pdf file into an xml file......Thank u
                                              • 23. PDF to XML conversion
                                                (Aandi_Inston) Community Member
                                                > can any1 plz tel mi how to convert a pdf file into an xml file......Thank u

                                                What sort of XML file? What PDF content?


                                                Aandi Inston
                                                • 24. Re: PDF  to XML conversion
                                                  Community Member
                                                  hi linston.....actually i hav got a application form frm one comapy....its in pdf format....they hav told us to fill tat forn n send it back in .xml format.....i wantd to knw how to convert tat .pdf file into .xml file.......am usin adobe 9.......liston can u givmi ur email address....i will mail tat form to u......dude plzzz help mi
                                                  • 25. PDF to XML conversion
                                                    (Aandi_Inston) Community Member
                                                    Sorry, no email. Messages are all public so people can benefit in
                                                    future.

                                                    The people who sent you the form need to explain what you need to do.
                                                    For instance, there may be a button on the form to click.

                                                    Aandi Inston
                                                    • 26. Re: PDF  to XML conversion
                                                      Hi Aandi

                                                      Is there a way to convert PDF to XML? If yes, how do I do that?

                                                      Thanks
                                                      Jayashree
                                                      • 27. Re: PDF  to XML conversion
                                                        pguerett techies
                                                        Are you looking at getting the data out of the form in an xml format or are you trying to convert the whole pdf to xml?
                                                        • 28. Re: PDF  to XML conversion
                                                          A C Jones
                                                          Re Jasmin's answer from 10:18am Jul 23, 07 PST (#6 of 27) "So is this really a html representation of the PDF. PDFG should be the product that can convert PDF into HTML. The operation is ExportPDF." What version of PDFG would that be? We have v 6.0 and it doesn't have the Export PDF function. We just need to convert a PDF to HTML keeping tables etc in tact. Help please!
                                                          • 29. Re: PDF  to XML conversion
                                                            pguerett techies
                                                            The HTML conversion in PDF/G was added in version 8
                                                            • 30. Re: PDF  to XML conversion
                                                              I have a pdf form and i want to get data from form in a XML format. I assume that is doable using Export feature of Adobe Live Cycle.
                                                              However, by default, in this XML, every cell is converted to an element in XML. However, I want to get some cells as attributes.
                                                              Is it doable?
                                                              Please advise.
                                                              Thanks!
                                                              • 31. Re: PDF  to XML conversion
                                                                pguerett techies
                                                                No ....by default that is the way it works. You coudl apply a style sheet to reformat the XML in the way that you want.
                                                                • 32. Re: PDF  to XML conversion
                                                                  Community Member
                                                                  Thanks,Paul.

                                                                  Do you have some online resources which could be helpful in learning and applying style sheets for reformatting XML?
                                                                  some links etc..
                                                                  Thank you!
                                                                  • 33. Re: PDF  to XML conversion
                                                                    pguerett techies
                                                                    There are plenty of sites out there. Simply do a search for XSLT and you will find plenty of material.
                                                                    • 34. Re: PDF  to XML conversion
                                                                      hi all,

                                                                      actually i want to know is there some code to extract xsd from pdf form in java through itext.

                                                                      thanx and regards,
                                                                      Gaurav