Does your PDF contain actual form fields?
It does not—it is a PDF of a web page with some data on it that we want to transform into Excel for analysis (cutting/pasting puts the columns out of order). What do form fields do when trying to convert PDF to Excel?
Form fields are not being converted, so if your complete document is made up form form fields, you would not get any information in your resulting Excel file.
I assume that the file you are trying to convert does not contain real text - very likely because somebody converted all text to outlines in order to make it harder to extract data. Can you select text and copy and paste it into e.g. a Word document?
Converting from PDF to Word, Excel or any other format is one of the most complex things you can try to do with a PDF file. It works very well in some cases, in other cases the output has very little to do with the original file. The key for success is that the PDF file needs to be "tagged" - which means that it contains information about the information that is displayed in the file. The best way to make sure that a PDF file is tagged correctly is by using the PDFMaker in Acrobat to create the PDF file from Word or Excel (that's the Acrobat ribbon or toolbar).
Sometimes it helps to save the PDF file as a set of high resolution (e.g. 600dpi) images, then import these images back into Acrobat, run OCR and then export to Word or Excel again.
There are other tools available that can convert PDF to Excel. Whenever I come across a file that does not want to behave (and I don't want to go through the process for converting to an image and importing again), I give Tabula (http://tabula.technology) a chance. However, because you are not getting anything, I suspect that this will not work either.
Hi Karl, I appreciate your detailed response. I think I’m understanding what you mean—we hired this company to host data for us and they are only willing to do one data dump per month, which really doesn’t work for my team since we are in an iterative pilot phase of the project and need to be modifying our procedures based on the data outcomes.
When I exported the PDF as a Word document, it came out a bit garbled but was mostly legible—but if I’m understanding you correctly, I think what you’re saying about converting actual text to outlines is correct, because all of the words and numbers are little image type boxes (rather than being able to move a cursor through them freely).
I just tried copying the things we need into a word document and I’m able to manipulate it a bit there, but ultimately need to pass it on in an excel version. This is a possible workaround, but unfortunately would be very time consuming since we have to get data from ~40 different sites, and then manually manipulate them all.
Since we are creating this PDF from a website that is hosting our data, it must not be ‘tagged’ as you have explained.
I also found out from a team member that he has previously been able to do this with Adobe Acrobat 10, but my other colleague and I are having difficulty with Acrobat DC and Acrobat 9.
We’ll have to find a workaround. Thank you.
1 person found this helpful
If you get date in Word, chances are that the document is not converted to outlines. You may want to give a 3rd party tool a try: Tabula: Extract Tables from PDFs Sometimes I get better result with Tabula, but it's very slow.
You can also try to export the PDF files as a series of high resolution TIFF images (e.g. 600dpi) and then import these images back into Acrobat as a PDF file. Then run OCR and try to export again.