I've got around 400 pdf's of varying length that I'd like to batch convert to a spreadsheet. The best case scenario would be this: Get the content of the first page of each pdf and get the text into a spreadsheet, each entry inside it's own single cell. Do you have any ideas on how to do this?
Here's my own idea so far:
1) Delete all pages except the first one in a batch. I've the tried the action wizard, but it seems to only work if all the pdf's have the same amount of pages - which they don't. Is there any way to overcome this?
2) Batch convert pdfs to xml. This I can do, and it seems to do a quite good job at the OCR. However, the text is spread out on multiple cells in the spreadsheet. Is there any way to tell Acrobat to put all the information in a single cell?
3) Merge the xml-documents into a single spreadsheet. This should be fairly simple, I think, so no worries on that one.
Any help with the two steps? Or others ideas on how to achieve this?
Thank you :-)
I think the way to do it is by using a script, like this:
- Use an Action to process all the files.
- Perform OCR on the first page of each file.
- Extract the first page's text and save it into a global variable using a
- When the Action is complete, run a separate script to export the value of
that variable to a text file, which can then be opened using Excel.
Thank you, try67 - my solution ended up being batch converting to txt-files and then importing in Excel using VBA.