24 Replies Latest reply on Apr 30, 2008 12:25 AM by (debjani_dasgupta)

    Extract Text via Applescript

      Hi all,

      I am trying to setup an Applescript to convert a group of PDFs to plain text. I have the script below, which almost works, but I can't get it to save the text file with a variable as a filename.

      The script below works fine if I use the 'tester' variable, which explicitly sets the filename, however if i use the 'newName' variable which uses the original filename the script fails with the following error:-
      Adobe Acrobat Professional got an error: "this.saveAs(\"/C7s.txt\", \"com.adobe.acrobat.plain-text\");" doesn't understand the do script message.

      Script is:
      tell application "Finder"

      set theFolder to choose folder with prompt "Select a folder of PDFs:"
      set theFiles to every file of theFolder

      repeat with a from 1 to length of theFiles
      set theFile to file ((item a of theFiles) as string)
      set fileExtension to the name extension of theFile
      set the fileName to the name of theFile
      set the newName to text 1 thru -((length of fileExtension) + 2) of the fileName & ".txt"
      set tester to "myDoc.txt"
      tell application "Adobe Acrobat Professional"
      open theFile
      tell document 1
      --save the file out as a text file using JavaScript
      do script "this.saveAs(\"/" & newName & "\", \"com.adobe.acrobat.plain-text\");"
      end tell
      close document 1
      end tell
      end repeat
      end tell

      Any help would be much appreciated.

      James
      P.S. I am running Mac OS X.4.10 and Acrobat 8
        • 1. Re: Extract Text via Applescript
          Level 1
          I have managed to solve this and have put the final script up here in case you are trying to do the same thing:

          tell application "Finder"

          set theFolder to choose folder with prompt "Select a folder of PDFs:"
          set theFiles to every file of theFolder

          repeat with a from 1 to length of theFiles
          set theFile to file ((item a of theFiles) as string)
          set fileExtension to the name extension of theFile
          set the fileName to the name of theFile
          set the newName to text 1 thru -((length of fileExtension) + 2) of the fileName & ".txt"
          set {text:newName} to newName as string

          tell application "Adobe Acrobat Professional"
          open theFile
          tell document 1
          --save the file out as a text file using JavaScript
          do script "this.saveAs(\"/" & newName & "\", \"com.adobe.acrobat.plain-text\");"
          end tell
          close document 1
          end tell

          end repeat
          end tell
          • 2. Re: Extract Text via Applescript
            Is this possible for other formats as well? Postscript, EPS?
            • 3. Re: Extract Text via Applescript
              (Aandi_Inston) Level 1
              When it comes to text extraction, the Acrobat SDK is concerned with
              automating Acrobat, which only deals with PDF. You could automate
              converting the PostScript and EPS to PDF, however.

              Aandi Inston
              • 4. Re: Extract Text via Applescript
                Level 1
                I was more interested in automating the conversion of a pdf to Postscript or EPS. For that matter any of the formats that Acrobat supports in it's export function. Any thoughts on this?
                • 5. Re: Extract Text via Applescript
                  Level 1
                  Hi Chris,

                  The javascript call I used ("com.adobe.acrobat.plain-text\") does support other methods, if you look at the document linked to below and search for plain text. You should see all the other options, including eps

                  http://www.adobe.com/devnet/acrobat/pdfs/Acro6JS.pdf

                  regards

                  james
                  • 6. Re: Extract Text via Applescript
                    Hi James,

                    I needed to do the same thing (Convert PDFs to text).

                    I used your script (the corrected one)

                    I still get the error that you mentioned while posting your first message

                    What could be the problem.

                    I'm using Mac OS 10.5.2 and Acrobat 8.0

                    Thanks

                    Dheeraj
                    • 7. Re: Extract Text via Applescript
                      Level 1
                      Hi Dheeraj,

                      I seem to be getting the same message via OSx.5 as well. You have two options:

                      1. Automator in Leopard has the option to extract text from pdfs, far simpler than any script!
                      2. I have come across PDF2Office which I have tested the demo version of and seems to do this any much more.
                      http://www.recosoft.com/products/pdf2office/

                      Hope this helps

                      Regards

                      James
                      • 8. Re: Extract Text via Applescript
                        Level 1
                        Dear James,

                        Thanks for the help.

                        I had managed to figure out the "Automator solution" before reading your message.

                        Will check out pdf2office as well.

                        Thanks once again.

                        Dheeraj
                        • 9. Re: Extract Text via Applescript
                          A few export options:

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".txt"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.accesstext"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".html"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.html-3-20"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".html"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.html-4-01-css-1-00"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".jpg"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.jpeg"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".jpg"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.jp2k"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".doc"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.doc"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name & ".txt"
                          save front document to file New_File_Path using conversion "com.adobe.acrobat.plain-text"
                          close front document
                          end tell
                          end repeat
                          end tell
                          --
                          on getBaseName(fName)
                          set baseName to fName
                          repeat with idx from 1 to (length of fName)
                          if (item idx of fName = ".") then
                          set baseName to (items 1 thru (idx - 1) of fName) as string
                          exit repeat
                          end if
                          end repeat
                          return baseName
                          end getBaseName

                          property Default_Location : (path to desktop as Unicode text) as alias
                          --
                          set Input_Folder to choose folder default location Default_Location with prompt "Select a folder of PDFs" without invisibles
                          --
                          tell application "Finder"
                          set File_List to (files of Input_Folder whose name extension is "pdf")
                          repeat with This_File in File_List
                          set The_File to This_File as alias
                          tell application "Adobe Acrobat 7.0 Professional"
                          activate
                          open The_File
                          set Doc_Name to name of document 1
                          set Base_Name to my getBaseName(Doc_Name)
                          set New_File_Path to (Default_Location as string) & Base_Name
                          • 10. Re: Extract Text via Applescript
                            Level 1
                            Dear Mark

                            Thanks for the detailed response and the scripts.

                            I needed to export a bunch of PDFs to "plain text". Leopard's Automator extracts text in "Access text". So I tried using your script.

                            While the accessstext option of your script works perfectly, the plain-text version gives me the message "Adobe Acrobat does not understand the save command".

                            I checked the Acrobat Javascript reference library and the script syntax seems to be correct.

                            What could be the reason for the error message? I'm using Acrobat 8.0 Professional and Mac OS 10.5.2

                            Cheers

                            Dheeraj
                            • 11. Re: Extract Text via Applescript
                              (Aandi_Inston) Level 1
                              >While the accessstext option of your script works perfectly, the plain-text version gives me the message "Adobe Acrobat does not understand the save command".

                              You are converting the line

                              save front document to file New_File_Path using conversion
                              "com.adobe.acrobat.accesstext"

                              and nothing else? What conversion string are you using, and where is
                              that string documented?

                              Aandi Inston
                              • 12. Re: Extract Text via Applescript
                                Level 1
                                Works OK for me with 7.0.9 on X.4.11 - Leopard will be some time off for me.
                                • 13. Re: Extract Text via Applescript
                                  Level 1
                                  Hi Dheeraj

                                  Just wondering what the difference is between accesstext and plain text. I just did an extract with Automator and it looks like plain text output to me.

                                  James
                                  • 14. Re: Extract Text via Applescript
                                    MarkWalsh Level 4
                                    Check http://www.bluem.net/downloads/pdftotext_en/

                                    I have used it before and it worked very well for my needs.
                                    • 15. Re: Extract Text via Applescript
                                      Level 1
                                      Hi James,

                                      Frankly - Even I do not know the difference between Access Text and Plain text.

                                      However, I'm trying to convert a foreign language (the Indian language Gujarati) pdf.

                                      The purpose is to extract certain specific information for which we've written code to convert the extracted text to English language.

                                      When we export to Access Text, we get a perfectly formatted output but the code that extracts the text to English encounters certain problems (like treating two different letters as one and the same - when translated to English).

                                      When we export to Plain text it gives us a document which is not perfectly formatted but works fine otherwise. Solving the formatting problem is easier than solving the other problem.

                                      Hence the effort to get plain-text output.

                                      Thanks for the inputs anyway.

                                      Cheers

                                      Dheeraj
                                      • 16. Re: Extract Text via Applescript
                                        Level 1
                                        Hi Mark,

                                        Thanks for the bluem.net link.

                                        I've downloaded and installed the package.

                                        As you may have realised (reading my previous post) I'm trying to extract text from a document that's not written in English. (It's actually a Gujarati - an Indian language - document).

                                        This software does not seem to work for the Gujarati document

                                        Thanks for the link once again.

                                        Dheeraj

                                        PS : Can it convert a bunch of documents at one go - say a folder of PDFs.
                                        • 17. Re: Extract Text via Applescript
                                          Level 1
                                          Can you export using the Automator Extract Text, then batch convert to plain text using text edit or even textwrangler?

                                          I assume it is the Gujarati character set that is throwing the access text off.
                                          • 18. Re: Extract Text via Applescript
                                            Level 1
                                            Hi James,

                                            Sorry. I'm not a technology professional - though I'm comfortable using tech.

                                            Could you explain how I could do what you say (Batch convert in TW or TE )

                                            Would appreciate it much.
                                            • 19. Re: Extract Text via Applescript
                                              MarkWalsh Level 4
                                              Sorry, I have never used the pdftotext with any language except for english, had assumed it would work with whatever text was in the document.

                                              Also, it doesn't seem to work if it is passed multiple paths. You would probably have to loop through each one.
                                              • 20. Re: Extract Text via Applescript
                                                Level 1
                                                I'm not sure if this will help, as the langauge is probably causing the issue, but I have created a little automator workflow that extracts the text, places it in TextEdit the closes with saving. You could try that to see if it helps.

                                                www.hakoonamatata.co.uk/extractText.zip

                                                Regards
                                                • 21. Re: Extract Text via Applescript
                                                  Level 1
                                                  Hi James

                                                  Thanks for the automator workflow.

                                                  It appears that Textedit does not differentiate between plain text and access text. And automator seems to extract in access text format.

                                                  So I'm back to square one.

                                                  Thanks for the help anyway

                                                  Dheeraj
                                                  • 22. Re: Extract Text via Applescript
                                                    I am not sure if this helps you.You can go through Re-Add Selected Tracks as Podcast v1.1 This script recursively searches user-selected folders for files not added to iTunes and creates a text file listing their file paths. Optionally, this text file can be saved as a M3U file and imported by iTunes to add the found files. Requires Mac OS 10.4 or better.

                                                    Martin Fowl
                                                    • 24. Re: Extract Text via Applescript
                                                      Level 1
                                                      I am not sure if this helps you.You can go through Re-Add Selected Tracks as Podcast v1.1 This script recursively searches user-selected folders for files not added to iTunes and creates a text file listing their file paths. Optionally, this text file can be saved as a M3U file and imported by iTunes to add the found files. Requires Mac OS 10.4 or better.
                                                      Martin Fowl
                                                      | inkjet
                                                      cartridge
                                                      | data recovery
                                                      | data recovery |