0 Replies Latest reply on Jul 10, 2015 12:00 PM by mrwilhale

    Indexing certain PDFs fails

    mrwilhale Level 1

      Hey Group.


      I have 36000 files or so on a windows 2008 server.  CF11 Enterprise Update 5


      Love SOLR indexing for it's speed but having issues with some of the docs.  PDFs especially.  These are documents of a legal nature so i cannot share them but the problem is pretty straight forward.


      I get the: "Could not index the file [path here] .pdf in SOLR. Check the exception for more details: An error occurred during the extracttext operation of the cfpdf tag.


      When i run cfpdf extract on the file I get invalid document [path] specified for source or directory.

      cfpdf action="extracttext" source="http://localhost/[path]" name="mypdf"


      When I run the same with useStructure="false"

      cfpdf action="extracttext" useStructure="false" source="http://localhost/[path]" name="mypdf"

      and dump the variable I get all of the text along with what looks like poorly formatted xml (xml closing tags missing)

      I dont really care if that is how I get the data as it is only used to let the uner know what document contains the subject of their search.

      Thngs I know:

      it opens in Acrobat

      was created with Acrobat PDF maker 10.1 for Word

      all dates are present

      claims to be PDF Version 1.4 (acrobat 4.x)

      it is 3 mb

      Is there a way to tell CF11 to retry on failure of that document, ignoring the structure?