0 Replies Latest reply on Jul 10, 2015 12:00 PM by mrwilhale

    Indexing certain PDFs fails

    mrwilhale

      Hey Group.

       

      I have 36000 files or so on a windows 2008 server.  CF11 Enterprise Update 5

       

      Love SOLR indexing for it's speed but having issues with some of the docs.  PDFs especially.  These are documents of a legal nature so i cannot share them but the problem is pretty straight forward.

       

      I get the: "Could not index the file [path here] .pdf in SOLR. Check the exception for more details: An error occurred during the extracttext operation of the cfpdf tag.

       

      When i run cfpdf extract on the file I get invalid document [path] specified for source or directory.

      cfpdf action="extracttext" source="http://localhost/[path]" name="mypdf"

       

      When I run the same with useStructure="false"

      cfpdf action="extracttext" useStructure="false" source="http://localhost/[path]" name="mypdf"


      and dump the variable I get all of the text along with what looks like poorly formatted xml (xml closing tags missing)

      I dont really care if that is how I get the data as it is only used to let the uner know what document contains the subject of their search.


      Thngs I know:

      it opens in Acrobat

      was created with Acrobat PDF maker 10.1 for Word

      all dates are present

      claims to be PDF Version 1.4 (acrobat 4.x)

      it is 3 mb


      Is there a way to tell CF11 to retry on failure of that document, ignoring the structure?


      Thanks