I have 36000 files or so on a windows 2008 server. CF11 Enterprise Update 5
Love SOLR indexing for it's speed but having issues with some of the docs. PDFs especially. These are documents of a legal nature so i cannot share them but the problem is pretty straight forward.
I get the: "Could not index the file [path here] .pdf in SOLR. Check the exception for more details: An error occurred during the extracttext operation of the cfpdf tag.
When i run cfpdf extract on the file I get invalid document [path] specified for source or directory.
cfpdf action="extracttext" source="http://localhost/[path]" name="mypdf"
When I run the same with useStructure="false"
cfpdf action="extracttext" useStructure="false" source="http://localhost/[path]" name="mypdf"
and dump the variable I get all of the text along with what looks like poorly formatted xml (xml closing tags missing)
I dont really care if that is how I get the data as it is only used to let the uner know what document contains the subject of their search.
Thngs I know:
it opens in Acrobat
was created with Acrobat PDF maker 10.1 for Word
all dates are present
claims to be PDF Version 1.4 (acrobat 4.x)
it is 3 mb
Is there a way to tell CF11 to retry on failure of that document, ignoring the structure?