I'm using CFINDEX to feed Solr with pdf files.
While monitoring disk activity coldfusion CFINDEX only reads about 200k from disk while indexing an 1Mb pdf (102 pages)
Similar patterns for other large pdf files.
I can use solr to search anything on the first 38 pages but after that I score 0.
Are there any size limitations in CFINDEX ? anything I can tweak on?
(I already tried the maxfieldsize in solrconfig.)
any ideas?
Coldfusion 9.01 standard ed.
Sorry to take a while to get back to you: I've been a bit busy in the evenings this week.
Um... yeah... I get the same thing. It seems to only index the first 38 or so pages for me.
I've knocked together some stand-alone code that replicates this, in case anyone else can test it too:
<!--- createCollection.cfm --->
<cftry>
<cfcollection action="delete" collection="scratch">
<cfcatch>
</cfcatch>
</cftry>
<cfcollection
action = "create"
collection = "scratch"
path ="#server.coldfusion.rootDir#\collections"
engine = "solr"
>
<!--- indexCollection.cfm --->
<cfindex
action = "refresh"
collection = "scratch"
key = "#expandPath('.')#"
type = "path"
extensions = ".pdf"
>
<!--- createPdf.cfm --->
<cfparam name="URL.file">
<cfparam name="URL.text">
<cfparam name="URL.size">
<cfdocument format="PDF" filename="#expandPath('./')##URL.file#" overwrite="true">
<cfset sPadding = "padding">
<cfset iSize = 0>
<cfset iPaddingLen = len(sPadding) + 1 + len(createUuid())>
<cfloop condition="true">
<cfset sThisPadding = sPadding & " " & createUuid()>
<cfoutput>#sThisPadding#</cfoutput>
<cfset iSize += iPaddingLen>
<cfif iSize GT URL.size * 1024>
<cfbreak>
</cfif>
</cfloop>
<cfoutput>#URL.text#</cfoutput>
</cfdocument>
<!--- search.cfm --->
<cfparam name="URL.search">
<cfsearch collection="scratch" name="q" criteria="#URL.search#">
<cfdump var="#q#">
Save all those into a directory, then run:
createCollection.cfm
createPdf.cfm?size=75&file=large1.pdf&text=locate
createPdf.cfm?size=85&file=large2.pdf&text=locate
indexCollection.cfm
search.cfm?search=locate
The PDFs created are only around 120-130kB apiece, but are 34 and 39 pages respectively. Neither in size nor in length are they very big.
I only get a match in large1.pdf
If I peg back the large2.pdf to be size=83, its page count drops back to within 38, and I start getting it coming back in the search results too.
I dunno if this is a limitation of the dev edition of CF, or it's a fairly horrible bug...
Were you running on a dev edition, or a licensed one?
--
Adam
Adam, thanks a lot for verifying this BUG in CFINDEX.
I'm running a licensed version of ColdFusion 9.01 Standard Ed.
I can test this in a licensed Enterprise Ed. As well, but since the CF Developer Ed. is basically the same I don’t think that's necessary really.
I will file a bug report to Adobe and hope for some sort of solution.
Maybe I change to Apache Tika in the mean time.
//Anders
Do you need special permission to vote? I just tried using the user name and password I have for the forum, and I can't get in.
We plan to start a project soon where we will have hundrends of PDF's online for our SOP system, and several of them are over 50 pages and we need them to be indexed.
Thanks
I'd vote if I could. I get Invalid credentials when I try to use my Adobe login.
Create a new login?
I have to concede I have no idea which login they want on the bugbase... I've got 2-4 logins for various parts of the Adobe/Macormedia site but they've all got the same credentials, so I dunno which is which... but I thought it was the one for the main Adobe site; so not the one for these forums (if that was what you are trying).
There are plenty of issues in the bugbase that could do with community support, so I would always encourage people to do what they can to enable themselves to vote for them (there is no proof that Adobe pays attention to the votes, btw, but it can't hurt ;-). I vote for a lot of issue if only so I get an update email if they ever get around to doing something about it (I have received two emails this week about issues being resolved, so it does happen!).
--
Adam
I've just received email notification that it's been fixed. Sadly they never say when the fix will be available (or whether it'll be a hotfix, bundled in an updater or in the next release of CF).
Still: at least they've dealt with it, and it demonstrates that putting stuff in the bug tracker is worthwhile!
--
Adam
Aha. Ray Camden has been on the case with this issue, and has turned up some interesting info: http://www.coldfusionjedi.com/index.cfm/2011/8/22/Indexing-PDFs-with-S olr-Read-this-tip
Apparently this is by design.
--
Adam
AH that solved it!
I already tried to modify the <maxFieldLength> field under the section <indexDefaults> in solrconfig, with no result,
but I found it repeated further down and changing the same field under the section <mainIndex> made a difference
Its rather strange that these limitations are applied by default, when adobes own PDF format is frequently used for large documents.
Thank you Adam!
North America
Europe, Middle East and Africa
Asia Pacific