Skip navigation
Currently Being Moderated

Indexing large PDF's

Aug 9, 2011 7:09 AM

I'm using CFINDEX to feed Solr with pdf files.

While monitoring disk activity coldfusion CFINDEX only reads about 200k from disk while indexing an 1Mb pdf (102 pages)

Similar patterns for other large pdf files.

 

I can use solr to search anything on the first 38 pages but after that I score 0.

 

Are there any size limitations in CFINDEX ? anything I can tweak on?

 

(I already tried the maxfieldsize in solrconfig.)

 

 

any ideas?

 

Coldfusion 9.01 standard ed.

 
Replies
  • Currently Being Moderated
    Aug 14, 2011 7:40 AM   in reply to XIntelligence

    Does this happen for any large PDF file, or just a specific one?  Perhaps if it's just the one, there's some sort of corruption or something "unexpected" at the boundary you're seeing?

     

    --

    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 18, 2011 1:52 PM   in reply to XIntelligence

    Sorry to take a while to get back to you: I've been a bit busy in the evenings this week.

     

    Um... yeah... I get the same thing.  It seems to only index the first 38 or so pages for me.

     

    I've knocked together some stand-alone code that replicates this, in case anyone else can test it too:

     

    <!--- createCollection.cfm --->

    <cftry>
        <cfcollection action="delete" collection="scratch">
        <cfcatch>
        </cfcatch>
    </cftry>
    <cfcollection
        action        = "create"
        collection    = "scratch"
        path        ="#server.coldfusion.rootDir#\collections"
        engine        = "solr"
    >

     

    <!--- indexCollection.cfm --->

    <cfindex
        action        = "refresh"
        collection    = "scratch"
        key            = "#expandPath('.')#"
        type        = "path"
        extensions    = ".pdf"
    >

     

    <!--- createPdf.cfm --->

    <cfparam name="URL.file">
    <cfparam name="URL.text">
    <cfparam name="URL.size">
    <cfdocument format="PDF" filename="#expandPath('./')##URL.file#" overwrite="true">
        <cfset sPadding = "padding">
        <cfset iSize = 0>
        <cfset iPaddingLen = len(sPadding) + 1 + len(createUuid())>
        <cfloop condition="true">
            <cfset sThisPadding = sPadding & " " & createUuid()>
            <cfoutput>#sThisPadding#</cfoutput>
            <cfset iSize += iPaddingLen>
            <cfif iSize GT URL.size * 1024>
                <cfbreak>
            </cfif>
        </cfloop>
        <cfoutput>#URL.text#</cfoutput>
    </cfdocument>

     

    <!--- search.cfm --->

    <cfparam name="URL.search">
    <cfsearch collection="scratch" name="q" criteria="#URL.search#">
    <cfdump var="#q#">

     

    Save all those into a directory, then run:

    createCollection.cfm

    createPdf.cfm?size=75&file=large1.pdf&text=locate

    createPdf.cfm?size=85&file=large2.pdf&text=locate

    indexCollection.cfm

    search.cfm?search=locate

     

    The PDFs created are only around 120-130kB apiece, but are 34 and 39 pages respectively.  Neither in size nor in length are they very big.

     

    I only get a match in large1.pdf

     

    If I peg back the large2.pdf to be size=83, its page count drops back to within 38, and I start getting it coming back in the search results too.

     

    I dunno if this is a limitation of the dev edition of CF, or it's a fairly horrible bug...

     

    Were you running on a dev edition, or a licensed one?

     

    --

    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2011 12:50 AM   in reply to XIntelligence

    I will file a bug report to Adobe

     

    Cool.  If you report back with the bug ref, I'll vote for it.

     

    --

    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2011 1:55 AM   in reply to XIntelligence

    Voted.

     

    Cheers for doing that.

     

    --

    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2011 9:31 AM   in reply to XIntelligence

    Do you need special permission to vote?  I just tried using the user name and password I have for the forum, and I can't get in.

    We plan to start a project soon where we will have hundrends of PDF's online for our SOP system, and several of them are over 50 pages and we need them to be indexed.

     

    Thanks

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2011 9:54 AM   in reply to XIntelligence

    I'd vote if I could.  I get Invalid credentials when I try to use my Adobe login. 

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 19, 2011 3:24 PM   in reply to bloodbanker

    I'd vote if I could.  I get Invalid credentials when I try to use my Adobe login. 

     

    Create a new login?

     

    I have to concede I have no idea which login they want on the bugbase... I've got 2-4 logins for various parts of the Adobe/Macormedia site but they've all got the same credentials, so I dunno which is which... but I thought it was the one for the main Adobe site; so not the one for these forums (if that was what you are trying).

     

    There are plenty of issues in the bugbase that could do with community support, so I would always encourage people to do what they can to enable themselves to vote for them (there is no proof that Adobe pays attention to the votes, btw, but it can't hurt ;-).  I vote for a lot of issue if only so I get an update email if they ever get around to doing something about it (I have received two emails this week about issues being resolved, so it does happen!).

     

    --

    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2011 1:51 AM   in reply to XIntelligence

    I've just received email notification that it's been fixed.  Sadly they never say when the fix will be available (or whether it'll be a hotfix, bundled in an updater or in the next release of CF).

     

    Still: at least they've dealt with it, and it demonstrates that putting stuff in the bug tracker is worthwhile!

     

    --

    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 22, 2011 4:42 AM   in reply to Adam Cameron.

    Aha.  Ray Camden has been on the case with this issue, and has turned up some interesting info: http://www.coldfusionjedi.com/index.cfm/2011/8/22/Indexing-PDFs-with-S olr-Read-this-tip

     

    Apparently this is by design.

     

    --
    Adam

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 24, 2011 12:28 AM   in reply to XIntelligence

    Cool.  It was all Ray & the dudes from Adobe that came up with the info here, not me: I just echoed it back.

     

    Anyway, good that things are working now.  And it's good to know this snippet of info for "next time".

     

    --

    Adam

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points