Skip navigation

CQ5

Currently Being Moderated

CQ search and Apache Tika

Aug 10, 2012 3:56 AM

Tags: #cq5.4

I have a requirement to be able to allow the website users to search within the content of pdf and word docs. The site is planned to be built on CQ5.5. I belive CQ integrates with Apache Tika (look for full text extraction at http://dev.day.com/docs/en/crx/current/developing/searching_in_crx.htm l ) to achieve this.

 

I polling this group to check if we have used this feature in any other project. How good or bad it is ? Any lessons learnt from it.

 

Also have we successfully used any of the other features that the native lucene search in CQ provides . I am specially interested in spell check, stemming, synonym matching, similarity matching.

 

Thanks in advance for all your responses.

 
Replies
  • Currently Being Moderated
    Calculating status...
    Sep 13, 2012 4:25 AM   in reply to Anoop_Kumar

    Hi,

     

    I too have the same requirement. How do we enable this full text extraction in cq 5.5 DAM search ,by default it doesnot search in content of

    PDFs or any other supported doc , it just searches only in metadata of the asset.

     

    Thanks,

    Deepikaa

     
    |
    Mark as:
  • Currently Being Moderated
    Sep 14, 2012 6:16 AM   in reply to Deepikaa Nagesh

    @Deepikaa  :-  Please install the 5.5 update1 package & then reindex.

     

    @Anoop

     

    -  stemming Porter stemmer is default one rather than dictionary-backed stemmer.The way Porter steamer works is both Country & Countries steam to countri. However you can write your own Analyzer implementation or other workaround would be to use QueryParser for search results.

    -  The spellchecker dictionary is actually built from the words contained in your site's content. This mighr be an optimal spellchecker and should handle cases where your product name is mispelled by users. In other words, you should not need to change the dictionary and if you did want to you would have to implement custom code to do that.

    -  IIRC To enable the synonym lookup mechanism need to use the tilda (~) character which can be configured in workspace.xml under the SearchIndex element.

    -  The new indexing rules http://wiki.apache.org/jackrabbit/IndexingConfiguration

     

    Some examples at http://dev.day.com/docs/en/crx/current/developing/searching_in_crx.htm l

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (2)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points