I have a requirement to be able to allow the website users to search within the content of pdf and word docs. The site is planned to be built on CQ5.5. I belive CQ integrates with Apache Tika (look for full text extraction at http://dev.day.com/docs/en/crx/current/developing/searching_in_crx.htm l ) to achieve this.
I polling this group to check if we have used this feature in any other project. How good or bad it is ? Any lessons learnt from it.
Also have we successfully used any of the other features that the native lucene search in CQ provides . I am specially interested in spell check, stemming, synonym matching, similarity matching.
Thanks in advance for all your responses.
I too have the same requirement. How do we enable this full text extraction in cq 5.5 DAM search ,by default it doesnot search in content of
PDFs or any other supported doc , it just searches only in metadata of the asset.
@Deepikaa :- Please install the 5.5 update1 package & then reindex.
- stemming Porter stemmer is default one rather than dictionary-backed stemmer.The way Porter steamer works is both Country & Countries steam to countri. However you can write your own Analyzer implementation or other workaround would be to use QueryParser for search results.
- The spellchecker dictionary is actually built from the words contained in your site's content. This mighr be an optimal spellchecker and should handle cases where your product name is mispelled by users. In other words, you should not need to change the dictionary and if you did want to you would have to implement custom code to do that.
- IIRC To enable the synonym lookup mechanism need to use the tilda (~) character which can be configured in workspace.xml under the SearchIndex element.
- The new indexing rules http://wiki.apache.org/jackrabbit/IndexingConfiguration
Europe, Middle East and Africa