I just tried this on CQ 5.5 update 2.1 Content of uploaded PDFs show up in my search results (full word search). I'm not sure which CQ version you are running.
CQ uses Apache Tika to extract text from PDFs:
Thanks, Jayan. Can you give me a specific example that you get to work, say, on the Geometrixx site that I can duplicate?
For example, these do not return results for me:
That text, "Product Sample Datasheet Blue", is lifted from the page 1 of the PDF called GeoCube Datasheet.
I tested this on a fresh quickstart install, so there's nothing in the config that has been changed from default.
These quickstart installs appear to be version 5.5.0. I did not install the current 2.1 service pack on them as we don't have the service pack running on our production sites either.
I'm running into similar problems. Did you get any answer or did you find anything?
Any help would be greatly appreciated.
We never got a definitive answer, but we have suspicions that it was due to having upgraded from CQ 5.4 to 5.5. It seems that the libraries used for the indexing changed during that version upgrade. When I took our application and installed it on a pristine 5.5 installation, the PDF indexing worked. It was only our existing installations (two staging, two production) that did not work. So at least we know it's not our application or CQ in general.
Sadly, we don't have the resources to rebuild our servers, and we also ran into a separate problem that would prevent us from using the indexing anyway. It seems that there is no way to prevent cross-site results if you have multiple sites on the same CQ install and they each have their own sections in the DAM where the PDF files are stored. Would take some custom code to get around the issue, it seems.
For example, you have site A and site B.
/content/a <- Main site A content for pages
/content/dam/a <- Site A's files in the DAM
There is no stock way, that I am aware of, to keep searches on site A from turning up PDF results from /content/dam/b (for site B), and vice versa. That's enough to keep us from using it - a total deal breaker.
Michael - For the former issue, you should contact DayCare. For the latter issue, you can easily search just within a path, e.g. in XPath: /jcr:root/content/dam/b//element(*, dam:Asset)[jcr:contains(., "term")].
For the former issue we did contact DayCare, and that is the best we got out of them. A theory.
As I recall (it's been some months since I looked at the code), the stock search component functionality in CQ5 did not support multiple paths being specified at once. It would be easy enough to offer a search that worked in the proper DAM section, and one that worked in the proper site content folder, but not both.
If you are saying that the search component could except an XPath that included both /content/a and /content/dam/a in one, then that would do the trick. I was not aware that the component could accept an XPath since the developers of the site had it using a plain string like "/content/a". Specifying "/content/a|/content/dam/a" doesn't seem to work in the component. I'm not a full time CQ dev by any means and I haven't been working on that project for a few months, so I'd be glad to know I missed something....
Michael - I believe the foundation search component can only have a single search path. This is, however, a limitation of this component, not the search index.
Yeah, I buy that. But for our purposes it means that we couldn't make use of the indexing until we'd write a new search component. That's not likely to happen any time soon. It's just a shame that the out of the box functionality is limited in that way, since the search component otherwise works well enough.
Thanks for the quick answer.
At my end, Seeing that I couldn't get anything to work in my JSP, I did basic tests in the Asset Manager interface (eg : searching fulltext for a single word included in the PDFs that come with Geometrixx sample) with a pristine install of CQ. It worked in CQ 5.4 but not in CQ 5.5 which seems to be slighly different for you.
Justin, do you have any clue on that one?
Laurent - I have seen reports from time to time about full text indexing not working consistently with the provided sample documents. To be clear, I've never had problems in this area with 5.5; I just have heard about other people having problems. As I said to Michael, I would suggest filing a DayCare ticket. I understand that they have some simple steps to take to fix this problem (assuming it is the same issue as other people have run into).