CQ 5

A Step-by-Step Guide to Indexing CQ with Nutch

In my previous post, Integrating Apache Solr with Adobe CQ / AEM, I talked about the various Solr / CQ integration approaches. In this post, we will index the Geometrixx Media site using Apache Nutch. The integration described here is meant for those with little or no experience with Apache Solr and/or Apache Nutch. Since I am a Mac user all steps assume a UNIX environment. 

Read More

Integrating Apache Solr with Adobe CQ / AEM

Recently, I have been noticing a bit of interest by the CQ community regarding CQ / Solr integration. However, as most people have pointed out, there isn't a clear path detailed anywhere. Given the interest, I will be posting regularly on the subject. This first post will stay relatively high-level and discuss the possible integration points.

There are really two areas that should be considered when integrating Solr with CQ: indexing content and searching content. For the most part, you can treat these as two independent efforts.

Indexing CQ Content

Over the past 6 months I have experimented with multiple approaches to indexing CQ content in Solr. Each approach has its respective strengths and weaknesses.

  1. Crawl your site using an external crawler.
  2. Create one or more CQ servlets to serialize your content into a Solr JSON or Solr XML Update format.
  3. Create an observer within CQ to listen for page modifications and trigger indexing operations to Solr.

Using an External Crawler

Using an external crawler such as Nutch or Heritrix is perhaps the simplest way to start indexing your CQ content; however, it does have its drawbacks. Using a crawler involves working with unstructured content in the form of mainly HTML documents. While most crawlers do a decent job extracting the content body, title, url, description, keywords and other metadata, you typically need to define a strategy for extracting other useful data points to drive functionality such as faceting. Extracting this information can be achieved in several ways: use an external document processing framework (recommended), use Solr's Update Request Processor (not recommended), use Solr's tokenizers for basic extraction, etc.

The other drawback with this approach is that it uses a pull approach to indexing content. There are ways around this; however, using a crawler typically means that you will be sacrificing real-time indexing.

CQ Servlets & Solr Update JSON/XML

Another possible approach is to create one or more CQ servlets that produces a dump of your CQ content using Solr's Update JSON or Update XML format. The advantage here is that you are working with structured content and have full access to CQ's APIs for querying JCR content. An external cron job can then be used to fetch this page using curl and post it to Solr.

A variation of this approach is to use a selector to render a page in either the Solr JSON or XML update format. 

CQ Observer

Using a CQ observer provides the tightest integration with Solr and as such provides real-time indexing capabilities. Like the CQ Servlet approach, it simplifies content extraction since you are working with structured data. There are several methods for implementing an observer. Refer to Event Handling in CQ by Nicolas Peltier. My personal preference is listening to Page Events and Replication Events using Sling Eventing. In this approach once you receive an event, such as page modification, you can use the SolrJ API to update the Solr index.

Searching CQ

Once you have your CQ content indexed in Solr you will need a search interface. While there are several approaches for building search experiences against Solr, the most popular approach is to use Solr's Java API, SolrJ. For client-side integration, ajax-solr is a great choice.

Lastly, I need to shamelessly plug an upcoming integration for CQ and Solr by headwire.com, Inc, aptly named CQ Solr Search. This integration offers support for building search interfaces using search components built on ajax-solr as well as a configurable CQ observer for real-time Solr indexing. We will be introducing the first public implementation on CQ Blueprints. Our intent is to provide one place for searching all CQ/Sling/JCR content on the web.

Upcoming

Based on the community feedback, please stay tuned for the following. 

  1. CQ Solr Search by headwire.com, Inc. - (Not yet available)
  2. A Step-by-Step Guide to Indexing CQ with Nutch (Coming soon)
  3. A Steb-by-Step Guide to Indexing CQ with CQ Servlets (Coming soon)
  4. A Step-by-Step Guide to Indexing CQ using an Observer (Coming soon)