A Step-by-Step Guide to Indexing CQ with Nutch

In my previous post, Integrating Apache Solr with Adobe CQ / AEM, I talked about the various Solr / CQ integration approaches. In this post, we will index the Geometrixx Media site using Apache Nutch. The integration described here is meant for those with little or no experience with Apache Solr and/or Apache Nutch. Since I am a Mac user all steps assume a UNIX environment. 

The high-level deployment and configuration process is as follows: 

  1. Install and start a vanilla version of CQ 5.6. This should be a CQ Publish instance running on localhost on port 4503.
  2. Download and start the quick start distribution of Apache Solr 4.5. 
  3. Download and configure Apache Nutch 1.7.
  4. Initiate a Nutch crawl of the Geometrixx Media site. 

 

CQ 5.6 Publish

Begin by starting a CQ 5.6 publish instance. This article assumes that you are running CQ locally and that it is listening on port 4503. It further assumes that you have the default sample content installed (i.e. Geometrixx). We are using a publish instance instead of an author instance as it simplifies the Nutch crawler configuration if we do not need to worry about authentication.

Apache Solr 4.5

Next, obtain the latest stable release of Apache Solr and unpack the distribution to a temporary location. We will use ~/nutch-solr-example as the temporary working directory.

 

 

 

 

$ mkdir ~/nutch-solr-example
$ cd ~/nutch-solr-example
$ wget http://archive.apache.org/dist/lucene/solr/4.5.0/solr-4.5.0.tgz
$ tar -xzf solr-4.5.0.tgz

We will come back to the Solr configuration as part of the Nutch installation and configuration.

Apache Nutch Installation & Solr Configuration

 

Download the binary version of Apache Nutch and save it to the same temporary location (~/nutch-solr-example). Then, unpack it.

$ wget http://apache.cs.utah.edu/nutch/1.7/apache-nutch-1.7-bin.tar.gz
$ tar -xzf apache-nutch-1.7-bin.tar.gz

Nutch comes with a sample Solr schema suitable for indexing documents fetched by Nutch. Perform the following to configure Solr.  Note: We will use the example Solr core, collection1 as a starting point, rather than creating a core from scratch.

$ cp apache-nutch-1.7/conf/schema-solr4.xml solr-4.5.0/example/solr/collection1/conf/schema.xml

 Add the _version_ field within the <fields> element in solr-4.5.0/example/solr/collection1/conf/schema.xml.

<field name="_version_" type="long" indexed="true" stored="true"/>

Also change the schema name in the root XML element from nutch to geometrixx-media. Once these changes have been made, save schema.xml.

Next, rename the collection1 Solr core to geometrixx-media

$ mv solr-4.5.0/example/solr/collection1 solr-4.5.0/example/solr/geometrixx-media

Now, edit solr-4.5.0/example/solr/geometrixx-media/core.properties and replace name=collection1 with name=geometrixx-media.

At this point, we have a single Solr core named geometrixx-media that is suitable for indexing documents crawled by Nutch. 

Lastly, move into the quick start directory and start Solr.

 

 

$ cd solr-4.5.0/example
$ java -jar start.jar

Open a web browser and visit http://localhost:8983/solr/ to verify that the Solr installation was successful.  

Nutch Configuration

Now that we have Solr running, we need to configure Nutch to crawl the Geometrixx Media site. 

Begin by opening another terminal window and moving into our temporary directory. 

$ cd ~/nutch-solr-example/

Edit apache-nutch-1.7/conf/nutch-site.xml and define a user agent for our crawler.  Add the following to the <configuration> XML element.

<property>
    <name>http.agent.name</name>
    <value>Geometrixx Media Crawler</value>
</property>

Next, we need to restrict the scope of our crawl to only the Geometrix Media site. Simply, edit apache-nutch-1.7/conf/regex-urlfilter.txt and replace the last line:

 

# accept anything else
+.

with: 

# Accept only the Geometrixx Media site
+.*/content/geometrixx-media/.*

We also need to define a seed file that will act as the entry point for our crawler. Nutch uses a simple text file for its seeds; one line per seed. 

$ mkdir apache-nutch-1.7/urls
$ echo "http://localhost:4503/content/geometrixx-media/en.html" > apache-nutch-1.7/urls/seed.txt

For convenience, let's create a simple wrapper script for crawling and indexing the Geometrixx Media site. Create a script called apache-nutch-1.7/index-geometrixx.sh

#!/bin/sh
HOST=localhost
PORT=8983
CORE=geometrixx-media
bin/nutch crawl urls -dir crawl -depth 4 -topN 15 -solr http://${HOST}:${PORT}/solr/${CORE}

Make the script executable, move into the apache-nutch-1.7 directory and crawl the site. 

$ chmod 750 apache-nutch-1.7/index-geometrixx.sh
$ cd apache-nutch-1.7
$ ./index-geometrixx.sh

Hopefully, congratulations are in order (unless, I fat fingered something in my instructions or you did). After a few minutes of crawling, the documents should be in your Solr index. 

You can verify this by navigating to http://localhost:8983/solr/#/geometrixx-media and selecting the Query tab on left. Then, click the Execute Query button.

I hope that this (very) basic introduction to indexing CQ with Nutch is enough to get you started with your Solr integration. Again, take a look at SolrJ and ajax-solr for building your front-end.