A Step-by-Step Guide to Indexing CQ with Nutch

In my previous post, Integrating Apache Solr with Adobe CQ / AEM, I talked about the various Solr / CQ integration approaches. In this post, we will index the Geometrixx Media site using Apache Nutch. The integration described here is meant for those with little or no experience with Apache Solr and/or Apache Nutch. Since I am a Mac user all steps assume a UNIX environment. 

Read More

Integrating Apache Solr with Adobe CQ / AEM

Recently, I have been noticing a bit of interest by the CQ community regarding CQ / Solr integration. However, as most people have pointed out, there isn't a clear path detailed anywhere. Given the interest, I will be posting regularly on the subject. This first post will stay relatively high-level and discuss the possible integration points.

There are really two areas that should be considered when integrating Solr with CQ: indexing content and searching content. For the most part, you can treat these as two independent efforts.

Indexing CQ Content

Over the past 6 months I have experimented with multiple approaches to indexing CQ content in Solr. Each approach has its respective strengths and weaknesses.

  1. Crawl your site using an external crawler.
  2. Create one or more CQ servlets to serialize your content into a Solr JSON or Solr XML Update format.
  3. Create an observer within CQ to listen for page modifications and trigger indexing operations to Solr.

Using an External Crawler

Using an external crawler such as Nutch or Heritrix is perhaps the simplest way to start indexing your CQ content; however, it does have its drawbacks. Using a crawler involves working with unstructured content in the form of mainly HTML documents. While most crawlers do a decent job extracting the content body, title, url, description, keywords and other metadata, you typically need to define a strategy for extracting other useful data points to drive functionality such as faceting. Extracting this information can be achieved in several ways: use an external document processing framework (recommended), use Solr's Update Request Processor (not recommended), use Solr's tokenizers for basic extraction, etc.

The other drawback with this approach is that it uses a pull approach to indexing content. There are ways around this; however, using a crawler typically means that you will be sacrificing real-time indexing.

CQ Servlets & Solr Update JSON/XML

Another possible approach is to create one or more CQ servlets that produces a dump of your CQ content using Solr's Update JSON or Update XML format. The advantage here is that you are working with structured content and have full access to CQ's APIs for querying JCR content. An external cron job can then be used to fetch this page using curl and post it to Solr.

A variation of this approach is to use a selector to render a page in either the Solr JSON or XML update format. 

CQ Observer

Using a CQ observer provides the tightest integration with Solr and as such provides real-time indexing capabilities. Like the CQ Servlet approach, it simplifies content extraction since you are working with structured data. There are several methods for implementing an observer. Refer to Event Handling in CQ by Nicolas Peltier. My personal preference is listening to Page Events and Replication Events using Sling Eventing. In this approach once you receive an event, such as page modification, you can use the SolrJ API to update the Solr index.

Searching CQ

Once you have your CQ content indexed in Solr you will need a search interface. While there are several approaches for building search experiences against Solr, the most popular approach is to use Solr's Java API, SolrJ. For client-side integration, ajax-solr is a great choice.

Lastly, I need to shamelessly plug an upcoming integration for CQ and Solr by headwire.com, Inc, aptly named CQ Solr Search. This integration offers support for building search interfaces using search components built on ajax-solr as well as a configurable CQ observer for real-time Solr indexing. We will be introducing the first public implementation on CQ Blueprints. Our intent is to provide one place for searching all CQ/Sling/JCR content on the web.

Upcoming

Based on the community feedback, please stay tuned for the following. 

  1. CQ Solr Search by headwire.com, Inc. - (Not yet available)
  2. A Step-by-Step Guide to Indexing CQ with Nutch (Coming soon)
  3. A Steb-by-Step Guide to Indexing CQ with CQ Servlets (Coming soon)
  4. A Step-by-Step Guide to Indexing CQ using an Observer (Coming soon)

 

 

 

 

 

 

Deploying the FAST ESP Search API to CQ 5.5

This post is dedicated to any OSGi developer who has endured the pain of wrapping a third-party JAR in order to deploy it to an OSGi container.

In this post we will deploy the FAST ESP Java Search API to CQ 5. Since Microsoft does not provide an OSGi bundle for this API, we will create our own using the technique described on the CQ Blueprints post, Deploying 3rd Party Libraries.

The high-level approach is as follows:

  1. Download the FAST ESP Java Search API (version 5.3.0.6) from Microsoft Connect and upload it to your 3rd party Nexus repository. I assume that the readers of this post are familiar with Nexus and have their own repository.
  2. Create a Maven project to create the wrapped version of API.
  3. Deploy the wrapped version of the API to your Nexus repository.
  4. Deploy the wrapped version of the API to CQ via the Felix console.
  5. Add the wrapped version of the API as a dependency to your Maven project.
  6. Update your CQ instance to allow sun.io to be exported as part of the Felix system bundle from the framework classloader.

Adding a 3rd Party JAR (esp-searchapi.jar) to Nexus

It is recommended that you add a proxy repository to http://repository.opencastproject.org/nexus/content/groups/public/ as this repository has the Xalan and Xerces artifacts used by this article.

  1. Log in as the admin user to your Nexus repository (i.e., http://localhost:8081/nexus/)
  2. Select Repositories and click 3rd party repository.
  3. Click the Artifact Upload tab an enter the following information:

    GAV Definition:GAV Parameters
    Group:no.fast
    Artifact:esp-searchapi
    Version:5.3.0.6
    Packaging:jar

  4. Click the Select Artifact(s) to Upload… button and browse to the location of the FAST ESP Java Search API (i.e., esp-searchapi.jar).
  5. Once selected, click the Add Artifact button followed by the Upload Artifact(s) button.
  6. If successful, you should now have a vanilla version of the FAST ESP Java Search API that can be included as a dependency by Maven. This dependency will be used in the next step.

<dependency>
  <groupId>no.fast</groupId>
  <artifactId>esp-searchapi</artifactId>
  <version>5.3.0.6</version>
</dependency>

Create a Maven Project to Build the Wrapped JAR

Create the following POM. Please note: the dependencies listed in the POM below were defined by trial and error. I had many unsucessful deployments to Apache Felix with failed dependencies. In the end, the list of embedded dependencies included Xalan, Xerces and Log4j. Most of the remaining dependencies, such as HttpClient and javax.* packages were satisfied by Felix. Actually, the only dependency that was not satisfied was the sun.io package. I solved this by allowing Felix to export and load the sun.io packages from the framework class loader.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

	<modelVersion>4.0.0</modelVersion>

	<groupId>no.fast</groupId>
	<artifactId>esp-search-api-wrapped</artifactId>
	<version>5.3.0.6</version>
	<packaging>bundle</packaging>

	<name>FAST ESP Search API</name>
	<description>An OSGi version of FAST ESP Search API</description>

	<properties>
		<esp-searchapi.version>5.3.0.6</esp-searchapi.version>
	</properties>

	<dependencies>
		<dependency>
			<groupId>org.apache.xalan</groupId>
			<artifactId>com.springsource.org.apache.xalan</artifactId>
			<version>2.7.1</version>
		</dependency>
		<dependency>
			<groupId>org.apache.xerces</groupId>
			<artifactId>com.springsource.org.apache.xerces</artifactId>
			<version>2.9.1</version>
		</dependency>
		<dependency>
			<groupId>log4j</groupId>
			<artifactId>log4j</artifactId>
			<version>1.2.15</version>
		</dependency>
		<dependency>
			<groupId>no.fast</groupId>
			<artifactId>esp-searchapi</artifactId>
			<version>5.3.0.6</version>
		</dependency>
	</dependencies>

	<build>
		<pluginManagement>
			<plugins>
				<plugin>
					<groupId>org.apache.felix</groupId>
					<artifactId>maven-bundle-plugin</artifactId>
					<version>2.3.5</version>
					<extensions>true</extensions>
				</plugin>
			</plugins>
		</pluginManagement>
		<plugins>
			<plugin>
				<groupId>org.apache.felix</groupId>
				<artifactId>maven-bundle-plugin</artifactId>
				<configuration>
					<instructions>
						<Import-Package>javax.*,sun.io.*,org.apache.commons.httpclient.*,org.apache.commons.logging.*</Import-Package>
						<Embed-Dependency>*;scope=compile|runtime</Embed-Dependency>
						<Embed-Directory>OSGI-INF/lib</Embed-Directory>
						<Embed-Transitive>true</Embed-Transitive>
						<_exportcontents>
							com.fastsearch.esp.search.*;version=${esp-searchapi.version}
						</_exportcontents>
					</instructions>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>

Run mvn clean install. This should produce a file called esp-search-api-wrapped-5.3.0.6.jar in your target directory.

Similar to before, upload this artifact to your 3rd party Nexus repository using the following:

GAV Definition:GAV Parameters
Group:no.fast
Artifact:esp-searchapi-wrapped
Version:5.3.0.6
Packaging:jar

You should now be able to use the new wrapped version of the API in your Maven POM by adding the following dependency.

<dependency>
  <groupId>no.fast</groupId>
  <artifactId>esp-searchapi-wrapped</artifactId>
  <version>5.3.0.6</version>
</dependency>

Deploy the esp-searchapi-wrapped-5.3.0.6.jar to CQ via the Felix console.

Lastly, edit yourcqinstance/crx-quickstart/sling.properties and add the following line. This will allow Felix to export sun.io and make it available from the framework classloader.

org.osgi.framework.system.packages.extra=sun.io

.

Once this change is made, restart CQ 5.

CQ5 WebDAV Support for Windows 7 64-bit

After a long break from working in the content management space, I returned to my CMS roots with a focus on CQ5. As a novice CQ5 developer, I've been chipping away at CQ5 recipes such as: As a developer, I would to like to access the CRX via WebDAV on my Windows 7 workstation. Simple question, right? Wrong. As it turns out Windows 7 64-bit does not support mapping a WebDAV resource easily. Sure, there were claims that applying KB907306 would do the trick. This didn't work. There were instructions on mucking with the registry. Really, this isn't the 90s. No thank you. Oh, wait...there are third-party freeware packages such as BitKinex. Again, no thank you. Lastly, there were some articles around changing the authentication scheme from Basic Authentication to Digest Authentication. Why can't I have native support! 

Enough with the rant. I recently had a good experience building a command line WebDAV client under Linux (CentOS) called cadaver. As a command line guy, I already had Cygwin running under Windows 7. Sure enough, Cygwin supports cadaver under All > Web

For those of you running Windows 7, need WebDAV support and don't mind using the command line, try cadaver out. Once installed, connecting to the CRX is pretty painless.

  1. Launch Cygwin
  2. Create a file called ~/.netrc and include the following lines. This will allow you to interact with the CRX without being promoted for a username and password.
    machine localhost
    login admin
    password admin
    
  3. Run cadaver.
    $ cadaver http://localhost:4502/crx/repository/crx.default
    
  4. You should now receive a shell to interact with the CRX. Most of the commands are similar to a command line FTP client (ls, cd, get, etc.). Simply type help for a list of available commands.

 

Apache Felix - Example 3 Continued: Testing Service Unregistration

 

Apache Felix Tutorial Example 3 walks the user through an implementation of a simple dictionary service. At the end of the tutorial, it describes a use case in which the dictionary service is unregistered. Under this scenario, the code in the example should throw a null pointer exception.

 I was curious how to force this use case and came up with the following additional steps to experiment with unregistering the service while the client was using it. We just need access to another shell so that we can have the client running in one shell, and stop the service in a second shell.

  1. Begin, by downloading the Remote Shell bundle under the subprojects section of the download page.
  2. Install and start the Remote Shell bundle. This will provide remote access to the Felix shell via Telnet.
    start file:/path/to/download/org.apache.felix.shell.remote-1.1.2.jar
  3. Open your preferred Telnet application, such as putty, and set the port to 6666 and the host to localhost.
  4. In your original shell (not the remote) start the example 3 client.
    g! lb | grep Client
    START LEVEL 1
    ID|State |Level|Name
    12|Resolved | 1|Dictionary Client (1.0.0)
    g! start 12
    Enter a blank line to exit.
    Enter a word:
    testing
    No match found!
  5. Find the ID of the English dictionary service and stop it.

    g! lb | grep English
    10|Resolved | 1|English dictionary (1.0.0)
    g! stop 10
  6. Lastly, go back to the shell running the dictionary client and enter a new word. You should now receive a null pointer exception!
    Enter a word:
    testing2
    org.osgi.framework.BundleException: Activator start error in bundle [12].
    at org.apache.felix.framework.Felix.activateBundle(Felix.java:2027)
    at org.apache.felix.framework.Felix.startBundle(Felix.java:1895)
    at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:944)
    at org.apache.felix.gogo.command.Basic.start(Basic.java:729)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.felix.gogo.runtime.Reflective.invoke(Reflective.java:137)
    at org.apache.felix.gogo.runtime.CommandProxy.execute(CommandProxy.java:82)
    at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:477)
    at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:403)
    at org.apache.felix.gogo.runtime.Pipe.run(Pipe.java:108)
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:183)
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:120)
    at org.apache.felix.gogo.runtime.CommandSessionImpl.execute(CommandSessionImpl.java:89)
    at org.apache.felix.gogo.shell.Console.run(Console.java:62)
    at org.apache.felix.gogo.shell.Shell.console(Shell.java:203)
    at org.apache.felix.gogo.shell.Shell.gosh(Shell.java:128)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.felix.gogo.runtime.Reflective.invoke(Reflective.java:137)
    at org.apache.felix.gogo.runtime.CommandProxy.execute(CommandProxy.java:82)
    at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:477)
    at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:403)
    at org.apache.felix.gogo.runtime.Pipe.run(Pipe.java:108)
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:183)
    at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:120)
    at org.apache.felix.gogo.runtime.CommandSessionImpl.execute(CommandSessionImpl.java:89)
    at org.apache.felix.gogo.shell.Activator.run(Activator.java:75)
    at java.lang.Thread.run(Unknown Source)
    Caused by: java.lang.NullPointerException
    at com.gastongonzalez.felix.tutorial.example3.Activator.start(Unknown Source)
    at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:641)
    at org.apache.felix.framework.Felix.activateBundle(Felix.java:1977)
    ... 32 more
    java.lang.NullPointerException

Apache Felix - Bundle Installation and NoClassDefFoundError

 

I am on day two of a long journey to master OSGi. As such, I have been working through the OSGi tutorials on the Apache Felix site. As I continue down this path, I thought I would document the issues that I encounter as a novice, hopefully, helping others at my same level.

This afternoon, I tried installing a new bundle (felix-tutorial-example5.jar) and received a java.lang.NoClassDefFoundError exception for the ServiceTracker class. As it turns out, I forgot to import the org.osgi.util.tracker package in my manifest. So next time you receive a NoClassDefFoundError exception when installing a bundle, check that you have all of your import packages defined.

The Solution:

Import-Package: org.osgi.framework, 
org.osgi.util.tracker,
com.gastongonzalez.felix.tutorial.example2.service

The Problem:

g! felix:start file:/c:/temp/felix-tutorial-example5.jar
org.osgi.framework.BundleException: Activator start error in bundle [28].
at org.apache.felix.framework.Felix.activateBundle(Felix.java:2027)
at org.apache.felix.framework.Felix.startBundle(Felix.java:1895)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:944)
at org.apache.felix.gogo.command.Basic.start(Basic.java:729)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.felix.gogo.runtime.Reflective.invoke(Reflective.java:137)
at org.apache.felix.gogo.runtime.CommandProxy.execute(CommandProxy.java:82)
at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:477)
at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:403)
at org.apache.felix.gogo.runtime.Pipe.run(Pipe.java:108)
at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:183)
at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:120)
at org.apache.felix.gogo.runtime.CommandSessionImpl.execute(CommandSessionImpl.java:89)
at org.apache.felix.gogo.shell.Console.run(Console.java:62)
at org.apache.felix.gogo.shell.Shell.console(Shell.java:203)
at org.apache.felix.gogo.shell.Shell.gosh(Shell.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.felix.gogo.runtime.Reflective.invoke(Reflective.java:137)
at org.apache.felix.gogo.runtime.CommandProxy.execute(CommandProxy.java:82)
at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:477)
at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:403)
at org.apache.felix.gogo.runtime.Pipe.run(Pipe.java:108)
at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:183)
at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:120)
at org.apache.felix.gogo.runtime.CommandSessionImpl.execute(CommandSessionImpl.java:89)
at org.apache.felix.gogo.shell.Activator.run(Activator.java:75)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoClassDefFoundError: org/osgi/util/tracker/ServiceTracker
at com.gastongonzalez.felix.tutorial.example5.Activator.start(Unknown Source)
at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:641)
at org.apache.felix.framework.Felix.activateBundle(Felix.java:1977)
... 32 more
Caused by: java.lang.ClassNotFoundException: org.osgi.util.tracker.ServiceTracker not found by[28]
at org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1460)
at org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringIm
pl.java:72)
at org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadCla
ss(BundleWiringImpl.java:1843)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 35 more
java.lang.NoClassDefFoundError: org/osgi/util/tracker/ServiceTracker

 

FAST ESP Error: Document summary too short, couldn't unpack

The FAST ESP error, Document summary too short, couldn't unpack, occasionally occurs after performing an index profile update. In the past, my solution to this problem involved executing a manual cold update. While this works every time, it does require that you refeed all of your content. Definitely not the ideal solution when dealing with a production system.

I discovered a TechNet posting this morning that offers another solution that's not destructive. The following are the steps described in the post:

  1. Stop the QR Server (nctrl stop qrserver).
  2. Delete the %FASTSEARCH%\var\qrserver\webcluster\15100\cache_cs directory.
  3. Start the QR Server (nctrl start qrserver).
  4. Stop Search (nctrl stop search-1).
  5. Delete the %FASTSEARCH%\var\searchctrl directory.
  6. Start Search (nctrl start search-1)
  7. Repeat the above steps on any systems in the cluster running a QR Server or Search component.

A big thanks to Rob Va for the solution!


FAST Search Engineer's Guide: An Alternate Approach to SBC Absolute Position Boosting

Given the relatively closed nature of FAST ESP, there are a number of areas that are not well documented. This makes our jobs as search engineers difficult when we are required to solve a problem in one of these undocumented, or lightly documented areas.

In March 2011 I wrote a white paper, FAST Search Engineer's Guide: An Alternate Approach to SBC Absolute Position Boosting, that describes a method for applying SBC-style Top 10 or Absolute Position boosts without the use of SBC.