Supporting Multi-Term Synonyms in hybris 5.4 / Solr 4.6.1

Recently, I was working with a client on a hybris 5.4 implementation and was asked to import their synonyms from their current platform. Easy enough, right? Wrong. The out-of-the-box synonym integration allows business users to define multi-term synonyms on the "from-side” in the hMC; however, at the time of this writing, Sol 4.x does not natively support multi-term synonyms on the from-side of the synonym mapping. For example, if we had a synonym on the from-side (i.e., classic gaming console) mapped to the "to-side" (i.e., nintendo entertainment system), the hMC would silently allow this synonym definition. However, this would have no effect on the Solr-side.

How do we solve this problem given that Solr does not yet support multi-term synonyms? Luckily, Ted Sullivan from Lucidworks has a great article, "Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter”, for solving this problem. Before continuing with my article, you should read Ted Sullivan’s post first. At a high-level, he has provided a work around for this problem through the use of a custom token filter (AutoPhrasingTokenFilter) and custom query parser (AutoPhrasingQParserPlugin). The former is responsible for accepting a list of multi-term phrases (i.e., the from-side of our multi-term synonym)  that need to be preserved as a single token for the synonym mapping. To continue with our example, we define our multi-term synonym phrase in: autophrases.txt:

classic gaming console

And, per Ted Sullivan’s recommendation, we will use the character ‘x’ as our white space replacement token.  With this in place, we can add the following to our synonym file: synonyms.txt.

classicxgamingxconsole,nintendo entertainment system

This all works great on the indexing side; however, the OOTB Solr query parser generates single tokens, so Ted Sullivan has provided a custom query parser (AutoPhrasingQParserPlugin) that is aware of the phrases in the autophrases.txt as well as the white space replacement character.

As Ted point's out, the Solr Analysis console incorrectly suggests that you do not need the Auto Phrasing query parser. While he is correct, I created an additional field type definition (text_en_autophrase_tester) that is never used by any field in my schema, but is intended for performing auto phrasing tests for both they query and index. 

Auto Phrasing with synonyms in action

hybris 5.4 integration

Again, please read "Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter" first. There are a couple of challenges when integrating the Auto Phrasing TokenFilter / Query Parser directly with hybris 5.4:

  1. hybris 5.4 ships with Solr 4.6.1 and the GitHub code for the Auto Phrasing TokenFilter was written for Solr 4.10.3.
  2. hybris uses the eDismax query parser by default. So, we need another approach for supporting both query parsers: eDismax and the autophrasingParser. 

The version mismatch can be solved in one of two ways:

  1. Upgrade the Solr server from 4.6.1 to 4.10.3. I would be cautious with this approach; however, I performed a number of initial search tests (searches, indexing, synonym/stopword management, Commerce Search Cockpit testing) and everything worked without issue. It should be noted that this upgrade may not be supported by hybris. You’ve been warned.
  2. Fork the GitHub project and downgrade the Auto Phrasing TokenFilter to Solr 4.6.1. This is the approach I went with to minimize risk. My version of the GitHub project is available as a fork. I suggest that you clone my version from the hybris-5.4_solr-4.6.1 branch and build the JAR from source.

Supporting both query parsers, is also easily solvable. My recommendation is to implement a custom SolrQueryPostProcessor and manipulate the Solr query and add a Solr nested query. Nested queries are extremely powerful and allow you to embed a query of another type within a query string. For example, let's assume that our Solr query is:

classic gaming console

By default, this would use the eDismax query parser. We can modify this query in our post Solr query processor to also support the nested Auto Phrasing query parser:

classic gaming console OR _query_:"{!autophrasingParser df=name_text_en}classic gaming console" 

Should you need to support multi term synonyms on other fields, simply add another OR clause with the desired field name.

classic gaming console OR _query_:"{!autophrasingParser df=name_text_en}classic gaming console" OR _query_:"{!autophrasingParser df=body_text_en}classic gaming console"

Conclusion

The Auto Phrasing TokenFilter is a great way to improve your search experience in hybris. While the approach described here works nicely, it is decoupled from the OOTB hMC synonym management. This requires that single-term synonyms are managed through the hMC and multi-term synonyms are managed in Solr via synonyms.txt and autophrases.txt. If I had time I would go down the path of moving the synonym management into the Backoffice and adding full management support for multi-terms and single terms into one frictionless experience. 

Stay tuned for upcoming posts on advanced hybris / Solr integrations!