Why search is hard

Getting search right is hard. The sentence “search can be added as a well-performing feature to your existing product quickly” is one of many falsehoods myths about search. The truth is: Adding search to a product is not a simple enable-and-forget.

Furthermore: there is no “Once setup, search will work the same way forever”, forget about it. If your data is changing over time, the probability that the search algorithm needs to be updated, “enhanced”, or tuned again is pretty high. Being able to find relevant information, documents, or products through search depends a lot on knowing the structure of the data in your collection, its findability, and how to configure your search engine. These things are not as easy as one might think. In this blog post I’d like to share my most important lessons learned on how to improve search.

Strategies to optimize search

For one of the projects I have been working on at Liip, the goal was to “improve” its search. Here are some search optimization strategies we used on that project, including some pitfalls and hints. Being able to plan how to implement a search system from the beginning is usually an advantage. Improving an already running system might require some more effort.

Being able to find relevant information in a collection is mostly about being able to know (or to define) the structure of the data in your collection, the data’s findability, and how to configure and tune your search-library.

Note: in the following the term "document" will be used to identify the piece of information that can be indexed and retrieved; a document has multiple "properties" (like a title, abstract, and so on). The word "improve" is used to address answers to client’s questions like “why is product X on top of the list, that is not correct!”, “product Y is not found!”, or “the search doesn’t work!” (yes, sometimes that’s how clients formulate questions :D ).

1. Be able to Explain

Being able to answer why a document is found (and in a specific position) as a result of a search is of crucial importance when working with a search engine and with a big dataset. It allows to debug how the relevance score is computed for a particular result, and avoid some “what’s going on”ℱ exclamations.

This gives the developer the possibility to identify some easy-to-fix cases, like wrong/missing data, or some issues in the analysis pipeline applied at indexing or query time.
Search engines like Elasticsearch and Solr allow getting a detailed description of the steps and values involved in the relevance score computation of search results, by enabling an “explain” flag on the requests sent to the search engine.

Lesson learned: search is not a black-box (and it should not be)! To properly tune the matching algorithms of your search engine you must be able to understand how it works and how the relevance score of a search result is computed. Knowing how the relevance score is computed is (almost) a must.

2. Test and Monitor your changes

Changes applied to the search and matching algorithm, such as fields used for searching, boostings, stemming, or the whole text analysis pipeline (see the Solr's Analyzers or the Elasticsearch analysis documentation), must be closely tested and monitored.

Being able to reproduce the user search with live data is important, as for being able to run A/B testing and comparisons on the system with the new and old search algorithms. On the topic of evaluation measures there is a very wast amount of works and models (precision/recall, F-Measure, DCG/nDCG, ect.). Which ones should be used and how to choose them depends on the data, the project, and sometimes on your budget.

Lesson learned: being able to monitor the effects of your changes should be the first milestone when working on “improving” a search feature of a product. Not only to be able to present your improvements, but also to track any regression introduced during the process.

3. Start small

From the Elasticsearch documentation using a “catchall” (_all) field where all properties of a document are indexed together is NOT a good way to start implementing a search feature (they deprecated it v6.x and finally removed from newest releases). Apart from that being a bad idea for the query and document length-normalization applied during the scoring process, it defeats the possibility to define boosting factors for specific properties.

The hint here is: start small, and search only in a few and selected properties of your documents, chosen from the ones containing the most specific information. This allows to control the search results better, and limit the bad-data factor for collections with very diverse contents. Using a high number of properties might increase recall, while reducing the precision of your results (see Precision and Recall measures).

In the project I was working on, the number of properties used for searching was very high, and that lead to bad results returned for some searches. The reason behind such poor quality of the results was due to some of the used properties. In one case the contents of a headline property were very broad and did not contain any text about the document itself. For some documents the property was just one single word, and for others, it consisted of a very-very broad text (this can be a case for trash-in trash-out).

A second case was the usage of the categorization associated to documents. The search was (with ingenuity) using properties from all the categories associated to a document, with no selection nor differentiation in the boosting depending on the depth of the category tree.

Let’s consider the following category paths: (1) “Books > Health, Fitness & Dieting > Psychology & Counseling > Developmental Psychology”, and (2) “Books > Health, Fitness & Dieting >Exercise & Fitness > Weight Training”. A search query for “Fitness” would match the “Health, Fitness & Dieting” category, leading to all documents belonging to that category to be retrieved: documents related to “Developmental Psychology” were among the results, even if not totally relevant. Probably such decision was taken to increase the number of documents retrieved, maybe in correlation with a different category structure available at the beginning of the project.

Lesson learned: Properly choosing and limiting the properties used for searching is paramount! Choose document properties that carry the most important and specific information, to avoid false positive results.

4. Speak the Language đŸ‡©đŸ‡Ș đŸ‡«đŸ‡· 🇼đŸ‡č 🇹🇭

It is common to use language-specific analysis on document properties to improve the recall of the search results. The most applied is the stemming algorithm. Stemming allows reducing words to their “stem” (an example: “runner”, “running” and “runs” become “run”) allowing searches to match words with the same stem. The stemming process in information retrieval works by applying language-specific rules to derive the stem of a word. Different stemmers might exist for a specific language: from the ones labeled as light or “less aggressive” (reduced number of stemmed words) to the ones with a very limited scope (handling only plurals and singulars).

Stemming for multiple languages

One pitfall encountered in our case-project was in the text analysis performed on properties containing text with words in multiple languages. In our case, the german stemmer algorithm was applying german-language rules to english words, and the results were, at least, confusing and leading tons of false positives.

Stemming edge-cases

Stemmers, as working with fixed rules, might also have edge cases. We hit some of those with the german stemmer: as an example, we discovered terms like “reisen” (to travel) stemmed to “reis” (rice), or “organization” being stemmed into “organ”. Solr and Elasticsearch can mark specific words as keywords, and let the stemmer not to process them. Have a look at the KeywordMarkerFilter’s configuration and usages (see: Solr and Elasticsearch). Another possible solution is to define the correct stemming for those words; Solr and Elasticsearch implement the StemmerOverrideFilter for such case (see the Solr StemmerOverrideFilterFactory and Elasticsearch Stemmer Override Token Filter documentation).

Lesson learned: check the tokens produced by the stemmers used in your pipelines, try to use the “light” versions of them and be aware of edge cases, in particular in text with mixed languages.

5. Accents normalization and ascii folding

When applying language-specific analysis it is common to lowercase and “normalize” accented letters (e.g.: removing accents from letters). In german it is also accepted to rewrite letters with umlaute by removing the umlaut and following that letter by “e” (e.g.: MĂŒnchen can be written as Muenchen).

This can lead to issues if not properly handled, such as:

  • Words with different meanings when written with or without umlauts or accents are considered the same word, like “tauschen” (to swap) and “tĂ€uschen” (to deceive);
  • No match if using the alternate writing of words with umlauts: MĂŒnchen is not found when searching for Muenchen;
  • Prefix (or part of the word) searches will not work: the word TemporĂ€r will not be matched by a prefix search with Temporae*.

In the case-project, we initially implemented the above (faulty) pipeline, and that lead to mismatching in the search results. In german: searching for Kuchen (cake, pie) was also matching documents containing KĂŒchen (kitchen).

Lesson learned: our analysis pipeline now handles umlauts by replacing them with the “-e” (Ă€ becomes ae) transformation. This allows to differentiate words with and without umlauts, and being able to support prefix queries.

6. Regional words: manage dialects in search

One of the beauties of Switzerland in terms of search is the presence of 4 national languages and many more dialects at regional level (and some at city-level too). Furthermore, it is common to have people using words from one of the other languages when speaking or using the search.
One of the challenges for our case-project was to be able to handle the 3 most used languages in Switzerland (German, French and Italian), taking into consideration the different regional dialects and their specific words, too.
The project uses 3 indexes, one for each language to simplify the retrieval and to keep consistent the document-lenght normalization applied during the relevance score computation.

Manage language dialects in searches

The analysis pipeline implemented in the improved search makes use (at query time) of a set of regional synonyms per language. As an example for german searches, synonyms are used for “translating” swiss-german terms into high-german words.
Those terms were acquired by (1) analysing the searches issued by the users on the system and their subsequent refinements, and (2) by applying some local-intelligence (simply asking: “how do you write the word X in your region?” to various colleagues :-) ). Fun fact: swiss-german has no rules dictating how to spell a word, and in some cases the list of synonyms for a high-german word is composed of 10 different swiss-german words.

Lesson learned: monitor the terms entered by the users in the search to discover regional/national words or new trending keywords; adapt your documents or synonyms accordingly.

Conclusions

The goal of improving the search is quite a long and challenging process as it involves many moving parts on both the search side (text analysis pipelines, query matching, results relevance computation, and so on) and the data side (document structure, missing/irrelevant contents, ...).

Having a good strategy from the beginning helps to quickly assess the current search performances and check some of the main issues and pitfalls that could cause a bad search experience. I hope this post can help you to avoid some of the pitfalls that we encountered while analysing one of our projects.

Knowing that this is not an exhaustive list of pitfalls/hints to be used when improving a search feature (like integrating LtR, NPL or NER algorithms), do you share some of those? What are your “favourite” pitfalls and hints when dealing with search? Please share them :-)

Resources

Credits: photo by Anthony Martino on Unsplash