Mot-clé: search · Blog · Liip https://www.liip.ch/fr/blog/tags/search Kirby Wed, 29 Aug 2018 00:00:00 +0200 Articles du blog Liip avec le mot-clé “search” fr Metadata-based search vs. Primary-data search https://www.liip.ch/fr/blog/metadata-based-search-vs-primary-data-search https://www.liip.ch/fr/blog/metadata-based-search-vs-primary-data-search Wed, 29 Aug 2018 00:00:00 +0200 Let's consider two examples:

  • Anna wants to compare real estate prices in Bern and Zurich. She stumbles upon an open data portal of Switzerland. There is a dataset of the Statistical Office of the City of Zurich covering the real estate prices, unfortunately there is no such dataset for Bern. Anna will try to find those using the search.
  • Thomas wants to have a look at the Outcomes of the recent votings (“Ständeratswahlen”) in Zürich and wants to know how each of the prominent politicians performed.

We have taken the assumption, that a user of such a portal is interested in certain data, but doesn’t know where to find it. They are not familiar with the portal nor are they considered “power users” with special knowledge about the data or metadata.

To make things more exciting we have written this blog post with two authors: Stefan Oderbolz and Thomas Ebermann. Stefan will try to convince you that a search based on metadata will help Anna greatly to find her the right dataset to compare real estate prices. Thomas will try to convince you that a search on primary data will work better for Thomas and allow him to quickly find the right dataset.

While we rolled a dice who will have to write which part, we will try as hard as we can to convince you - the reader - that our opinion is the best. Buckle up and enjoy the ride.

Why we need Metadata-based search

by Stefan Oderbolz

Metadata-based search means, the search engine, that powers a portal search has indexed documents based on a specified metadata schema. Documents have metadata fields like title, description, keywords or temporal and spatial coverage.

In its most basic form, a query for “Zurich real estate” will return all documents, that have metadata values matching the “Zurich” “real” and “estate”.
“Zurich” will be covered by the spatial coverage field. “Real” and “estate” resp. “Real estate” are either found as keywords or as part of the title and description of the dataset.

Anna’s search will return 2 results:

  • The aforementioned dataset of the Statistical Office of the City of Zurich
  • A dataset of the Canton of Zurich, that contains the real estate data for the whole Canton (incl. The City of Zurich)

No further results are shown. The analogous search “Bern real estate” doesn’t return any datasets.

This example shows the strength of the metadata-based search: if the metadata has good quality, you get the correct datasets. And if the search does not return anything, you can be confident, that the dataset does not exist (by-the-way: this would be a good time to start a “data request” for this data, so next time you search, you’ll find both datasets right away).

I think this is an important message: getting zero results is actually a good thing. You know it’s not there. This heavily relies on the assumption that the metadata is good and correctly indexed.

from kiwi.concept, https://www.flickr.com/photos/kiwikoncepts/41246808615, CC BY 2.0

A neatly organized catalogue helps the users to browse it, without even entering search terms. You can show categories for your datasets, to give information about what kind of things you can find on the portal.

A neatly organized catalogue helps the users to browse it, without even entering search terms. You can show categories for your datasets, to give information about what kind of things you can find on the portal.

Screenshot from opendata.swiss

With keywords on each dataset you can even create this on a more fine-grained level. A user might discover similar datasets, that share the same keyword (see screenshot below). Like this a user has the possibility to “move” seamlessly in the catalogue and discover the available datasets.

Screenshot from opendata.swiss

Last but not least the metadata-based approach forces data publishers to deliver high quality metadata for each dataset. Otherwise nobody will find and use the datasets. This incentive is important to keep in mind. If we simply outsource the task to the computer, we lose all the valuable knowledge the publishers have of their data. For the closed world of a data portal, this is a source we can not afford to lose.

The catch-all primary-data approach might be the right choice for something vast like the web, but not for the small-ish, precise area of a data catalogue. But I’ll let Thomas take it from here and convince you otherwise.

Why we will need primary-data search

By Thomas Ebermann

Let's go back in time. Let’s go way back, even before pokemon go was popular, maybe let's go even to a time where pokemon didn’t even exist. In that distant past somewhere around 1996 we lived in a different world. In a world without Trump and chaos. In a world where there things were somewhat ordered.

If you wanted a book you went to your local library - and if you had an old library like me: without a computer, then could go to an even then weird place, called an index-catalogue, where you could look for books containing a certain author, or you could look at books for a certain genre, for example fantasy literature. If you felt even a bit adventurous you could as well wonder through the library only to find that all of the books were nicely put into categories and have been alphabetized by the authors name. It was a nice experience. But it was also a tedious one.

A well sorted library

The web was no different. We were living at the time, where looking for a website was really easy. You just had to do the same things that you did in your library. Go to an index-catalogue - I mean Yahoo - and then select the right category, for example Recreation - Health and Sports and there you would find all of the websites about soccer or american football. It was simple, it was effective, life was good.

Source archive.org

But of course someone had to ruin it. It was us, we just liked making websites too much. More and more websites emerged, and soon there was no simple fast way to categorize them all. It was like a library that instead of receiving 10 new books a day all of a sudden received 1 Million a day. Nobody could categorize it all, nobody could read it all.

Source archive.org

And then Google came to save us. It was drastic. They had thrown away all of the categories, folders and myriads of subfolders, that librarians had worked so hard on. Instead they just gave us a search box. That was it. And the most ridiculous thing happened. People actually liked it. The results were relevant and fast. It was almost magic, like a librarian that knew it all. Like a librarian that had actually read all the books and knew what was inside them.

from https://www.internetlivestats.com/internet-users/

Moving forward 20 years here we stand, and I still feel reminded of my good old library when I think of opendata.swiss. Looking at the nearly 7000 datasets on opendata swiss it makes me proud how fast the amount of datasets has risen. In 2016 the amount of datasets was nearly half that number, I still remember that website having a big two...something on the frontpage.

While we cannot simply assume that the amount of open datasets will rise as quickly as the amount of websites or internet users, I still expect that rather sooner than later we will have 100 000 datasets on open data worldwide. At this point it will definitely be a burden to go and find these datasets via a catalogue. We will definitely rely on the big search box more.

We will probably expect even more from that searchbox. Similarly to Google, who has become a librarian who has read all the websites, we probably will also want a librarian who has read all of our datasets.

So when I type in “Limmatstrasse” into that box, I will somehow expect to find every dataset that has to do with Limmatstrasse. Probably those that are really popular to be on the top of my search results and those that are less popular to be at the back.

While I eventually might want to facet my search, so for example just have datasets on related to politics, I might as well enter “Limmatstrasse Kantonsratswahlen 2018 Mauch” or something into the box and find what I needed, when I am looking for a dataset containing the candidates and some sort of breakdown on the regions.

Voting results from https://www.stadt-zuerich.ch/portal/de/index/politik_u_recht/abstimmungen_u_wahlen/vergangene_termine/180304/resultate_erneuerungswahlen.html

Being a lazy person I might expect, that a click on that search result will take me directly into the relevant rows of the dataset, just to verify that that's the right thing that I want.

Yet all of these things are not possible when relying only on a metadata for my search. First of all, I will probably get lost in the catalogue when trying to go through 100 000 datasets. Second, I probably won’t find even one dataset containing Limmatstrasse, because nobody cared to enter myriads of different streets into the metadata. It's just not practical. The same goes for all involved candidates. Nobody has the time nor resources to annotate the dataset that thoroughly. Finally it's simply impossible to point me at the right row in a dataset when all I have is just some metadata.

No results for Limmatstrasse

So while everybody who submitted his dataset did a fairly good job at annotating it, it's simply not enough to fulfill my needs. I need a librarian who, similarly to Google back in the 90ties, has a radically different approach. I will need a librarian who has read all of the datasets that are in the library and can point me to the right dataset, even if my query very rather fuzzy. In other words I need a search that has indexed all of the primary data.

Conclusion: best of both worlds

So there we are, you have seen the high flying arguments from both worlds while each us us has swiped the negative aspects of each solutions under the table. So here they are:

  • Downsides of Metadata for search:
    • It’s relevant when you want to make sense out of the primary data but it will never be as rich as the primary data. It obviously does not contain some aspects that a user might be searching for.
    • There is a constant dissonance between what the users are searching for and how we tag things (e.g. “weather” vs. meteodata, or “Lokale Informationen” vs. Zürich)
  • Downsides of Primary Data for search:
    • On the other side primary data might match a lot of relevant search terms that the user is searching for, but it is simply not good for abstraction (e.g. I want all the data from all swiss cities).
    • Creating such ontologies from primary data is very difficult: Thus automatically tagging datasets based on primary data into categories like health or politics is hard.
    • Using only primary data we might also run into the problem of relevancy. When a user is searching for a very generic keyword like Zürich, and then finds myriads or results that have the word Zürich and yet cannot facet his search down only to political results is frustrating.

Precision and Recall

So of course from our perspective a perfect search will have to embrace both worlds. To formalize that a bit let's think about recall and precision.

  • Recall: How many datasets that are relevant have been found? (If 10 are potentially relevant but the search returns only one, thats a low recall)
  • Precision: How many datasets that have been returned are relevant? (If 10 datasets have been returned, but only 1 is actually relevant, thats a low precision)

So in an ideal world we would want a search to have both, but the reality today looks more like this:

Precision Recall
Metadata Search High Low
Primary-data Search Low High
Combined Approach High High

So while metadata search has a high precision, because you only get what you search for, it lacks in recall, often not finding all of the relevant datasets, just because they have been tagged badly. On the other hand a primary-data search gives you a high recall, e.g. returning all of the datasets that somewhere have the word “Zürich” in it, but has a low precision because probably most of the search results are not really relevant for you.

There are also two other arguments where the primary data and meta-data approach differ: On one hand indexing primary data allows us to search for “Limmatstrasse Kantonsratswahlen 2018 Mauch”, so giving us a very fine grained information retrieval. On the other hand just using primary data to “browse” a catalogue is not useful. In contrast using metadata, searching for “Politics” or “Votings” we rather get a very broad result set. Yet using those tags to browse into “Politik” and “Abstimmungen” might give us a much wider overview of available datasets that go beyond our little search.

Good for Poor for
Metatdata Information Browsing the catalogue Highly detailed search queries
Primary-data Information Highly detailed search queries Browsing the catalogue

That's why we think that in the future we should embrace indexing primary data of our datasets while combining it smartly with the metadata information, to really get the best of both world. While this might not be easy, especially having a high precision and a high recall, we think it is a challenge worth trying. I am very sure that it will improve the overall user experience. After all we want all these precious datasets to be found and used.

]]>
Drupal SearchAPI and result grouping https://www.liip.ch/fr/blog/drupal-searchapi-result-grouping https://www.liip.ch/fr/blog/drupal-searchapi-result-grouping Mon, 24 Oct 2016 00:00:00 +0200 In this blog post I will present how, in a recent e-Commerce project built on top of Drupal7 (the former version of the Drupal CMS), we make Drupal7, SearchAPI and Commerce play together to efficiently retrieve grouped results from Solr in SearchAPI, with no indexed data duplication.

We used the SearchAPI and the FacetAPI modules to build a search index for products, so far so good: available products and product-variations can be searched and filtered also by using a set of pre-defined facets. In a subsequent request, a new need arose from our project owner: provide a list of products where the results should include, in addition to the product details, a picture of one of the available product variations, while keep the ability to apply facets on products for the listing. Furthermore, the product variation picture displayed in the list must also match the filter applied by the user: this with the aim of not confusing users, and to provide a better user experience.

An example use case here is simple: allow users to get the list of available products and be able to filter them by the color/size/etc field of the available product variations, while displaying a picture of the available variations, and not a sample picture.

For the sake of simplicity and consistency with Drupal's Commerce module terminology, I will use the term “Product” to refer to any product-variation, while the term “Model” will be used to refer to a product.

Solr Result Grouping

We decided to use Solr (the well-known, fast and efficient search engine built on top of the Apache Lucene library) as the backend of the eCommerce platform: the reason lies not only in its full-text search features, but also in the possibility to build a fast retrieval system for the huge number of products we were expecting to be available online.

To solve the request about the display of product models, facets and available products, I intended to use the feature offered by Solr called Result-Grouping as it seemed to be suitable for our case: Solr is able to return just a subset of results by grouping them given an “single value” field (previously indexed, of course). The Facets can then be configured to be computed from: the grouped set of results, the ungrouped items or just from the first result of each group.

Such handy feature of Solr can be used in combination with the SearchAPI module by installing the SearchAPI Grouping module. The module allows to return results grouped by a single-valued field, while keeping the building process of the facets on all the results matched by the query, this behavior is configurable.

That allowed us to:

  • group the available products by the referenced model and return just one model;
  • compute the attribute's facets on the entire collection of available products;
  • reuse the data in the product index for multiple views based on different grouping settings.

Result Grouping in SearchAPI

Due to some limitations of the SearchAPI module and its query building components, such plan was not doable with the current configuration as it would require us to create a copy of the product index just to apply the specific Result Grouping feature for each view.

The reason is that the features implemented by the SearchAPI Grouping are implemented on top of the “ Alterations and Processors” functions of SearchAPI. Those are a set of specific functions that can be configured and invoked both at indexing-time and at querying-time by the SearchAPI module. In particular Alterations allows to programmatically alter the contents sent to the underlying index, while the Processors code is executed when a search query is built, executed and the results returned.

Those functions can be defined and configured only per-index.

As visible in the following picture, the SearchAPI Grouping module configuration could be done solely in the Index configuration, but not per-query.

SearchAPI: processor settings

Image 1: SearchAPI configuration for the Grouping Processor.

As the SearchAPI Grouping module is implemented as a SearchAPI Processor (as it needs to be able to alter the query sent to Solr and to handle the returned results), it would force us to create a new index for each different configuration of the result grouping.

Such limitation requires to introduce a lot of (useless) data duplication in the index, with a consequent decrease of performance when products are saved and later indexed in multiple indexes.

In particular, the duplication is more evident as the changes performed by the Processor are merely an alteration of:

  1. the query sent to Solr;
  2. the handling of the raw data returned by Solr.

This shows that there would be no need to index multiple times the same data.

Since the the possibility to define per-query processor sounded really promising and such feature could be used extensively in the same project, a new module has been implemented and published on Drupal.org: the SearchAPI Extended Processors module. (thanks to SearchAPI's maintainer, DrunkenMonkey, for the help and review :) ).

The Drupal SearchAPI Extended Processor

The new module allows to extend the standard SearchAPI behavior for Processors and lets admins configure the execution of SearchAPI Processors per query and not only per-index.

By using the new module, any index can now be used with multiple and different Processors configurations, no new indexes are needed, thus avoiding data duplication.

The new configuration is exposed, as visible in the following picture, while editing a SearchAPI view under “Advanced > Query options”.

The SearchAPI processors can be altered and re-defined for the given view, a checkbox allows to completely override the current index setting rather than providing additional processors.

Drupal SearchAPI: view's extended processor settings

Image 2: View's “Query options” with the SearchAPI Extended Processors module.

Conclusion: the new SearchAPI Extended Processors module has now been used for a few months in a complex eCommerce project at Liip and allowed us to easily implement new search features without the need to create multiple and separated indexes.

We are able to index Products data in one single (and compact) Solr index, and use it with different grouping strategies to build both product listings, model listings and model-category navigation pages without duplicating any data.

Since all those listings leverages the Solr cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq(FilterQueryParameter text: FilterQuery query parameter) to filter the correct set of products to be displayed, Solr can make use of its internal set of caches and specifically the filterCache to speed up subsequent searches and facets. This aspect, in addition to the usage of only one index, allows caches to be shared among multiple listings, and that would not be possible if separate indexes were used.

For further information, questions or curiosity drop me a line, I will be happy to help you configuring Drupal SearchAPI and Solr for your needs.

]]>
Search: Get past English with Solr https://www.liip.ch/fr/blog/search-get-past-english-with-solr https://www.liip.ch/fr/blog/search-get-past-english-with-solr Mon, 05 Oct 2015 00:00:00 +0200 Implementing a great search feature for an English website is already quite a task. When you add accented characters like you have in French, things tend to get messy. What about more exotic languages like Japanese and Chinese?

When we tried to implement a search engine for a multi lingual website where we had articles in Chinese, Japanese and Korean, despite not knowing those languages at all, we quickly remarked that our search engine was performing really poorly. On some occasion it wasn't even returning an article we specifically copied a word from.

We had to do a lot of research to understand what was happening, here is a compilation of what we found along the way in the hope you won't have to go the same path as us!

Since our project is using Solr, this post will concentrate on how to use the described techniques with it. The version used is Solr 4.5 but this should work on newer version and most of them will also work on Solr 3 with only minor adaptation.

At first, returning results to a search query can be seen as easy, but pretty soon you realize that everyone has a different way of expressing things. You can also encounter spelling mistakes, usage of synonyms, use of conjugated verbs, etc.

Fortunately, a lot of intelligent people have already resolved those common issues, they're often just a few keystrokes away.

Stop words

A stop word is a work too common in the language to bring any meaningful addition to the search. In English examples are “the”, “that”, “and”, “or”.

Those words usually appears in every texts and thus are not helping at all when searching. You can easily find databases of stop words for various languages on the Internet. Solr ships its own lists.

Usually you apply stop words both in the indexation and query processing. You can easily do that with Solr by adding a filter to your analyzers:

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

Depending on the particular field treated in your articles, you might want to add stop words specific to your dataset.

Synonyms

Not everyone use the same words to describe the same things, this is great for poetry, but it is the bane of search engines. Luckily, like for stop words, you can find ready to use synonyms list on the Internet and it is as easy to use them with Solr:

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

Contrary to stop words filtering however, usually you use synonyms filters only on the query to avoid cluttering your indexation space.

I also encourage you to extends your synonyms database with words that could be specific to your field.

Spelling mistakes

You could use synonyms filtering to catch common spelling mistakes, but it would quickly become cumbersome to have a complete list of errors that way. Most of modern search  engine integrates spelling mistakes correction in their core.

Solr also has a nifty feature that operate exactly like do “Did you mean” proposition you can sometimes see on the Google search page. It uses some rules and the document corpus to propose other query to your user.

Describing how to implement this with Solr is out of scope, but you can find documentation in the official wiki:

Stemming

Words can be used in their singular or plural forms and verbs can be conjugated. This makes the job of the search engine really difficult. For example, if your user is looking for “How to cut a diamond” you probably want to propose him the article “Diamond cutting”.

The words “How”, “to” and “a” will already be considered has stop words, so no problems here, however, you want to have a match for “cutting”. This is where stemming comes into play.

Stemming is the action of keeping only the relevant parts of each word (the stem), in this case it means that in most scenarios “cutting” and “cut” can be considered as identical. In French, you would consider “coupure”, “coupe”, “coupez” and “couper” as identical also.

Stemming is often activated on both indexation and query analysis. There is multiple stemming filters in Solr some more aggressive than others, usually the SnowballPorterFilterFactory is used for English:

<filter class="solr.SnowballPorterFilterFactory" language="English"/>

But it may be too aggressive for other languages, this is why specific stemmers exists:

GermanLightStemFilterFactory, FrenchLightStemFilterFactory, JapaneseKatakanaStemFilterFactory, …

Phonetic matching

In some cases, for example when searching proper nouns, people will write something that is phonetically close but does not have the same spelling. There is a wide range of algorithms that can be used in this case.

There is a dedicated documentation available: Solr Phonetic Matching

ASCII Folding

In some cases, for example when you want to sort results or with proper nouns, you want to get rid of all the accented letters and none ASCII characters.

Solr a a filter just for that:

<filter class="solr.ASCIIFoldingFilterFactory"/>

This will transform “Björn Ångström” to “Bjorn Angstrom”, allowing to sort results without issues.

If you want a more consistent folding across all the Unicode planes, you can use the ICUFoldingFilterFactory.

It is however not recommended to use those filters on the fields used for searching because it could break other filters like stemming and will probably lead to less precise results.

Working with ideograms

Once you leave the known ground of the latin alphabet and its related languages, things start to get more complicated. There is a lot of differences between the way we commonly approach search and text comprehension that no longer holds.

As a disclaimer, I am no Japanese, Chinese or Korean speaker, so anything I say concerning those languages is to be taken with a grain of salt, it is only what I could comprehend of all my readings on the subject. If you can have access to someone knowledgeable about those, I can only advise you to speak with them to better your configuration even further.

The first difficulty we stumbled upon is that there are not really words you can base your search upon. There are no spaces in the sentences, only a chain of ideograms. Usually, search engines separate sentences in words in order to apply the various techniques we saw earlier, this is not possible with ideograms based languages.

The usual solution to this issue is to use n-grams.

n-grams

A n-gram is a sequence of n items, in our case ideograms, from a text. Let's make this clear using an example. Say we have the sentence “To be or not to be.” and we want to create n-gram of words from it, the result will be:

  • 2-gram or bigram: “to be”, “be or”, “or not”, “not to”, “to be”.
  • 3-gram or trigram: “to be or”, “be or not”, “or not to”, “not to be”.

Now apply this transformation not to words but ideograms and you can start using a search engine like Solr to search your articles.

There is a generic filter in Solr to create n-grams:

<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="4"/>

But in our case, Solr has a special tokenizer for CJK (Chinese, Japanese, Korean) languages:

<tokenizer class="solr.CJKTokenizerFactory"/>

If you use this tokenizer, Solr will automatically creates bigrams without the use of a specific filter for that.

Once you either applied the filter or the tokenizer to generate “words”, you can now apply the other techniques seen above to improve your search results.

It is important to note that n-gramming generates many terms per document and thus greatly increase the size of the index, impacting the performances. Sadly, as far as we found out, there's no real other alternatives for Chinese and Korean. Japanese documents can however be treated using morphological analysis.

Morphological analysis

Since a sentence in Japanese may use up to 4 different alphabets, simple n-gramming often changes meaning and semantics. You can see the “ Japanese Linguistics in Lucene and Solr” presentation for more details about those issues. Fortunately, Solr integrates a morphological analyzer for Japanese: Kuromoji.

Morphological analysis tries to separate Japanese in “words” that are meaningful for each sentence. It uses a statistical model to do so and thus there could be errors, but in any case the result won't be worst than simple n-gramming.

Kuromoji is shipped as a tokenizer and a serie of filters, so you can use it as you would any other Solr feature we saw earlier:

<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />

Language specific text types

We just described some of the techniques that can be used to improve search results. Some of them you can use only with specific languages, others can be applied to all of them. There is also a lot of tuning that can be done based on each language specificity.

Luckily enough, a lot of people from around the world took the time to create specific Solr fields for some of the most used languages. Those are tailored to cather to each language peculiarities in the best way possible.

You can found those fields in the Solr example schema. I highly recommend that you first check there is there is an already existing field for the language you have to index and start from that.

The example schema is pretty big but also a well of best practices and knowledge when it comes to Solr, use it as much as possible. I promise you will gain a lot of time.

Conclusion

We just brushed the surface on what it is possible to do with Solr and we haven't even started to talk about fine tuning.

Most of the filters presented in this post have parameters that you can tweak to ensure great results for your users. It is also really important to not underestimate what a good  knowledge of the specific business vocabulary can do if used to craft synonyms and stop words lists.

To go further, I can recommend you to start with the dedicated page of the Solr documentation about analyzing specific languages which should be up to date with the last techniques you can use for a whole list of languages: Solr Language Analysis.

In particular, have a look on the Language-Specific Factories part which list all the filters and tokenizer that are specific for each languages.

As a parting note, the order of filters can greatly impact performances and results. Usually you put filters removing words first (stop words filters for example), then normalizing ones and finally stemming. You can also apply some filters only at query time, like explained in the “Synonyms” part.

I hope this post can help you provide great search results to your customer and if you have any advice and techniques that you would like to share, please leave a comment!

]]>
Magento Config Search https://www.liip.ch/fr/blog/8062 https://www.liip.ch/fr/blog/8062 Thu, 21 May 2015 00:00:00 +0200 Working with Magento configuration it is always a chore. To make a change, you have to find a necessary section, then open it, then open its subsection, then sub-subsection and probably some more… and only after a dozen of cliсks you can finally change necessary Magento configuration item.

Once, while testing new functionality, after routine search of some rarely used settings, I finally got tired and decided to improve this process a little bit.

My attention was drawn to the search field at the top of the admin panel. Existing ‘admin search' already knew how to perform search among customers, products and orders, so why not to add configuration section to this list as well? As always, I expected that I would have to rewrite basic functionality to achieve the desired result, but I found out a nice way to extend this functionality.

Configurations for the search is stored in the file below:

app/code/core/Mage/Adminhtml/etc/config.xml

<?xml version="1.0"?>
<config>
    <!-- ... -->
    <adminhtml>
        <!-- ... -->
        <global_search>
            <products>
                <class>adminhtml/search_catalog</class>
                <acl>catalog</acl>
            </products>
            <customers>
                <class>adminhtml/search_customer</class>
                <acl>customer</acl>
            </customers>
            <sales>
                <class>adminhtml/search_order</class>
                <acl>sales</acl>
            </sales>
        </global_search>
    </adminhtml>
</config>

So, the injection was very easy. I had to add my search model only into the adminhtml/global_search node.

My ConfigSearch model is looking for the matches in the config fields labels, also it takes into account translations. The search result shows you the full config field label and its full path. When matches are found, you'll see something like this:

search

When you click on one of the displayed results, you'll be redirected to the page with this configuration section opened. The field will also be highlighted, so you can find and edit it even more quickly. Task is done!

search_result

I hope that this extension will help you a little bit with the navigation in the Magento Admin panel.

source code

]]>
Using a powerful and full-featured search engine on mobile platforms https://www.liip.ch/fr/blog/using-a-powerful-and-full-featured-search-engine-on-mobile-platforms https://www.liip.ch/fr/blog/using-a-powerful-and-full-featured-search-engine-on-mobile-platforms Mon, 15 Dec 2014 00:00:00 +0100 When Xamarin meets Lucene…

Introduction

As soon as we are dealing with a bigger amount of data, it can be complicated to find what you are actually looking for. It is obvious that we can ease the task of finding information by structuring our data and by offering an intuitive user interface.

Nonetheless, there are several scenarios where a search engine can come in handy.

Probably the best example is our good old friend the Internet. Information is stored and obtained in various ways and it is an immense yet growing collection of information resources. If you do not exactly know what you are looking for, your search engine of choice is an essential helper to point you into the right direction.

Implementing search capabilities in your desktop application is no rocket science because you can rely on powerful search engines that do the difficult work for you. It is rather a matter of configuration than implementing complex algorithms yourself. Especially when software is growing up, handcrafted search functionality is simply not satisfying anymore.

What do I expect from a “good” search engine? At first the obvious: return me the most accurate data I am looking for. It should find my information even if I misspell it (we all make mistakes). It should suggest me similar results and it should do all that fast. Pretty basic needs but quite some work if you have to implement this from scratch.

A search engine for mobile?

Some months ago, we had the opportunity to work on a very interesting project. The goal was to build a rich product catalog with enhanced search features that runs on mobile devices. Back then, our data was stored in a closed SAP environment and the easiest would have been to create a web service that provides the data (and handles all the searching, filtering, etc.). Hovewer, one challenging requirement was the offline capability: once the data has been synchronized with the device it needs to be searchable without internet connection. This means that we need a client side search engine and so the journey began..

One problem in finding a search engine for mobile devices is the diversity of programming languages. Assuming that you have an application that runs on iOS and Android, you also need a search engine that is supported by both platforms. We did some research to find mobile optimized search engines without much success. In the meanwhile it became clear that we had to write our Application in C# using Xamarin. Our client wanted to maintain the codebase by themselves afterwards.

Note: The Xamarin platform enables developers to write iOS, Android, Mac and Windows apps with native user interfaces using C#. Xamarin utilizes Mono, a free and open source project to run Microsoft .NET applications on different platforms. You can re-use your existing C# code and share a significant amount across device platforms.

How mobile is your .NET?

Xamarin provides this convenient tool called the .Net Mobility Scanner. It shows you how much of your .NET code can run on other operating systems.

Suddenly we had this funny idea to scan an existing .Net search engine we usually use for desktop applications.

We've scanned  Lucene.Net  and the result was quite interesting: 95% of your code is ready for mobilization! For iOS and Android itself, it even reached 99% compatibility. There was actually only one piece of code which was not supported using a System.Configuration dependency – nothing critical.

You can find the scan results here:

scan.xamarin.com/Home/Report?scanId=021c2372-6760-4104-9a3c-07400d51c82e

Note: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. It has been ported to other programming languages including C#, targeting .NET runtime users.

The scan results raised several questions. Could we adapt Lucene.Net to actually run on mobile devices? Would it perform well? Is it stable enough? Did others already try it out? One thing was clear, we all agreed on giving it a try.

I looked around but found only one guy on Twitter that used this library for iOS & Android projects. He told me that it actually works fine but memory consumption was always a bit of a problem. Nonetheless, we wanted to try it out ourselves so we downloaded the Lucene.Net source code and quickly fixed the  1% issue with the System.Configuration dependency. Everything was ready to do an extensive testing.

Make your data searchable

In order to make your data searchable, the first thing you need to do is building an index. Lucene stores its data as documents containing fields of text. You can basically index everything that contains textual information. Take your data and create a document for each one with certain fields and save these documents to a physical directory on your filesystem.

Simplified example of how this could look like (using Lucene.Net v.3.0.3):

public void BuildIndex ()
{
  var indexPath = Path.Combine(
    Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments),
    "index"
  );
  var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
  var indexDirectory = FSDirectory.Open(indexPath);
  var writer = new IndexWriter(indexDirectory, analyzer, IndexWriter.MaxFieldLength.LIMITED);
  var data = new ListData>()
  {
    new Data { Id = 0, Text = "Introducing the Technology" },
    new Data { Id = 1, Text = "Xamarin meets Lucene" },
    new Data { Id = 2, Text = "A full-featured search engine for mobile" },
    new Data { Id = 3, Text = "Make your data searchable" }
  };

  foreach (var row in data)
  {
    Document doc = new Document();
    doc.Add(new Field("Id", row.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.Add(new Field("Text", row.Text, Field.Store.YES, Field.Index.ANALYZED));
    writer.AddDocument(doc);
  }

  writer.Optimize();
  writer.Commit ();
  writer.Dispose();
  indexDirectory.Dispose();
}

This is our small Data class:

public class Data
{
    public int Id { get; set; }
    public string Text { get; set; }
}

If you want to try it out yourself you'll need to download install Xamarin ( xamarin.com/download) and the slightly modified Lucene.Net library ( github.com/chrigu-ebert/Xamarin-Lucene.Net).

Indexing in Lucene.Net

You just indexed a couple of documents with Lucene.Net, yeah! Let's have a look at the code example above. There are a couple of important things you need to keep in mind.

We open an index directory using FSDirectory.Open in which Lucene will store its indexed data. If you open a directory or file you should always close it by calling the corresponding Dispose method: indexDirectory.Dispose(). If don't do this you might corrupt your index because of locked files. The same applies to the IndexWriter which actually writes data into the directory.

You might have noticed that the IndexWriter needs an analyzer instance, in our case the StandardAnalyzer. When you want to insert data into a Lucene index, or when you want to get the data back out of the index you will need to use an Analyzer to do this. Lucene provides many different analyzer classes such as:

  • SimpleAnalyzer
  • StandardAnalyzer
  • StopAnalyzer
  • WhiteSpaceAnalyzer

There are ones for working with different languages, ones which determine how words are treated (and which words to be ignored) or how whitespace is handled. Understanding analyzers is somehow tricky and as we do not want to loose time, we simply use the StandardAnalyzer. It works very well especially on english content.

Last but not least we loop over our data and create a new Document for each and pass it to the IndexWriter. Each Document contains a set of fields which contain the data that we want to make searchable. Normally we store Field content as string but there is also a NumericField type which is very powerful, if you search by numeric ranges.

It is important to understand the Field attributes especially the store and index values to avoid common mistakes:

name The name of the field, used to build queries later
value The string representation of your data
store Specifies if you want to store the value of the field in the index or not. It does not affect the indexing or searching with Lucene. It just tells Lucene if you want it to act as a datastore for the values in the field. If you use`Field.Store.YES`, then when you search, the value of that field will be included in your search result documents. If you are storing your data in a database and only using the Lucene index for searching, then you can get away with `Field.Store.NO` on all of your fields. However, if your are using the index as storage as well, then you will want`Field.Store.YES`.
index `Field.Index.ANALYZED:`Index the tokens produced by running the fields value through an Analyzer. This makes a lot of sense on longer texts to improve performance significantly but you might run into problems if you try to sort analyzed fields or if you want to find exact matches on single terms (e.g. unique IDs). `Field.Index.ANALYZED_NO_NORMS:` Index the tokens produced by running the fields value through an Analyzer, and also separately disable the storing of norms. No norms means that a few bytes will be saved by not storing some normalization data. This data is what is used for boosting and field-length normalization. The benefit is less memory usage as norms take up one byte of RAM per indexed field for every document in the index, during searching. Only use this flag if you are sure that youre not using that normalization data. `Field.Index.ANALYZED_NO:` The field will not be indexed and therefore unsearchable. However, you can use Index.No along with Store.Yes to store a value that you dont want to be searchable. `Field.Index.ANALYZED_NOT_ANALYZED:` Index the fields value without using an Analyzer, so it can be searched. As no analyzer is used the value will be stored as a single term. This is useful for unique Ids like product numbers or if you want to sort the results using this field. `Field.Index.ANALYZED_NOT_ANALYZED_NO_NORMS:` Index the fields value without an Analyzer and also disable the storing of norms.

Finally, you'll have to call writer.Commit() to persist the changes. It is always a good thing to use writer.Optimize() from time to time to re-structure your index and improve search-performance. If your index is getting bigger, the optimization can take some time (several seconds).

Are you still with me? At this point, you hopefully understand, how you can make your data searchable. Get yourself a cookie, congrats!

Searching in Lucene.Net

Searching data using Lucene is incredibly powerful. I could write books just about that but this is not the goal of this blog post. We will do some really simple searches to explain the basics. Based on this you can build your own, amazingly complex queries.

We will use the following helper method to execute basic queries:

public List<Data> GetDataForQuery(Query query, int limit = 50)
{
  var data = new List<Data>();
  var indexPath = Path.Combine(
    Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments),
    "index"
  );
  var indexDirectory = FSDirectory.Open(indexPath);
  using (var searcher = new IndexSearcher(indexDirectory))
  {
    var hits = searcher.Search(query, limit);
    Console.WriteLine(hits.TotalHits + " result(s) found for query: " + query.ToString());
    foreach (var scoreDoc in hits.ScoreDocs)
    {
      var document = searcher.Doc(scoreDoc.Doc);
      data.Add(new Data()
      {
        Id = int.Parse(document.Get("Id")),
        Text = document.Get("Text")
      });
    }
  }
  indexDirectory.Dispose();
  return data;
}

We simply use FSDirectory and IndexSearcher to open our index and to perform Lucene queries. We loop over the results and return them as Data list. As you might have noticed, similar to the IndexWriter, we have to explicitly Dispose the IndexSearcher (automatically done due to the using statement) and the indexDirectory.

Since we stored both field values during the indexing (Field.Store.YES), we can retrieve them using document.Get("FieldName">). Our helper method takes two arguments: a Lucene Query object which will be explained below and a limit parameter. The hits variable contains a property called hits.TotalHits. which gives you the total amount of documents that does match your given query. If you have thousands of documents stored in your index, it doesn't make sense to return them all. Usually it is enough to just return a certain subset (limit) where you know that there are probably more results.

Getting all documents becomes as simple as that:

var query = new MatchAllDocsQuery();
var data = GetDataForQuery(query);

The following example shows how to find a document by id using a TermQuery:

var term = new Term("Id", 2);
var query = new TermQuery(term);
var data = GetDataForQuery(query);
if(data.Any())
{
  //writesAfull-featuredsearchengineformobile
  Console.WriteLine(data.FirstOrDefault().Text);
}

The following example shows a common mistake. We're trying to use a PrefixQuery to find documents which have a Text-field that starts with “Intro”:

// this does not return any data!
var term = new Term("Text", "Intro");
var query = newPrefixQuery(term);
var data = GetDataForQuery(query);

Actually, we do have a document that has the following Text “Introducing the Technology” and it should actually work. But since our Text field is analyzed using the StandardAnalyzer, the text is tokenized and in this case stored lowercase. If you would change your Term into new Term("Text", "intro") it would return the document.

An easier way is using the Lucene query syntax [x] and the QueryParser:

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
var parser = newQueryParser(Lucene.Net.Util.Version.LUCENE_30, "Text", analyzer);
var query = parser.Parse("Xamarin");
var data = GetDataForQuery(query);

Because we use the same analyzer (StandardAnalyzer) as we did while indexing the documents, the sample code above returns our document.

You can perform wildcard queries using an asterisk (*) at the end of a word:

// this will match the word Technology
var query = parser.Parse("Tec*gy");

The query parser can do a lot more. Using a tilde character (~) at the end of a word, indicates a fuzzy query:

// this will matchXamarin assuming that were a little drunk
var query = parser.Parse("amixarin~");

Summary

As you can see, indexing and searching data is actually pretty simple. The examples above are just scratching the surface. As soon as you start combining queries using BooleanQuery band giving weight to certain fields, it starts to get really serious. So far we didn't even talk about filtering and sorting .

I strongly suggest you to give it a try. We have worked months on a project using Lucene and Xamarin together and indexed thousands of documents. The performance and possibilities are simply amazing.

If you are curious and already tried it out, you could also have a look at the Linq to Lucene project. I didn't try it out on a mobile device so far but it helps a lot to get started.

The code examples are tested on Lucene.Net version 3.0.3. Things might have changed significantly on older/newer versions. The stable version on apache.org didn't change for quite a while. If you want to get the latest version which is under active development, you can clone the Github repository (links below).

Links

]]>
Ecostar Elastica/FOQElasticaBundle https://www.liip.ch/fr/blog/ecostar-elasticafoqelasticabundle https://www.liip.ch/fr/blog/ecostar-elasticafoqelasticabundle Thu, 31 Jan 2013 00:00:00 +0100  As you might remember Lukas and I started working on some changes to the elastica library and the Symfony 2 Bundle FOQElasticaBundle during a hackday. You might also remember that we were not entirely happy with our solution for the infinite nesting levels in the mappings configuration of the bundle. Also, we got some feedback from other developers upon our pull requests to both the library and the bundle. In order to be able to clean up our code and respond to the feedback I asked for some innovation budget and got it. Thanks for that! :)

So, what did I do exactly with this budget:

Elastica

During the hackday we enhanced the elastica library with serializer support so you could not just pass data arrays to it to be written to the elasticsearch index, but also whole objects that then will be serialized and written to the index. At first we simply passed a serializer object to the library, assuming it has a public method called ‘serialize'. We also added support for serializer groups, taking the example of the JMSSerializer.

After some discussion with other developers and the maintainer of the library we decided that it would be a better solution to just pass a PHP callable to the library. This gives the developer the possibility to use whatever serializer he/she likes and to configure it properly.

The pull request for these changes was already merged to the master branch of the library.

At the same time I also adapted the FOQElasticaBundle to give the developer the possibility to make use of this new serializer support of the elastica library. I also implemented a default serializer callable using the JMSSerializer. This pull request is at the moment still waiting for feedback.

FOQElasticaBundle

Another change we did to the FOQElasticaBundle during the hackday was to try to add support for an infinite number of nesting levels in the mapping configurations. The solution we found then was working but there were some drawbacks about it. e.g. the error messages given to the developer when using a wrong value for a certain configuration variable were not clear because with that solution we didn't find a way to generate the full path to the wrong variable. We created a pull request anyways and could discuss some better solutions with other developers.

The solution we came up with in the end is the following. Instead of trying to build a configuration tree that accepts an infinite number of nesting levels in the mappings we generate a fix tree based on the current configuration. This means that we pass the configuration array to the DependencyInjectionConfiguration class and figure out based on the array, how many nested levels are necessary and which ones. Based on that we then build a fix configuration tree exactly matching the current configuration. Like that we also get the nice automatically generated error messages when passing a wrong value to one of the configuration variables.

The pull request is still waiting for feedback.

]]>
Why a project switched from Google Search Appliance to Zend_Lucene https://www.liip.ch/fr/blog/why-a-project-switched-from-google-search-appliance-to-zend_lucene https://www.liip.ch/fr/blog/why-a-project-switched-from-google-search-appliance-to-zend_lucene Thu, 13 Jan 2011 00:00:00 +0100 Google technology does a good job when searching the wild and treacherous realms of the public internet. However, the commercial Google Search Appliance (GSA) sold for searching intranet websites did not convince me at all. For a client, we first had to integrate the GSA, later we reimplemented search with Zend_Lucene. Some thoughts comparing the two search solutions.

This post became rather lengthy. If you just want the summary of my pro and con for GSA versus Lucene, scroll right to the end :-)

In a project we got to take over, the customer had already bought a GSA (the “cheap” one – only about $20'000). There was a list of wishes from the client how to optimally integrate the appliance into his web sites:

  • Limit access to authorized users
  • Index all content in all languages
  • Filter content by target group (information present as META in the HTML headers)
  • Show a box with results from their employee directory

GSA Software

The GSA made problems with most of those requests.

When you activate access protection , the GSA makes a HEAD request on the first 20 or so search results for each single search request, to check if that user has the right to see that document. As on our site, there are no individual visibility requirements, we did not need that. But there is no way to deactivate this check, resulting in unnecessary load on the web server. We ended up catching the GSA HEAD request quite early and just send a Not Modified response without further looking into the request.

The GSA completely ignores the language declaration (whether in META or in the attribute or inside the HTML head) and uses it's own heuristics. This might be fine for public Internet, when you can assume many sites declaring their content to be in the server installation language even if it is not – but in a controlled environment we can make sure those headers are correct. We talked to google support about this, but they could only confirm that its not possible. This was annoying, as the heuristics was wrong, for example when some part of a page content was in another language.

The spider component messed up with some bugs from the web site we needed to index. We found that the same parameter got repeated over and over on an URL. Those cycles led to having the same page indexed many times and the limit of 500'000 indexed pages being filled up. This is of course a bug in the web server, but we found no way to help the GSA not to stumble over it.

Filtering by meta information would work. But we have binary documents like PDF, Word and so on. There was no way to set the meta information for those documents. requiredfields=gsahintview:group1|-gsahintview should trigger a filter to say either we have the meta information with a specific value, or no meta at all. However, Google confirmed that, this combination of filter expressions is not possible. They updated their documentation to at least explain the restrictions.

The only thing that really worked without hassle was the search box. You can configure the GSA to request data from the web server and return an XML fragment that is integrated into the search result page.

Support by Google was a very positive aspect. They answered fast and without fuss, and have been motivated to help. They seemed competent – so I guess when they did not propose alternatives but simply said there is no such feature, there really was no alternative for our feature requests.

GSA Hardware

The google hardware however was a real nuisance. You get the appliance as a standard sized server to put into the rack. Have the hardware locally makes sense. It won't use external bandwith for indexing and you can be more secure about your confidential data. But during the 2 years we used the GSA, there were 3 hardware failures. As part of the setup test, our hoster checks if the system work properly by unplugging the whole system. While this is not good for data of course, the hardware should survive that. The GSA did not and had to be sent for repair. There were two more hardware issues – one was simply a RAM module signaling an error. But as the hoster is not allowed to open the box, even such simple repair took quite a while. Our client did not want to buy more than one Appliance for his system, as they are rather expensive. So you usually do not have a replacement ready. With any other server, the hoster can fix the system rather fast or in the worst case just re-install the system from backups. With the GSA there is no such redundancy.

The GSA is not only closed in on hardware level. You also do not have shell access to the system, so all configuration has to be done in the web interface. Versioning of that information can only be done by exporting and potentially re-importing the complete configuration. I like to have all relevant stuff in version control for better tracking.

Zend Lucene

The GSA license is for two years. After that period, another amount of 20 something thousand dollars has to be payed if you want to keep support. At that point, we discussed the state with our client and decided to invest a bit more than the license and go to an environment where we have more control and redundancy. The new search uses the Zend_Lucene component to build indexes. As everything is PHP here, the indexer uses the website framework itselves to render the pages and build the indexes.

  • We run separate instances of the process for each web site and each language, each building one index. In the beginning we had one script to build all indexes, but a PHP script running for over 24 hours was not very reliable – and we wanted to use the power of the multicore machine, as each PHP instance is single threaded. Lucene is rather CPU intensive to analyze text.
  • We did not want to touch existing code that changes content. We did not want to risk breaking normal operations in case something is wrong with Lucene. Every hour, a cronjob looks for new or changed documents to update the index. Every weekend, all indexes are rebuilt and – after a sanity check – replace the old indexes. Deletion of content neither triggers lucene. Until the index is rebuilt, the result page generation will just ignore results items that no longer exist in the database.
  • For documents, we use linux programs to convert the file into plain text that is analyzed by lucene (see code below). Except for docx and friends (the new XML formats of Microsoft Office 2007) which are natively supported
    • .msg, .txt, .html: cat
    • .doc, .dot: antiword (worked better than catdoc)
    • .rtf: catdoc
    • .ppt: catppt (part of catdoc package)
    • .pdf: pdftotext (part of xpdf)
    • We ignore zip files, although PHP would allow to open them.
  • All kind of meta information can be specified during indexing. This solves the language specification issue. As the database knows about the document languages, even binary documents are indexed in the correct language
  • The indexes are copied to each server (opening them over the shared nfs file server is not possible as Zend_Lucene wants to lock the files and nfs does not support that). This provides redundancy in case a server crashes. And the integration test server can run its own copy and index the test database.
  • We where able to fine-tune ranking relevance based on type and age of content.
  • To improve finding similar words, we used the stemming filters. We choose php-stemmer and are quite happy with it.
  • If we run into performance problems, we could switch to the Java Lucene for handling search requests, as the binary index format is compatible between Zend_Lucene and Java Lucene.

Indexing about 50'000 documents takes about a full day, running parallel scripts and having CPU cores pretty busy. But our webservers are bored over the weekend anyways. If this would be an issue, we could buy a separate server for searching, as you have in the case of the GSA. The hardware of that server would probably be more reliable and could be fixed by our hoster.

The resulting indexes are only a couple of megabyte. So even though Zend_Lucene has to load the index file for each search request, it is quite fast. Loading the index takes about 50ms of the request. I assume the file system cache keeps the file in memory

Zend_Lucene worked out quite well for us, although today, I would probably use Apache Solr to save some work, especially reading documents and for stemming.

Code fragment for reading binary files as plain text:

$map = array('ppt' => 'catppt %filename% 2>/dev/null',
             'pdf' => 'pdftotext -enc UTF-8 %filename% - 2>/dev/null', //the "-" tells to output to stdout
             'txt' => 'cat %filename% 2>/dev/null'
             ...);

if (! file_exists($filename))
    throw new Exception("File does not exist: '$filename'");

$type = pathinfo($filename, PATHINFO_EXTENSION);
if (! isset($map[$type]))
    throw new Exception("Unsupported document type: '$type'");

$filename = escapeshellarg($filename);
$cmd = str_replace('%filename%', $filename, $cmd[$type]);
$output = array(); $status = 0;
exec ($cmd, $output, $status);
if ($status != 0)
    throw new Exception("Converting $filename: exit status $status");

return implode($output, "n");

Conclusions

Google Search Appliance

Pro:

  • Reputation with client and acceptance by users as it's a known brand

  • Good ranking algorithms for text and handle stemming

  • Responsive and helpful support

Con:

– Closed “black box” system

– You are not allowed to fix the hardware yourself

– No redundancy unless you buy several boxes

– Missing options to tailor to our use case (use HTML language information, request pages, filter flexibility)

– Significant price tag for the license, plus still quite some work to customize the GSA and adapt your own systems

Zend_Lucene

Pro:

  • Very flexible to build exactly what we needed

+ The problematic framework made less problems, we can iterate over content lists instead of parsing URLs to spider the site

  • Well documented and there is a community (not much experience as we did not have questions)

+ No arbitrary limitations on number of pages in an index.

  • Proved reliable for two years now

  • If performance ever becomes an issue, we can switch to Java Lucene while keeping the php indexer

Con:

– In-depth programming needed

– Thus more risk for bugs

– More to build by hand than with the GSA – but for us not as much as license costs plus customization of the web system to play well with GSA.

]]>