At Liip, we develop an API that provides product information and other data for Migros, the biggest retailer in Switzerland. Let us explain the core principles and challenges we had and how we solved them (or not yet).
Data of the API is stored in an ElasticSearch cluster. On top of that, we have a couple of application servers that run a Symfony2 application. The requests all first go through a Varnish reverse proxy cache. The Symfony2 application also provides commands that are run by cronjobs to import data from a number of systems at Migros into the ElasticSearch cluster. The main data source is their Product Information Management System (PIM), which is where product information for the whole company is entered and managed.
The goal of the API is three-fold:
- Provide fast access and full-text search to underlying product information.
- Provide consistent and documented product data (and more) to the API users. Abstract the complexity of the underlying data.
- Provide a “quick view” of the data in the eclectic Migros environments. It usually takes one development iteration (2 weeks) to provide access to new data.
Along the development, a lot of things changed and a lot of other things became irrelevant. However, we tried to stick to a few core principles that we list below. Every time a new feature request comes up, we check it against those core principles to see if it makes sense or not.
No data master
The API does not hold any data that cannot be recovered from external sources. This helps us be more efficient at developing new features as it provides a safety net in case something goes wrong. All the data that we store in ElasticSearch can be recovered from external sources. If the worst-case scenario of a fatal cluster crash happens, we can recover everything in a few hours.
Should the ElasticSearch cluster be a bottleneck, we can add more nodes to the cluster. Should the application be a bottleneck, we can add more application servers horizontally. For more on this, see the blog post on ElasticSearch performance
As the API has to provide data to a lot of different applications, often built by different people/companies, it is meant to be used as a server-to-server API and is not directly accessible by clients. Since all application servers are in the same internal network, we have almost infinite bandwidth at our disposal, which gives us the luxury to be able to add data to the API without worrying about bandwidth.
Nevertheless, we don't return all product data fields in the calls that return a list of products, but only when retrieving a single product. Also, potentially long data fields that are not used for searching/faceting are kept separate and returned through different API calls.
The Web API layer allows us to keep the main API clean and avoid micro-optimizations for specific use cases. In the end, it allowed us to focus on bringing out new data and new features to the API more quickly. There was a discussion about using a API Header to switch between API output at the beginning of the project, I am glad we didn't. It is not the role of the API to know which client needs which data.
Fields of the API should be self-documenting as often as possible. It should not require more than reading the JSON response to understand the meaning of each field. If a field cannot be described in one sentence, then it's probably too complex and should be simplified/split.
Referential integrity never
The API does not guarantee referential integrity between API calls. For example, it could be that a shopping list contains a product that does not exist any more. Or it might be that a discount references a product that is not visible.
In case a resource does not exist, the clients will get an answer with HTTP status 404 and is responsible for handling that properly. It would be really hard and time-consuming to guarantee referential integrity between calls and between APIs. Some of the data in the underlying APIs as well as some of the imported data is not validated.
Clients decide which data is relevant
Do not rely on imports to maintain data consistency. It can be that an import does not run, because some external source is not working or just because sometimes sh*t happens.
Example: For the Migros discounts, a critical part of the whole API, we were relying on an import to set a discounted flag on the products once a discount becomes valid on them. That import failed a few times, resulting in discounts not being displayed at the right moment. We changed that by storing the discount object with a start date and end date on the product itself, letting the client decide when the product becomes discounted.
As a side effect, this has the advantage that the product can be cached for a longer period of time, since no re-index is required when the start or end date is passed.
The second part of this blog focuses on the part of the API that we struggle with. Some of those might be worth writing an entire post about, but we'll just list them briefly here.
The API calls are cached by Varnish so most of the calls don't hit the application. However, cache invalidation is tricky: On a paged list of products, if an update on product at position 1 causes it to change from being in the first page of results to the last, all the pages need to be invalidated. If we fail to do this, some pages would be cached with the old order, while others already are at the new order. As a product at the beginning vanished, product 11 moves from position 1 in page 2 to position 10 in page 1. If page 2 is still cached, the product would be repeated when paging through the list.
To solve this issue, we plan to stop caching list requests. Instead, we'll make them return a list of Edge Side Include ( ESI) tags with the product IDs, which will then individually get requested by Varnish. This way, products are always stored individually in the cache and invalidation becomes easy. Just invalidate the updated products.
Something you probably want to know when building an API is: how do clients use it? What parts are not understood or mis-used? Could we warn them about it?
Well we don't really have a solution for that yet but are evaluating things like ApiGee, 3scale or even collecting and analyzing our own logs. We'll keep you posted.
On a daily basis, we need to answer questions such as: Why is the price of this product xx.- CHF? Where does this name come from? As the data we give out never comes from our own API, and as a same field can potentially come from a bunch of different underlying APIs, we need to be able to answer those questions.
For this, we are currently developing a bunch of debugging tools that will help retrieve the data from the underlying sources and compare it with the data we give out. We are not yet able to tell from which data source a particular field comes from but that would be a logical next step. Had we known how often we need to answer such questions, we would probably have included that earlier.
Error management & alerting
As the system imports millions of entries per day, from many different systems, a lot of information but also a lot of errors get logged. Sometimes they are real errors and sometimes they are false positives. Finding out whether an error is worth alerting someone is tricky. As I am writing this, my inbox is filling up with alert emails because of a problem we introduced a few hours ago.
There are a bunch of tools that help manage this, but they all do slightly different things. At the moment we use NewRelic, Graylog, Pingdom and emails to a mailing list to log directly from the background cron jobs. We are looking into ways to lower the amount of duplicate alerting emails.
We are trying several things. We are building status tools that help us return the state of the imports, when they last run, how many items they did. We also log errors in the database for reporting purposes. On the other hand, we handle incoming alerts and either fix them or change the log level if appropriate.
Collateral damage on deployments
As the API becomes bigger and more complex, each deployment gets more tricky and needs more attention. We are thinking about splitting up the API in smaller chunks that can be deployed individually. This way, parts of the API that don't change too often won't risk being taken down by a faulty deployment. Another advantage of that is that we can scale those API parts individually. That way some parts of the API that are more business-critical can be scaled up and down independently.
The goals of the API are mostly reached, Migros gives access to the API to all sorts of developing partners and it is embedded in a big amount of applications, ranging from the Migros App to a few micro-sites used for marketing campaigns.
At every development iteration, we integrate new data and try to tackle the few challenges we have. Step by step, the API gets better and we learn new things with each new deployment. It is not easy to handle the rising complexity of the API, and the core principles help us keep that under control.
More to follow…