<?xml version="1.0" encoding="utf-8"?>
<!-- generator="Kirby" -->
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">

  <channel>
    <title>Mot-cl&#233;: deep-learning &#183; Blog &#183; Liip</title>
    <link>https://www.liip.ch/fr/blog/tags/deep-learning</link>
    <generator>Kirby</generator>
    <lastBuildDate>Wed, 15 Aug 2018 00:00:00 +0200</lastBuildDate>
    <atom:link href="https://www.liip.ch" rel="self" type="application/rss+xml" />

        <description>Articles du blog Liip avec le mot-cl&#233; &#8220;deep-learning&#8221;</description>
    
        <language>fr</language>
    
        <item>
      <title>Face detection - An overview and comparison of different solutions</title>
      <link>https://www.liip.ch/fr/blog/face-detection-an-overview-and-comparison-of-different-solutions-part1</link>
      <guid>https://www.liip.ch/fr/blog/face-detection-an-overview-and-comparison-of-different-solutions-part1</guid>
      <pubDate>Wed, 15 Aug 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/baguettebox.js/1.10.0/baguetteBox.min.css">
<style>
article.article figure a { color: transparent; }
.wysiwyg > figure { margin-top: -1rem; }
.wysiwyg figcaption { font-style: italic; margin-top: -20px; }
div.note { background-color: white; border-left: 2px solid currentColor; font-style: italic; padding: 0.5rem 20px; }
div.note p { margin: 0; }
</style>
<h2>Part 1: SaaS vendors</h2>
<div class="note">
  <p>This article is the first part of a series. Make sure to <a href="https://www.liip.ch/en/blog/tags/data-services.rss">subscribe</a> to receive future updates!<br><strong>TLDR:</strong> If you want to use the API's as fast as possible, directly check out my code on <a href="https://github.com/dpacassi/face-detection">GitHub</a>.</p>
</div>
<p>Did you ever had the need for face detection?<br />
Maybe to improve image cropping, ensure that a profile picture really contains a face or maybe to simply find images from your dataset containing people <em>(well, faces in this case)</em>.<br />
Which face detection SaaS vendor would be the best for <em>your project</em>? Let’s have a deeper look into the differences in <strong>success rates</strong>, <strong>pricing</strong> and <strong>speed</strong>.</p>
<p>In this blog post I'll be analyzing the face detection API's of:</p>
<ul>
<li><a href="https://aws.amazon.com/rekognition/">Amazon Rekognition</a></li>
<li><a href="https://cloud.google.com/vision/">Google Cloud Vision API</a></li>
<li><a href="https://www.ibm.com/watson/services/visual-recognition/">IBM Watson Visual Recognition</a></li>
<li><a href="https://azure.microsoft.com/en-us/services/cognitive-services/face/">Microsoft Face API</a></li>
</ul>
<h2>How does face detection work anyway?</h2>
<p>Before we dive into our analysis of the different solutions, let’s understand how face detection works today in the first place.</p>
<h3>The Viola–Jones Face Detection</h3>
<p>It’s the year 2001. Wikipedia is being launched by Jimmy Wales and Larry Sanger, the Netherlands becomes the first country in the world to make same-sex marriage legal and the world witnesses one of the most tragic terror attacks ever.<br />
At the same time two bright minds, Paul Viola and Michael Jone, come together to start a revolution in computer vision.</p>
<p>Until 2001, face detection was something which didn’t work very precise nor very fast. That was, until the <a href="https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf">Viola-Jones Face Detection Framework</a> was proposed which not only had a high success rate in detecting faces but could do it also in real time.</p>
<p>While face and object recognition challenges existed since the 90’s, they surely boomed even more after the Viola–Jones paper was released.</p>
<h3>Deep Convolutional Neural Networks</h3>
<p>One of such challenges is the <a href="http://www.image-net.org/challenges/LSVRC/">ImageNet Large Scale Visual Recognition Challenge</a> which exists since 2010. While in the first two years the top teams were working mostly with a combination of Fisher Vectors and Support vector machines, <strong>2012 changed everything</strong>.</p>
<p>The team of the University of Toronto (consisting of Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton) used for the first time a <strong>deep convolutional neural network</strong> for object detection. They scored first place with an error rate of 15.4% while the second placed team had a 26.2% error rate!<br />
A year later, in 2013, <strong>every team</strong> in the top 5 was using a deep convolutional neural network.</p>
<p>So, <strong>how does such a network work?</strong><br />
An easy-to-understand video was published by Google earlier this year:</p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//youtube.com/embed/OcycT1Jwsns" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h3>What do Amazon, Google, IBM and Microsoft use today?</h3>
<p>Since then, not much changed. Today’s vendors still use Deep Convolutional Neural Networks, probably combined with other Deep Learning techniques though.<br />
Obviously, they don’t publish how their visual recognition techniques exactly work. The information I found was:</p>
<ul>
<li>Amazon: <a href="https://aws.amazon.com/rekognition/faqs/">Deep Neural Networks</a></li>
<li>Google: <a href="https://youtu.be/OcycT1Jwsns?t=2m41s">Convolutional Neural Network</a></li>
<li>IBM: <a href="https://www.ibm.com/cloud/watson-visual-recognition">Deep Learning algorithms</a></li>
<li>Microsoft: <a href="https://docs.microsoft.com/en-us/azure/cognitive-services/face/overview">Face algorithms</a></li>
</ul>
<p>While they all sound very similar, there are some differences in the results.<br />
Before we test them, let’s have a look at the pricing models first though!</p>
<h2>Pricing</h2>
<p><a href="https://aws.amazon.com/rekognition/pricing/">Amazon</a>, <a href="https://cloud.google.com/vision/pricing">Google</a> and <a href="https://azure.microsoft.com/en-us/pricing/details/cognitive-services/face-api/">Microsoft</a> have a similar pricing model, meaning that with increasing usage the price per detection drops.<br />
With <a href="https://www.ibm.com/cloud/watson-visual-recognition/pricing">IBM</a> however, you always pay the same price per API call after your free tier usage volume is exhausted.<br />
<a href="https://azure.microsoft.com/en-us/pricing/details/cognitive-services/face-api/">Microsoft</a> provides you the best free tier, allowing you to process <strong>30'000 images</strong> per month for <strong>free</strong>.<br />
If you need more detections though, you need to use their standard tier where you start paying from the first image on.</p>
<h3>Price comparison</h3>
<p>That being said, let’s calculate the costs for three different profile types.</p>
<ul>
<li>Profile A: Small startup/business processing 1’000 images per month</li>
<li>Profile B: Digital vendor with lots of images, processing 100’000 images per month</li>
<li>Profile C: Data center processing 10’000’000 images per month</li>
</ul>
<table>
<thead>
<tr>
<th></th>
<th style="text-align: right;">Amazon</th>
<th style="text-align: right;">Google</th>
<th style="text-align: right;">IBM</th>
<th style="text-align: right;">Microsoft</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Profile A</strong></td>
<td style="text-align: right;">$1.00 USD</td>
<td style="text-align: right;"><strong>Free</strong></td>
<td style="text-align: right;"><strong>Free</strong></td>
<td style="text-align: right;"><strong>Free</strong></td>
</tr>
<tr>
<td><strong>Profile B</strong></td>
<td style="text-align: right;"><strong>$100.00 USD</strong></td>
<td style="text-align: right;">$148.50 USD</td>
<td style="text-align: right;">$396.00 USD</td>
<td style="text-align: right;"><strong>$100.00 USD</strong></td>
</tr>
<tr>
<td><strong>Profile C</strong></td>
<td style="text-align: right;">$8’200.00 USD</td>
<td style="text-align: right;">$10’498.50 USD</td>
<td style="text-align: right;">$39’996.00 USD</td>
<td style="text-align: right;"><strong>7’200.00 USD</strong></td>
</tr>
</tbody>
</table>
<p>Looking at the numbers, for small customers there’s not much of a difference in pricing. While Amazon charges you starting from the first image, having 1’000 images processed still only costs one Dollar. However, if you don’t want to pay anything, then Google, IBM or Microsoft will be your vendor to go.</p>
<div class="note">
  <p><strong>Note:</strong> Amazon offers a free tier on which you can process 5’000 images per month for the <strong>first 12 months for free</strong>! However, after this 12 month trial, you’ll have to start paying starting with the first image.</p>
</div>
<h4>Large API usage</h4>
<p>If you really need to process millions of images, it's important to compare how every vendor scales.<br />
Here's a list of the <strong>minimum</strong> price you pay for the API usage after a certain amount of images.</p>
<ul>
<li>IBM constantly charges you $4.00 USD per 1’000 images (no scaling)</li>
<li>Google scales down to $0.60 USD (per 1’000 images) after the 5’000’000th image</li>
<li>Amazon scales down to $0.40 USD (per 1’000 images) after the 100’000’000th image</li>
<li>Microsoft scales down to $0.40 USD (per 1’000 images) after the 100’000’000th image</li>
</ul>
<p>So, comparing prices, Microsoft (and Amazon) seem to be the winner.<br />
But can they also score in success rate, speed and integration? Let’s find out!</p>
<h2>Hands on! Let’s try out the different API’s</h2>
<p>Enough theory and numbers, let’s dive into coding! You can find all code used here in my <a href="https://github.com/dpacassi/face-detection">GitHub repository</a>.</p>
<h3>Setting up our image dataset</h3>
<p>First things first. Before we scan images for faces, let’s set up our image dataset.<br />
For this blog post I’ve downloaded 33 images from <a href="https://www.pexels.com/">pexels.com</a>, many thanks to the contributors/photographers of the images and also to Pexels!<br />
The images have been committed to the GitHub repository, so you don't need to search for any images if you simply want to start playing with the API's.</p>
<h3>Writing a basic test framework</h3>
<p>Framework might be the wrong word as my custom code only consists of two classes. However, these two classes help me to easily analyze image (meta-) data and have as few code as possibly in the different implementations.</p>
<p>A very short description: The <a href="https://github.com/dpacassi/face-detection/blob/master/src/FaceDetectionClient.php">FaceDetectionClient</a> class holds general information about where the images are stored, vendor details and all processed images (as <a href="https://github.com/dpacassi/face-detection/blob/master/src/FaceDetectionImage.php">FaceDetectionImage</a> objects).</p>
<h3>Comparing the vendors SDK’s</h3>
<p>As I’m most familiar with PHP, I've decided to stick to PHP for this test. I still want to point out what SDK’s each vendor provides (as of today):</p>
<table>
<thead>
<tr>
<th><a href="https://aws.amazon.com/rekognition/resources/">Amazon</a></th>
<th><a href="https://cloud.google.com/vision/docs/libraries">Google</a></th>
<th><a href="https://www.ibm.com/watson/developercloud/visual-recognition/api/v3/curl.html?curl#introduction">IBM</a></th>
<th><a href="https://docs.microsoft.com/en-us/azure/cognitive-services/face/quickstarts/csharp">Microsoft</a></th>
</tr>
</thead>
<tbody>
<tr>
<td><ul><li>Android</li><li>JavaScript</li><li>iOS</li><li>Java</li><li>.NET</li><li>Node.js</li><li>PHP</li><li>Ruby</li><li>Python</li></ul></td>
<td><ul><li>C#</li><li>Go</li><li>Java</li><li>Node.js</li><li>PHP</li><li>Python</li><li>Ruby</li><li>cURL examples</li></ul></td>
<td><ul><li>Node.js</li><li>Java</li><li>Python</li><li>cURL examples</li></ul></td>
<td><ul><li>C#</li><li>Go</li><li>Java</li><li>JavaScript</li><li>Node.js</li><li>PHP</li><li>Python</li><li>Ruby</li><li>cURL examples</li></ul></td>
</tr>
</tbody>
</table>
<div class="note">
  <p><strong>Note:</strong> Microsoft doesn't actually provide any SDK's, they do offer code examples for the technologies listed above though.</p>
</div>
<p>If you’ve read the lists carefully, you might have noticed that IBM does not only offer the least amount of SDK’s but also no SDK for PHP.<br />
However, that wasn’t a big issue for me as they provide cURL examples which helped me to easily write <a href="https://github.com/dpacassi/face-detection/blob/master/src/solutions/ibm-watson-visual-recognition/VisualRecognitionV3.php">37 lines</a> of code for a (very basic) IBM Visual Recognition client class.</p>
<h3>Integrating the vendors API’s</h3>
<p>Getting the SDK's is easy. Even easier with Composer. However, I did notice some things that could be improved to make a developer’s life easier.</p>
<h4>Amazon</h4>
<p>I've started with the Amazon Rekognition API. Going through their <a href="https://aws.amazon.com/documentation/sdk-for-php/">documentation</a>, I really felt a bit lost at the beginning. Not only did I miss some basic examples (or wasn’t able to find them?), but also I had the feeling that I have to click a few times until I was able to find what I was looking for. In one case I even gave up and simply got the information by directly inspecting their SDK source code.<br />
On the other hand, it could just be me? <strong>Let me know</strong> if Amazon Rekognition was easy (or difficult) for you to integrate!</p>
<div class="note">
  <p><strong>Note:</strong> While Google and IBM return the bounding boxes <strong>coordinates</strong>, Amazon returns the coordinates as <strong>ratio</strong> of the overall image width/height.
    I have no idea why that is, but it's not a big deal. You can write a helper function to get the coordinates from the ratio, <a href="https://github.com/dpacassi/face-detection/blob/master/src/FaceDetectionImage.php#L292">just as I did</a>.</p>
</div>
<h4>Google</h4>
<p>Next came Google. In comparison with Amazon, they do provide <a href="https://cloud.google.com/vision/docs/libraries#client-libraries-install-php">examples</a>, which helped me a lot! Or maybe I was just already in the <em>“investing different SDK’s”</em> mindset.<br />
Whatever the case may be, integrating the SDK felt a lot simpler and also I had to spend less clicks to retrieve information I was looking for.</p>
<h4>IBM</h4>
<p>As stated before, IBM doesn’t (yet?) provide a SDK for PHP. However, with the provided cURL examples, I had a custom client set up in no time. There’s not much that you can do wrong if a cURL example is provided to you!</p>
<h4>Microsoft</h4>
<p>Looking at Microsoft's code example for PHP (which uses Pear's <a href="http://pear.php.net/package/HTTP_Request2">HTTP_Request2</a> package), I ended up writing my own client for Microsoft's Face API.<br />
<em>I guess I'm simply a <strong>cURL</strong> person.</em></p>
<h2>Inter-rater reliability</h2>
<p>Before we compare the different face detection API’s, let's scan the images first by ourselves! How many faces would a <strong>human</strong> be able to detect?<br />
If you already had a look on my dataset, you might have seen a few images containing <em>tricky</em> faces. What do I mean by <em>&quot;tricky&quot;</em>? Well, when you e.g. only see a small part of a face and/or the face is in an uncommon angle.</p>
<h3>Time for a little experiment</h3>
<p>I went over all images and wrote down how many faces I thought I've detected. I would use this number to calculate a vendor's sucess rate for an image and see if it was able to detect as many faces as I did.<br />
However, setting  the expected number of faces detected solely by me seemed a bit too biased to me. I needed more opinions.<br />
This is when I kindly asked three coworkers to go through my images and tell me how many faces <strong>they</strong> would detect.<br />
The only task I gave them was <em>&quot;Tell me how many faces, and not heads, you're able to detect&quot;</em>. I didn't define any rules, I wanted to give them any imaginable freedom for doing this task.</p>
<h3>What is a face?</h3>
<p>When I ran through the images detecting faces, I just counted every face from which at least <em>around</em> a quarter was visible. Interestingly my coworkers came up with slightly different definitions of a face.</p>
<ul>
<li>Coworker 1: I've also counted faces which I mostly wasn't able to see. But I did see the body, so my mind told me that <strong>there is a face</strong></li>
<li>Coworker 2: If I was able to see the eyes, nose and mouth, I've counted it as a face</li>
<li>Coworker 3: I've only counted faces which I would be able to recognize in another image again</li>
</ul>
<h4>Example image #267885</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/dataset/267885.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/d9c45c/267885.jpg" alt=""></figure></a>
<figcaption>My coworkers and me detected each 10, 13, 16 and 16 faces in this image. I've decided to continue with the average, thus 14.</figcaption></figure>
<p>It was very interesting to me to see how everyone came up with different techniques regarding face detection.<br />
That being said, I've used the average face count of my results and the ones from my coworkers to set the <em>expected number of faces detected</em> for an image.</p>
<h2>Comparing the results</h2>
<p>Now that we have the dataset and the code set up, let’s process all images by all competitors and compare the results.<br />
My <code>FaceDetectionClient</code> class also comes with a handy CSV export which provides some analytical data.</p>
<p>This is the first impression I've received:</p>
<table>
<thead>
<tr>
<th></th>
<th style="text-align: right;">Amazon</th>
<th style="text-align: right;">Google</th>
<th style="text-align: right;">IBM</th>
<th style="text-align: right;">Microsoft</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Total faces detected</strong></td>
<td style="text-align: right;"><strong>99 / 188</strong><br /><strong>(52.66 %)</strong></td>
<td style="text-align: right;">76 / 188<br />(40.43 %)</td>
<td style="text-align: right;">74 / 188<br />(39.36 %)</td>
<td style="text-align: right;">33 / 188<br />(17.55 %)</td>
</tr>
<tr>
<td><strong>Total processing time (ms)</strong></td>
<td style="text-align: right;">57007</td>
<td style="text-align: right;">43977</td>
<td style="text-align: right;">72004</td>
<td style="text-align: right;"><strong>40417</strong></td>
</tr>
<tr>
<td><strong>Average processing time (ms)</strong></td>
<td style="text-align: right;">1727</td>
<td style="text-align: right;">1333</td>
<td style="text-align: right;">2182</td>
<td style="text-align: right;"><strong>1225</strong></td>
</tr>
</tbody>
</table>
<h3>Very low success rates?</h3>
<p>Amazon was able to detect 52.66 % of the faces defined, Google 40.43 %, IBM 39.36 % and Microsoft even just 17.55 %.<br />
How come the <em>low</em> success rates? Well, first off, I do have lots of tricky images in my dataset.<br />
And secondly, we should not forget that we, as humans, do have a couple of million years worth of evolutionary context to help understand what something is.<br />
While many people believe that we've mastered face detection in tech already, there's still room for improvement!</p>
<h3>The need for speed</h3>
<p>While Amazon was able to detect the most faces, Google’s and Microsoft’s processing times were clearly faster than the other ones. However, in average they still need <strong>longer than one second</strong> to process one image from our dataset.<br />
Sending the image data from our computer/server to another server surely scratches on performance.</p>
<div class="note">
<p><strong>Note:</strong> We’ll find out in the next part of the series if (local) open source libraries could do the same job faster.</p>
</div>
<h3>Groups of people with (relatively) small faces</h3>
<p>After analyzing the images, Amazon seems to be quite good at detecting faces in groups of people and where the face is <em>(relatively)</em> small.</p>
<h4>A small excerpt</h4>
<table>
<thead>
<tr>
<th>Image #</th>
<th style="text-align: right;">Amazon<br />(faces detected)</th>
<th style="text-align: right;">Google<br />(faces detected)</th>
<th style="text-align: right;">IBM<br />(faces detected)</th>
<th style="text-align: right;">Microsoft<br />(faces detected)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>109919</strong></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/amazon-rekognition/109919.jpg">15</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/google-cloud-vision-api/109919.jpg">10</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/ibm-watson-visual-recognition/109919.jpg">8</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/microsoft-azure-face-api/109919.jpg">8</a></td>
</tr>
<tr>
<td><strong>34692</strong></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/amazon-rekognition/34692.jpg">10</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/google-cloud-vision-api/34692.jpg">8</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/ibm-watson-visual-recognition/34692.jpg">6</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/microsoft-azure-face-api/34692.jpg">8</a></td>
</tr>
<tr>
<td><strong>889545</strong></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/amazon-rekognition/889545.jpg">10</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/google-cloud-vision-api/889545.jpg">4</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/ibm-watson-visual-recognition/889545.jpg">none</a></td>
<td style="text-align: right;"><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/microsoft-azure-face-api/889545.jpg">none</a></td>
</tr>
</tbody>
</table>
<h4>Example image #889545 by Amazon</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/amazon-rekognition/889545.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/3b5d54/amazon-889545.jpg" alt=""></figure></a>
<figcaption>Amazon was able to detect 10 faces in this image, while Google only found 4, IBM 0 and Microsoft 0.</figcaption></figure>
<h3>Different angles, uncomplete faces</h3>
<p>So, does it mean that IBM is simply less good than their competitors? Not at all. While Amazon might be good in detecting small faces in group photos, IBM has another strength:<br />
<strong>Difficult images</strong>.</p>
<p>What do I mean with that? Well, images with faces where the head is in an uncommon angle or maybe not shown completely.<br />
Here are three examples from our dataset from which IBM was the <strong>sole vendor</strong> to detect the face.</p>
<h4>Example image #356147 by IBM</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/ibm-watson-visual-recognition/356147.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/6b4a47/ibm-356147.jpg" alt=""></figure></a>
<figcaption>Image with a face only detected by IBM.</figcaption></figure>
<h4>Example image #403448 by IBM</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/ibm-watson-visual-recognition/403448.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/d50d19/ibm-403448.jpg" alt=""></figure></a>
<figcaption>Image with a face only detected by IBM.</figcaption></figure>
<h4>Example image #761963 by IBM</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/ibm-watson-visual-recognition/761963.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/9308a5/ibm-761963.jpg" alt=""></figure></a>
<figcaption>Image with a face only detected by IBM.</figcaption></figure>
<h3>Bounding boxes</h3>
<p>Yes, also the resulting bounding boxes are different.<br />
Amazon, IBM and Microsoft are here very similar and return the bounding boxes of a person’s face.<br />
Google is slightly different and focuses not on someone’s face but on the complete head <em>(which makes more sense to me?)</em>.</p>
<h4>Example image #933964 by Google</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/google-cloud-vision-api/933964.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/a853e2/google-933964.jpg" alt=""></figure></a>
<figcaption>Google returns bounding boxes covering most of the head, not just the face.</figcaption></figure>
<h4>Example image #34692 by Microsoft</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/microsoft-azure-face-api/34692.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/9487a2/microsoft-34692.jpg" alt=""></figure></a>
<figcaption>Microsoft (as well as IBM and Amazon) focus on the face instead of the head.</figcaption></figure>
<p>What is your opinion on this? Should an API return the bounding boxes to the person's face or to the person's head?</p>
<h3>False positives</h3>
<p>Even though our dataset was quite small (33 images), it contains two images on which face detection failed for some vendors.</p>
<h4>Example image #167637 by Amazon</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/amazon-rekognition/167637.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/dd23f8/amazon-167637.jpg" alt=""></figure></a>
<figcaption>Find the face!</figcaption></figure>
<p>In this <em>(nice)</em> picture of a band, Amazon and Google both didn’t detect the face of the front man but of his <strong>tattoo(!)</strong> instead. Microsoft didn't detect any face at all.<br />
Only IBM succeeded and correctly detected the front man’s face (and not his tattoo).<br />
<strong>Well played IBM!</strong></p>
<h4>Example image #948199 by Google</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/google-cloud-vision-api/948199.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/ab7c0e/google-948199.jpg" alt=""></figure></a>
<figcaption>Two-Face, is that you?</figcaption></figure>
<p>In this image Google somehow detected two faces in the same region. Or the network sees something which is invisible to us. Which is even more scary.</p>
<h3>Wait, there is more!</h3>
<p>You can find the <strong>complete dataset</strong> with 33 source images, 4x 33 processed images and the metadata CSV export on <a href="https://github.com/dpacassi/face-detection">GitHub</a>.<br />
Not only that, if you clone the repository and enter your <strong>API keys</strong>, you can even process your own dataset!<br />
At last but not least, if you know of any other face detection API, feel free to <strong>send me a pull request</strong> to include it to the repository!</p>
<h2>How come the different results?</h2>
<p>As stated in the beginning of this blog post, none of the vendors completely reveal how they implemented face detection.<br />
Let’s pretend for a second that they use the same algorithms and network configuration - they could still end up with different results depending on the <strong>training data</strong> they used to train their neural network.</p>
<p>Also there might be some wrappers around the neural networks. Maybe IBM simply rotates the image 3 times and processes it 4 times in total to also find uncommon face angles?<br />
<em>We may never find out.</em></p>
<h2>A last note</h2>
<p>Please keep in mind that I only focused on <strong>face detection</strong>. It’s not to confuse with <strong>face recognition</strong> (which can tell if a certain face belongs to a certain person) and also I didn’t dive deeper into other features the API’s may provide to you.<br />
Amazon for example, tells you if someone is smiling, has a beard or their eyes open/closed. Google can tell you the likeliness if someone is surprised or wearing a headwear. IBM tries to provide you an approximately age range of a person including its likely gender. And Microsoft could tell you if a person is wearing any <strong>makeup</strong>.</p>
<p>The above points are only a few examples of what this vendors can offer to you. If you need more than just basic face detection, I highly recommend you to read and test their specs according to your purpose.</p>
<h2>Conclusion</h2>
<p>So, which vendor is now <strong>the best</strong>? There is really no right answer to this. Every vendor has its strengths and weaknesses. But for <em>“common”</em> images, Amazon, Google and IBM  should do a pretty good job.<br />
Microsoft didn't really convince me though. With 33 out of 188 faces detected, they had the lowest success rate of all four vendors.</p>
<h4>Example image #1181562 by Google</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/google-cloud-vision-api/1181562.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/5de205/google-1181562.jpg" alt=""></figure></a>
<figcaption>For "common" images, Amazon, Google and IBM will be able to detect all faces.</figcaption></figure>
<h4>Example image #1181562 by Microsoft</h4>
<figure><a href="https://raw.githubusercontent.com/dpacassi/face-detection/master/example-images/microsoft-azure-face-api/1181562.jpg"><figure><img src="https://liip.rokka.io/www_inarticle/8deaf6/microsoft-1181562.jpg" alt=""></figure></a>
<figcaption style="font-weight: bold;">Microsoft, y u no detect faces?</figcaption></figure>
<h3>What about OpenCV and other open source alternatives?</h3>
<p>This question will be answered in the next part of this series. Feel free to subscribe to our <a href="https://www.liip.ch/en/blog/tags/data-services.rss">data science RSS feed</a> to receive related updates in the future and thank you so much for reading!</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/baguettebox.js/1.10.0/baguetteBox.min.js"></script>
<script type="text/javascript">
baguetteBox.run('article.article figure');
</script>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/30b0a3/amazon-109919-header.jpg" length="858636" type="image/png" />
          </item>
        <item>
      <title>Zoo Pokedex Part 2: Hands on with Keras and Resnet50</title>
      <link>https://www.liip.ch/fr/blog/zoo-pokedex-part-2-hands-on-with-keras-and-resnet50</link>
      <guid>https://www.liip.ch/fr/blog/zoo-pokedex-part-2-hands-on-with-keras-and-resnet50</guid>
      <pubDate>Tue, 07 Aug 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<h3>Short Recap from Part 1</h3>
<p>In the <a href="https://www.liip.ch/en/blog/poke-zoo-or-making-deep-learning-tell-oryxes-apart-from-lamas-in-a-zoo-part-1-the-idea-and-concepts">last blog post</a> I briefly discussed the potential of using deep learning to build a zoo pokedex app that could be used to motivate zoo goers to engage with the animals and the information. We also discussed the <a href="http://image-net.org">imagenet competition</a> and how deep learning has drastically changed the image recognition game. We went over the two main tricks that deep learning architectures do, namely convolutions and pooling, that allow such deep learning networks to perform extremely well. Last but not least we realized that all you have to do these days is to stand on the shoulders of giants by using the existing networks (e.g. Resnet50)  to be able to write applications that have a similar state of the art precision.  So finally in this blog post it’s time to put these giants to work for us.</p>
<h3>Goal</h3>
<p>The goal is to write an image detection app that will be able to distinguish animals in our zoo. Now for obvious reasons I will make our zoo really small, thus only containing two types of animals:</p>
<ul>
<li>Oryxes and</li>
<li>LLamas (why there is a second L in english is beyond my comprehension).</li>
</ul>
<figure><img src="https://liip.rokka.io/www_inarticle/8c74f3/lamavsoryx.jpg" alt=""></figure>
<p>Why those animals? Well they seem fluffy, but mostly because the original imagenet competition does not contain these animals. So it represents a quite realistic scenario of a Zoo having animals that need to be distinguished but having existing deep learning networks that have not been trained for those. I really have picked these two kinds of animals mostly by random just to have something to show. (Actually I checked if the Zürich Zoo has these so i can take our little app and test it in real life, but that's already part of the third blog post regarding this topic)</p>
<h3>Getting the data</h3>
<p>Getting data is easier than ever in the age of the internet. Probably in the 90ties I would have had to go to some archive or even worse take my own camera and shoot lots and lots of pictures of these animals to use them as training material. Today I can just ask Google to show me some. But wait - if you have actually tried using Google Image search as a resource you will realize that downloading their images in huge amounts is a pain in the ass. The image api is highly limited in terms of what you can get for free, and writing scrapers that download such images is not really fun. That's why I went to the competition and used Microsoft's cognitive services to download images for each animal. </p>
<h3>Downloading image data from Microsoft</h3>
<p>Microsoft offers quite a convenient image search API via their <a href="https://azure.microsoft.com/en-us/services/cognitive-services/">cogitive services</a>. You can sign up there to get a free tier for a couple of days, which should be enough to get you started. What you basically need is an API Key and then you can already start downloading images to create your datasets. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/b79e82/microsoft.jpg" alt=""></figure>
<pre><code class="language-ruby "># Code to download images via Microsoft cognitive api
require 'HTTParty'
require 'fileutils'

API_KEY = "##############"
SEARCH_TERM = "alpaka"
QUERY = "alpaka"
API_ENDPOINT  = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
FOLDER = "datasets"
BATCH_SIZE = 50
MAX = 1000

# Make the dir
FileUtils::mkdir_p "#{FOLDER}/#{SEARCH_TERM}"

# Make the request
headers = {'Ocp-Apim-Subscription-Key' =&gt; API_KEY}
query = {"q": QUERY, "offset": 0, "count": BATCH_SIZE}
puts("Searching for #{SEARCH_TERM}")
response = HTTParty.get(API_ENDPOINT,:query =&gt; query,:headers =&gt; headers)
total_matches = response["totalEstimatedMatches"]

i = 0
while response["nextOffset"] != nil &amp;&amp; i &lt; MAX
    response["value"].each do |image|
        i += 1
        content_url = image["contentUrl"]
        ext = content_url.scan(/^\.|jpg$|gif$|png$/)[0]
        file_name = "#{FOLDER}/#{SEARCH_TERM}/#{i}.#{ext}"
        next if ext == nil
        next if File.file?(file_name)
        begin
            puts("Offset #{response["nextOffset"]}. Downloading #{content_url}")
            r = HTTParty.get(content_url)
            File.open(file_name, 'wb') { |file| file.write(r.body) }
        rescue
            puts "Error fetching #{content_url}"
        end
    end
    query = {"q": SEARCH_TERM, "offset": i+BATCH_SIZE, "count": BATCH_SIZE}
    response = HTTParty.get(API_ENDPOINT,:query =&gt; query,:headers =&gt; headers)
end</code></pre>
<p>The ruby code above simple uses the API in batches and downloads llamas and oryxes into their separate directories and names them accordingly. What you don’t see is that I went through these folders by hand and removed images that were not really the animal, but for example a fluffy shoe, that showed up in the search results. I also de-duped each folder. You can scan the images quickly on your mac using the thumbnail preview or use an image browser that you are familiar with to do the job. </p>
<h3>Problem with not enough data</h3>
<p>Ignoring probable copyright issues (Am i allowed to train my neural network on copyrighted material) and depending on what you want to achieve you might run into the problem, that it’s not really that easy to gather 500 or 5000 images of oryxes and llamas. Also to make things a bit challenging I tried to see if it was possible to train the neural networks using only 100 examples of each animal while using roughly 50 examples to validate the accuracy of the networks. </p>
<p>Normally everyone would tell you that you need definitely more image material because deep learning networks need a lot of data to become useful. But in our case we are going to use two dirty tricks to try to get away with our really small collection: data augmentation and reuse of already pre-trained networks. </p>
<h3>Image data generation</h3>
<p>A really neat handy trick that seems to be prevalent everyday now is to take the images that you already have and change them slightly artificially. That means rotating them, changing the perspective, zooming in on them. What you end up is, that instead of having one image of a llama, you’ll have 20 pictures of that animal, just every picture being slightly different from the original one. This trick allows you to create more variation without actually having to download more material. It works quite well, but is definitely inferior to simply having more data.  </p>
<p>We will be using <a href="http://keras.io">Keras</a> a deep learning library on top of tensorflow, that we have used before in <a href="https://www.liip.ch/en/blog/tensorflow-and-tflearn-or-can-deep-learning-predict-if-dicaprio-could-have-survived-the-titanic">other</a> blog posts to <a href="https://www.liip.ch/en/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks">create a good sentiment detection</a>. In the domain of image recognition Keras can really show its strength, by already having built in methods to do image data generation for us, without having to involve any third party tools. </p>
<pre><code class="language-python"># Creating a Image data generator
train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
    shear_range=0.2, zoom_range=0.2, horizontal_flip=True)</code></pre>
<p>As you can see above we have created an image data generator, that uses sheering, zooming and horizontal flipping to change our llama pictures. We don’t do a vertical flip for example because its rather unrealistic that you will hold your phone upside down.  Depending on the type of images (e.g. aerial photography) different transformations might or might not make sense.</p>
<pre><code class="language-python"># Creating variations to show you some examples
img = load_img('data/train/alpaka/Alpacca1.jpg')
x = img_to_array(img) 
x = x.reshape((1,) + x.shape)  
i = 0
for batch in train_datagen.flow(x, batch_size=1,
                          save_to_dir='preview', save_prefix='alpacca', save_format='jpeg'):
    i += 1
    if i &gt; 20:
        break  # otherwise the generator would loop indefinitely</code></pre>
<figure><img src="https://liip.rokka.io/www_inarticle/31a080/variations.png" alt=""></figure>
<p>Now if you want to use that generators in our model directly you can use the convenient flow from directory method, where you can even define the target size, so you don’t have to scale down your training images with an external library. </p>
<pre><code class="language-python"># Flow from directory method
train_generator = train_datagen.flow_from_directory(train_data_dir,
    target_size=(sz, sz),
    batch_size=batch_size, class_mode='binary')</code></pre>
<h3>Using Resnet50</h3>
<p>In order to finally step on the shoulder of giants we can simply import the resnet50 model, that we talked about earlier. <a href="http://ethereon.github.io/netscope/#/gist/db945b393d40bfa26006">Here</a> is a detailed description of each layer and <a href="https://arxiv.org/pdf/1512.03385.pdf">here is the matching paper</a> that describes it in detail. While there are <a href="https://keras.io/applications/">different alternatives that you might also use</a> the resnet50 model has a fairly high accuracy, while not being too “big” in comparison to the computationally expensive <a href="http://www.robots.ox.ac.uk/~vgg/">VGG</a> network architecture.</p>
<p>On a side note: The name “res” comes from residual. A residual can be understood a a subtraction of features that were learned from the input a leach layer. ResNet has a very neat trick that allows deeper network to learn from residuals by “short-circuiting” them with the deeper layers. So directly connecting the input of an n-th layer to some (n+x)th layer. This short-circuiting has been proven to make the training easier. It does so by helping with the problem of degrading accuracy, where networks that are too deep are becoming exponentially harder to train. </p>
<pre><code class="language-python">#importing resnet into keras
from keras.models import load_model
base_model = ResNet50(weights='imagenet')</code></pre>
<figure><img src="https://liip.rokka.io/www_inarticle/3b54cc/comparison.jpg" alt=""></figure>
<p>As you can see above, importing the network is really dead easy in keras. It might take a while to download the network though. Notice that we are downloading the weights too, not only the architecture.</p>
<h3>Training existing models</h3>
<p>The next part is the exciting one. Now we finally get to train the existing networks on our own data. The simple but ineffective approach would be to download or just re-build the architecture of the successful network and train those with our data. The problem with that approach is, that we only have 100 images per class. 100 images per class  are not even remotely close to being enough data to train those networks well enough to be useful. </p>
<p>Instead we will try another technique (which I somewhat stole from the <a href="https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html">great keras blog</a>): We will freeze all weights of the downloaded network and add three final layers at the end of the network and then train those. </p>
<h3>Freezing the base model</h3>
<p>Why is this useful you might ask: Well by doing so we can freeze all of the existing layers of the resnet50 network and just train the final layer. This makes sense, since the imagenet task is about recognizing everyday objects from everyday photographs, and it is already very good at recognising “basic” features such as legs, eyes, circles, heads, etc… All of this “smartness” is already encoded in the weights (see the last blog post). If we throw these weights away we will lose these nice smart properties. But instead we can just glue another pooling layer and a dense layer at the very end of it, followed by a sigmoid activation layer, that's needed to distinguish between our two classes. That's by the way why it says “include_top=False” in the code, in order to not include the initial 1000 classes layer, that was used for the imagenet competition. Btw. If you want to read up on the different alternatives to the resnet50 you will find them <a href="https://keras.io/applications/">here</a>.</p>
<pre><code class="language-python"># Adding three layers on top of the network
base_model = ResNet50(weights='imagenet', include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)</code></pre>
<p>Finally we can now re-train the network with our own image material and hope for it to turn out to be quite useful. I’ve had some trouble finding the right optimizer that had proper results. Usually you will have to experiment with the right learning rate to find a configuration that has an improving accuracy in the training phase.</p>
<pre><code class="language-python">#freezing all the original weights and compiling the network
from keras import optimizers
optimizer = optimizers.RMSprop(lr=0.00001, rho=0.9, epsilon=None, decay=0.0)
model = Model(inputs=base_model.input, outputs=predictions)
for layer in base_model.layers: layer.trainable = False
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
model.fit_generator(train_generator, train_generator.n // batch_size, epochs=3, workers=4,
        validation_data=validation_generator, validation_steps=validation_generator.n // batch_size)</code></pre>
<p>The training shouldn’t take long, even when you are using just a CPU instead of a GPU and the output might look something like this:</p>
<figure><img src="https://liip.rokka.io/www_inarticle/dc208a/training.png" alt=""></figure>
<p>You’ll notice that we reached an accuracy of 71% which isn’t too bad, given that we have only 100 original images of each class. </p>
<h3>Fine-tuning</h3>
<p>One thing that we might do now is to unfreeze some of the very last layers in the network and re-train the network again, allowing those layers to change slightly. We’ll do this in the hope that allowing for more “wiggle-room”, while changing most of the actual weights, the network might give us better results. </p>
<pre><code class="language-python "># Make the very last layers trainable
split_at = 140
for layer in model.layers[:split_at]: layer.trainable = False
for layer in model.layers[split_at:]: layer.trainable = True
model.compile(optimizer=optimizers.RMSprop(lr=0.00001, rho=0.9, epsilon=None, decay=0.0), loss='binary_crossentropy', metrics=['accuracy'])    
model.fit_generator(train_generator, train_generator.n // batch_size, epochs=1, workers=3,
        validation_data=validation_generator, validation_steps=validation_generator.n // batch_size)</code></pre>
<figure><img src="https://liip.rokka.io/www_inarticle/2e0324/improvement.png" alt=""></figure>
<p>And indeed it helped our model to go from 71% accuracy to 82%! You might want play around with the learning rates a bit or maybe split it at a different depth, in order to tweak results. But generally I think that just adding more images would be the easiest way to achieve 90% accuracy.  </p>
<h3>Confusion matrix</h3>
<p>In order to see how well our model is doing we might also compute a confusion matrix, thus calculating the true positives, true negatives, and the false positives and false negatives. </p>
<pre><code class="language-python"># Calculating confusion matrix
from sklearn.metrics import confusion_matrix
r = next(validation_generator)
probs = model.predict(r[0])
classes = []
for prob in probs:
    if prob &lt; 0.5:
        classes.append(0)
    else:
        classes.append(1)
cm = confusion_matrix(classes, r[1])
cm</code></pre>
<p>As you can see above I simply took the first batch from the validation generator (so the images of which we know if its a alpakka  or an oryx) and then use the <a href="http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py">confusion matrix from scikit-learn</a> to output something. So in the example below we see that 28 resp. 27 images of each class were labeled correctly while making an error in 4 resp. 5 images. I would say that’s quite a good result, given that we used only so little data.</p>
<pre><code class="language-python">#example output of confusion matrix
array([[28,  5],
       [ 4, 27]])</code></pre>
<h3>Use the model to predict images</h3>
<p>Last but not least we can of course finally use the model to predict if an animal in our little zoo is an oryx or an alpakka. </p>
<pre><code class="language-python"># Helper function to display images
def load_image(img_path, show=False):

    img = image.load_img(img_path, target_size=(224, 224))
    img_tensor = image.img_to_array(img)                    # (height, width, channels)
    img_tensor = np.expand_dims(img_tensor, axis=0)         # (1, height, width, channels), add a dimension because the model expects this shape: (batch_size, height, width, channels)
    #img_tensor /= 255.                                      # imshow expects values in the range [0, 1]

    if show:
        plt.imshow(img_tensor[0]/255)                           
        plt.axis('off')
        plt.show()

    return img_tensor

# Load two sample images
oryx = load_image("data/valid/oryx/106.jpg", show=True)
alpaca = load_image("data/valid/alpaca/alpaca102.jpg", show=True)
model.predict(alpaka)
model.predict(oryx)</code></pre>
<figure><img src="https://liip.rokka.io/www_inarticle/6d2129/prediction.png" alt=""></figure>
<p>As you can see in the output, our model successfully labeled the alpaca as an alpaca since the value was less than 0.5 and the oryx as an oryx, since the value was &gt; 0.5. Hooray! </p>
<h3>Conclusion or What’s next?</h3>
<p>I hope that the blog post was useful to you, and showed you that you don’t really need much in order to get started with deep learning for image recognition. I know that our example zoo pokedex is really small at this point, but I don’t see a reason (apart from the lack of time and resources) why it should be a problem to scale out from our 2 animals to 20 or 200. </p>
<p>On the technical side, now that we have a model running that’s kind of useful, it would be great to find out how to use it in on a smartphone e.g. the IPhone, to finally have a pokedex that we can really try out in the wild. I will cover that bit in the third part of the series, showing you how to export existing models to Apple mobile phones making use of the <a href="https://developer.apple.com/machine-learning/">CoreML</a> technology. As always I am looking forward to your comments and corrections and point you to the ipython notebook that you can download <a href="https://github.com/plotti/zoo/blob/master/Zoo%20prediction.ipynb">here</a>.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/460c46/adorable-adult-animals-1040396.jpg" length="2036526" type="image/jpeg" />
          </item>
        <item>
      <title>Poke-Zoo - How to use deep learning image recognition to tell oryxes apart from llamas in a zoo</title>
      <link>https://www.liip.ch/fr/blog/poke-zoo-or-making-deep-learning-tell-oryxes-apart-from-lamas-in-a-zoo-part-1-the-idea-and-concepts</link>
      <guid>https://www.liip.ch/fr/blog/poke-zoo-or-making-deep-learning-tell-oryxes-apart-from-lamas-in-a-zoo-part-1-the-idea-and-concepts</guid>
      <pubDate>Wed, 18 Jul 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<p>We’ve all witnessed the hype in 2016 when people started hunting pokemons in “real-life” with the app Pokémon GO . It was one of the apps with the <a href="http://www.businessofapps.com/data/pokemon-go-statistics/">fastest rise</a> in user-base and for a while with a higher addiction rate than crack - correction: I mean candycrush. Comparing it to technologies like telephone or email, <a href="http://blog.interactiveschools.com/blog/50-million-users-how-long-does-it-take-tech-to-reach-this-milestone">it only took it 19 days to reach 50 mio users</a>, vs. 75 years for the telephone. </p>
<h3>Connecting the real with the digital world</h3>
<p>You might be wondering, why I am reminiscing about old apps, we have certainly all moved on since the Pokemon GO hype in 2016 and are doing other serious things now. True, but I think though that the idea of “collecting” virtual things that are bound to real-life locations was a great idea and that we want to build more of it in the future. That’s why Pokemon is the starting point fort this blogpost. In fact If you are young enough to have watched the pokemon series, you are probably familiar with the idea of the pokedex. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/22bd5c/pokedex.jpg" alt=""></figure>
<h3>The idea</h3>
<p>The pokedex was a small device that Ash (the main character) could use to lookup information about certain pokemons in the animated series. He used it now and then to lookup some facts about them. While we have seen how popular the pokemon GO was, by connecting the real with the digital world, why not take the idea of the pokedex and apply it in  real world scenarios, or:</p>
<p><strong><em> What if we had such an app to distinguish not pokemons but animals in the zoo? </em></strong></p>
<h2>The Zoo-Pokedex</h2>
<p>Imagine a scenario where kids have an app their parent’s mobile phones - the zoo-pokedex. They start it up when entering a zoo and they then go exploring. When they are at a cage they point the phones camera at the cage and try to film the animal with it. The app recognizes which animal they are seeing and gives them additional information on it as a reward. </p>
<p>Instead of perceiving the zoo as a educational place where you have to go from cage to cage and observe the animal, absorb the info material you could send them out there and let them “capture” all the animals with their Zoo-Pokedex. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/c89545/pokedexzoo.jpg" alt=""></figure>
<p>Let’s have a look at the classic ways of learning about animals in a zoo:</p>
<p>Reading a plaque in the zoo feels boring and dated<br />
Reading an info booklet you got at the cashier feels even worse<br />
Bringing your own book about animals might be fun, when comparing the pictures of animals in the book with the real ones, but there is no additional information<br />
Having a QR code at the cage that you need to scan, will never feel exciting or fun<br />
Having a list of animals in my app that I can tap on to get more info could be fun, but more for parents in order to appear smart before their kids giving them facts about the animal</p>
<p>Now imagine the zoo-pokedex, you really need to go exploring the zoo in order to get information. In cases where the animals area is big and it can retreat you need to wait in front of it to take a picture of it. That takes endurance and perseverance. It might even be the case that you don’t get to see it and have to come back. When the animal appears in front of you, you’ll need to be quick - maybe even an element of surprise, excitement is there - you need to get that one picture of the animal in order to check of the challenge. Speaking of challenge, why not make it a challenge to have seen every animal in the zoo? That would definitely mean you need to come back multiple times, take your time and go home having ticked off 4-5 animals in your visit. This experience encourages you to come back and try again next time. And each time you learn something, you go home with a sense of accomplishment.  </p>
<p>That would definitely be quite interesting, but how could such a device work? Well we would definitively use the phone’s camera and we could train a deep learning network to recognize the animals that are present in the zoo. </p>
<p>So imagine a kid walking up to an area and then trying to spot the animal in order to point his mobile phone to it and then magically a green check-mark appears next to it. We could display some additional info material like where they are originally from, what they eat, when they sleep etc.., but definitely those infos would feel much more entertaining than just reading them off a boring info plaque.</p>
<h3>How train the Pokedex to distinguish new animals</h3>
<p>Well nice idea you say, but how am I going to make that magical device that will recognize animals, especially the “weird” ones e.g. the oryx in the title :) . The answer is …. of course …. deep learning. </p>
<p>In recent years you have probably noticed the rise of deep learning in different areas of machine learning and noticed their practical applications in your everyday life. In fact I have covered a couple of these practical applications such as <a href="https://www.liip.ch/en/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks">state of the art sentiment detection</a> or <a href="https://www.liip.ch/en/blog/tensorflow-and-tflearn-or-can-deep-learning-predict-if-dicaprio-could-have-survived-the-titanic">survival rates for structured data</a> or <a href="https://www.liip.ch/en/blog/betti-bossi-recipe-assistant-prototype-with-automatic-speech-recognition-asr-and-text-to-speech-tts-on-socket-io">automatic speech recognition</a> and <a href="https://www.liip.ch/en/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-3-developing-the-prototype">text to speech applications</a> in our blog. </p>
<h3>Deep learning image categorization task</h3>
<p>The area we need for our little zoo-pokedex is image categorization. Image categorization tasks have advanced tremendously in the last years, due to deep learning outperforming all other machine learning approaches (see below). One good indicator of this movement is the yearly <a href="http://www.image-net.org">imagenet competition</a>, which is about letting machine learning algorithms compete about the best way of finding out what can be seen on an image. The task is simple: there are 1000 categories of everyday objects such as cats, elephants, tea-cattles and millions of images that need to be mapped to one of these categories. The algorithm that makes the lowest error wins. Below is an example of the output on the sample images. You’ll notice that the algorithm displays the label of which it thinks the image belongs to. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/351c6b/imagenet.jpg" alt=""></figure>
<p>Now this ILSVRC competition has been going on for a couple of years now and while the improvements that have been made have been astonishing each year, in the last 5 years especially in 2012 and 2013 deep learning appeared with a big bang on the horizon. As you can see on the image below the amount of state of the art solutions exploded and outperformed all other solutions in this area. It even goes so far that the ability of the algorithm to tell the contents apart is better than this of a competing human group. This super-human ability of deep learning networks in these areas is what the hype is all about. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/46be78/solutions.jpg" alt=""></figure>
<h3>How does it work?</h3>
<p>In this blog post I don’t want to be technical but just show you how two easy concepts of convolution (kernels) and pooling are applied in a smart way to really achieve outstanding results in image recognition tasks with deep learning. I don’t want to go into details how deep learning works in the way of how it learns in the form of updating of weights, backpropagation but abstract all of this stuff away from you. In fact if you have 20 minutes and are a visual learner I definitely recommend that video below that does an extremely good job at explaining the concepts behind it:</p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//youtube.com/embed/aircAruvnKk" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<p>Instead I will quickly cover the two basic tricks that are used to make things really work. </p>
<p>We’ll start by looking at a common representation of a deep learning network above and you’ll  notice that two words appear a lot there, namely convolution and pooling. While it seems obvious that the image data has to travel through these layers from left to right, it would be cool if we only knew what these layers do. </p>
<h3>Convolutions and Kernels</h3>
<p>If you are not a native speaker you’ve probably have never heard of the word convolution before and might be quite puzzled when you hear it. For me it also sounded like some magic procedure that apparently does something very complicated and apparently makes the deep learning work :). </p>
<p>After getting into the field I realized that it's basically its an image transformation that is almost 20 years old (e.g. Computer Vision. From Prentice Hall book by Shapiro)  and present in your everyday image editing software. Things like sharpening an image or blurring it, or finding edges are basically a convolution. It's a process of applying a small e.g. 3x3 Matrix over each pixel of your image and multiply this value with the neighbouring pixels and then collect the results of that manipulation in a new image.</p>
<p>To make this concept more understandable I stole some <a href="http://setosa.io/ev/image-kernels/">examples</a> of how a 3x3 matrix, also called a kernel, transforms an image after being applied to every pixel in your image. </p>
<p>In the image below the kernel gives you the top-edges in your image. The numbers in the grey boxes represent the gray image values (from 0 black to 255 white) and the little numbers after the X represent how these numbers are multiplied when added together. If you change these numbers you get another transformation. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/edbbdd/top-edge.jpg" alt=""></figure>
<p>Here is another set of numbers in the 3x3 matrix that will blur your image. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/1ae7fe/blur.jpg" alt=""></figure>
<p>Now normally the way of create such “filters” is to hand-tune these numbers by hand to achieve the desired results. With some logical thinking you can easily come up with filters that sharpen or blur an image and then apply those to the image. But how are these applied in the context of deep learning?</p>
<p>With deep learning we do things the other way round, we teach the neural network to find filters that are somewhat useful in regards to the final result. So for example to tell a zebra apart from an elephant it would really be useful if we had a filter that detects diagonal edges. And if the image has diagonal edges e.g. the stripes of the zebra, it's probably not an elephant. So we train the network on our training images of zebras and elephants and let it learn these filters or kernels on its own. If the emerging kernels are helpful with the task they have a tendency to stay, if not, they keep on updating themselves until they become useful. </p>
<p>So one layer that applies such filters or kernels or convolutions is called a convolutional layer. And now comes another cool property. If you keep on stacking such layers on top of each other, each of these layers will find own filters that are helpful. And on top of that each of these filters will become more and more complicated and be able to detect more detailed features.</p>
<figure><img src="https://liip.rokka.io/www_inarticle/9904f6/layer.jpg" alt=""></figure>
<p>In the image above (which is from a seminal <a href="https://arxiv.org/pdf/1311.2901.pdf">paper</a>, you see gray boxes and images. A great way to show these filters is to show the activations or convolutions which are these gray boxes. The images are samples that “trigger” these filters the most. Or said the other way round, these are images that these filters detect well. </p>
<p>So for example in the first layer you’ll notice that the network detects mostly vertical, horizontal and diagonal edges. In the second layer its already a bit “smarter” and is able to detect round things, e.g. eyes or corners of frames etc.. In the third layer its already a bit smarter and is able to detect not only round things but things that look like car tires for example. This layering often goes on and on for many layers. Some networks have over 200 of these layers. That's why they are called deep. Now you know. So usually adding more and more of these layers makes the network better at detecting things but also it makes it slower and sometimes less able to generalize for things it had not seen yet.  </p>
<h3>Pooling</h3>
<p>The second word that you might see a lot in those architecture above is the word pooling. Here the trick is really simple: You look at a couple of pixels next to each other e.g. 2x2 and simply take the biggest value - also called max-pooling. In the image below this trick has been applied for each colored 2x2 area and the output is a much smaller image. Now why are we doing this?</p>
<p>The answer is simple, in order to be size invariant. We try to scale the image down and up  multiple times in order to be able to detect a zebra that is really close to the camera vs. one that might only be viewable in the far distance. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/52da95/pooling.jpg" alt=""></figure>
<h3>Putting things together</h3>
<p>After the small excursion into the two main principles of inner workings of state of the art deep learning networks we have to ask the question of how we are going to use these tricks to detect our animals in the zoo. </p>
<p>While a few years ago you would have had to write a lot of code and hire a whole machine team to do this task, today you can already stand on the shoulders of giants. Thanks to the Imagenet competitions (and I guess thanks to Google, Microsoft and other research teams constantly outputting new research) we can use some of these pretrained networks to do our job for us. What does this mean?</p>
<p>The networks that are often used in these competitions can be obtained freely (In fact they even come <a href="https://github.com/pytorch/pytorch pre-bundled to the deep-learning frameworks"><a href="https://github.com/pytorch/pytorch">https://github.com/pytorch/pytorch</a> pre-bundled to the deep-learning frameworks</a>) and you can use networks these without any tuning in order to be able to categorize your image into the 1000 categories that are used in the competition. As you can see in the image below the bigger in terms of layers the network the better it performs, but also the slower it is and the more data it needs to be trained. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/3b54cc/comparison.jpg" alt=""></figure>
<h3>Outlook - Part 2 How to train state of the art image recognition networks to categorize new material</h3>
<p>The cool thing now is that in the next blog post we will use these pretrained networks and teach them new tricks. In our case teach them to tell apart a llama from an oryx, for our zoo pokedex. So basically train these network to recognize things these networks have never been trained to do. So obviously we will need training data and we have to find a way to somehow teach them new stuff without “destroying” their properties of being really good at detecting common things. </p>
<p>Finally after that blog post I hope to leave you with at least one the takeaway of demystifying deep learning networks in the image recognition domain. So hopefully whenever you see these weird architecture drawings of image recognition deep learning networks and you see those steps saying “convolution” and “pooling” you’ll hopefully know that this magic sauce is not that magic after all. It’s just a very smart way of applying those very old techniques to achieve outstanding results.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/b498f0/animals-assorted-background-953211.jpg" length="541852" type="image/jpeg" />
          </item>
        <item>
      <title>Sentiment detection with Keras, word embeddings and LSTM deep learning networks</title>
      <link>https://www.liip.ch/fr/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks</link>
      <guid>https://www.liip.ch/fr/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks</guid>
      <pubDate>Fri, 04 May 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<h3>Overview SaaS</h3>
<p>When it comes to sentiment detection it has become a bit of a commodity. Especially the big 5 vendors offer their own sentiment detection as a service. Google offers an <a href="https://cloud.google.com/natural-language/docs/sentiment-tutorial">NLP API</a> with sentiment detection. Microsoft offers sentiment detection through their <a href="https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/">Azure</a> platform. IBM has come up with a solution called <a href="https://www.ibm.com/watson/services/tone-analyzer/">Tone Analyzer</a>, that tries to get the &quot;tone&quot; of the message, which goes a bit beyond sentiment detection. Amazon offers a solution called <a href="https://aws.amazon.com/de/blogs/machine-learning/detect-sentiment-from-customer-reviews-using-amazon-comprehend/">comprehend</a> that runs on aws as a lambda. Facebook surprisingly doesn't offer an API or an open source project here, although they are the ones with user generated content, where people often are not <a href="https://www.nzz.ch/digital/facebook-fremdenfeindlichkeit-hass-kommentare-ld.1945">so nice</a> to each other. Interestingly they do not offer any assistance for page owners in that specific matter.</p>
<p>Beyond the big 5 there are a few noteworthy of companies like <a href="https://aylien.com">Aylien</a> and <a href="https://monkeylearn.com">Monkeylearn</a>, that are worth checking out. </p>
<h3>Overview Open Source Solutions</h3>
<p>Of course there are are open source solutions or libraries that offer sentiment detection too.<br />
Generally all of these tools offer more than just sentiment analysis. Most of the outlined SaaS solutions above as well as the open source libraries offer a vast amount of different NLP tasks:</p>
<ul>
<li>part of speech tagging (e.g. &quot;going&quot; is a verb), </li>
<li>stemming (finding the &quot;root&quot; of a word e.g. am,are,is -&gt; be), </li>
<li>noun phrase extraction (e.g. car is a noun), </li>
<li>tokenization (e.g. splitting text into words, sentences), </li>
<li>words inflections (e.g. what's the plural of atlas), </li>
<li>spelling correction and translation. </li>
</ul>
<p>I like to point you to pythons <a href="http://text-processing.com/demo/sentiment/">NLTK library</a>, <a href="http://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis">TextBlob</a>, <a href="https://www.clips.uantwerpen.be/pages/pattern-en#sentiment">Pattern</a> or R's <a href="https://cran.r-project.org/web/packages/tm/index.html">Text Mining</a> module and Java's <a href="http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html">LingPipe</a> library. Finally, I encourage you to have a look at the latest <a href="https://spacy.io">Spacy NLP suite</a>, which doesn't offer sentiment detection per se but has great NLP capabilities. </p>
<p>If you are looking for more options I encourage you to take a look at the full list that I have compiled in our <a href="http://datasciencestack.liip.ch/#nlp">data science stack</a>. </p>
<h3>Let's get started</h3>
<p>So you see, when you need sentiment analysis in your web-app or mobile app you already have a myriad of options to get started. Of course you might build something by yourself if your language is not supported or you have other legal compliances to meet when it comes to data privacy.</p>
<p>Let me walk you through all of the steps needed to make a well working sentiment detection with <a href="https://keras.io">Keras</a> and <a href="https://de.wikipedia.org/wiki/Long_short-term_memory">long short-term memory networks</a>. Keras is a very popular python deep learning library, similar to <a href="http://tflearn.org">TFlearn</a> that allows to create neural networks without writing too much boiler plate code. LSTM networks are a special form or network architecture especially useful for text tasks which I am going to explain later. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/30a13b/keras.png" alt=""></figure>
<h3>Step 1: Get the data</h3>
<p>Being a big movie nerd, I have chosen to classify IMDB reviews as positive or negative for this example. As a benefit the IMDB sample comes already with the Keras <a href="https://keras.io/datasets/">datasets</a> library, so you don't have to download anything. If you are interested though, not a lot of people know that IMDB offers its <a href="https://www.imdb.com/interfaces/">own datasets</a> which can be <a href="https://datasets.imdbws.com">downloaded</a> publicly. Among those we are interested in the ones that contain movie reviews, which have been marked by hand to be either positive or negative. </p>
<pre><code class="language-python">#download the data
from keras.datasets import imdb 
top_words = 5000 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)</code></pre>
<p>The code above does a couple of things at once: </p>
<ol>
<li>It downloads the data </li>
<li>It downloads the first 5000 top words for each review </li>
<li>It splits the data into a test and a training set. </li>
</ol>
<figure><img src="https://liip.rokka.io/www_inarticle/fb9a1c/processed.png" alt=""></figure>
<p>If you look at the data you will realize it has been already pre-processed. All words have been mapped to integers and the integers represent the words sorted by their frequency. This is very common in text analysis to represent a dataset like this. So 4 represents the 4th most used word, 5 the 5th most used word and so on... The integer 1 is reserved reserved for the start marker, the integer 2 for an unknown word and 0 for padding. </p>
<p>If you want to peek at the reviews yourself and see what people have actually written, you can reverse the process too:</p>
<pre><code class="language-python">#reverse lookup
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["&lt;PAD&gt;"] = 0
word_to_id["&lt;START&gt;"] = 1
word_to_id["&lt;UNK&gt;"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in train_x[0] ))</code></pre>
<p>The output might look like something like this:</p>
<pre><code class="language-python">&lt;START&gt; this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert &lt;UNK&gt; is an amazing actor and now the same being director &lt;UNK&gt; father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for &lt;UNK&gt; and would recommend it to everyone to watch and the fly &lt;UNK&gt; was amazing really cried at the end it was so sad and you know w</code></pre>
<h3>One-hot encoder</h3>
<p>If you want to do the same with your text (e.g. my example are some work reviews) you can use Keras already built in &quot;one-hot&quot; encoder feature that will allow you to encode your documents with integers. The method is quite useful since it will remove any extra marks (e.g. !&quot;#$%&amp;...) and split sentences into words by space and transform the words into lowercase. </p>
<pre><code class="language-python">#one hot encode your documents
from numpy import array
from keras.preprocessing.text import one_hot
docs = ['Gut gemacht',
        'Gute arbeit',
        'Super idee',
        'Perfekt erledigt',
        'exzellent',
        'naja',
        'Schwache arbeit.',
        'Nicht gut',
        'Miese arbeit.',
        'Hätte es besser machen können.']
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)</code></pre>
<p>Although the encoding will not be sorted like in our example before (e.g. lower numbers representing more frequent words), this will still give you a similar output:</p>
<pre><code>[[18, 6], [35, 39], [49, 46], [41, 39], [25], [16], [11, 39], [6, 18], [21, 39], [15, 23, 19, 41, 25]]</code></pre>
<h3>Step 2: Preprocess the data</h3>
<p>Since the reviews differ heavily in terms of lengths we want to trim each review to its first 500 words. We need to have text samples of the same length in order to feed them into our neural network. If reviews are shorter than 500 words we will pad them with zeros. Keras being super nice, offers a set of <a href="https://keras.io/preprocessing/text/">preprocessing</a> routines that can do this for us easily. </p>
<pre><code class="language-python"># Truncate and pad the review sequences 
from keras.preprocessing import sequence 
max_review_length = 500 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length) </code></pre>
<figure><img src="https://liip.rokka.io/www_inarticle/27e1ad/padded.png" alt=""></figure>
<p>As you see above (I've just output the padded Array as a pandas dataframe for visibility) a lot of the reviews have padded 0 at the front which means, that the review is shorter than 500 words. </p>
<h3>Step 3: Build the model</h3>
<p>Surprisingly we are already done with the data preparation and can already start to build our model. </p>
<pre><code class="language-python"># Build the model 
embedding_vector_length = 32 
model = Sequential() 
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length)) 
model.add(LSTM(100)) 
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
print(model.summary()) </code></pre>
<p>The two most important things in our code are the following:</p>
<ol>
<li>The Embedding layer and </li>
<li>The LSTM Layer. </li>
</ol>
<p>Lets cover what both are doing. </p>
<h3>Word embeddings</h3>
<p>The embedding layer will learn a word embedding for all the words in the dataset. It has three arguments the input_dimension in our case the 500 words. The output dimension aka the vector space in which words will be embedded. In our case we have chosen 32 dimensions so a vector of the length of 32 to hold our word coordinates. </p>
<p>There are already pre-trained word embeddings (e.g. GloVE or <a href="https://radimrehurek.com/gensim/models/word2vec.html">Word2Vec</a>) that you can <a href="https://nlp.stanford.edu/projects/glove/">download</a> so that you don't have to train your embeddings all by yourself. Generally, these word embeddings are also based on specialized algorithms that do the embedding always a bit different, but we won't cover it here. </p>
<p>How can you imagine what an  embedding actually is? Well generally words that have a similar meaning in the context should be embedded next to each other. Below is an example of word embeddings in a two-dimensional space:</p>
<figure><img src="https://liip.rokka.io/www_inarticle/88d44e/embeddings.png" alt=""></figure>
<p>Why should we even care about word embeddings? Because it is a really useful trick. If we were to feed our reviews into a neural network and just one-hot encode them we would have very sparse representations of our texts. Why? Let us have a look at the sentence &quot;I do my job&quot; in &quot;bag of words&quot; representation with a vocabulary of 1000: So a matrix that holds 1000 words (each column is one word), has four ones in it (one for <strong>I</strong>, one for <strong>do</strong> one for <strong>my</strong> and one for <strong>job</strong>) and 996 zeros. So it would be very sparse. This means that learning from it would be difficult, because we would need 1000 input neurons each representing the occurrence of a word in our sentence. </p>
<p>In contrast if we do a word embedding we can fold these 1000 words in just as many dimensions as we want, in our case 32. This means that we just have an input vector of 32 values instead of 1000. So the word &quot;I&quot; would be some vector with values (0.4,0.5,0.2,...) and the same would happen with the other words. With word embedding like this, we just need 32 input neurons. </p>
<h3>LSTMs</h3>
<p>Recurrent neural networks are networks that are used for &quot;things&quot; that happen recurrently so one thing after the other (e.g. time series, but also words). Long Short-Term Memory networks (LSTM) are a specific type of Recurrent Neural Network (RNN) that are capable of learning the relationships between elements in an input sequence. In our case the elements are words. So our next layer is an LSTM layer with 100 memory units.</p>
<p>LSTM networks maintain a state, and so overcome the problem of a vanishing gradient problem in recurrent neural networks (basically the problem that when you make a network deep enough the information for learning will &quot;vanish&quot; at some point). I do not want to go into detail how they actually work, but <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">here</a> delivers a great visual explanation. Below is a schematic overview over the building blocks of LSTMs.</p>
<p>So our output of the embedding layer is a 500 times 32 matrix. Each word is represented through its position in those 32 dimensions. And the sequence is the 500 words that we feed into the LSTM network. </p>
<p>Finally at the end we have a dense layer with one node with a sigmoid activation as the output. </p>
<p>Since we are going to have only the decision when the review is positive or negative we will use binary_crossentropy for the loss function. The optimizer is the standard one (adam) and the metrics are also the standard accuracy metric. </p>
<p>By the way, if you want you can build a sentiment analysis without LSTMs, then you simply need to replace it by a flatten layer:</p>
<pre><code class="language-python">#Replace LSTM by a flatten layer
#model.add(LSTM(100)) 
model.add(Flatten()) </code></pre>
<h3>Step 4: Train the model</h3>
<p>After defining the model Keras gives us a summary of what we have built. It looks like this:</p>
<pre><code class="language-python">#Summary from Keras
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
=================================================================
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None</code></pre>
<p>To train the model we simply call the fit function,supply it with the training data and also tell it which data it can use for validation. That is really useful because we have everything in one call. </p>
<pre><code class="language-python">#Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=3, batch_size=64) </code></pre>
<p>The training of the model might take a while, especially when you are only running it on the CPU instead of the GPU. When the model training happens, what you want to observe is the loss function, it should constantly be going down, this shows that the model is improving. We will make the model see the dataset 3 times, defined by the epochs parameter. The batch size defines how many samples the model will see at once - in our case 64 reviews. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/1868ba/training.png" alt=""></figure>
<p>To observe the training you can fire up tensor board which will run in the browser and give you a lot of different analytics, especially the loss curve in real time. To do so type in your console:</p>
<pre><code class="language-bash">sudo tensorboard --logdir=/tmp</code></pre>
<h3>Step 5: Test the model</h3>
<p>Once we have finished training the model we can easily test its accuracy. Keras provides a very handy function to do that:</p>
<pre><code class="language-python">#Evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))</code></pre>
<p>In our case the model achieved an accuracy of around 90% which is excellent, given the difficult task. By the way if you are wondering what the results would have been with the Flatten layer it is also around 90%. So in this case I would use <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam's razor</a> and in case and in doubt: go with the simpler model.</p>
<h3>Step 6: Predict something</h3>
<p>Of course at the end we want to use our model in an application. So we want to use it to create predictions. In order to do so we need to translate our sentence into the corresponding word integers and then pad it to match our data. We can then feed it into our model and see if how it thinks we liked or disliked the movie.</p>
<pre><code class="language-python">#predict sentiment from reviews
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"
for review in [good,bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length) 
    print("%s. Sentiment: %s" % (review,model.predict(array([tmp_padded][0]))[0][0]))
i really liked the movie and had fun. Sentiment: 0.715537
this movie was terrible and bad. Sentiment: 0.0353295</code></pre>
<p>In this case a value close to 0 means the sentiment was negative and a value close to 1 means its a positive review. You can also use &quot;model.predict_classes&quot; to just get the classes of positive and negative. </p>
<h3>Conclusion or what’s next?</h3>
<p>So we have built quite a cool sentiment analysis for IMDB reviews that predicts if a movie review is positive or negative with 90% accuracy. With this we are already <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">quite close</a> to industry standards. This means that in comparison to a <a href="https://www.liip.ch/en/blog/whats-your-twitter-mood">quick prototype</a> that a colleague of mine built a few years ago we could potentially improve on it now. The big benefit while comparing our self-built solution with an SaaS solution on the market is that we own our data and model. We can now deploy this model on our own infrastructure and use it as often as we like. Google or Amazon never get to see sensitive customer data, which might be relevant for certain business cases. We can train it with German or even Swiss German language given that we find a nice dataset, or simply build one ourselves. </p>
<p>As always I am looking forward to your comments and insights! As usual you can download the Ipython notebook with the code <a href="https://github.com/plotti/keras_sentiment/blob/master/Imdb%20Sentiment.ipynb">here</a>.</p>
<p>P.S. The people from monkeylearn contacted me and pointed out that they have written quite an extensive introduction to sentiment detection here: <a href="https://monkeylearn.com/sentiment-analysis/">https://monkeylearn.com/sentiment-analysis/</a> so I point you to that in case you want to read up on the general concepts.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/674f1c/clamp-clips-close-up-160824.jpg" length="2751344" type="image/jpeg" />
          </item>
    
  </channel>
</rss>
