<?xml version="1.0" encoding="utf-8"?>
<!-- generator="Kirby" -->
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">

  <channel>
    <title>Mot-cl&#233;: asr &#183; Blog &#183; Liip</title>
    <link>https://www.liip.ch/fr/blog/tags/asr</link>
    <generator>Kirby</generator>
    <lastBuildDate>Tue, 12 Jun 2018 00:00:00 +0200</lastBuildDate>
    <atom:link href="https://www.liip.ch" rel="self" type="application/rss+xml" />

        <description>Articles du blog Liip avec le mot-cl&#233; &#8220;asr&#8221;</description>
    
        <language>fr</language>
    
        <item>
      <title>Recipe Assistant Prototype with ASR and TTS on Socket.IO - Part 3 Developing the prototype</title>
      <link>https://www.liip.ch/fr/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-3-developing-the-prototype</link>
      <guid>https://www.liip.ch/fr/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-3-developing-the-prototype</guid>
      <pubDate>Tue, 12 Jun 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<p>Welcome to part three of three in our mini blog post series on how to build a recipe assistant with automatic speech recognition and text to speech to deliver a hands free cooking experience. In the first blog post we gave you a hands on <a href="https://www.liip.ch/en/blog/betti-bossi-recipe-assistant-prototype-with-automatic-speech-recognition-asr-and-text-to-speech-tts-on-socket-io">market overview</a> of existing Saas and opensource TTS solutions, in the second post we have put the user in the center by covering the <a href="https://www.liip.ch/en/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-2-ux-workshop">usability aspects of dialog driven apps</a> and how to create a good conversation flow. Finally it's time to get our hands dirty and show you some code. </p>
<h3>Prototyping with Socket.IO</h3>
<p>Although we envisioned the final app to be a mobile app and run on a phone it was much faster for us to build a small Socket.io web application, that is basically mimicking how an app might work on the mobile. Although socket.io is not the newest tool in the shed, it was great fun to work with it because it was really easy to set up. All you needed is a js library on the HTML side and tell it to connect to the server, which in our case is a simple python flask micro-webserver app.</p>
<pre><code class="language-html">#socket IO integration in the html webpage
...
&lt;script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/2.1.0/socket.io.js"&gt;&lt;/script&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;script&gt;
$(document).ready(function(){
    var socket = io.connect('http://' + document.domain + ':' + location.port);
    socket.on('connect', function() {
        console.log("Connected recipe");
        socket.emit('start');
    });
    ...</code></pre>
<p>The code above connects to our flask server and emits the start message, signaling that our audio service can start reading the first step. Depending on different messages we can quickly alter the DOM or do other things in almost real time, which is very handy.</p>
<p>To make it work on the server side in the flask app all you need is a <a href="https://flask-socketio.readthedocs.io">python library</a> that you integrate in your application and you are ready to go:</p>
<pre><code class="language-python"># socket.io in flask
from flask_socketio import SocketIO, emit
socketio = SocketIO(app)

...

#listen to messages 
@socketio.on('start')
def start_thread():
    global thread
    if not thread.isAlive():
        print("Starting Thread")
        thread = AudioThread()
        thread.start()

...

#emit some messages
socketio.emit('ingredients', {"ingredients": "xyz"})
</code></pre>
<p>In the code excerpt above we start a thread that will be responsible for handling our audio processing. It starts when the web server receives the start message from the client, signalling that he is ready to lead a conversation with the user. </p>
<h3>Automatic speech recognition and state machines</h3>
<p>The main part of the application is simply a while loop in the thread that listens to what the user has to say. Whenever we change the state of our application, it displays the next recipe state and reads it out loudly. We’ve sketched out the flow of the states in the diagram below. This time it is really a simple mainly linear conversation flow, with the only difference, that we sometimes branch off, to remind the user to preheat the oven, or take things out of the oven. This way we can potentially save the user time or at least offer some sort of convenience, that he doesn’t get in a “classic” recipe on paper. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/dc0509/flow.png" alt=""></figure>
<p>The automatic speech recognion (see below) works with <a href="https://wit.ai">wit.ai</a>  in the same manner like I have shown in my recent <a href="https://www.liip.ch/en/blog/speech-recognition-with-wit-ai">blog post</a>. Have a look there to read up on the technology behind it and find out how the RecognizeSpeech class works. In a nutshell we are recording 2 seconds of audio locally and then sending it over a REST API to <a href="https://wit.ai">Wit.ai</a> and waiting for it to turn it into text. While this is convenient from a developer’s side - not having to write a lot of code and be able to use a service - the downside is the reduced usability for the user. It introduces roughly 1-2 seconds of lag, that it takes to send the data, process it and receive the results. Ideally I think the ASR should take place on the mobile device itself to introduce as little lag as possible. </p>
<pre><code class="language-python">#abbreviated main thread

self.states = ["people","ingredients","step1","step2","step3","step4","step5","step6","end"]
while not thread_stop_event.isSet():
    socketio.emit("showmic") # show the microphone symbol in the frontend signalling that the app is listening
    text = recognize.RecognizeSpeech('myspeech.wav', 2) #the speech recognition is hidden here :)
    socketio.emit("hidemic") # hide the mic, signaling that we are processing the request

    if self.state == "people":
        ...
        if intro_not_played:
            self.play(recipe["about"])
            self.play(recipe["persons"])
            intro_not_played = False
        persons = re.findall(r"\d+", text)
        if len(persons) != 0:
            self.state = self.states[self.states.index(self.state)+1]
        ...
    if self.state == "ingredients"
        ...
        if intro_not_played:
            self.play(recipe["ingredients"])
            intro_not_played = False
        ...
        if "weiter" in text:
            self.state = self.states[self.states.index(self.state)+1]
        elif "zurück" in text:
            self.state = self.states[self.states.index(self.state)-1]
        elif "wiederholen" in text:
            intro_not_played = True #repeat the loop
        ...
</code></pre>
<p>As we see above, depending on the state that we are in, we play the right audio TTS to the user and then progress into the next state. Each step also listens if the user wanted to go forward (weiter), backward (zurück) or repeat the step  (wiederholen), because he might have misheard. </p>
<p>The first prototype solution, that I am showing above, is not perfect though, as we are not using a wake-up word. Instead we are offering the user periodically a chance to give us his input. The main drawback is that when the user speaks when it is not expected from him, we might not record it, and in consequence be unable to react to his inputs. Additionally sending audio back and forth in the cloud, creates a rather sluggish experience. I would be much happier to have the ASR part on the client directly especially when we are only listening to mainly 3-4 navigational words. </p>
<h3>TTS with Slowsoft</h3>
<p>Finally you have noticed above that there is a play method in the code above. That's where the TTS is hidden. As you see below we first show the speaker symbol in the application, signalling that now is the time to listen. We then send the text to Slowsoft via their API and in our case define the dialect &quot;CHE-gr&quot; and the speed and pitch of the output.</p>
<pre><code class="language-python">#play function
    def play(self,text):
        socketio.emit('showspeaker')
        headers = {'Accept': 'audio/wav','Content-Type': 'application/json', "auth": "xxxxxx"}
        with open("response.wav", "wb") as f: 
            resp = requests.post('https://slang.slowsoft.ch/webslang/tts', headers = headers, data = json.dumps({"text":text,"voiceorlang":"gsw-CHE-gr","speed":100,"pitch":100}))
            f.write(resp.content)
            os.system("mplayer response.wav")</code></pre>
<p>The text snippets are simply parts of the recipe. I tried to cut them into digestible parts, where each part contains roughly one action. Here having an already structured recipe in the <a href="http://open-recipe-format.readthedocs.io/en/latest/topics/tutorials/walkthrough.html">open recipe</a> format helps a lot, because we don't need to do any manual processing before sending the data. </p>
<h3>Wakeup-word</h3>
<p>We took our prototype for a spin and realized in our experiments that it is a must to have a wake-up. We simply couldn’t time the input correctly to enter it when the app was listening, this was a big pain for user experience. </p>
<p>I know that nowadays smart speakers like alexa or google home provide their own wakeup word, but we wanted to have our own. Is that even possible? Well, you have different options here. You could train a deep network from scratch with <a href="https://www.tensorflow.org/mobile/tflite/">tensorflow-lite</a> or create your own model by following along this tutorial on how to create a <a href="https://www.tensorflow.org/tutorials/audio_recognition">simple</a> speech recognition with tensorflow. Yet the main drawback is that you might need a lot (and I mean A LOT as in 65 thousand samples) of audio samples. That is not really applicable for most users. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/088ea4/snowboy.png" alt=""></figure>
<p>Luckily you can also take an existing deep network and train it to understand YOUR wakeup words. That means that it will not generalize as well to other persons, but maybe that is not that much of a problem. You might as well think of it as a feature, saying, that your assistant only listens to you and not your kids :). A solution of this form exists under the name <a href="https://snowboy.kitt.ai">snowboy</a>, where a couple of ex-Googlers created a startup that lets you create your own wakeup words, and then download those models. That is exactly what I did for this prototype. All you need to do is to go on the snowboy website and provide three samples of your wakeup-word. It then computes a model that you can download. You can also use their <a href="http://docs.kitt.ai/snowboy/#restful-api-calls">REST API</a> to do that, the idea here is that you can include this phase directly in your application making it very convenient for a user to set up his own wakeup- word. </p>
<pre><code class="language-python">#wakeup class 

import snowboydecoder
import sys
import signal

class Wakeup():
    def __init__(self):
        self.detector = snowboydecoder.HotwordDetector("betty.pmdl", sensitivity=0.5)
        self.interrupted = False
        self.wakeup()

    def signal_handler(signal, frame):
        self.interrupted = True

    def interrupt_callback(self):
        return self.interrupted

    def custom_callback(self):
        self.interrupted = True
        self.detector.terminate()
        return True

    def wakeup(self):
        self.interrupted = False
        self.detector.start(detected_callback=self.custom_callback, interrupt_check=self.interrupt_callback,sleep_time=0.03)
        return self.interrupted
</code></pre>
<p>All it needs then is to create a wakeup class that you might run from any other app that you include it in. In the code above you’ll notice that we included our downloaded model there (“betty.pmdl”) and the rest of the methods are there to interrupt the wakeup method once we hear the wakeup word.</p>
<p>We then included this class in your main application as a blocking call, meaning that whenever we hit the part where we are supposed to listen to the wakeup word, we will remain there unless we hear the word:</p>
<pre><code class="language-python">#integration into main app
...
            #record
            socketio.emit("showear")
            wakeup.Wakeup()
            socketio.emit("showmic")
            text = recognize.RecognizeSpeech('myspeech.wav', 2)
…</code></pre>
<p>So you noticed in the code above that we changed included the <em>wakeup.Wakeup()</em> call that now waits until the user has spoken the word, and only after that we then record 2 seconds of audio to send it to processing with wit.ai. In our testing that improved the user experience tremendously. You also see that we signall the listening to the user via graphical clues, by showing a little ear, when the app is listening for the wakeup word, and then showing a microphone when the app is ready is listening to your commands. </p>
<h3>Demo</h3>
<p>So finally time to show you the Tech-Demo. It gives you an idea how such an app might work and also hopefully gives you a starting point for new ideas and other improvements. While it's definitely not perfect it does its job and allows me to cook handsfree :). Mission accomplished! </p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//player.vimeo.com/video/270594859" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h2>What's next?</h2>
<p>In the first part of this blog post series we have seen quite an <a href="https://www.liip.ch/en/blog/betti-bossi-recipe-assistant-prototype-with-automatic-speech-recognition-asr-and-text-to-speech-tts-on-socket-io">extensive overview</a> over the current capabilities of TTS systems. While we have seen an abundance of options on the commercial side, sadly we didn’t find the same amount of sophisticated projects on the open source side. I hope this imbalance catches up in the future especially with the strong IoT movement, and the need to have these kind of technologies as an underlying stack for all kinds of smart assistant projects. Here is an <a href="https://www.kickstarter.com/projects/seeed/respeaker-an-open-modular-voice-interface-to-hack?lang=de">example</a> of a Kickstarter project for a small speaker with built in open source ASR and TTS.</p>
<p>In the <a href="https://www.liip.ch/en/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-2-ux-workshop">second blog post</a>, we discussed the user experience of audio centered assistants. We realized that going audio-only, might not always provide the best user experience, especially when the user is presented with a number of alternatives that he has to choise from. This was especially the case in the exploration phase, where you have to select a recipe and in the cooking phase where the user needs to go through the list of ingredients.  Given that the <a href="https://www.amazon.de/Amazon-Echo-2nd-Generation-Anthrazit-Stoff-/dp/B06ZXQV6P8">Alexas</a>, <a href="https://www.apple.com/homepod/">Homepods</a> and the <a href="https://www.digitec.ch/de/s1/product/google-home-weiss-grau-multiroom-system-6421169">Google Home</a> smart boxes are on their way to take over the audio-based home assistant area, I think that their usage will only make sense in a number of very simple to navigate domains, as in “Alexa play me something from Jamiroquai”. In more difficult domains, such as cooking, mobile phones might be an interesting alternative, especially since they are much more portable (they are mobile after all), offer a screen and almost every person already has one. </p>
<p>Finally in the last part of the series I have shown you how to integrate a number of solutions together - wit.ai for ASR, slowsoft for TTS, snowboy for wakeupword and socket.io and flask for prototyping - to create a nice working prototype of a hands free cooking assistant. I have uploaded the code on github, so feel free to play around with it to sketch your own ideas. For us a next step could be taking the prototype to the next level, by really building it as an app for the Iphone or Android system, and especially improve on the speed of the ASR. Here we might use the existing <a href="https://developer.apple.com/machine-learning/">coreML</a> or <a href="https://www.tensorflow.org/mobile/tflite/">tensorflow light</a> frameworks or check how well we could already use the inbuilt ASR capabilities of the devices. As a final key take away we realized that building a hands free recipe assistant definitely is something different, than simply having the mobile phone read out the recipe out loud for you. </p>
<p>As always I am looking forward to your comments and insights and hope to update you on our little project soon.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/3cb44c/gadget-google-assistant-google-home-1072851.jpg" length="560390" type="image/jpeg" />
          </item>
        <item>
      <title>Recipe Assistant Prototype with ASR and TTS on Socket.IO - Part 2 UX Workshop</title>
      <link>https://www.liip.ch/fr/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-2-ux-workshop</link>
      <guid>https://www.liip.ch/fr/blog/recipe-assistant-prototype-with-asr-and-tts-on-socket-io-part-2-ux-workshop</guid>
      <pubDate>Mon, 04 Jun 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<p>Welcome to part two of three in our mini blog post series on how to build a recipe assistant with automatic speech recognition and text to speech to deliver a hands free cooking experience. In the last <a href="https://www.liip.ch/en/blog/betti-bossi-recipe-assistant-prototype-with-automatic-speech-recognition-asr-and-text-to-speech-tts-on-socket-io">blog post</a> we provided you with an exhaustive hands on text to speech (TTS) market review, now its time to put the user in the center. </p>
<h3>Workshop: Designing a user experience without a screen</h3>
<p>Although the screen used to dominate the digital world, thanks to the rapid improvement of technologies, there are more options emerging. Most of mobile users have used or heard Siri from Apple iOS and Amazon Echo and almost <a href="https://techcrunch.com/2018/01/12/39-million-americans-now-own-a-smart-speaker-report-claims/">60 Mio Americans</a> apparently already own a smart speaker. Until recently sill unheard of, nowadays smart voice based assistants are changing our life quickly.  This means that user experience has to think beyond screen based interfaces. Actually it has always defined a holistic experience in a context where the user is involved, and also in speech recognition and speech as main input source, UX is needed to prevent potential usability issues in its interaction. </p>
<p>Yuri  participated in our innoday workshop as an UX designer, where her goal was to help the team to define a recipe assistant with ASR and TTS, that help the user to cook recipes in the kitchen without using his hands, and is a enjoyable to use. In this blog post Yuri helped me to write down our UX workshop steps. </p>
<h3>Ideation</h3>
<p>We started off with a brainstorming of our long term vision and short term vision and then wrote down down our ideas and thoughts on post its. We then grouped the ideas into three organically emerging topics, which were Business, Technology and User needs. I took the liberty to highlight some of the aspects that came to our minds:</p>
<ul>
<li>User 
<ul>
<li>Privacy: Users might not want to have their voice samples saved on some google server. Click <a href="https://myactivity.google.com/myactivity?restrict=vaa%20speech">here</a> to listen to all your samples, if you have an Android phone. </li>
<li>Alexa vs. Mobile or is audio only enough?: We spent a lot of discussion thinking if a cookbook could work in an audio only mode. We were aware that there is for example an <a href="https://www.amazon.de/Chefkoch-GmbH/dp/B0733CWP3Q/ref=sr_1_1?s=digital-skills&amp;ie=UTF8&amp;qid=1526581717&amp;sr=1-1&amp;keywords=chefkoch">Alexa Skill</a> from Chefkoch, but somehow the low rating made us suspicious if the user might need some minimal visual orientation. An app might be able to show you the ingredients or some visual clues on what to do in a certain step and who doesn't like these delicious pictures in recipes that lure you in to give a recipe a try?</li>
<li>Conversational Flow: An interesting aspect, that is easy to overlook was how to design the conversational flow in order to allow the user enough flexibility when going through each step of recipe but also not being to rigid.</li>
<li>Wakeup Word: The so called wakeup word is a crucial part of every ASR system, which triggers the start of the recording. I've written about it in a recent <a href="https://www.liip.ch/en/blog/speech-recognition-with-wit-ai">blog post</a>.  </li>
<li>Assistant Mode: Working with audio also gives interesting opportunities for features that are rather unusual on normal apps. We thought of a spoken audio alert, when the app notifies you to take the food from the oven. Something that might feel very helpful, or very annoying, depending on how it is solved.</li>
</ul></li>
<li>Technology
<ul>
<li>Structured Data: Interestingly we soon realized that breaking down a cooking process means that we need to structure our data better than a simple text. An example is simply multiplying the ingredients by the amount of people. An interesting project in this area is the <a href="http://open-recipe-format.readthedocs.io/en/latest/topics/tutorials/walkthrough.html">open recipe</a> format that simply defines a YAML to hold all the necessary data in a structured way. </li>
<li>Lag and Usability: Combining TTS with ASR poses an interesting opportunity to combine different solutions in one product, but also poses the problem of time lags when two different cloud based systems have to work together. </li>
</ul></li>
<li>Business
<ul>
<li>Tech and Cooking: Maybe a silly idea, but we definitely thought that as men it would feel much cooler to use a tech gadget to cook the meal, instead of a boring cookbook. </li>
</ul></li>
</ul>
<figure><img src="https://liip.rokka.io/www_inarticle/2a3561/stickies.jpg" alt=""></figure>
<h3>User journey</h3>
<p>From there we took on the question: “How might we design an assistant that allows for cooking without looking at recipe on the screen several times, since the users’ hands and eyes are busy with cooking.”</p>
<p>We sketched the user journey as a full spectrum of activities that go beyond just cooking, and can be described as:</p>
<ul>
<li>Awareness of the recipes and its interface on App or Web</li>
<li>Shopping ingredients according to selected recipe</li>
<li>Cooking</li>
<li>Eating</li>
<li>After eating </li>
</ul>
<figure><img src="https://liip.rokka.io/www_inarticle/2feb6a/journey.png" alt=""></figure>
<p>Due to the limited time of an inno-day, we decided to focus on the cooking phase only, while acknowledging that the this phase is definitely part of a much bigger user journey, where some parts, such as exploration, might be hard to tackle with an audio-only assistant. We tried though to explore the cooking step of the journey and broke it down into its own sub-steps. For example: </p>
<ul>
<li>Cooking
<ul>
<li>Preparation</li>
<li>Select intended Recipe to cook</li>
<li>Select number of portions to cook</li>
<li>Check ingredients if the user has them all ready</li>
</ul></li>
<li>Progress
<ul>
<li>Prepare ingredients</li>
<li>The actual cooking (boiling, baking, etc)</li>
<li>Seasoning and garnishing </li>
<li>Setting on a table</li>
</ul></li>
</ul>
<p>This meant for our cooking assistant that he needs to inform the user when to start each new sub-step and introduce the next steps in a easy unobtrusive way. He has also to track the multiple starts and stops from small actions during cooking, to for example remind the user to preheat the baking oven at an early point in time, when the user might not think of that future step yet (see below)</p>
<figure><img src="https://liip.rokka.io/www_inarticle/acad97/steps.png" alt=""></figure>
<h3>User experience with a screen vs. no screen</h3>
<p>Although we were first keen on building an audio only interface, we found that a quick visual overview helps to make the process faster and easier. For example, an overview of ingredients can be viewed at a glance on the mobile screen without listening every single ingredient from the app. As a result we decided that a combination of a minimal screen output and voice output will ease out potential usability problems. </p>
<p>Since the user needs to navigate with his voice easy input options like “back”, “stop”, “forward”, “repeat” we decided to also show the step that the user is currently in the screen.  This feedback helps the user to solve small errors or just orient himself more easily. </p>
<p>During the ux-prototyping phase, we also realised that we should visually highlight the moments when the user is expected to speak and when he is expected to listen. That's why  immediately after a question from the app, we would like to show an icon with a microphone meaning “Please tell me your answer!”. In a similar way we also want to show an audio icon when we want the user to listen carefully. Finally since we didn’t want the assistant to permanently listen to audio, but listen to a so called “wake-up-word”, we show a little ear-icon, signalling that the assistant is now listening for this wake-up-word. </p>
<p>While those micro interactions and visual cues, helped us to streamline the user experience, we still think that these are definitely areas that are central to a user experience and should be improved in a next iteration. </p>
<h3>Conclusion and what's next</h3>
<p>I enjoyed that instead of starting to write code right away, we first sat together and started to sketch out the concept, by writing sticky notes, with ideas and comments that came to our mind. I enjoyed having a mixed group where we had UX people, Developers, Data Scientists and Project owners sitting at one table. Although our ambitious goal for the day was to deliver a prototype that was able to read recipes to the user we ran out of time and I couldn’t code the prototype on that day, but in exchange I think we gathered very valuable insights on a user experiences that work and that don’t work without a screen. We realized that going totally without a screen is much harder than it seems. It is crucial for the user experience that the user has enough orientation to know where he is in the process in order for him not to feel lost or confused. </p>
<p>In the final and third blog post of this mini series I will finally provide you with the details on how to write a simple flask and socket.io based prototype that combines automatic speech recognition, text to speech and wake-up-word detection to create a hands-free cooking experience.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/f634d1/blur-cellphone-close-up-196644.jpg" length="1411274" type="image/jpeg" />
          </item>
        <item>
      <title>Recipe Assistant Prototype with Automatic Speech Recognition (ASR) and Text to Speech (TTS) on Socket.IO - Part 1 TTS Market Overview</title>
      <link>https://www.liip.ch/fr/blog/betti-bossi-recipe-assistant-prototype-with-automatic-speech-recognition-asr-and-text-to-speech-tts-on-socket-io</link>
      <guid>https://www.liip.ch/fr/blog/betti-bossi-recipe-assistant-prototype-with-automatic-speech-recognition-asr-and-text-to-speech-tts-on-socket-io</guid>
      <pubDate>Mon, 28 May 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<h2>Intro</h2>
<p>In one of our monthly innodays, where we try out new technologies and different approaches to old problems, we had the idea to collaborate with another company. Slowsoft is a provider of text to speech (TTS) solutions. To my knowledge they are the only ones who are able to generate Swiss German speech synthesis in various Swiss accents. We thought it would be a cool idea to combine it with our existing automatic speech recognition (ASR) expertise and build a cooking assistant that you can operate completely hands free. So no more touching your phone with your dirty fingers only to check again how many eggs you need for that cake. We decided that it would be great to go with some recipes from a famous swiss cookbook provider. </p>
<h2>Overview</h2>
<p>Generally there are quite a few text to speech solutions out there on the market. In the first out of two blog posts would like to give you a short overview of the available options. In the second blog post I will then describe at which insights we arrived in the UX workshop and how we then combined wit.ai with the solution from slowsoft in a quick and dirty web-app prototype built on socket.io and flask. </p>
<p>But first let us get an overview over existing text to speech (TTS) solutions. To showcase the performance of existing SaaS solutions I've chosen a random recipe from Betty Bossi and had it read by them:</p>
<pre><code class="language-text">Ofen auf 220 Grad vorheizen. Broccoli mit dem Strunk in ca. 1 1/2 cm dicke Scheiben schneiden, auf einem mit Backpapier belegten Blech verteilen. Öl darüberträufeln, salzen.
Backen: ca. 15 Min. in der Mitte des Ofens.
Essig, Öl und Dattelsirup verrühren, Schnittlauch grob schneiden, beigeben, Vinaigrette würzen.
Broccoli aus dem Ofen nehmen. Einige Chips mit den Edamame auf dem Broccoli verteilen. Vinaigrette darüberträufeln. Restliche Chips dazu servieren. </code></pre>
<h3>But first: How does TTS work?</h3>
<p>The classical way works like this: You have to record at least dozens of hours of raw speaker material in a professional studio. Depending on the task, the material can range from navigation instructions to jokes, depending on your use case. The next trick is called &quot;unit-selection&quot;, where recorded speech is sliced into a high number (10k - 500k) of elementary components called <a href="https://en.wikipedia.org/wiki/Phone">phones</a>, in order to be able to recombine those into new words, that the speaker has never recorded. The recombination of these components is not an easy task because the characteristics depend on the neighboring phonemes and the accentuation or <a href="https://en.wikipedia.org/wiki/Prosody">prosody</a>. These depend on a lot on the context. The problem is to find the right combination of these units that satisfy the input text and the accentuation and which can be joined together without generating glitches. The raw input text is first translated into a phonetic transcription which then serves as the input to selecting the right units from the database that are then concatenated into a waveform. Below is a great example from Apple's Siri <a href="https://machinelearning.apple.com/2017/08/06/siri-voices.html">engineering team</a> showing how the slicing takes place. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/3096e9/components.png" alt=""></figure>
<p>Using an algorithm called <a href="https://en.wikipedia.org/wiki/Viterbi_algorithm">Viterbi</a> the units are then concatenated in such a way that they create the lowest &quot;cost&quot;, in cost resulting from selecting the right unit and concatenating two units together. Below is a great conceptual graphic from Apple's engineering blog showing this cost estimation. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/166653/cost.png" alt=""></figure>
<p>Now in contrast to the classical way of TTS <a href="http://josesotelo.com/speechsynthesis/">new methods based on deep learning</a> have emerged. Here deep learning networks are used to predict the unit selection. If you are interested how the new systems work in detail, I highly recommend the <a href="https://machinelearning.apple.com/2017/08/06/siri-voices.html">engineering blog entry</a> describing how Apple crated the Siri voice. As a final note I'd like to add that there is also a format called <a href="https://de.wikipedia.org/wiki/Speech_Synthesis_Markup_Language">speech synthetisis markup language</a>, that allows users to manually specify the prosody for TTS systems, this can be used for example to put an emphasis on certain words, which is quite handy.  So enough with the boring theory, let's have a look at the available solutions.</p>
<h2>SaaS / Commercial</h2>
<h3>Google TTS</h3>
<p>When thinking about SaaS solutions, the first thing that comes to mind these days, is obviously Google's <a href="https://cloud.google.com/text-to-speech/">TTS solution</a> which they used to showcase Google's virtual assistant capabilities on this years Google IO conference. Have a look <a href="https://www.youtube.com/watch?v=d40jgFZ5hXk">here</a> if you haven't been wowed today yet. When you go to their website I highly encourage you to try out their demo with a German text of your choice. It really works well - the only downside for us was that it's not really Swiss German. I doubt that they will offer it for such a small user group - but who knows. I've taken a recipe and had it read by Google and frankly liked the output. </p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//player.vimeo.com/video/270423560" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h3>Azure Cognitive Services</h3>
<p>Microsoft also offers TTS as part of their Azure <a href="https://azure.microsoft.com/en-us/services/cognitive-services/speech/">cognitive services</a> (ASR, Intent detection, TTS). Similar to Google, having ASR and TTS from one provider, definitely has the benefit of saving us one roundtrip since normally you would need to perform the following trips:</p>
<ol>
<li>Send audio data from client to server, </li>
<li>Get response to client (dispatch the message on the client)</li>
<li>Send our text to be transformed to speech (TTS) from client to server </li>
<li>Get the response on client. Play it to the user.</li>
</ol>
<p>Having ASR and TTS in one place reduces it to:</p>
<ol>
<li>ASR From client to server. Process it on the server. </li>
<li>TTS response to client. Play it to the user.</li>
</ol>
<p>Judging the speech synthesis quality, I personally I think that Microsoft's solution didn't sound as great as Googles synthesis. But have a look for yourself. </p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//player.vimeo.com/video/270423598" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h3>Amazon Polly</h3>
<p>Amazon - having placed their bets on Alexa - of course has a sophisticated TTS solution, which they call <a href="https://console.aws.amazon.com/polly/home/SynthesizeSpeech">Polly</a>. I love the name :). To be where they are now, they have acquired a startup called Ivona already back in 2013, which were back then producing state of the art TTS solutions. Having tried it I liked the soft tone and the fluency of the results. Have a check yourself:</p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//player.vimeo.com/video/270423539" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h3>Apple Siri</h3>
<p>Apple offers TTS as part of their iOS SDK in the name of <a href="https://developer.apple.com/sirikit/">SikiKit</a>. I haven’t had the chance yet to play in depth with it. Wanting to try it out I made the error to think that apples TTS solution on the Desktop is the same as SiriKit. Yet SiriKit is nothing like the built in TTS on the MacOS. To have a bit of a laugh on your Macbook you can do a really poor TTS in the command line you can simply use a command:</p>
<pre><code class="language-bash">say -v fred "Ofen auf 220 Grad vorheizen. Broccoli mit dem Strunk in ca. 1 1/2 cm dicke Scheiben schneiden, auf einem mit Backpapier belegten Blech verteilen. Öl darüberträufeln, salzen.
Backen: ca. 15 Min. in der Mitte des Ofens."</code></pre>
<p>While the output sounds awful, below is the same text read by Siri on the newest iOS 11.3. That shows you how far TTS systems have evolved in the last years. Sorry for the bad quality but somehow it seems impossible to turn off the external microphone when recording on an IPhone. </p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//player.vimeo.com/video/270441878" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h3>IBM Watson</h3>
<p>In this arms race IBM also offers a TTS system, with a way to also define the prosody manually, using the <a href="https://de.wikipedia.org/wiki/Speech_Synthesis_Markup_Language">SSML markup language standard</a>. I didn't like their output in comparison to the presented alternatives, since it sounded quite artificial in comparison. But give it a try for yourself.</p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//youtube.com/embed/2Er2xl7MPBo" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<h3>Other commercial solutions</h3>
<p>Finally there are also competitors beyond the obvious ones such as <a href="https://www.nuance.com">Nuance</a> (formerly Scansoft - originating from Xerox research). Despite their page promising a <a href="http://ttssamples.syntheticspeech.de/ttsSamples/nuance-zoe-news-1.mp3">lot</a>, I found the quality of the TTS in German to be a bit lacking. </p>
<figure class="embed-responsive embed-responsive--16/9"><iframe src="//player.vimeo.com/video/270423596" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen="true"></iframe></figure>
<p>Facebook doesn't offer a TTS solution, yet - maybe they have rather put their bets on Virtual Reality instead. Other notable solutions are <a href="http://www.acapela-group.com/">Acapella</a>, <a href="http://www.innoetics.com">Innoetics</a>, <a href="http://www.onscreenvoices.com">TomWeber Software</a>, <a href="https://www.aristech.de/de/">Aristech</a> and <a href="https://slowsoft.ch">Slowsoft</a> for Swiss TTS.</p>
<h2>OpenSource</h2>
<p>Instead of providing the same kind of overview for the open source area, I think it's easier to list a few projects and provide a sample of the synthesis. Many of these projects are academic in nature, and often don't give you all the bells and whistles and fancy APIs like the commercial products, but with some dedication could definitely work if you put your mind to it.  </p>
<ul>
<li><a href="http://espeak.sourceforge.net">Espeak</a>. <a href="http://ttssamples.syntheticspeech.de/ttsSamples/espeak-s1.mp3">sample</a> - My personal favorite. </li>
<li><a href="http://www.speech.cs.cmu.edu/flite/index.html">Festival</a> a project from the CMU university, focused on portability. No sample.</li>
<li><a href="http://mary.dfki.de">Mary</a>. From the german &quot;Forschungszentrum für Künstliche Intelligenz&quot; DKFI. <a href="http://ttssamples.syntheticspeech.de/ttsSamples/pavoque_s1.mp3">sample</a></li>
<li><a href="http://tcts.fpms.ac.be/synthesis/mbrola.html">Mbrola</a> from the University of Mons <a href="http://ttssamples.syntheticspeech.de/ttsSamples/de7_s1.mp3">sample</a></li>
<li><a href="http://tundra.simple4all.org/demo/index.html">Simple4All</a> - a EU funded Project. <a href="http://ttssamples.syntheticspeech.de/ttsSamples/simple4all_s1.mp3">sample</a></li>
<li><a href="https://mycroft.ai">Mycroft</a>. More of an open source assistant, but runs on the Raspberry Pi.</li>
<li><a href="https://mycroft.ai/documentation/mimic/">Mimic</a>. Only the TTS from the Mycroft project. No sample available.</li>
<li>Mozilla has published over 500 hours of material in their <a href="https://voice.mozilla.org/de/data">common voice project</a>. Based on this data they offer a deep learning ASR project <a href="https://github.com/mozilla/DeepSpeech">Deep Speech</a>. Hopefully they will offer TTS based on this data too someday. </li>
<li><a href="http://josesotelo.com/speechsynthesis/">Char2Wav</a> from the University of Montreal (who btw. maintain the theano library). <a href="http://josesotelo.com/speechsynthesis/files/wav/pavoque/original_best_bidirectional_text_0.wav">sample</a></li>
</ul>
<p>Overall my feeling is that unfortunately most of the open source systems have not yet caught up with the commercial versions. I can only speculate about the reasons, as it might take a significant amount of good raw audio data to produce comparable results and a lot of fine tuning on the final model for each language. For an elaborate overview of all TTS systems, especially the ones that work in German, I highly recommend to check out the <a href="http://ttssamples.syntheticspeech.de">extensive list</a> that Felix Burkhardt from the Technical University of Berlin has compiled. </p>
<p>That sums up the market overview of commercial and open source solutions. Overall I was quite amazed how fluent some of these solutions sounded and think the technology is ready to really change how we interact with computers. Stay tuned for the next blog post where I will explain how we put one of these solutions to use to create a hands free recipe reading assistant.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/29d939/baking-bread-knife-brown-162786.jpg" length="2948380" type="image/jpeg" />
          </item>
        <item>
      <title>Speech recognition with wit.ai</title>
      <link>https://www.liip.ch/fr/blog/speech-recognition-with-wit-ai</link>
      <guid>https://www.liip.ch/fr/blog/speech-recognition-with-wit-ai</guid>
      <pubDate>Tue, 13 Mar 2018 00:00:00 +0100</pubDate>
      <description><![CDATA[<p>Speech recognition is here to stay. Google Home, Amazon Alexa/Dot or the Apple Homepod devices are storming our living rooms. Speech recognition in terms of assistants on mobile phones such as Siri or Google home has reached a point where they actually become reasonably useful. So we might ask ourselves, can we put this technology to other uses than asking Alexa to put beer on the shopping list, or Microsoft Cortana for directions. For creating your own piece of software with speech recognition, actually not much is needed, so lets get started! </p>
<h2>Overview</h2>
<p>If you want to have your own speech recognition, there are three options: </p>
<ol>
<li>You can either hack Alexa to do things but you might be limited in possibilities</li>
<li>You can use one of the integrated solutions such as <a href="https://www.kickstarter.com/projects/seeed/respeaker-an-open-modular-voice-interface-to-hack?lang=de">Rebox</a> that allows you more flexibility and has  a microphone array and speech recognition built in.</li>
<li>Or you  use a simple raspberry pi or your laptop only. That’s the option I am going to talk about in this article. Oh btw <a href="https://www.liip.ch/en/blog/webspeech-apis">here</a> is a blog post from Pascal - another Liiper - showing how to do asr in the browser. </li>
</ol>
<h2>Speech Recognition (ASR) as Opensource</h2>
<p>If you want to build your own device you make either use of excellent open source projects like <a href="https://cmusphinx.github.io">CMU Sphinx</a>, <a href="https://mycroft.ai/get-mycroft/">Mycroft</a>, <a href="https://github.com/Microsoft/CNTK">CNTK</a>, <a href="https://github.com/kaldi-asr/kaldi">kaldi</a>, <a href="https://github.com/mozilla/DeepSpeech">Mozilla DeepSpeech</a> or <a href="https://keenresearch.com">KeenASR</a> which can be deployed locally, often work already on a Raspberry Pi and often and have the benefit, that no data has to be sent through  the Internet in order to recognize what you’ve  just said.  So there is no lag between saying something and the reaction of your device  (We’ll cover this issue later). The drawback might be the quality of the speech recognition and the ease of its use. You might be wondering why it is hard to get speech recognition right? Well the short answer is data. The longer answer follows:</p>
<h3>In a nutshell - How does speech recognition works?</h3>
<p>Normally (<a href="http://www.cs.toronto.edu/~graves/icml_2006.pdf">original paper here</a>) the idea is that you have a <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent neural network</a>(RNN). A RNN is a deep learning network where the current state influences the next state. Now you feed in 20-40 ms slices of audio that have been formerly transformed into a <a href="https://de.wikipedia.org/wiki/Spektrogramm">spectrogram</a> as input into the RNN. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/114078/spectrogram-edison.jpg" alt="A spectrogram"></figure>
<p>An RNN is useful for language tasks in particular because each letter influences the likelihood of the next. So when you say &quot;speech&quot; for example,the chances to say “ch” after you’ve said &quot;spee&quot; is quite high (&quot;speed&quot; might  be an alternative too). So each 20ms slice is transformed into a letter and we might end up with a letter sequence like this: &quot;sss_peeeech&quot; where “” means  nothing was recognized. After removing the blanks and combining the same letters into one we might end up with the word &quot;speech&quot;, if we’re lucky and among other candidates like &quot;spech&quot;, &quot;spich&quot;, &quot;sbitsch&quot;, etc... Because the word speech appears more often in written text we’ll go for that. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/862924/bildschirmfoto-2018-03-14-um-11-10-02.jpg" alt="A RNN for speech recognition"></figure>
<p>Where is the problem now? Well the problem is, you as a private person will not have millions of speech samples, which are needed to train the neural network. On the other hand everything you say to your phone, is collected by e.g. Alexa  and used as training examples. You are not believing me?  Here is all you have ever said <a href="https://myactivity.google.com/myactivity?restrict=vaa speech">samples</a> to your Android phone. So what options are you having? Well you can still use one of the open source libraries, that already come with a pre-trained model. But often these models have been only trained for the english language. If you want to make them work for German or even Swiss-German you’d have to train them yourself. If you just want to get started you could use a speech recognition as a service provider. </p>
<h2>Speech Recognition as a Service</h2>
<p>If you feel like using a speech recognition service it might surprise you most startups in this area have been bought up by the giants. Google has bought startup <a href="https://dialogflow.com">api.ai</a> and Facebook has bought another startup working in this field: <a href="http://wit.ai">wit.ai</a>. Of course other big 5 companies are having their own speech services too. Microsoft has <a href="https://azure.microsoft.com/services/cognitive-services/speech/">cognitive services in azure</a> and IBM has speech recognition built into <a href="https://www.ibm.com/watson/services/speech-to-text/">Watson</a>. Feel free to choose one for yourself. From my experience their performance is quite similar. In this example I went with wit.ai</p>
<h2>Speech recognition with wit.ai</h2>
<p>For a fun little project &quot;Heidi - the smart radio&quot; at the <a href="https://www.hackdays.ch">SRF Hackathon</a> (btw. Heidi scored 9th out of 30 :) I decided to build a smart little radio, that basically listens to what you are saying. You just tell the radio to play the station you want to hear and then it plays it. That’s about it. So all you need is a microphone and a speaker to build a prototype. So let’s get started.</p>
<h2>Get the audio</h2>
<p>First you will have to get the audio from your microphone, which can be done with python and <a href="https://www.liip.ch/pyaudio http://people.csail.mit.edu/hubert/pyaudio/">pyaudio</a> quite nicely. The idea here is that you’ll create a never ending loop which always records 4 seconds of your speech and saves it to a file after.  In order to send the data to wit.ai, it reads from  the file backa and sends it as a post request to wit.ai. Btw. we will do the recording in mono. </p>
<pre><code class="language-python">#speech.py
import pyaudio
import wave

def record_audio(RECORD_SECONDS, WAVE_OUTPUT_FILENAME):
    #--------- SETTING PARAMS FOR OUR AUDIO FILE ------------#
    FORMAT = pyaudio.paInt16    # format of wave
    CHANNELS = 1                # no. of audio channels
    RATE = 44100                # frame rate
    CHUNK = 1024                # frames per audio sample
    #--------------------------------------------------------#

    # creating PyAudio object
    audio = pyaudio.PyAudio()

    # open a new stream for microphone
    # It creates a PortAudio Stream Wrapper class object
    stream = audio.open(format=FORMAT,channels=CHANNELS,
                        rate=RATE, input=True,
                        frames_per_buffer=CHUNK)

    #----------------- start of recording -------------------#
    print("Listening...")

    # list to save all audio frames
    frames = []

    for i in range(int(RATE / CHUNK * RECORD_SECONDS)):
        # read audio stream from microphone
        data = stream.read(CHUNK)
        # append audio data to frames list
        frames.append(data)

    #------------------ end of recording --------------------#   
    print("Finished recording.")

    stream.stop_stream()    # stop the stream object
    stream.close()          # close the stream object
    audio.terminate()       # terminate PortAudio

    #------------------ saving audio ------------------------#

    # create wave file object
    waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')

    # settings for wave file object
    waveFile.setnchannels(CHANNELS)
    waveFile.setsampwidth(audio.get_sample_size(FORMAT))
    waveFile.setframerate(RATE)
    waveFile.writeframes(b''.join(frames))

    # closing the wave file object
    waveFile.close()

def read_audio(WAVE_FILENAME):
    # function to read audio(wav) file
    with open(WAVE_FILENAME, 'rb') as f:
        audio = f.read()
    return audio

def RecognizeSpeech(AUDIO_FILENAME, num_seconds = 5):

    # record audio of specified length in specified audio file
    record_audio(num_seconds, AUDIO_FILENAME)

    # reading audio
    audio = read_audio(AUDIO_FILENAME)

    # WIT.AI HERE
    # ....

if __name__ == "__main__":
    while True:
        text =  RecognizeSpeech('myspeech.wav', 4)</code></pre>
<p>Ok now you should have a myspeech.wav file in your folder that gets replaced with the newest recording every 4 seconds. We need to send it to wit.ai to find out what we've actually said. </p>
<h2>Transform it into text</h2>
<p>There is an extensive <a href="https://wit.ai/docs">extensive documentation</a> to wit.ai. I will use the <a href="https://wit.ai/docs/http/20170307">HTTP API</a>, which you can simply use with curl to try things out. To help you out in the start, I thought I'd write the file to show some of its capabilities. Generally all you need is an access_token from wit.ai that you send in the headers and the data that you want to be transformed into text. You will receive a text representation of it. </p>
<pre><code class="language-python">#recognize.py
import requests
import json

def read_audio(WAVE_FILENAME):
    # function to read audio(wav) file
    with open(WAVE_FILENAME, 'rb') as f:
        audio = f.read()
    return audio

API_ENDPOINT = 'https://api.wit.ai/speech'
ACCESS_TOKEN = 'XXXXXXXXXXXXXXX'

# get a sample of the audio that we recorded before. 
audio = read_audio("myspeech.wav")

# defining headers for HTTP request
headers = {'authorization': 'Bearer ' + ACCESS_TOKEN,
           'Content-Type': 'audio/wav'}

#Send the request as post request and the audio as data
resp = requests.post(API_ENDPOINT, headers = headers,
                         data = audio)

#Get the text
data = json.loads(resp.content)
print(data)</code></pre>
<p>So after recording something into your &quot;.wav&quot; file, you can send it off to wit.ai and receive an answer:</p>
<pre><code class="language-bash">python recognize.py
{u'entities': {}, u'msg_id': u'0vqgXgfW8mka9y4fi', u'_text': u'Hallo Internet'}</code></pre>
<h2>Understanding the intent</h2>
<p>Nice it understood my gibberish! So now, the only thing left is to understand the <strong><em>intent</em></strong> of what we actually want. For this wit.ai has created an interface to figure out what the text was about. Different providers <a href="https://medium.com/@abraham.kang/understanding-the-differences-between-alexa-api-ai-wit-ai-and-luis-cortana-2404ece0977c">differ</a> quite a bit on how to model intent, but for wit.ai it is nothing more than fiddling around with the gui. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/12a718/interface.jpg" alt="Teaching wit.ai our patterns"></figure>
<p>As you see in the screenshot wit has a couple of predefined entity types, such as: age_of_person, amount_of_money, datetime, duration, email, etc.. What you basically do is, to mark the word you are particularly interested about, using your mouse, for example the radio-station &quot;srf1&quot; and assign it to a matching entity type. If you can't find one you can simply create one such as &quot;radiostation&quot; . Now you can use the textbox to enter some  examples and formulations and mark the entity to &quot;train&quot; wit to recognize your entity in different contexts. It works to a certain extent, but don't expect too much of it. If you are happy with the results, you can use the API to try it.</p>
<pre><code class="language-python">#intent.py

import requests
import json
API_ENDPOINT = 'https://api.wit.ai/speech'
ACCESS_TOKEN = 'XXXXXXXXXXXXXXX'

headers = {'authorization': 'Bearer ' + ACCESS_TOKEN}

# Send the text
text = "Heidi spiel srf1."
resp = requests.get('https://api.wit.ai/message?&amp;q=(%s)' % text, headers = headers)

#Get the text
data = json.loads(resp.content)
print(data)</code></pre>
<p>So when you run it you might get:</p>
<pre><code class="language-bash">python intent.py
{u'entities': {u'radiostation': [{u'confidence': 1, u'type': u'value', u'value': u'srf1'}]}, u'msg_id': u'0CPCCSKNcZy42SsPt', u'_text': u'(Heidi spiel srf1.)'}</code></pre>
<h2>Obey</h2>
<p>Nice it understood our radio station! Well there is not really much left to do, other than just play it. I've used a hacky mplayer call to just play something, but sky is the limit here.</p>
<pre><code class="language-python">...
if radiostation == "srf1" :
        os.system("mplayer http://stream.srg-ssr.ch/m/drs1/aacp_96")
...</code></pre>
<h2>Conclusion</h2>
<p>That was easy wasn't it? Well yes, I omitted one problem, namely that our little smart radio is not very convenient because it feels very &quot;laggish&quot;. It has to listen  for 4 seconds first, then transmit the data to wit  and wait until wit has recognized it, then find the intent out and finally play the radio station. That takes a while - not really long e.g. 1-2 seconds, but we humans are quite sensitive to such lags. Now if you are saying the voice command in the exact right moment when it is listening, you might be lucky. But otherwise you might end up having to repeat your command multiple times, just to hit the right slot. So what is the solution?</p>
<p>The solution comes in the form of a so called “wake word”. It's a keyword that the device listens constantly to and the reason why you have to say &quot;Alexa&quot; first all the time, if you want something from it. Once a device picks up its own “wake word”, it starts to record what you have to say after the keyword and transmits this bit to the cloud for processing and storage. In order to pickup the keyword fast, most of these devices do the automatic speech recognition for the keyword on the device, and send off the data to the cloud afterwards. Some companies, like Google, went even further and put the <a href="https://arxiv.org/pdf/1603.03185.pdf">whole ml model</a> on the mobile phone in order to have a faster response rate and as a bonus to work offline too. </p>
<h2>What's next?</h2>
<p>Although the &quot;magic&quot; behind the scenes of automatic speech recognition systems is quite complicated, it’s easy to use automatic speech recognition as a service. On the other hand the market is already quite saturated with different devices, for quite affordable prices. So there is really not much to win, if you want to create your own device in such a competitive market. Yet it might be interesting to use open source asr solutions in already existing systems, where there is need for confidentiality. I am sure not every user wants his speech data to end up in a google data center when he is using a third party app.  </p>
<p>On the other hand for the big players offering devices for affordable prices turns out to be a good strategy. Not only are they so collecting more training data - which makes their automatic speech recognition even better -  but eventually they are controlling a very private channel to the consumer, namely speech. After all, it’s hard to find an easier way of buying things rather than just <a href="https://xkcd.com/1807/">saying it out loud</a>.</p>
<p>For all other applications, it depends what you want to achieve. If you are a media company and want to be present on these devices, that will probably soon replace our old radios, then you should start <a href="https://developer.amazon.com/docs/ask-overviews/build-skills-with-the-alexa-skills-kit.html">developing</a> so called <a href="https://www.amazon.de/b?ie=UTF8&amp;node=10068460031">&quot;skills&quot;</a> for each of these systems. The discussion on the pros and cons of smart speakers is already  <a href="https://medienwoche.ch/2018/03/07/interaktive-medienbeschallung-aus-dem-intelligenten-lautsprecher/">ongoing</a>. </p>
<p>For websites this new technology might finally bring an improvement for impaired people as most modern browsers seem more and more to <a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API">support ASR directly</a> the client. So it won't take  too long, unless the old paradigm in web development will shift from &quot;mobile first&quot; to &quot;speech first&quot;. We will see what the future holds.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/43ad37/pexels-photo-595804.jpg" length="6542136" type="image/jpeg" />
          </item>
    
  </channel>
</rss>
