<?xml version="1.0" encoding="utf-8"?>
<!-- generator="Kirby" -->
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">

  <channel>
    <title>Mot-cl&#233;: data stack &#183; Blog &#183; Liip</title>
    <link>https://www.liip.ch/fr/blog/tags/data+stack</link>
    <generator>Kirby</generator>
    <lastBuildDate>Mon, 16 Apr 2018 00:00:00 +0200</lastBuildDate>
    <atom:link href="https://www.liip.ch" rel="self" type="application/rss+xml" />

        <description>Articles du blog Liip avec le mot-cl&#233; &#8220;data stack&#8221;</description>
    
        <language>fr</language>
    
        <item>
      <title>The Data Science Stack 2018</title>
      <link>https://www.liip.ch/fr/blog/the-data-science-stack-2018</link>
      <guid>https://www.liip.ch/fr/blog/the-data-science-stack-2018</guid>
      <pubDate>Mon, 16 Apr 2018 00:00:00 +0200</pubDate>
      <description><![CDATA[<p>More than one year ago I sat down and went through my various github stars and browser bookmarks to compile what I then called the Data Science stack. It was basically an exhaustive collection of tools from which some I use on a daily basis, while others I have only heard of. The outcome was a big PDF poster which you can download <a href="https://www.liip.ch/en/blog/data-stack">here</a>. </p>
<p>The good thing about it was, that every tool I had in mind could be found there somewhere, and like a map I could instantly see to which category it belonged. As a bonus I was able to identify my personal white spots on the map. The bad thing about it was, that as soon as I have compiled the list, it was out of date. So I transferred the collection into a google sheet and whenever a new tool emerged on my horizon I added it there. Since then -  in almost a year - I have added over 102 tools to it. </p>
<h2>From PDF to Data Science Stack website</h2>
<p>While it would be OK to release another PDF of the stack year after year, I thought that might be  a better idea to turn this into website, where everybody can add tools to it.<br />
So without further ado I present you the <a href="http://datasciencestack.liip.ch">http://datasciencestack.liip.ch</a> page. Its goal is still to provide an orientation like the PDF, but eventually never become stale. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/2dd1be/front.png" alt="frontpage"></figure>
<p><strong>Adding Tools: </strong>Adding tools to my google sheet felt a bit lonesome, so I asked others internally to add tools whenever they find new ones too. Finally when moving away from the old google sheet and opening our collection process to everybody I have added a little button on the website that allows everybody to add tools by themselves to the appropriate category. Just send us the name, link and a quick description and we will add it there after a quick sanity check. The goal is to gather user generated input too!  The I am thinking also about turning the website into a “github awesome” repository, so that adding tools can be done more in a programmer friendly way. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/855d2c/add.png" alt="adding tools for everyone"></figure>
<p><strong>Search:</strong> When entering new tools, I realized that I was not sure if that tool already exists on the page, and since tools are hidden away after the first five the CTRL+F approach didn’t really work. That's why the website now has a little search box to search if a tool is already in our list. If not just add it to the appropriate category. </p>
<p><strong>Mailing List:</strong> If you are a busy person and want to stay on top of things, I would not expect you to regularly check back and search for changed entries. This is why I decided to send out a quarterly mailing that contains the new tools we have added since our last data science stack update. This helps you to quickly reconnect to this important topic and maybe also to discover a data science gem you have not heard of yet. </p>
<p><strong>JSON download:</strong> Some people asked me for the raw data of the PDF and at that time I was not able to give it to them quickly enough. That's why I added a json route that allows you to simply download the whole collection as a json file and create your own visualizations / maps or stacks with the tools that we have collected. Maybe something cool is going to come out of this. </p>
<p><strong>Communication:</strong> Scanning through such a big list of options can sometimes feel a bit overwhelming, especially since we don’t really provide any additional info or orientation on the site. That’s why I added multiple ways of contacting us, in case you are just right now searching for a solution for your business. I took the liberty to also link our blog posts that are tagged with machine learning at the bottom of the page, because often we make use of the tools in these. </p>
<p><strong>Zebra integration:</strong> Although it's nowhere visible on the website, I have hooked up the data science stack to our internal “technology database” system, called Zebra (actually Zebra does a lot more, but for us the technology part is relevant). Whenever someone enters a new technology into our technology db, it is automatically added for review to the data science stack. Like this we are basically tapping into the collective knowledge of all of our employees our company. A screenshot below gives a glimpse of our tech db on zebra capturing not only the tool itself but also the common feelings towards it. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/680599/zebra.png" alt="Zebra integration"></figure>
<h2>Insights from collecting tools for one more year</h2>
<p>Furthermore, I would like to provide you with the questions that guided me in researching each area and the insights that I gathered in the year of maintaining this list. Below you see a little chart showing to which categories I have added the most tools in the last year. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/caabb8/graphs2.png" alt="overview"></figure>
<h3>Data Sources</h3>
<p>One of the remaining questions, for us is what tools do offer good and legally compliant ways to capture user interaction?  Instead of Google Analytics being the norm, we are always on the lookout for new and fresh solutions in this area. Despite Heatmap Analytics, another new category I added is «Tag Management˚ Regarding the classic website analytics solutions, I was quite surprised that there are still quite a lot of new solutions popping up. I added a whole lot of solutions, and entirely new categories like mobile analytics and app store analytics after discovering that great github awesome list of analytics solutions <a href="https://github.com/onurakpolat/awesome-analytics">here</a>.</p>
<figure><img src="https://liip.rokka.io/www_inarticle/2aff10/sources2.png" alt="data sources"></figure>
<h3>Data Processing</h3>
<p>How can we initially clean or transform the data? How and where can we store logs that are created by these transformation events? And where do we also take additional valuable data? Here I’ve added quite a few of tools in the ETL area and in the message queue category. It looks like eventually I will need  to split up the “message queue” category into multiple ones, because it feels like this one drawer in the kitchen where everything ends up in a big mess. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/732417/processing.png" alt="data processing"></figure>
<h3>Database</h3>
<p>What options are out there to store the data? How can we search through it? How can we access data sources efficiently? Here I mainly added a few specialized solutions, such as databases focused on storing mainly time series or graph/network data. I might either have missed something, but I feel that since there is no new paradigm shift on the horizon right now (like graph oriented, or nosql, column oriented or newsql dbs). It is probably in the area of big-data where most of the new tools emerged. An awesome list that goes beyond our collection can be found <a href="https://github.com/onurakpolat/awesome-bigdata">here</a>.</p>
<figure><img src="https://liip.rokka.io/www_inarticle/64a802/database.png" alt="database"></figure>
<h3>Analysis</h3>
<p>Which stats packages are available to analyze the data? What frameworks are out there to do machine learning, deep learning, computer vision, natural language processing? Obviously, due to the high momentum of deep learning leads to many new entries in this category. In the “general” category I’ve added quite a few entries, showing that there is still a huge momentum in the various areas of machine learning beyond only deep learning. Interestingly I did not find any new stats software packages, probably hinting that the paradigm of these one size fits all solutions is over. The party is probably taking place in the cloud, where the big five have constantly added more and more specialized machine learning solutions. For example for text, speech, image, video or chatbot/assistant related tasks, just to name a few. At least those were the areas where I added most of the new tools. Going beyond the focus on python there is the awesome <a href="https://github.com/josephmisiti/awesome-machine-learning">list</a> that covers solutions for almost every programming language. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/ab1df5/analysis.png" alt="analysis"></figure>
<h3>Visualization, Dashboards, and Applications</h3>
<p>What happens with the results? What options do we have to visually communicate them? How do we turn those visualizations into dashboards or entire applications? Which additional ways of to communicate with user beside reports/emails are out there? Surprisingly I’ve only added a few new entries here, may it be due to the fact that I accidentally have been quite thorough at research this area last year, or simply because of the fact that somehow the time of js visualizations popping up left and right has cooled off a bit and the existing solutions are rather maturing. Yet this awesome <a href="https://github.com/fasouto/awesome-dataviz">list</a> shows that development in this area is still far from cooling off. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/cd8663/viz.png" alt="visualization"></figure>
<h3>Business Intelligence</h3>
<p>What solutions do exist  that try to integrate data sourcing, data storage, analysis and visualization in one package? What BI solutions are out there for big data? Are there platforms/solutions that offer more of a flexible data-scientist approach (e.g. free choice of methods, models, transformations)? Here I have added solutions that were platforms in the cloud, it seems that it is only logical to offer less and less of desktop oriented BI solutions, due to the restrained computational power or due to the high complexity of maintaining BI systems on premise. Although business intelligence solutions are less community and open source driven as the other stacks, there are also <a href="https://github.com/thenaturalist/awesome-business-intelligence">awsome lists</a> where people curate those solutions. </p>
<figure><img src="https://liip.rokka.io/www_inarticle/7fa6f4/bi.png" alt="business intelligence"></figure>
<p>You might have noticed that I tried to slip in an awsome list on github into almost every category to encourage you to look more in depth into each area. If you want to spend days of your life discovering awesome things, I strongly suggest you to check out this collection of awesome lists <a href="https://github.com/jnv/lists">here</a> or <a href="https://github.com/sindresorhus/awesome or">here</a>.</p>
<h3>Conclusion or what's next?</h3>
<p>I realized that keeping the list up to date in some areas seems almost impossible, while others gradually mature over time and the amount of new tools in those areas is easy to oversee. I also had to recognize that maintaining an exhaustive and always up to date list in those 5 broad categories seems quite a challenge. That's why I went out to get help. I’ve looked for people in our company interested particularly in one of these areas and nominated them technology ambassadors of this part of the stack. Their task will be to add new tools whenever they pop up on their horizon. </p>
<p>I have also come to the conclusion that the stack is quite useful when offering customers a bit of an overview at the beginning of a journey. It adds value to just know what popular solutions are out there and start digging around yourself. Yet separating more mature tools from the experimental ones or knowing which open source solutions have a good community behind it, is quite a hard task for somebody without experience. Somehow it would be great to highlight “the pareto principle” in this stack by pointing out to only a handful of solutions and saying you will be fine when you use those. Yet I also have to acknowledge that this will not replace a good consultation in the long run. </p>
<p>Already looking towards the improvement of this collection, I think that each tool needs some sort of scoring: While there could be plain vanilla tools that are mature and do the job, there are also the highly specialized very experimental tools that offers help in very niche area only. While this information is somewhat buried in my head, it would be good to make it explicit on the website. Here I am highly recommending what Thoughtworks has come up with in their <a href="https://www.thoughtworks.com/radar">technology radar</a>. Although their radar goes well beyond our little domain of data services, it offers a great idea to differentiate tools. Namely into four categories: </p>
<ul>
<li>Adopt: We feel strongly that the industry should be adopting these items. We see them when appropriate on our projects. </li>
<li>Trial: Worth pursuing. It is important to understand how to build up this capability. Enterprises should try this technology on a project that can handle the risk. </li>
<li>Asses: Worth exploring with the goal of understanding how it will affect your enterprise. </li>
<li>Hold: Proceed with caution.</li>
</ul>
<figure><img src="https://liip.rokka.io/www_inarticle/37daaf/radar.png" alt="Technology radar"></figure>
<p>Assessing tools according to these criteria is no easy task - thoughtworks is doing it by nominating a high profile jury that vote regularly on these tools. With 4500 employees, I am sure that their assessment is a representative sample of the industry. For us and our stack, a first start would be to adopt this differentiation, fill it out myself and then get other liipers to vote on these categories. To  a certain degree we have already started this task internally in our tech db, where each employee assessed a common feeling towards a tool. </p>
<p>Concluding this blogpost, I realized that the simple task of “just” having a list with relevant tools for each area seemed quite easy at the start. The more I think about it, and the more experience I collect in maintaining this list, the more realize that eventually such a list is growing into a knowledge and technology management system. While such systems have their benefits (e.g. in onboarding or quickly finding experts in an area) I feel that turning this list into one will be walking down this rabbit hole of which I might never re-emerge. Let’s see what the next year will bring.</p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/2dd1be/front.jpg" length="4200904" type="image/png" />
          </item>
        <item>
      <title>The Data Stack &#8211; Download the most complete overview of the data centric landscape.</title>
      <link>https://www.liip.ch/fr/blog/data-stack</link>
      <guid>https://www.liip.ch/fr/blog/data-stack</guid>
      <pubDate>Mon, 13 Feb 2017 00:00:00 +0100</pubDate>
      <description><![CDATA[<p>(Web)-Developers are used to stacks, most prominent among them probably the LAMP Stack or the more current MEAN Stack. Of course there are plenty around, but on the other hand, I have not heard too many data scientists talking about so much about data stacks – may it because we think, that in a lot of cases all you need is some python a CSV, pandas, and scikit-learn to do the job.</p>
<p>But when we sat down recently with our team, I realized that we indeed use a myriad of different tools, frameworks, and SaaS solutions. I thought it would be useful to organize them in a meaningful data stack. I have not only included the tools we are using, but I sat down and started researching. It turned out into an extensive list aka. the <strong> data stack PDF.</strong>  This poster will:</p>
<ul>
<li>provide an overview of solutions available in the 5 layers (Sources, Processing, Storage, Analysis, Visualization)</li>
<li>offer you a way to discover new tools and</li>
<li>offer orientation in a very densely populated area</li>
</ul>
<p>So without further ado, here is my data stack overview <a href="http://bit.ly/data_stack">Click to open PDF</a>. Feel free to share it with your friends too.</p>
<figure><a href="http://bit.ly/data_stack"><img src="https://liip.rokka.io/www_inarticle/d702df/liip-data-stack.jpg" alt=""></a></figure>
<p>Liip data stack version 1.0</p>
<h2><a href="http://liip.to/data_stack">Click here to get notified by email when I release version 2.0 of the data stack.</a></h2>
<p>Let me lay out some of the questions that guided me in researching each area and throw in my 5 cents while researching each one of them:</p>
<ul>
<li><strong>Data Sources:</strong>  Where does our data usually come from? For us, it's websites with sophisticated event tracking. But for some projects the data has to be scraped, comes from social media outlets or comes from <a href="https://blog.liip.ch/archive/2016/10/17/counting-people-stairs-particle-photon-node-js.html">IoT devices</a>.</li>
<li><strong>Data Processing:</strong>  How can we initially clean or transform the data? How and where can we store the logs that those events create? Also from where do we also take additional valuable data?</li>
<li><strong>Database:</strong>  What options are out there to store the data? How can we search through it? How can we access big data sources efficiently?</li>
<li><strong>Analysis:</strong>  Which stats packages are available to analyze the data? Which frameworks are out there to do machine learning, deep learning, computer vision, natural language processing?</li>
<li><strong>Visualization, Dashboards, and Applications:</strong>  What happens with the results? What options do we have to visually communicate them? How do we turn those visualizations into dashboards or whole applications? Which additional ways of communicating with the user beside reports/emails are out there?</li>
<li><strong>Business Intelligence:</strong> </li>
</ul>
<p>What solutions are out there that try to integrate the data sourcing, data storage, analysis and visualization in one package? What solutions BI solutions are out there for big data? Are there platforms/solutions that offer more of a flexible data-scientist approach?</p>
<h3>My observations when compiling the list:</h3>
<h4>Data Sources</h4>
<ul>
<li>For scrapers, there are actually quite a lot of open source projects out there that work really well, probably because those are used mostly by developers.</li>
<li>While there is quite a few software as a service solutions with slightly different focus, capturing website data in most cases is done via google analytics, although Piwik offers a nice on-premise alternative.</li>
<li>We have been <a href="https://blog.liip.ch/archive/2016/10/17/counting-people-stairs-particle-photon-node-js.html">experimenting quite a bit</a> with IoT devices and analytics, and it turns out that there seems to be quite a few integrated data-collection and analysis software as a service solutions out there, although you are always able to to use your own (see later) solutions.</li>
<li>For social media data, the data comes either from the platforms themselves via an API (which is probably the default for most projects), but there are some convenient data providers out there that allow you to ingest social media data across all platforms.</li>
</ul>
<h4>Data Processing</h4>
<ul>
<li>While there are excellent open source logging services like graylog or logstash, it can sometimes save a lot of time to use those pricey saas solutions because people have solved all the quirks and tiny problems that open source solutions sometimes have.</li>
<li>While there are some quite old and mature open source solutions (e.g. RabbitMQ or Kafka) in the message queues or streams category, it turned out that there a lot of new open source  stream analytics solutions (Impala, Flink or Flume) in the market and almost all of the big four (Microsoft, Google, Facebook, Amazon) offer their own approaches.</li>
<li>The data cleansing or transformation category is quite a mixed bag. While on one hand there are a number of very mature industry standard solutions (e.g. Talend), there are also alternatives for end users that allow them simply to clean their data without any programming knowledge (e.g. Trifacta or Open Refine)</li>
</ul>
<h4>Databases</h4>
<ul>
<li>Databases: If you haven't followed the development in the databases area closely like me, you might think that solutions will fall either in the SQL (e.g. MySQL) or the NoSQL (e.g. MongoDB) bucket. But apparently a LOT has been going on here, probably among the most notable are the graph based databases (e.g. Neo4J) and the Column Oriented databases (e.g. Hana or Monet DB) that offer a much better performance for BI tasks. There are also some recent experimental highly promising solutions like databases in the GPU (e.g. Mapd) or ones that only sample (e.g. BlinkDB) the whole dataset.</li>
<li>The distributed big data ecosystem: It is mostly populated by mature projects from the Apache foundation that integrate quite well in the Hadoop ecosystem. Worth mentioning are of course the distributed machine learning solutions for large scale processing like Spark or Mahout, that are really handy. There are also a lot of mature options like Cloudera or Hortonworks that offer out of the box integrations.</li>
<li>In Memory Databases or Search: Of course you the first thing that comes to mind is elastic(search) that proved over the years to be a reliable solution. Overall the area is populated by quite a lot of stable open source projects (e.g. Lucene or Solr) while on the other hand, you can now directly tap into search as a service (e.g. AzureSearch or Cloudsearch) from the major vendors. The most interesting projects I will try follow are the fastest in-memory database Exasol and its “competitor” VoltDB.</li>
</ul>
<h4>Analysis / ML Frameworks</h4>
<ul>
<li>Deep Learning Frameworks: Obviously, on one hand, you will find the kind of low-level frameworks like Tensorflow, Torch, and Theano in there. But on the other hand, there are also high-level alternatives that build up upon those like TFlearn (that has been integrated into Tensorflow now) or Keras, which allow you to make progress faster with less coding but also without being able to control all the details. Finally, there are also alternatives to hosting these solutions yourself, in services like the Google ML platform.</li>
<li>Statistic software packages: While maybe a long time ago you could only choose from commercial solutions like SPSS, Matlab or SAS, nowadays there is really a myriad of open source solutions out there. Whole ecosystems have developed around those languages (python, R, Julia etc.). But also even without programming, you can analyze data quite efficiently with tools like Rapidminer, Orange or Rattle. For me, nothing beats the combination of pandas and an ipython notebook.</li>
<li>General ML libraries: I put the focus here on mainly the python ecosystem, although the <a href="https://blog.liip.ch/archive/2015/10/08/machine-learning-on-google-analytics.html">other ones</a> are probably as diverse as this one. With scipy, numpy or scikit-learn we've got a one-stop shop for all your ML needs, but nowadays there are also libraries that take care of the hyperparameter optimization (e.g. REP) or model selection (AutoML). So again here you can also choose your level of immersion yourself.</li>
<li>Computer vision: While you will find a lot of open source libraries that rely on OpenCV somehow a myriad of awesome SaaS solutions (e.g. Google CV, Microsoft CV) from big vendors have popped up in the last years. These will probably beat everything you might hastily build over the weekend but are going to cost you a bit. The deep learning movement has really made computer vision, object detection etc.. really accessible for anyone.</li>
<li>Natural language processing: Here I noticed a similar movement. We used NLP libraries to process social media data (e.g. <a href="https://blog.liip.ch/archive/2016/06/07/whats-your-twitter-mood.html">sentiment</a>analysis) and found that there are really great open source projects or libraries out there. While there are various options for text processing (e.g. natural for node.js, or nltk for python or coreNLP from Stanford), it is deep learning and the SaaS products built upon it that have really made natural language processing available for anyone. I am very impressed with the results of these tools, although I doubt that we will come anywhere close in the next years to computers really understanding us. After all its the holy grail of AI.</li>
</ul>
<h4>Dashboards / Visualization</h4>
<ul>
<li>Visualization: I was really surprised how many js libraries are out there, that allow you to do the fanciest data visualizations in the browser. I mean its great to have those solid libraries like ggplot or matplotlib, or the fancy ones like bokeh or seaborn but if you want to communicate your results to the user in a periodic way, you will need to go through the mobile / browser. I guess we have to thank the strong D3 community for the great developments in this area, but also there are a lot of awesome: SaaS and open source solutions that go way beyond just visualization like Shiny for R or Redash that feel more like a business intelligence solution.</li>
<li>Dashboards: I am personally a big fan of dashing.io because it is simply free and it's in ruby, but plotly has really surprised me as a very useful tool to just create a dashboard without a hassle. There is a myriad of SaaS solutions out there that I stumbled upon when researching this field, which I will have to try. I am not sure if they will all hold up to the shiny expectations, that those websites sell.</li>
<li>Bot Frameworks: Although I think of bots or agents more of a way of interacting with a user, I have put them into the visualization area because they didn't fit in anywhere else. P-Brain.ai and Wit.ai or botpress turn out to be a really fast way to get started here when you just want to build a (slack)-bot. I am however not sure if chatbots will be able to deliver the right results, given the hype with those.</li>
</ul>
<h4>Business Intelligence</h4>
<ul>
<li>Business Intelligence: I thought I knew more or less the alternatives that are out there. But having researched a bit, boy was I surprised to find how much is actually out there. Basically, every vendor of the big four has a very mature solution out there. Yet I found it really hard to distinguish between the different SaaS solutions out there, maybe it's because of the marketing talk, or maybe because they just all do the same thing. It's interesting to compare how potentially business intelligence solutions are offering the capabilities of the before mentioned data stack, but given the variety of different solutions in each layer, I think more and more people will be tempted to pick and chose instead of buying the expensive all in one solution. There are however open source alternatives, of which some feel quite mature (e.g. Kibana or Metabase) while others are quite small but really useful (e.g. Blazer). Also don't judge me too hard, if I put Tableau in there, some may say it's just a visualization tool, others perceive it as a BI solution – I think the boundaries are really blurry in this terrain.</li>
<li>BI on Hadoop: I had to introduce this category because I discovered that a lot of solutions are particularly tailored to working on the Hadoop stack. It's great to see that there are options out there and I am eager to explore this terrain in the future.</li>
<li>Data Science Platforms: What I noticed too is that somehow that data scientists are becoming a target group of integrated business intelligence solutions or data science platforms. I had some experience with BigML and Snowplow before, but it turns out that there is a lot of different platforms popping up, that might make your life much easier. For example, when it comes to deploying your models (like Yhat) or having a totally automated way of learning models (e.g. Datarobot). I am really excited to see what things will pop up here in the future.</li>
</ul>
<p>What I realized that this task of creating an overview of the different tools and solutions in the data-centric area will never be complete. Even when writing this blog post I had to add 14 more tools to the list. And I am aware of the fact that I might have missed some major tools out there, simply because it's hard to be unbiased when researching.</p>
<p>That is why I created a little email list that you can sign up to, and I will send you the updated version of this stack somewhere this year. So sign up to stay up to date (I promise I will not spam you) and write me a comment to let me know of new solutions or to let me know in the comments how you would have segmented this field or what your favorite tools are.</p>
<p><a href="http://liip.to/data_stack">Click here to get notified by email when I release version 2.0 of the data stack.</a></p>]]></description>
                  <enclosure url="http://liip.rokka.io/www_card_2/f8f743/book-stack-books-bookshop-264635.jpg" length="459561" type="image/jpeg" />
          </item>
    
  </channel>
</rss>
