On the 21th and 22nd of September we attended the RE-WORK Deep Learning Summit in London. This event was focused on the development of deep learning techniques and business applications based on deep learning. Facebook and Amazon presented their Natural Language Processing research using neural networks, which was a recurring theme in the summit. Bayesian Neural Networks and Generative models were other main topics discussed at the summit. Read More
A small delegation of BigData Republic consultants went to the Flink Forward conference in Berlin. One of the the key statements in this conference which resonated was: “ING is an IT company with a banking license.” In a world where data is increasingly important to keep ahead of competition, realizing that excellent IT infrastructure and data-driven software is key to realize a data driven vision. This realization also became clear in the statement to view Flink Jobs as applications on their own, delivering business value. Most talks were more technical and did not always have a clear pointer to the actual business value delivered. In this blog we give a brief overview of what we view as the highlights of this conference. Read More
Think about all the machines you use during a year, all of them, from a toaster every morning to an airplane every summer holiday. Now imagine that, from now on, one of them would fail every day. What impact would that have? The truth is that we are surrounded by machines that make our life easier, but we also get more and more dependent on them. Therefore, the quality of a machine is not only based on how useful and efficient it is, but also on how reliable it is. And together with reliability comes maintenance. Read More
Big Data has the goal to automate delivery of actionable business insights from data. In order to do this, you often end up wanting diverse data sources, large data sets and a vast amount of computational power. However, most are symptoms of an approach, not prerequisites of the goal.
This often leads to higher management focusing on tools used by competition instead of focusing on why the competition is using that tool in the first place and what steps are required to end up in the same league.
To emphasize this, I thought it would be easy to go through the points from the 2013 post Why does SCRUM fail? and lazily substitute SCRUM with Big Data to highlight my point.
A Genetic Algorithm (GA) is part of a family of algorithms known as Evolutionary Computing. These algorithms are all inspired by biological evolution, used for various optimizations. In this post we discuss what type of projects used Genetic Algorithms and provide a step-by-step example.
There’s no Scala Days conference without a keynote of Martin Odersky himself. This year he spoke about his current work: Dotty. Dotty is the new Scala compiler that will be part of Scala 3. The first release candidate was released just hours before the keynote and comes with the compiler itself (dotc), a repl (doti), a doc tool (dotd) and an IDE. It implements the MS language server protocol, enabling it to serve several front ends: VS-Code and Emacs (IntelliJ support is in the works). With Dotty, IDE’s can use the regular compiler as the presentation compiler.
For about a year I have been fully submerged in everything regarding Big Data; working with various tools and techniques throwing a bit of data science in the mix. I realized there is a high entry barrier for organizations to start turning their (dormant) data into something useful. With this knowledge, I wanted to look at how some of the leaders and early adopters in Big Data are tackling these barriers and if they will become easier (or harder) to handle in the future. Luckily BigData Republic gave me and 2 colleague the opportunity to visit the DataWorks Summit in Munich this April, providing some inside information.
Het aantal Nederlandse organisaties dat proactief bezig is met data science groeit enorm. Grote organisaties kunnen zich dedicated data science teams veroorloven, die op on-premise infrastructuur of via cloud providers modellen en applicaties ontwikkelen. Echter, middelgrote organisaties beginnen meestal eerst met het opzetten van data science activiteiten binnen de bestaande business intelligence afdeling. Externe consultancy partijen kunnen in deze fase worden ingeschakeld om samen met de business data science use-cases uit te werken tot Proof-of-Concepts (POCs), zodat nut en kansen van data-gedreven werken voor de business inzichtelijk worden. Een snel op te zetten cloud omgeving, bijvoorbeeld op Amazon Web Services of Microsoft Azure, is daarvoor een goede basis. Het gemak waarmee grote opslagcapaciteit en rekenkracht zonder voorinvestering kunnen worden ingezet, leent zich perfect voor dit soort trajecten. Er schuilen echter ook een aantal valkuilen: de stap naar een meer professionele data science omgeving krijgt na afloop van deze trajecten meestal niet prioriteit, met als gevolg dat de POC-omgeving gaandeweg de ‘standaard omgeving’ wordt voor alle data science activiteiten binnen de organisatie. Dit geeft niet alleen risico’s voor security, efficiëntie en onderhoudbaarheid, maar leidt ook tot onnodig hoge kosten.
In deze blog behandelen we de twee belangrijkste aandachtspunten bij het professionaliseren van een data science omgeving in de cloud: security en kosten. We geven aan waar de risico’s liggen wanneer een snel opgezette POC-omgeving ongewild of onbewust een eigen leven gaat leiden, en schetsen oplossingen die BigData Republic in de praktijk bij klanten implementeert.
Knowledge of the uncertainty in predictions of algorithms is paramount for anyone who wishes to make serious predictive analytics for his business. Predictions are never absolute, and it is imperative to know the potential variations. If one wishes to know the passengers volume for each flight, he also needs to know by how many passengers the prediction may differ. Another could decide to predict disembarking times. There is of course a difference between a prediction on a scale of a few hours with a 95% chance of correctness up to half an hour, and a potential error of 10 hours!
Here, I present a customized cost-function for applying the well-known xgboost regressor to quantile regression. Xgboost or Extreme Gradient Boosting is a very succesful and powerful tree-based algorithm. Because of the nature of the Gradient and Hessian of the quantile regression cost-function, xgboost is known to heavily underperform. I show that by adding a randomized component to a smoothed Gradient, quantile regression can be applied succesfully. I show that this method can outperform the GradientboostingRegressor algorithm from the popular scikit-learn package.