RE-WORK Deep Learning Summit London

By | Uncategorized | No Comments

On the 21th and 22nd of September we attended the RE-WORK Deep Learning Summit in London. This event was focused on the development of deep learning techniques and business applications based on deep learning. Facebook and Amazon presented their Natural Language Processing research using neural networks, which was a recurring theme in the summit. Bayesian Neural Networks and Generative models were other main topics discussed at the summit. Read More

Flink Forward Berlin 2017 – An overview

By | Uncategorized | No Comments

A small delegation of BigData Republic consultants went to the Flink Forward conference in Berlin. One of the the key statements in this conference which resonated was: “ING is an IT company with a banking license.” In a world where data is increasingly important to keep ahead of competition, realizing that excellent IT infrastructure and data-driven software is key to realize a data driven vision. This realization also became clear in the statement to view Flink Jobs as applications on their own, delivering business value. Most talks were more technical and did not always have a clear pointer to the actual business value delivered. In this blog we give a brief overview of what we view as the highlights of this conference. Read More

Machine learning for predictive maintenance: where to start?

By | Big data, Data science | No Comments

Think about all the machines you use during a year, all of them, from a toaster every morning to an airplane every summer holiday. Now imagine that, from now on, one of them would fail every day. What impact would that have? The truth is that we are surrounded by machines that make our life easier, but we also get more and more dependent on them. Therefore, the quality of a machine is not only based on how useful and efficient it is, but also on how reliable it is. And together with reliability comes maintenance. Read More

Why do Big Data projects fail and how to make it succeed?

By | Big data, Data science | No Comments

Big Data has the goal to automate delivery of actionable business insights from data. In order to do this, you often end up wanting diverse data sources, large data sets and a vast amount of computational power. However, most are symptoms of an approach, not prerequisites of the goal.

This often leads to higher management focusing on tools used by competition instead of focusing on why the competition is using that tool in the first place and what steps are required to end up in the same league.

To emphasize this, I thought it would be easy to go through the points from the 2013 post Why does SCRUM fail? and lazily substitute SCRUM with Big Data to highlight my point.

Read More

Key takeaways from the Scala days keynote

By | Data engineering | No Comments


There’s no Scala Days conference without a keynote of Martin Odersky himself. This year he spoke about his current work: Dotty. Dotty is the new Scala compiler that will be part of Scala 3. The first release candidate was released just hours before the keynote and comes with the compiler itself (dotc), a repl (doti), a doc tool (dotd) and an IDE. It implements the MS language server protocol, enabling it to serve several front ends: VS-Code and Emacs (IntelliJ support is in the works). With Dotty, IDE’s can use the regular compiler as the presentation compiler.

Read More

How to obtain advanced probabilistic predictions for your data science use case

By | Data science, Deep learning | No Comments
Many data science use cases involve predicting a continuous quantity. For instance, a grid operator might want to predict the energy consumption level for a group of households for next week. In order to deliver these predictions, the Big Data Scientist will apply machine learning algorithms to a large collection of features, such as the family size, weather forecasts, property value and last weeks consumption levels. There are many use cases of this type, for example, predicting sales numbers, hotel rooms booked, money transfers or the time-to-failure of critical components. But what number do we actually want our algorithm to output?
Read More

Peeking into the Big Data future: Lessons learned from the DataWorks Summit in Munich

By | Big data, Data engineering, Infrastructure | No Comments

For about a year I have been fully submerged in everything regarding Big Data; working with various tools and techniques throwing a bit of data science in the mix. I realized there is a high entry barrier for organizations to start turning their (dormant) data into something useful. With this knowledge, I wanted to look at how some of the leaders and early adopters in Big Data are tackling these barriers and if they will become easier (or harder) to handle in the future. Luckily BigData Republic gave me and 2 colleague the opportunity to visit the DataWorks Summit in Munich this April, providing some inside information.
Read More

Data science platformen in de cloud: van POC naar productie

By | Big data, Data engineering, Data science, Infrastructure | No Comments

Het aantal Nederlandse organisaties dat proactief bezig is met data science groeit enorm. Grote organisaties kunnen zich dedicated data science teams veroorloven, die op on-premise infrastructuur of via cloud providers modellen en applicaties ontwikkelen. Echter, middelgrote organisaties beginnen meestal eerst met het opzetten van data science activiteiten binnen de bestaande business intelligence afdeling. Externe consultancy partijen kunnen in deze fase worden ingeschakeld om samen met de business data science use-cases uit te werken tot Proof-of-Concepts (POCs), zodat nut en kansen van data-gedreven werken voor de business inzichtelijk worden. Een snel op te zetten cloud omgeving, bijvoorbeeld op Amazon Web Services of Microsoft Azure, is daarvoor een goede basis. Het gemak waarmee grote opslagcapaciteit en rekenkracht zonder voorinvestering kunnen worden ingezet, leent zich perfect voor dit soort trajecten. Er schuilen echter ook een aantal valkuilen: de stap naar een meer professionele data science omgeving krijgt na afloop van deze trajecten meestal niet prioriteit, met als gevolg dat de POC-omgeving gaandeweg de ‘standaard omgeving’ wordt voor alle data science activiteiten binnen de organisatie. Dit geeft niet alleen risico’s voor security, efficiëntie en onderhoudbaarheid, maar leidt ook tot onnodig hoge kosten.

In deze blog behandelen we de twee belangrijkste aandachtspunten bij het professionaliseren van een data science omgeving in de cloud: security en kosten. We geven aan waar de risico’s liggen wanneer een snel opgezette POC-omgeving ongewild of onbewust een eigen leven gaat leiden, en schetsen oplossingen die BigData Republic in de praktijk bij klanten implementeert.
Read More

Regression prediction intervals with XGBOOST

By | Data science | No Comments

Knowledge of the uncertainty in predictions of algorithms is paramount for anyone who wishes to make serious predictive analytics for his business. Predictions are never absolute, and it is imperative to know the potential variations. If one wishes to know the passengers volume for each flight, he also needs to know by how many passengers the prediction may differ. Another could decide to predict disembarking times. There is of course a difference between a prediction on a scale of a few hours with a 95% chance of correctness up to half an hour, and a potential error of 10 hours!

Here, I present a customized cost-function for applying the well-known xgboost regressor to quantile regression. Xgboost or Extreme Gradient Boosting is a very succesful and powerful tree-based algorithm. Because of the nature of the Gradient and Hessian of the quantile regression cost-function, xgboost is known to heavily underperform. I show that by adding a randomized component to a smoothed Gradient, quantile regression can be applied succesfully. I show that this method can outperform the GradientboostingRegressor algorithm from the popular scikit-learn package.

Read More