Why do Big Data projects fail and how to make it succeed?

By | Big data, Data science | No Comments

Big Data has the goal to automate delivery of actionable business insights from data. In order to do this, you often end up wanting diverse data sources, large data sets and a vast amount of computational power. However, most are symptoms of an approach, not prerequisites of the goal.

This often leads to higher management focusing on tools used by competition instead of focusing on why the competition is using that tool in the first place and what steps are required to end up in the same league.

To emphasize this, I thought it would be easy to go through the points from the 2013 post Why does SCRUM fail? and lazily substitute SCRUM with Big Data to highlight my point.

Read More

Genetic Algorithms in Practice

By | Data science, Deep learning, Evolutionary Computing | No Comments

A Genetic Algorithm (GA) is part of a family of algorithms known as Evolutionary Computing. These algorithms are all inspired by biological evolution, used for various optimizations. In this post we discuss what type of projects used Genetic Algorithms and provide a step-by-step example.

The basics of a GA are as follows: candidate solutions to a problem are randomly initialized and become part of the first population. Each individual (candidate solution) receives a fitness score, based on how well it solves the problem. A selection step follows, for which the fittest individuals have a better chance of survival. Via crossover and mutations new individuals are created and form the new population. This step can be repeated numerous times, until an individual is found that meets the acceptation criteria.

As an example, NASA created a space antenna using a GA, which had to meet a difficult set of requirements. They claimed that: “The current practice of designing and optimizing antennas by hand is limited in its ability to develop new and better antenna designs because it requires signi cant domain expertise and is both time and labor intensive”.  The individuals (random antennas) were represented as a list of operations, such as forward(length, radius) or rotate-x(angle). Both the operations and their parameters were mutatable. The fitness score was mostly a combination of wide beamwidth for a circularly polarized wave and wide impedance bandwidth.

To illustrate in depth the working of a Genetic Algorithm, we use a simpler case than NASA’s antenna: The knapsack problem.  For a given a set of items, that each have a certain weight and value, the total value of items that fit in the fixed size knapsack has to be optimized.

The optimization of this problem is NP-hard, meaning that there is no known algorithm that determines whether a solution is optimal in polynomial time. Genetic algorithms typically tackle NP-hard problems and approximate the optimal solution.

The first step in creating a GA is representing the problem in a way that is suitable for applying genetic operators. For the knapsack problem it is rather easy, the knapsack is represented as a flexible sized array with item identities. The fitness score is the summed value of the items in the backpack. If a backpack exceeds the carrying limit, its fitness score is 0.

Step 2 is creating an initial population. N backpacks will be filled with random items until the weight limit is reached.

Step 3 is determining the fitness score and transition to the next population. Two individuals are chosen pseudo-randomly (higher fitness score = higher probability). The first applied genetic operator is crossover, which relates to giving birth. With the combined genes of the parents a new individual is formed. It is difficult however to determine what parts of the parents are used, since there are two constraints that need to be satisfied: the backpack cannot contain duplicate items and it should not exceed the weight limit. The fact that we stated “should” instead of “cannot” is because it is a choice whether new individuals can be “falsely” born into the world, in which case new random individuals will take their place in the new population, similar to the first population.
The second operator is mutation. Mutation can be broken down to three operators of which one is randomly chosen to apply: adding an (valid) item, removing an item or replacing an item.

After the creation of a new population the steps repeat itself. For the knapsack problem a single iteration computes in only a few milliseconds, allowing the number of repetitions to be enormous and therefore guaranteeing an approximation to the optimal solution.

Recent developments in deep learning and especially Convolutional Neural Networks (CNN) have caused an increased interest for Evolutionary Computing. Google wrote a blogpost showing their research in this field. The “architecture” of a CNN is the specification of the layers and their parameters. The search space of the architectures is immense, a 10 layer network can have up to ~1010  variations. Many researchers study new types of architectures and the optimization of architectures, however there is no known architecture that is proven to be optimal. Evolutionary Computing can optimize these architectures by exploring the amount, types and order of the layers. The parameters of the layers and the global parameters of the architecture, such as learning rate and learning function can also be optimized. An issue is the computing power required to optimize these architectures. The fitness score of an architecture is determined by training and evaluating on a certain task, which for a typical GA has to be repeated thousands or millions of times.

To summarize: Evolutionary Computing uses techniques derived from natural processes and optimizes problems that have no clear optimal solution. It has been applied in various projects and has recently made an uprising in deep learning research, of which we will definitely see more of in the upcoming years.

Students from the Radboud University used an Evolutionary Algorithm to form a target image using small pieces of a base image. A live demo can be found here!

 

BigData Republic provides these type of Big Data Solutions. We are experienced in advanced predictive modeling and deploying large-scale big data pipelines. Interested in what we can do for you? Feel free to contact us.

Key takeaways from the Scala days keynote

By | Data engineering | No Comments


There’s no Scala Days conference without a keynote of Martin Odersky himself. This year he spoke about his current work: Dotty. Dotty is the new Scala compiler that will be part of Scala 3. The first release candidate was released just hours before the keynote and comes with the compiler itself (dotc), a repl (doti), a doc tool (dotd) and an IDE. It implements the MS language server protocol, enabling it to serve several front ends: VS-Code and Emacs (IntelliJ support is in the works). With Dotty, IDE’s can use the regular compiler as the presentation compiler.

Read More