Data Science Central RSS Feed

Digital Strategy Series Part I: Creating a Data Strategy that Delivers Value

Oh, the strategy pundits hate me! It’s not because I’m tall, good looking and from Iowa (well, 2 out of 3 ain’t bad), it’s because I think Strategy as a “Discipline” is way overblown. I won’t go as far as the Harvard Business Review to state that “Strategy is Dead“, but the importance of carefully defining a strategy (typically done in the ivory towers of the puzzle palace) and then commanding all the little soldiers to follow the strategy script are over!

Former World Heavyweight Boxing Champion Mike Tyson famously said that everyone has a plan, until they get punched in the mouth. And I think the traditional strategy function finds itself in the same predicament – everyone has a strategy, until they get punched in the mouth.

The Internet and Globalization have mitigated the economic, operational and cultural impediments traditionally associated with time and distance. We are an intertwined global economy, and now we realize (the hard way) that when someone sneezes in some part of the world, everyone everywhere gets sick. We are constantly getting punched in the mouth, and while we may not be sure from whence that punch might come next (pandemic, economic crisis, financial meltdown, climate change, catastrophic storms), trust me when I say that in a continuously transforming and evolving world, there are more punches coming our way.

So, my next two blogs are going to discuss: How does one develop and adapt data and AI strategies in a world of continuous change and transformation? It’s not that strategy is dead (though at times Strategy does look like an episode of the “Walking Dead”); it’s that strategy – like every other part of the organization and the world – needs to operate in an environment of continuous change and transformation.

Let’s dive into the subject, and let’s get those strategy pundits all wound up!

Organizations spend 100’s of millions of dollars in acquiring data as they deploy operational systems such as ERP, CRM, SCM, SFA, BFA, eCommerce, social media, mobile and now IoT. Then they spend even more outrageous sums of money to maintain all of the data whose most immediate benefit is regulatory, compliance and management reporting. No wonder CIO’s have an almost singular mandate to reduce those data management costs (hello, cloud). Data is a cost to be minimized when the only “value” one gets from that data is regulatory, compliance and management risk reduction.

Organizations have a tragic mismatch in their investments in acquiring and managing data versus their investments in monetizing data (see Figure 1).


Figure 1: Mismatched Data Acquisition versus Data Monetization Strategies

As my followers know, I preach almost non-stop about the unique economic characteristics of data – data never wears out, never depletes and can be used across an unlimited number of use cases at near zero marginal cost. But if data possesses this outsized economic potential, why is there a tragic mismatch in organization’s investments in acquiring and managing data versus their investments in monetizing data?

As in most organizational transformations, step 1 starts by reframing the conversation.

Many moons ago, I stated that “Organizations do not need a Big Data strategy; they need a business strategy that incorporates Big Data” (hence my nickname the Dean of Big Data).


Figure 2: Big Data more an Economics than Technology Conversation

To quote the Harvard Business Review (and they can’t possibly be wrong): “The problem is that, in many cases, big data is not used well. Companies are better at collecting data – about their customers, about their products, about competitors – than analyzing that data and designing strategy around it.[1]”

Too many organizations are making Big Data, and now IOT, an IT project. Instead, think of the mastery of big data and IOT as a strategic business capability that enables organizations to exploit the power of data with advanced analytics to uncover new sources of customer, product and operational value that can power the organization’s business and operational models (see Figure 3).


Figure 3: “5 Steps to Building a Big Data Business Strategy

To exploit the unique economic value of data, organization’s need a Business Strategy that uses advanced analytics to interrogate/torture the data to uncover detailed customer, product, service and operational insights that can be used to optimize key operational processes, mitigate compliance and cyber-security risks, uncover new revenue opportunities and create a more compelling, more differentiated customer experience.

But exactly how does one accomplish this?

By focusing on becoming value-driven, not data-driven.

The value of data isn’t having it (data-driven). The value of data is using it to derive and drive new sources of “wealth” (value-driven). To exploit the economic potential of data, Executives must transition from a data-driven mindset to a value-driven one that focuses on exploiting data to uncover new sources of customer, product and operational value (see Figure 4).


Figure 4: Value Engineering Framework

The heart of the Value Engineering Framework in Figure 3 is the collaboration between the different stakeholders to identify, validate, value and prioritize the Decisions (Use Cases) that support the organization’s key business initiatives. It is these Use Cases that help you determine which data is most relevant (because not all data is of equal value) – to differentiate the signal from the noise buried in the data.

Data may be the new oil and one of the most valuable resources in the world, but it is the analytic insights buried in the data that will determine the winners and losers in the 21st century.

If you don’t care if your data lake turns into a data swamp, then just go ahead and toss your data into your unmanageable gaggle of data repositories. Mission Accomplished! BUT if you seek to exploit the unique characteristics of data – assets that never deplete, never wear out and can be used across an infinite number of use cases at zero marginal cost – then transform your data lake into a “collaborative value creation” platform that supports the capture, refinement and re-use of your data assets across the organization (see Figure 5).


Figure 5: “Unlearn to Unleash Your Data Lake

From the “Disposable Technology: A Concept Whose Time Has Come” blog about modern digital companies, we learned two important Data Lake lessons:

  • Lesson #1: Use the organization’s key use cases to drive organizational alignment on identifying, capturing and operationalizing new sources of customer, product and operational value buried in the data.
  • Lesson #2: Don’t implement a rigid technology architecture that interferes with Lesson #1.

Modern digital companies realize that they are in the data monetization business, and the data lake is the platform for driving that data monetization.

In summary:

  • While most organizations have a data strategy, there is a tragic mis-alignment between the investments made in data acquisition and storage versus the investments made in data monetization.
  • Data may be the new oil and one the most valuable resources in the world, but it is the analytic insights buried in the data that will determine the winners and losers in the 21st century.
  • Organizations need to embrace a value engineering approach to close the data strategy versus data monetization realization gap.
  • How do you differentiate signal from noise within the data? The business and operational use cases will tell what in the data is important, and what is not.
  • If you focus on leveraging data to become value-driven, then the role of your data lake to power your business and operational initiatives will change FOR-EV-ER.

[1] “Companies Love Big Data but Lack the Strategy to Use It Effectively,” Harvard Business Review

Data Science Central RSS Feed

Differential ML on TensorFlow and Colab

Brian Huge and I just posted a working paper following six months of research and development on function approximation by artificial intelligence (AI) in Danske Bank. One major finding was that training machine learning (ML) models for regression (i.e. prediction of values, not classes) may be massively improved when the gradients of training labels wrt training inputs are available. Given those differential labels, we can write simple, yet unreasonably effective training algorithms, capable of learning accurate function approximations with remarkable speed and accuracy from small datasets, in a stable manner, without the need of additional regularization or optimization of hyperparameters, e.g. by cross-validation.

In this post, we briefly summarize these algorithms under the name differential machine learning, highlighting the main intuitions and benefits and commenting TensorFlow implementation code. All the details are found in the working paper, the online appendices and the Colab notebooks.

In the context of financial Derivatives pricing approximation, training sets are simulated with Monte-Carlo models. Each training example is simulated on one Monte-Carlo path, where the label is the final payoff of a transaction and the input is the initial state vector of the market. Differential labels are the pathwise gradients of the payoff wrt to the state and efficiently computed with Automatic Adjoint Differentiation (AAD). For this reason, differential machine learning is particularly effective in finance, although it is also applicable in all other situations where high-quality first-order derivatives wrt training inputs are available.

Models are trained on augmented datasets of not only inputs and labels but also differentials:


by minimization of the combined cost of prediction errors on values and derivatives:


The value and derivative labels are given. We compute predicted values by inference, as customary, and predicted derivatives by backpropagation. Although the methodology applies to architectures of arbitrary complexity, we discuss it here in the context of vanilla feedforward networks in the interest of simplicity.

Recall vanilla feedforward equations:


where the notations are standard and specified in the paper (index 3 is for consistency with the paper).

All the code in this post is extracted from the demonstration notebook, which also includes comments and practical implementation details.


Below is a TensorFlow (1.x) implementation of the feedforward equations. We chose to write matrix operations explicitly in place of high-level Keras layers to highlight equations in code. We chose soft plus activation. ELU is another alternative. For reasons explained in the paper, activation must be continuously differentiable, ruling out e.g. RELU and SELU.


Derivatives of output wrt inputs are predicted with backpropagation. Recall backpropagation equations are derived as adjoints of the feedforward equations, or see our tutorial for a refresh:

Or in code, recalling that the derivative of softplus is sigmoid:


Once again, we wrote backpropagation equations explicitly in place of a call to tf.gradients(). We chose to do it this way, first, to highlight equations in code again, and also, to avoid nesting layers of backpropagation during training, as seen next. For the avoidance of doubt, replacing this code by one call to tf.gradients() works too.

Next, we combine feedforward and backpropagation in one network, which we call twin network, a neural network of twice the depth, capable of simultaneously predicting values and derivatives for twice computation cost:



The twin network is beneficial in two ways. After training, it efficiently predicts values and derivatives given inputs in applications where derivatives predictions are desirable. In finance, for example, they are sensitivities of prices to market state variables, also called Greeks (because traders give them Greek letters), and also correspond to hedge ratios.

The twin network is also a fundamental construct for differential training. The combined cost function is computed by inference through the twin network, predicting values and derivatives. The gradients of the cost function are computed by backpropagation through the twin network, including the backpropagation part, silently conducted by TensorFlow as part of its optimization loop. Recall the standard training loop for neural networks:


The differential training loop is virtually identical, safe for the definition of the cost function, now combining mean squared errors on values and derivatives:


TensorFlow differentiates the twin network seamlessly behind the scenes for the needs of optimization. It doesn’t matter that part of the network is itself a backpropagation. This is just another sequence of matrix operations, which TensorFlow differentiates without difficulty.

The rest of the notebook deals with standard data preparation, training and testing and the application to a couple of textbook datasets in finance: European calls in Black & Scholes, and basket options in correlated Bachelier. The results demonstrate the unreasonable effectiveness of differential deep learning.


In the online appendices, we explored applications of differential machine learning to other kinds of ML models, like basis function regression and principal component analysis (PCA), with equally remarkable results.

Differential training imposes a penalty on incorrect derivatives in the same way that conventional regularization like ridge/Tikhonov favours small weights. Contrarily to conventional regularization, differential ML effectively mitigates overfitting without introducing bias. Hence, there is no bias-variance tradeoff or necessity to tweak hyperparameters by cross-validation. It just works.

Differential machine learning is more similar to data augmentation, which in turn may be seen as a better form of regularization. Data augmentation is consistently applied e.g. in computer vision with documented success. The idea is to produce multiple labelled images from a single one, e.g. by cropping, zooming, rotation or recolouring. In addition to extending the training set for negligible cost, data augmentation teaches the ML model important invariances. Similarly, derivatives labels, not only increase the amount of information in the training set for a very small cost (as long as they are computed with AAD) but also teach ML models the shape of pricing functions.

Working paper:
Github repo:
Colab Notebook:…

Antoine Savine

Originally posted here

Data Science Central RSS Feed

Bayesian Machine Learning (Part 8)

Bayesian Machine Learning (part-8)

Mean Field Approximation

Have you ever asked a question, why do we need to calculate the exact Posterior distribution ?

To understand the answer the above question, let us go to –

Back to the Basics !

To understand the answer of the above question, I would like you to re-visit our basic Baye’s rule.

Posterior = (Likelihood x Prior) / Evidence

Now to compute the Posterior we have following issues:

  • If prior is conjugate to likelihood, it is easy to compute the posterior, otherwise not
  • It is really very hard to compute the evidence of the points

So, what if we try and approximate our posterior!

Will it impact our results? .. let us check out with an example.

Let us suppose we have a probabilistic distribution as shown in the figure:


The computation of the exact posterior of the above distribution is very difficult.

Now suppose we try to approximate the above distribution with a gaussian distribution which looks as follows:


Now from the above figure we can say the error is very small as the gaussian distribution is conveniently satisfying the job. Also, in machine learning problems, we really do not want an exact value of the posterior as far as we can get relative values for decision making.

From the above example we saw that if we have good approximate distribution for the posterior, we are good to go !

The next question arises, how to compute this approximate posterior distribution.

Computing the Approximate Distribution

To compute the approximate distribution, we will have to consider a family of distribution and then we will try to minimize the KL Divergence parameter between the actual posterior and the approximate distribution.

Mathematically speaking,


KL Divergence :


Expanding the above equation, we get :



as q(z) is a probabilistic distribution and Log z is a constant for a given data, therefore It does not play a role in differentiation process of KL divergence to minimize the distance between the approximate posterior distribution and the actual distribution.

So, now we will understand a methodology to do this approximation.

Mean Field Approximation

The name suggests a word ‘mean’, we will see the importance of this later in the blog. Another word ‘field’ is basically coming from electromagnetism theory of physics, as this approximation methodology incorporates the impact of nearby neighbors in making the decision, thus incorporating the fields impacts of all neighbors.

So, let us start learning the concept behind this approximation.

Step 1 : Suppose the posterior distribution has n dimensions, then we will consider total n different distributions to approximate the posterior.

Step 2 : all the n different distributions will be multiplied together to approximate the posterior. The mathematical expression looks as :


Step 3 : now the expression of KL divergence is used to differentiate and minimize the distance between the posterior and approximation. We use coordinate descent method to differentiate the expression, which differentiates the expression w.r.t to one variable, update the expression and then differentiate w.r.t another variable and so on.

Mathematical Derivation

KL divergence expression :


Now, we only need to differentiate w.r.t k, therefore all the dimensions are constant as per coordinate descent.

By taking q(zk) as common and treating the remain multiplier as a constant and for the second object for P(z*), it becomes as Expected value for the random variable q(z1)*q(z2) … q(zi) where i != k.

The final expression we get is :image

Now, with a little conversion of the equation, it re occurs in form of KL divergence, and to differentiate and equate it to zero means equating the denominator and numerator. The final expression is as follows:


Now because of the mean impact of remaining dimensions results in th most optimal solution of the kth dimension, we call it mean fields.

Practical understanding

Now, what you just saw may be very scary !!! right ..

Let us take a working example to understand the concept of mean field approximation in more practical manner.

Grey scale image de-noising is a very popular use case of MFVI. It uses 2 concepts – Ising model and approximating the ising model using Mean field approximation. Below are the details:

Ising Model : is a mathematical model of ferromagnetism in statistical mechanics.

Core Idea behind Merging two theories : ferromagnetic substances, at their molecular level can have 2 types of spins – clockwise or anticlockwise. If the spins of all the molecules are same and aligned, the object start behaving as a magnet. The spin of any molecule has a significant impact from its neighboring molecules. This impact varies from substance to substance. Thus it makes the occurrence of the spin for each molecule probabilistic. Ising Model defines this behavior mathematically.

Now, we can consider our image as 2-D substance sheet and can consider every pixel of the image as a molecule. In a binary-image, these pixels can take only 2 values i.e. black and white, analogously -1 and 1. This image can be de-noised by considering the same theory that each pixel will take value as per its neighbor. And so we can apply Ising model to re correct the spins/black-white/-1,+1 values of the image, if we control the neighbor impact property of the image.

Now, the joint probabilistic distribution of the Ising model is impossible to evaluate analytically, thus we use Mean Field Approximation theory to evaluate it.

Example :

I tried this approach on a grey image of a bird by adding a lot of noise to it and tried to retrieve as much as possible from it. Below are my results :

Noisy Image:


De-noised Image:


That is kind of potential I am talking about, i.e. in the noisy image you can see eventually nothing, and in the de-noised image we see a bird. !!!

Please read more about it, to try this cool stuff.



Data Science Central RSS Feed

Data Stewardship in an AI-Driven Ecosystem: Introducing WhiteNoise

WhiteNoise is the newly available Differential Privacy System by OpenDP. The intent of Differential Privacy is to preserve the security of personally identifiable information & prevent against database reconstruction attacks. The methods provided by the WhiteNoise system are part of a toolkit that enable researchers to readily enhance their existing processes with differential privacy techniques.

The Differential Privacy System

The Differential Privacy system provides the checks, balances and controls to enable researchers and analysts to balance model accuracy with data privacy constraints within a machine learning model. The system allows users to:

  • Control Data – Throttle the completeness of data access requests via user budget
  • Add Noise – Inject noise into the data set (epsilon/noise-level is inverse to accuracy)
  • Report – Provide privatized data and transparency into results


Who is OpenDP?

OpenDP is an open source collaboration to build a system of tools for enabling privacy-protective analysis of sensitive personal data. Originating as a joint endeavor between Harvard’s Privacy Tools Project and Microsoft, today the project includes contributions from industry, academia & government.

Read more about Differential Privacy and OpenDP in the OpenDP White Paper

White Noise Library

The OpenDP White Noise library is available on GitHub as an open source library of algorithms for generating differentially private algorithms for statistical analysis and machine learning on sensitive data. Current capabilities assume the researcher is trusted by the data owner and enables creation of differentially private results over tabular data stores using SQL, implementing a shared privacy budget within a service to compose queries and differential privacy analysis for black-box stochastic testing.

The WhiteNoise methods are written in Rust and can be leveraged in Python via Protocol Buffers (protobuf). Additional work is being done to make capabilities also available in R.


WhiteNoise Core Components comprise the libraries necessary for performing the statistical evaluations and analysis for differential privacy.

  • Analysis – graphical representation of computations
  • Validator – rust library for validating differentially private
  • Runtime – Rust reference runtime
  • Bindings – python language bindings

WhiteNoise System Components comprise the mechanisms to interact with the Core components.

  • Data Access –Python library to intercept and process SQL queries

Supports: PostgreSQL, SQL Server, Spark, Presto, Pandas

  • Service – REST endpoint to serve requests or queries against shared data sources
  • Evaluator – enables support of Privacy, Accuracy, Utility & Bias Tests

System :: Data Access

Within an AI driven organization, how do we make data more readily available for analysis and insights across the organization, while also protecting the private data of our customers and clients? The WhiteNoise SQL DataReader is a wrapper available in Python that triggers the addition of white noise to a data set when calling the PrivateReader method. Details below:


  1. Parser – Takes query + metadata
  2. Validator – The validator checks that query meets requirements for differential privacy
  3. Rewriter – Modifies the query to enforce bounds
  4. Database Execution – Query is executed by target database engine
  5. Postprocessing – Aggregates are submitted to the differential privacy algorithms to create differentially private results

Let’s see what we can do with the WhiteNoise library and the Titanic train data set from The code snippet below is an example of how easy it is to invoke the SQL Data Reader. The researcher needs to import the WhiteNoise module which exposes the opendp.whitenoise.metadata and opendp.whitenoise.sql libraries. The reader libraries wrap around the PandasReader module from the Pandas library.


Highlighted above are steps to call the reader. You must first instantiate the PandasReader object against the data source, then instantiate the WhiteNoise PrivateReader against the PandasReader object. The query can then be passed through to the WhiteNoise PrivateReader object. As demonstrated, the WhiteNoise wrapper is easily implemented and is extensible across PostgreSQL, SQL Server, Spark, Presto and Pandas.


When viewing results from the PrivateReader, it should be noted that subsequent calls to the PrivateReader will result in slightly different results. Results should fall in a uniform distribution as random noise is injected into the data set.


What’s in a YAML file?

Both the PandasReader object and the WhiteNoise PrivateReader object take the CollectionMetadata object and read in metadata about datasets.

Income, Age, Sex & medical characteristics are all attributes that may be considered personally identifiable and private.

When performing analysis on personally identifiable data sets, which fields are private? Which fields are necessary for the experiment being performed by the researcher? Who should have access to these fields?

Organizations can leverage metadata files to define which fields require a budget and set upper and lower bounds for variables.

Metadata can be constructed in YAML or directly in code

To the right is an example YAML file for the Titanic training set.

System :: Service

The WhiteNoise libraries include methods for auto generating the REST client and models for APIs via swagger definition. The APIs provide access to the core modules: Dataset, Linear Regression, Logistic Regression, SQL & Synthetic Data.

Core :: Analysis & Validator + System :: Evaluator

Within the Core libraries are statistical methods to impose differential privacy within commonly utilized statistical calculations for the aggregates: mean, variance, moment & covariance. Data is privatized by returning data sets that have been perturbed by Gaussian, Geometric or Laprace noise. Mechanisms for each are available, as well as methods enabling digitizing, filtering & transforming data sets.

In experiments, the size of a data set can also be considered private or the researcher may work with datasets of unknown size. Core libraries provide methods for clamping, resizing, calculating n and nullity checks. Clamping is a method to define upper and lower bounds for variables, resizing add records to a dataset to mask population size, n is the estimate sample data size and nullity is the ability to determine whether the validator can guarantee results.

The example below adds Gaussian noise to the ‘Age’ values within our data set. The privacy usage can be adjusted by the epsilon value passed. A larger epsilon means less privacy. While a smaller epsilon means more privacy.


Post-analysis, the evaluator allows us to analyze and evaluate whether our analysis satisfies Differential Privacy standards:

  • Privacy Test – Does the report meet differential privacy conditions?
  • Accuracy Test – Does the reliability of the report fall within the upper and lower bounds?
  • Utility Test – Are the confidence bounds close enough to the data? Are we maximizing privacy?
  • Bias Test – Is the distribution of repeated queries balanced?

We’ve seen how the WhiteNoise Package enables a system for protecting personally identifiable information. Want to learn more about Responsible ML and WhiteNoise? OpenDP’s GitHub repository includes Python notebooks to get you started with the WhiteNoise libraries. Last week, Microsoft Build 2020 announced new Responsible ML and WhiteNoise capabilities. You can watch Salil Vadhan, Leader of Harvard’s Privacy Tools Project on the Channel 9 AI-Show with Seth Juarez, and hear about the Science Behind WhiteNoise. You can also view a post on Responsible ML and the Importance of Building AI Systems That We Can Trust, posted by Eric Boyd Corporate VP, Azure AI.

Additionally, Global AI Community was given a preview of Responsible ML tools. Eva Pardi provides an overview of Interpret-Text, an open source package that enables developers to understand their models behavior and the reasons behind individual predictions. Willem Meints discusses FairLearn, A python package that enables developers to assess and address fairness and observed unfairness within their models. Finally, Sammy Deprez will take you into the magical world of ConfidentialML, with Microsoft Seal and OpenEnclave.

About the Author:

Alicia is a Microsoft AI MVP, she authors the blog, a blog focused on cloud data learning resources. She is also an organizer for Global AI Bootcamp – Houston Edition, a Microsoft AI sponsored event. Alicia is active in the PASS User Group community, and enjoys speaking on #AI and #SQLServer topics.

Alicia has been in the Database/BI services industry for 10+ years. Alicia has earned Microsoft Certified: Azure Solutions Architect Expert & MCSE: Data Management & Analytics. She also holds certifications in both Azure and AWS Big Data.

Data Science Central RSS Feed

The 17 equations that changed the course of history

This article was written by Andy Kiersz.

From Ian Stewart’s book, these 17 math equations changed the course of human history

  • A 2013 book by mathematician and science author Ian Stewart looked at 17 mathematical equations that shaped our understanding of the world.
  • From basic geometry to our understanding of how the physical world works to the theories underlying the internet and our financial systems, these equations have changed human history.

Mathematics is all around us, and it has shaped our understanding of the world in countless ways.

In 2013, mathematician and science author Ian Stewart published a book on “17 Equations That Changed The World.”


To read the full article, with illustrations for each equation, click here.

Data Science Central RSS Feed

Making a case for Cloud-Based Tools for Event Management

If you are only vaguely familiar with events or have only attended a handful of them in your life, you would still know that they are an incredibly challenging thing to pull off. Even if the scale of the is small, there’s just so much that goes into pulling off a successful event. To start with, you need the venue, the registrations, the food, the attendees, their tickets, the seating, the flow of people across the event, and literally so much more in order. Considering that this is not the complete list of all the factors and details that are involved in activities, it should not be hard to imagine why they would be deemed a ‘complex’ endeavor. But that’s not to say it can be done — of course; it can. But event planners are starting to feel an urgent need for solutions that would make them and their operations more efficient and effective.


And that’s precisely where cloud-based solutions come in. Cloud-based event management solutions are designed in such a manner that they can offer invaluable assistance to the entire team right from the start. Be it sending out emails to attendees or updates to your team about something; such tools fit right in. And what else can they do for event managers and professionals in this trade? A lot more than you’d be inclined to imagine; so, we compiled a list of some of the more compelling benefits of cloud-based tools for event planning and management.

  1. Seamless access to information and data: With a cloud-based event management software, all your data and information is stored in a centralized database. It means it is incredibly easy to access this information and even disseminate information to attendees. So, be it your employees who need the latest update on the progress of individual tasks or attendees who need to know, say, the list of events, the schedule, or maybe the directions — all of it can be easily shared with the requisite people without a fuss.
  2. Online payments: What many people often forget that amount is an integral part of the events business as well. And that it is equally essential to ensure customers are offered seamless means of payment to provide an excellent experience for them. So, when you use payment devices linked to such solutions, it becomes easier to process payments, fills in the requisite data, and more.
  3. Enhanced efficiency: One of the most significant pain points for event management companies is that they struggle to enhance the effectiveness of their processes. Well, with a cloud-based solution, you gain advantages like the ability to access information on the go, track processes via mobile devices, and more–all of these factors playa critical role in boosting the company’s efficiency.

Suffice to say that a web-based software for event planning and management can do wonders for a business that doesn’t merely seek to operate profitably, but also to make a difference in the market via novel experiences and more.

Data Science Central RSS Feed

Speedup by 10x the Hyperparameter tuning of ML applications on Kubeflow using FPGAs

Kubernetes is a great system for handling clusters of containers (whether on cloud or on-premise), but deploying and managing containerized applications for ML can be a challenging task.

Kubeflow is known as a machine learning toolkit for Kubernetes. It is an open source project used for making deployments of machine learning workflows on Kubernetes simple, portable, and scalable. It is used by data scientists and ML engineers who want to build, experiment, test and serve their ML workloads to various environments.

Some of the main components that make Kubeflow useful include:

  • Central Dashboard, that provides access to the other main components through a UI
  • Notebook Servers, to set up Jupyter notebooks
  • Katib, for automated tuning of ML model’s hyperparameters
  • Pipelines, for building end-to-end ML workflows, based on containers.

Until now, the Kubeflow community has presented applications on CPUs or GPUs. FPGAs can be used to speedup ML applications but so far the integration and the deployment was hard.

InAccel FPGA manager makes much easier the deployment and integration of FPGAs to higher programming frameworks. With InAccel’s FPGA Kubernetes plugin, the applications can be easily accelerated without worrying about resource management and utilization of the FPGA cluster.

A complete guide on how to set up a complete machine learning application using FPGAs with Kubeflow on any existing Kubernetes cluster, is provided on this Tutorial Labs.

Hyperparameter tuning is the process of optimizing the hyperparameter values to maximize the predictive accuracy of the model. If you don’t use Katib or a similar system for hyperparameter tuning, you need run many training jobs yourself, manually adjusting the hyperparameters to find the optimal values.

Searching for the best parameters takes away important time from other stages of the Data Science lifecycle. So, tools that monitor and automate this repetitive training process do not suffice and need to be accelerated, in order to let the professionals concentrate a bit more on stages like business understanding, data mining etc.

XGBoost is a powerful machine learning library that has recently been dominating applied machine learning and is quite easy to build a predictive model. But, improving the model is difficult due to the multiple parameters and requires careful hyperparameter tuning to fully leverage its advantages over other algorithms.

InAccel released in the past the IP core for accelerated XGBoost on FPGAs. This IP core helped demonstrate the advantages of the FPGAs in the domain of ML and offered to the data science community the chance to experiment, deploy and utilize FPGAs in order to speedup their ML workloads. With Python, Java and Scala APIs provided, the engineers do not need to change their code at all or worry about configuring FPGAs.

Concerning Katib now, there are three steps in order to run your own experiment.

  1. Package your training code in a Docker container image and make the image available in a registry.
  2. Define the experiment specifications from Katib’s ‘Submit’ UI or with a YAML configuration file, setting the parameters you want to tune, the objective metric, the number of trials and more. Finally, allocate the resources to enable the FPGA accelerated execution.
    xilinx/aws-vu9p-f1-04261818: 1
  3. Run the experiment from the Katib UI by submiting the Yaml file and monitor the trials.


SVHN is a real-world image dataset, obtained from house numbers in Google Street View images. It consists of 99289 samples and 10 classes, with every sample being a 32-by-32 RGB image (3072 features).

The training code for the step 1 can be found on GitHub and is included inside inaccel/jupyter:scipy Docker image.

After defining the parameters, the search algorithm, the metrics and the other trial specifications we create a TrialTemplate YAML. In this file we:

  • run the training application inside the container,
  • define the algorithm’s tree_method,
  • get the accelerator’s bitstream for XGBoost and
  • allocate the resources as we mentioned previously.

For a CPU-only implementation we just need to change the tree_method to exact, hist etc.

Finally, we submit the experiment and navigate to the monitor screen.

  • CPU-only training plot:


  • FPGA-accelerated training plot:


In the above plots we see the objective metrics, accuracy and time, along with the three hyperparameters we chose to tune. We can keep the best combination of them, take more info or retry with another experiment. We notice that the accuracy is the same on both executions, but the CPU-only training takes 1100 seconds on average, while the FPGA-accelerated one lasts only 245 seconds. This means, that InAccel XGBoost achieves up to 4.5x speedup on this use case.

You will find a step-by-step tutorial here.

The following video also presents a complete walkthrough on how to submit a new experiment using Katib and highlights the extra steps needed for the FPGA deployment along with a small comparison of CPU and FPGA execution times.


Vangelis Gkiastas

ML Engineer

Copyright: InAccel, Inc.

Data Science Central RSS Feed

New Algorithm For Density Estimation and Noise Reduction

KernelML – Hierarchical Density Factorization

  • Approximate any empirical distribution
  • Build a parameterized density estimator
  • Outlier detection and dataset noise reduction

This solution I came up with was incorporated into a python package, KernelML. The example code can be found here.

My solution uses the following:

  1. Particle SwarmGenetic Optimizer
  2. Multi-Agent Approximation using IID Kernels
  3. Reinforcement Learning

Most kernels have hyper-parameters that control the mean and variation of the distribution. While these parameters are potentially differentiable, I decided against using a gradient-based method. The gradients for the variance parameters can potentially vanish, and constraining the variance makes the parameters non-differentiable. It makes sense to use a mixed integer or particle swarm strategy to optimize the kernels’ hyper-parameters. I decided to use a uniform distribution kernel because of its robustness to outliers in higher dimensions.

Over the past year, I’ve independently developed an optimization algorithm to solve non-linear, constrained optimization problems. It is by no means perfect, but building it from scratch allowed me to 1) make modifications based on the task 2) better understand the problem I was trying to solve.

My initial approach used a multi-agent strategy to simultaneously fit any multi-variate distribution. The agents, in this case, the kernels, were independent and identically distributed (IID). I made an algorithm, called density factorization, to fit an arbitrary number of agents to a distribution. The optimization approach and details can be found here. The video below shows a frame-by-frame example for how the solution might look over the optimization procedure.

[embedded content]

This algorithm seemed to perform well on non-sparse, continuous distributions. One problem was that the algorithm used IID kernels which is an issue when modeling skewed data. Every kernel has the same 1/K weight, where K is the number of kernels. In theory, hundreds of kernels could be optimized at once, but this solution lacked efficiency and granularity.

I decided to use a hierarchical, reinforcement style approach to fitting the empirical multi-variate distribution. The initial reward, R_0, was the empirical distribution, and the discounted reward, R_1, represented the data points not captured by the initial multi-agent algorithm at R_0. Equation (1) shows the update process for the reward.


Where p is the percentage of unassigned data points, R is the empirical distribution at step t, U is the empirical distribution for the unassigned data points, and lambda is the discount factor. The reward update is the multiplication of p and lambda.

This works because, by definition, the space between data points increases as the density decreases. As data points in under-populated regions are clustered, the cluster sizes will increase to capture new data points. The reward update is less than the percentage of unassigned data point which allows the denser regions to be represented multiple times before moving to the less dense regions.

The algorithm uses pre-computed rasters to approximate the empirical density which means that the computational complexity depends on the number of dimensions, not the data points. The example below shows how the estimated and empirical distribution might look for the 2-D case.


After fitting the initial density factorization algorithm, the reward is updated by some discount factor to improve the reward for the data points that have not been captured. The plot below shows how the empirical distribution might look after a few updates.


The samples in each cluster must be greater than the minimum-leaf-sample parameter. This parameter prevents clusters from accidentally modeling outliers by chance. This is mostly an issue in higher dimensional space due to the curse of dimensionality. If a cluster does not meet this constraint, it is pruned from the cluster solution. This process continues until 1) a new cluster solution does not capture new data points or 2) >99% of the data points have been captured (this threshold is also adjustable).

As the input space increases in dimensionality, the Euclidean space between data points increases. For example, for an input space that contains uniform random variables, the space between space points increases by a factor of sqrt(D), where D is the number of dimensions.

To create a presentable example, the curse of dimensionality will be simulated in 2-D. This can be achieved by creating an under-sampled (sparse) training dataset and an over-sampled validation dataset. Two of the clusters were moved closer together to make cluster separation more difficult.

The density can be estimated by counting the number of clusters assigned to a data point. The solution is parameterized so it can be applied to the validation dataset after training. The plot below shows the histogram of the density estimate after running the model on the training dataset.


The density can be used to visualize the denser areas of the data. The green rings show the true distributions’ two standard deviation threshold. The plot below visualizes the density for the training dataset.


Outliers can be defined by a percentile, i.e. 5th, 10th, etc., of the density estimate. The plot below shows the outliers defined by the 10th percentile. The green rings show the true distributions’ two standard deviation threshold


The plot below shows the histogram of the density estimate for the validation dataset.


The plot below visualizes the density for the validation dataset. The green rings show the true distributions’ two standard deviation threshold


The outliers, defined by the 10th percentile, are visualized below. The green rings show the true distributions’ two standard deviation threshold


This particular use case was focused on outlier detection. However, the algorithm also provided cluster assignments and density estimates for each data point. Other outlier detection methods, i.e., local outlier factor (LOF), can produce similar results in terms of outliers detection. Local outlier factor is dependent on the number of nearest neighbors and the contamination parameter. While it is easy to tune LOF’s parameter in 2-D, it is not so easy in multiple dimensions. Hierarchical density factorization provides a robust method to fit multi-variate distributions without the need for extensive hyper-parameter tuning. While the algorithm does not depend on the number of data points, it is still a relatively slow algorithm. Many improvements can be made to improve the efficiency and speed. The example notebook includes a comparison to LOF and a multivariate example using the Pokemon dataset.

Originally posted here.

Data Science Central RSS Feed

Wind Turbine Surface Damage Detection using Deep Learning Algorithm

Timely detection of surface damages on wind turbine blades is imperative for minimizing downtime and avoiding possible catastrophic structural failures. A large number of high-resolution images of wind turbines which are taken from a drone, are routinely acquired and subsequently analyzed by experts to identify imminent damages. Automated analysis of these inspection images with the help of deep learning algorithms can reduce the inspection cost and maintenance cost.

I used data from a publicly-available drone inspection image of the “Nordtank” turbine over the years of 2017 and 2018. The dataset is hosted within the Mendeley public dataset repository.

Using Labelimg, I manually annotate each image, which contains contained at least one wind turbine. Annotating the images in Labelimg creates an XML file corresponding to each image. These XML files must be converted to CSV and then TFRecords. Full code of this step available in my notebook.

To simplify the illustration, I just used the Vortex Panel, Vortex Panel Missing teeth, and Surface Damage as the type of defects to detect by the model. A few of the labeled images are shown below. In total, 320 (182 had positive samples) images were used for training and 63 for testing.

I randomly split images for training and testing to separate folders and generated annotated files from the annotation xml with bounding box coordinates in a specific format that needed for faster-RCNN model. Annotate file contains the image name with full path, defect type and bounding box coordinates for each image. There can be multiple rows for one image as a single image can have more than one defect type.


Here are few samples of annotated turbine images






Faster R-CNN model

Faster R-CNN is an updated version of R-CNN (and Fast R-CNN). The structure is similar to Fast R-CNN, but the proposal part is replaced by a ConvNet.

Faster R-CNN architecture


Convolution layer converts images into high-level spatial features called the feature map. Region Proposal Network (RPN) on these feature maps and get estimate where the objects could be located and ROI pooling is used to extract relevant features from the feature map for that particular region and based on that classifier, making the decision of whether an object of that particular class is present or not in the fully connected layer.

I adopted the faster RCNN implementation from keras-frcnn. I modified the parameters and image resolutions, prediction results for this problem. In the faster-RCNN model, the base network is ResNet. And the RPN built on the base layers. In addition, we have the classifier also built on the base layers

# base layersshared_layers = nn.nn_base(img_input, trainable=True)# define the RPN, built on the base layersrpn = nn.rpn(shared_layers, num_anchors)
# define the classifer, built on the base layers
classifier = nn.classifier(shared_layers, roi_input, C.num_rois,
nb_classes=len(classes_count), trainable=True)
# defining the models + a model that holds both other models
model_rpn = Model(img_input, rpn[:2])
model_classifier = Model([img_input, roi_input], classifier)
model_all = Model([img_input, roi_input], rpn[:2] + classifier)

I used the Mean Average Precision (MAP) to measure the quality of prediction. MAP is commonly used in computer vision to evaluate object detection performance during inference. An object proposal is considered accurate only if it overlaps with the ground truth with more than a certain threshold. Intersection over Union (IoU) is used to measure the overlap of a prediction and the ground truth where ground truth refers to the original damages identified and annotated.


The IoU value corresponds to the ratio of the common area over the sum of the proposed detection and ground truth areas (as shown in above image)

I have used 182 annotated images that were used for training and 63 for testing. For all image and model preprocessing, I used my Jupyter notebook (available in my GitHub) and for training and testing the faster-RCNN model in GPU, I used google Colab environment.




After making the annotation file in the above format, process training using below script

python -o simple -p train_annotate10.txt

It might take a lot of time to train the model and get the weights, depending on the configuration of your machine. I suggest using the weights I’ve got after training the model for around 60 epochs. You can download these weights from my GITHUB.

Here is the plot of my training accuracy and losses during training each epoch.



Now the training is completed and I do see the optimal accuracy and loss coverage around 25 epoch. The classification accuracy is 98% and it a great score. Let us start to predict with unannotated images. After move the testing images to test_files call the script below

python -p test_files

Finally, the images with the detected objects will be saved in the “results_imgs” folder. Below are a few (full data available in my GitHub) examples of the predictions I got after implementing Faster R-CNN:




The model predicted VG Panels, Missing Teeth, and Surface Damage in the validation image set are detected with high probability with the default training parameters as you see in the resulted images. However, I’ve found that many actual VG panels not detected when it was a different angle, or not correct zoom level. I am sure adding more samples will solve this and improve the model’s efficiency.


Faster RCNN model did great with respect to my sample data. Mean Average Precision (MAP) is good for the model when it predicted the defect surface. However, the model did not get to detect all missing panel teeth scenarios. Also, I have not considered this model, due to a lack of training images for this particular type. This faster-RCNN deep learning model will greatly improve efficiency when we add many training samples and images taken from different climatic conditions. In addition, we need custom augmentation methods to make a lot more training samples than standard augmentation techniques.

I will update my progress with another article soon. You can access the Jupyter notebook with full code for detecting wind turbine surface defects in my GitHub


1. This is the research paper inspired me for this article: Wind Turbine Surface Damage Detection by Deep Learning Aided Drone …

2. Data Source: DTU – Drone inspection images of wind turbine

Data Science Central RSS Feed

Proxy Quantum Clouds and the JupyterHub


About 4 months ago, I wrote for this blog an article entitled “List of Quantum Clouds“. In that article, I listed 17 “quantum clouds”. By now, there are probably a few more. The “HWB” (hardware backed) quantum clouds of Dwave, Rigetti and IBM, offer access to already existing, on-line qc hardware. I’ll call all the other quantum clouds “proxy” quantum clouds. Proxy quantum clouds can run a quantum circuit on a cloud hosted simulator or they can relay that quantum circuit to a HWB quantum cloud. Out of all the proxy quantum clouds, two of them are backed by two hungry, cruel giants called Amazon and Microsoft. The rest are backed by ants.

Will proxy quantum clouds, including the ones run by the 2 cruel giants, ever become popular or profitable? It remains to be seen.

Proxy quantum clouds are certainly not a guaranteed slam dunk from a business perspective. Users might decide to simulate their quantum circuits more cheaply by running them on their own PCs. And, if users want to run their quantum circuits on real qc hardware, they might choose to eliminate the proxy, and send their quantum circuits directly to a HWB quantum cloud. Users might eliminate the proxy middleman because it is slowing things down and charging the users a toll fee to do so.

Another important reason why proxy quantum clouds may fail, is open source software such as z2jk (Zero to Jupyterhub with Kubernetes). z2jk is free, open source software that allows anyone, even someone with almost zero understanding of the underlying docker/kuberbetes technology, to set up their own jupyterhub cloud service in minutes.

A jupyterhub, in case you are not familiar with the term, is software that can ‘spawn’ for many users, a personal file system in the browser of each user. This personal file system can open, run and save on the cloud many jupyter notebooks. So far, all quantum clouds are, deep down, jupyterhubs.

z2jk is superbly well documented and maintained by UC Berkeley, where it is used to teach many courses. Furthermore, z2jk is very sophisticated. It uses the latest in Jupyter notebook and cloud technology (Docker containers and container orchestration via Kubernetes). Due to Kubernetes, z2jk can be run on anything from a single workstation to multiple cloud providers running concurrently (cloud providers such as Amazon AWS, Microsoft Azure, Google cloud, Digital Ocean, IBM cloud, etc.). Kubernetes magically and seamlessly balances the load among all servers. z2jk is also very flexible: it can be used to teach any data science or quantum computing course, or you can use it to set up your own jupyterhub quantum cloud service, only for yourself, or for you and a group of your peers.

z2jk is so good that it may single-handedly put the kibosh on all proxy quantum clouds. Goodbye Amazon and Microsoft quantum clouds. Goodbye proxy quantum clouds run by startups. We hardly knew ye.