Kdnuggets RSS Feed

Evidence Counterfactuals for explaining predictive models on Big Data


Big Data generated by people — such as, social media posts, mobile phone GPS locations, and browsing history — provide enormous prediction value for AI systems. However, explaining how these models predict with the data remains challenging. This interesting explanation approach considers how a model would behave if it didn’t have the original set of data to work with.

By Yanou Ramon, Applied Data Mining Research Group, U. of Antwerp.

Predictive models on Big Data: Mining a pool of evidence

Why did the model predict you’d be interested in this post, based on the hundreds of KDNuggets posts you read? Because you read the post about “explainable AI” and the post about “cracking open the black box“: if you had not read these posts, you would not have been predicted to be interested.

The above example is an imaginary “Evidence Counterfactual” for a model that would predict interest in this post, based on your browsing data on KDNuggets (much like targeted online advertising works these days). In this post, you’ll learn more about the Evidence Counterfactual, an explanation approach to explaining the decisions of any predictive system that uses Big Data.

More companies are tapping into a rich pool of humanly-generated data (also referred to as “behavioral big data”). Think of a person liking Instagram posts, visiting different locations captured by their mobile GPS, browsing web pages, searching Google, making online payments, connecting to other people on LinkedIn, writing reviews on Reddit or Goodreads, and so on. Mining these massive behavioral traces leads to artificial intelligent (AI) systems with very high predictive performance in a variety of application areas,1 ranging from finance to risk to marketing.

The goal of these AI systems is to predict a variable of interest from these data, such as creditworthiness, fraudulent behavior, personality traits, or product interest. The input data are characterized by a large number of small pieces of evidence that the model uses to predict the output variable. Let’s refer to this as the “evidence pool.” These pieces of evidence are either “present” for an instance (e.g., a person in the data set) or “missing,” and each instance only has a relatively small portion of evidence present. As Foster Provost explains in this talk,2 a predictive model can be thought of as an evidence-combining system, where all pieces of evidence that are present for an instance can be used by the model to make predictions.

To illustrate more clearly how to see behavioral big data as a “pool of evidence,” think of a model that uses location data of persons in New York City to predict someone as a tourist or NY citizen. Out of all possible places to go to (the “evidence pool”), a person would only visit a small number of places each month (the “evidence of that person”). In a numerical data representation, each place is represented by a binary feature (see the columns in Figure 1), and the places someone visited will get a corresponding nonzero value for that person. All places that are not visited by that person are “missing” pieces of evidence and get a corresponding zero value. In Figure 1, for example, Anna visited 85 places out of the 50,000 possible places used by the predictive model. For example, she visited Time Square and Dumbo, however, she did not visit Columbia University, making this piece of evidence missing.

Yanou Fig1 Locationdata

Intuition behind the Evidence Counterfactual

It is not straightforward to interpret how predictive systems trained from behavioral footprint data make their decisions, either because of the modelling technique (it can be highly nonlinear such as Deep Learning models) or the data (very high-dimensional and sparse), or both.

To understand the reasons behind individual model predictions, Evidence Counterfactuals (or simply “counterfactuals”) have been proposed. This explanation approach (to the best of our knowledge first proposed for predictive modeling in this paper3 to explain document classifications) is mainly inspired by causal reasoning, where the goal is to identify a causal relationship between two events: an event A causes another event B if we observe a difference in B’s value after changing A while keeping everything else constant.4

The Evidence Counterfactual shows a subset of evidence of the instance (event A) that causally drives the model’s decision (event B). For any subset of evidence of an instance, we can imagine two worlds, identical in every way up until the point where the evidence set is present in one world, but not in the other. The first world is the “factual” world, whereas the unobserved world is the “counterfactual” world. The counterfactual outcome of the model is defined as the hypothetical value of the output under an event that did not happen (e.g., a set of pieces of evidence is no longer present for that instance). The counterfactual explanation can be defined as an irreducible set of evidence pieces such that, if it were no longer present for that instance, the model’s decision would be different. (We can also talk about “removing evidence” when making pieces of evidence “missing”). The irreducibility indicates that removing a subset of the features that are part of the counterfactual explanation does not affect the model’s decision.

To clarify this definition, consider the following Evidence Counterfactual as an explanation for why Anna was predicted as a tourist in our running location data example:

IF Anna did not visit Time Square and Dumbo, THEN the model’s prediction changes from tourist to NY citizen.

The pieces of evidence {Time Square, Dumbo} are a subset of the evidence of Anna (all the places she visited). Just removing Time Square or Dumbo from her visited locations would not be sufficient to change the predicted class (this refers to the irreducibility of the Evidence Counterfactual). The “factual world” is the one that’s observed and includes all the places Anna visited. The “counterfactual world” that results in a predicted class change is identical to the factual world in every way up until the two locations Time Square and Dumbo.

An important advantage of counterfactuals that immediately catches the eye in the above example is that they do not require all features that are used in the model (the “evidence pool”) or all the evidence of the instance (e.g., all places Anna visited) to be part of the explanation. This is especially interesting in the context of humanly-generated big data. How useful would an explanation be that shows the marginal contribution of each visited location to the prediction of being a tourist? Such an explanation encompasses hundreds of locations. The evidence counterfactual bypasses this issue by only showing those pieces of evidence that have led to the decision (they are causal with respect to the decision) and evidence that’s relevant for that particular person (only locations visited by that person or more general, evidence of that particular instance, can be part of the explanation).

To illustrate how counterfactual explanations can be used to explain models on big data, consider the well-known 20 Newsgroups data5 where we want to predict whether a document is about a “Medical” topic. Figure 2a shows all the words being used in the predictive model and the evidence (i.e., words) of each document. The counterfactual explanation that explains why document 01’s predicted topic is Medical is shown in Figure 2b. There are 17 words that need to be removed from the document so that the predicted topic would no longer be “Medical,” meaning there is quite some evidence that explains the model’s decision.

Yanou Fig2b Medicaltopic

Yanou Fig2b Medicaltopic

Consider another model trained on the 20 Newsgroups data to predict documents with the topic “Atheism,” where we do not remove header data as a textual preprocessing step. Figure 3a/b shows how the Evidence Counterfactual can help to identify problems with the trained model. Even though document 01 was correctly classified, the header information is being used to differentiate documents with the topic “Atheism” from documents with other topics. This leads to predictions being made for arbitrary reasons that have no clear connection with the predicted topic (e.g., “psilink,” “p00261”). It is unlikely that this arbitrary information is useful when predicting topics of new documents. This example illustrates how Evidence Counterfactuals can be used for identifying issues with a predictive system (such as predictions being “right for the wrong reasons”) and how such explanations can be a starting point for improving the model and the data preprocessing.

Yanou Fig3a Atheism topic

Yanou Fig3b Atheism topic

For more illustrations of counterfactuals for explaining models on behavioral big data, visit this GitHub repository. There are tutorials on explanations for gender prediction using movie viewing data using a Logistic Regression and a Multilayer Perceptron model, and Topic prediction from news documents using a Support Vector Machine with a linear kernel function.

Computing counterfactuals for binary classifiers

The huge dimensionality of the behavioral data makes it infeasible to compute counterfactual explanations using a complete search algorithm (this search strategy would check all subsets of evidence of an instance up until an explanation is found).

Alternatively, a heuristic search algorithm can be used to efficiently find counterfactuals. In the original paper, a best-first search has based on the scoring function of the model (the open-source Python code is available on GitHub). This scoring function is used to first consider subsets of evidence (features) that, when removed (set feature value to zero), reduce the predicted score the most in the direction of the opposite predicted class. These are the best-first feature combinations. There are at least two weaknesses of this strategy: 1) for some nonlinear models, removing one feature does not result in a predicted score change, which results in the search algorithm picking a random feature in the first iteration. This can result in counterfactuals that have too many features in the explanation set or a search time that becomes exponentially large because of the growing number of search iterations. 2) Second, the search time is very sensitive to the size of the counterfactual explanation: the more evidence that needs to be removed, the longer it takes the algorithm to find the explanation.

As an alternative to the best-first search, we proposed in this paper6 a search strategy that chooses features to consider in the explanation according to their overall importance for the predicted score. The importance weights can be computed by an additive feature attribution technique, such as the popular explanation technique LIME. The idea is that the more accurate the importance rankings are, the more likely it is to find a counterfactual explanation starting from removing the top-ranked feature up until a counterfactual explanation is found. The hybrid algorithm LIME-Counterfactual (LIME-C) seems to be a favorable alternative to the best-first search, because of its overall good effectiveness (high percentage of small-sized counterfactuals found) and efficiency. Another interesting upshot of this paper is that it solves an important issue related to importance-ranking methods (like LIME) for high-dimensional data, namely, how many features to show to the user? For counterfactuals, the answer is the number of features that results in a predicted class change.

Other data and models

Evidence Counterfactuals can address various data types, from tabular data to textual data to image data. The focal issue is to define what it means for evidence to be “present” or “missing.” To compute counterfactuals, we thus need to define the notion of “removing evidence” or setting evidence to “missing.”

In this post, we focused on behavioral big data. For these data, which is very sparse (a lot of zero values in the data matrix), it makes sense to represent evidence that’s present to those features (e.g., word or behavior) having a corresponding nonzero value. The absence of a piece of evidence is represented by a zero value for that feature.

For image data, the Evidence Counterfactual shows which parts of the image need to be “removed” to change the predicted class. Removing parts of the image can correspond to setting the pixels to black or blurring that part.7 For tabular data (think of data that can be shown in a standard Excel file), that has both numerical and categorical variables, the “missingness” of features can correspond to replacing the feature value to the mean or mode, respectively for numerical and categorical features.8

Key takeaways

  • Predictive systems that are trained from humanly-generated Big Data have high predictive performance, however, explaining them becomes challenging because of the modeling technique (e.g., Deep Learning), the dimensionality of the data, or both.
  • Explaining data-driven decisions is important for a variety of reasons (increase trust and acceptance, improve models, inspect misclassifications, aid in model use, gain insights, etc.), and for many different stakeholders (data scientists, managers, decision subjects, etc.).
  • The Evidence Counterfactual is an explanation approach that can be applied across many relevant applications and highlights a key subset of evidence of an instance that led to a particular model decision. It shows a set of evidence such that, when removing this evidence, the model’s decision would be different.

GitHub resource


  1. Junqué de Fortuny, E., Martens, D., Provost, F., Predictive Modeling with Big Data: Is Bigger Really Better?, Big Data, 1(4), pp215-226, 2013
  2. Provost, F., Understanding decisions driven by big data: from analytics management to privacy-friendly cloaking devices, Keynote Lecture, Strate Europe, (2014)
  3. Martens, D., Provost, F., Explaining data-driven document classifications, MIS Quarterly, 38(1), pp73-99 (2014)
  5. 20 Newsgroups data set:
  6. Ramon, Y., Martens, D., Provost, F., Evgeniou, T., Counterfactual Explanation Algorithms for Behavioral and Textual Data, arXiv:1912.01819 (2019). Available online
  7. Vermeire, T., Martens, D., Explainable Image Classification with Evidence Counterfactual, arXiv:2004.07511. Available online
  8. Fernandez, C., Provost, F., Han, X., Explaining data-driven decisions made by AI systems: the counterfactual approach, arXiv:2001.07417 (2019). Available online

Bios: Yanou Ramon graduated in 2018 as a business engineer from the University of Antwerp (Faculty of Business and Economics). She now works as a PhD student at the University of Antwerp under Professor David Martens (Applied Data Mining group). The topic of her dissertation is on making it easier for humans to understand and interact with predictive models on Big Data by using (post-hoc) techniques to explain model decisions, both on the instance and global level.

David Martens is a Professor at the University of Antwerp, where he heads the Applied Data Mining group. His work focuses on the development and application of data mining techniques for very high-dimensional (behavior) data and the use thereof in business domains such as risk, marketing, and finance. A key topic in his research relates to the ethical aspects of data science and the explainability of prediction models.


Kdnuggets RSS Feed

Top Stories, May 11-17: Start Your Machine Learning Career in Quarantine; AI and Machine Learning for Healthcare


Also: Satellite Image Analysis with for Disaster Recovery; Machine Learning in Power BI using PyCaret; Deep Learning: The Free eBook; 24 Best (and Free) Books To Understand Machine Learning

Most Popular Last Week

  1. newLearnStart Your Machine Learning Career in Quarantine, by Ahmad Anis
  2. newDeep Learning: The Free eBook, by Matthew Mayo
  3. decrease24 Best (and Free) Books To Understand Machine Learning
  4. newI Designed My Own Machine Learning and AI Degree
  5. decreaseHow to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat
  6. newFive Cool Python Libraries for Data Science
  7. newThe Elements of Statistical Learning: The Free eBook

Most Shared Last Week

  1. AI and Machine Learning for Healthcare, by Tirthajyoti Sarkar – May 14, 2020.
  2. Start Your Machine Learning Career in Quarantine, by Ahmad Anis – May 11, 2020.
  3. Satellite Image Analysis with for Disaster Recovery – May 14, 2020.
  4. Machine Learning in Power BI using PyCaret – May 12, 2020.
  5. The Elements of Statistical Learning: The Free eBook – May 11, 2020.
  6. What You Need to Know About Deep Reinforcement Learning – May 12, 2020.
  7. AI Channels to Follow – May 15, 2020.

Most Popular Past 30 Days

  1. newFive Cool Python Libraries for Data Science
  2. newThe Super Duper NLP Repo: 100 Ready-to-Run Colab Notebooks
  3. newNatural Language Processing Recipes: Best Practices and Examples
  4. decrease24 Best (and Free) Books To Understand Machine Learning
  5. newFree High-Quality Machine Learning & Data Science Books & Courses: Quarantine Edition
  6. decreaseMathematics for Machine Learning: The Free eBook
  7. decreaseHow to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat

Most Shared Past 30 Days

  1. Free High-Quality Machine Learning & Data Science Books & Courses: Quarantine Edition – Apr 22, 2020.
  2. Beginners Learning Path for Machine Learning – May 05, 2020.
  3. Should Data Scientists Model COVID19 and other Biological Events – Apr 22, 2020.
  4. The Super Duper NLP Repo: 100 Ready-to-Run Colab Notebooks – Apr 24, 2020.
  5. Natural Language Processing Recipes: Best Practices and Examples – May 01, 2020.
  6. AI and Machine Learning for Healthcare – May 14, 2020.
  7. Deep Learning: The Free eBook – May 04, 2020.
Data Science Central RSS Feed

Internet of Things: Learn to Know How It Benefits Your Business

Every nascent industry needs to consider modern technology for leveraging competitive advantages in today’s business world. They can leverage the power of the latest technology like blockchain, internet of things (IoT), augmented reality, or any other to be the frontrunner in the businesses.

The business entrepreneurs have to make sure that they learn how this modern technology can benefit their business in the long run. There are various questions which businesses have to consider for knowing whether a particular technology is a perfect match for their business or not, these questions include:

  • Which technology they must consider for their business?
  • How can they leverage the benefits of a particular technology?
  • In which areas of business they must consider involving modern technology initially?
  • What will it cost to integrate technology?

There are many other questions that any of the individual businesses have to consider when it comes to learning any of the specific technology completely. One of the leading technologies which businesses can consider is the Internet of Things (IoT), this modern technology has much to offer to businesses if they consider it in the perfect manner. Hence business must definitely put the needed efforts to learn this advanced technology to take their businesses to the new level of success and to enjoy various benefits which are coming on their way.

IoT is Interacting in Various Business Areas: Explore Reasons that Justify the Statement

The internet of things has obtained huge popularity due to its compelling improvement over other all next-generation technology. Thus advanced technology has become the buzzword when it comes to market research. It possesses the susceptibility for innovation; therefore, if any of the businesses consider involving it into their business can grasp numerous opportunities that this advanced technology offers to them.

The number of short-range IoT devices in 2018 was forecasted to reach around 7.5 billion worldwide, this number is predicted to increase and reach to 17.8 billion by 2024. Analytics and internet of things (IoT) market are expected to reach around 520 billion U.S. dollars by 2021, the systems integration segment is accounted for around 116.1 billion U.S. dollars of the same in the number. (Source)

More and more business startups are opting for the internet of things based device and solution for their business so that they can carry out all the work of their business in a more systematic and accurate way. Advanced IoT connected devices provide them with quick access to access each and everything within a matter of seconds. Due to the high potential of this advanced technology, the global market for end-user solutions is forecasted to grow and reach around 212 billion U.S. dollars in size by 2019. The internet of things is one such technology that has reached 100 billion dollars in market revenue in 2017 and predicted to grow more and reach around 1.6 trillion by 2025. (Source)

Top Internet of Things (IoT) Based Application Examples which Changed the World

The latest IDC report shows that corporate spending on IoT is estimated to reach around $745 billion by 2020. Spending on this advance technology was forecasted to reach around $1 trillion by 2020. Both organizations, as well as individuals, are leveraging the benefit of the internet of things. They are using various IoT-based solutions that allow them to automate almost all their daily tasks, it also enables them to monitor and control the connected devices, this will result in enhancing convenience and efficiency in performing tasks.

Every individual thinks that the concept of IoT technology is quite simple, but the implications are not simple. This advanced technology is providing a revolutionary change in almost all business sectors like manufacturing, health care, and many other industries. Most of the businesses are opting for leveraging numerous competitive advantages in their business field.

Business startups are mainly focusing on increasing operational efficiency, they are automating their business tasks and making use of real-time data management to leverage competitive edge. This advanced technology empowers them to develop and grow their business by providing them with advanced and more effective approaches.

Businesses can develop IoT based applications to control and manage all the tasks of their business more systematically. New applications such as ubereats clone or any other can be developed using IoT technologies, this advanced technology will definitely allow businesses to implement and design more robust risk management strategies for their business than ever before. There is much other application available in the world which has been developed using the internet of technology, explore to know which are those:

IoT Sensors

IoT sensors possess digital sensors that can be connected with circuit boards such as Raspberry Pi 2 or any other more accurately than ever before. The circuit boards are programmed to track the range of data collected from the sensor devices like temperature, pressure, carbon monoxide, vibration, humidity, and much more.

IoT sensors distinguish from simple sensors in many ways, they can gather data in and can also send the data to the connected device at the same time. IoT sensors allow users to have seamless control over a large amount of data through automation, delivering actionable insights. Businesses can use this data for improving efficiency, reducing cost, predictive maintenance, and for performing the various tasks which can help them to boost their business productivity more accurately than ever before. Have a quick overview of the video to know how IoT sensor actually works

Video: (Source)

Tracking and Monitoring System

Most of the businesses these days are making use of advanced IoT powered solutions for tracking assets. These advanced IoT devices use radio frequency (RF) or GPS to monitor and track assets in real-time. This smart device can be used by businesses for verification of assets and long-range identification. Watch the video given below to know how the tracking and monitoring system actually works.

Video: (Source)

Effective Supply Chain

Supply chain managers are allowed to make improved predictions via rerouting algorithms and smart routing. They can use IoT connected devices to grasp instant after-the-incident facts about the packages through RFID and GPS signals, this can result in making a more effective decision than ever before.

Applications developed using IoT technology can help business entrepreneurs in mitigating uncertainty risks throughout the supply chain management. This allows businesses to make use of smart supply chain management programs, which improve profitability, minimize variance, reduce costs, and much more. Besides this advance, IoT powered devices can help businesses in various areas like inventory management, fleet management, scheduled maintenance, vendor relationship, and much more, this might result in improving business activities and performance to a great extent. Watch the video to know how the internet of things (IoT) is changing the supply chain.

Video: (Source)

Smart Barcode Readers

IoT barcode readers support AI-driven digital signal processing, which helps in managing inventory for retailers in a much accurate way. These advanced devices can help in optimizing numerous operations such as logistics, warehouse, retail, and many others. This advanced barcode reader uses a cloud data connection to connect with other IoT devices and systems. This connected bar code reader eases the managing process of inventory.

IoT barcode readers can be integrated into the shopping carts to manage and track it more accurately. These readers use AI-based sensors to track when any of the product is added or removed from the cart. This smart barcode readers help in transferring data by automating the transferring task. This can help businesses in offering improved checkout experience to the end number of customers at the same time and can save a huge amount of time for individual customers. Have an overview of the video to know how smart barcode readers are working.

Video: (Source)

Smart Healthcare System

IoT has introduced numerous systems for the healthcare industry. The advanced technology can be used by healthcare sectors for offering high-quality medical services through smart medical devices. IoT enabled smart devices are also well known as the Internet of Medical Things (IoMT), the technology that helps in supporting and monitoring the end number of data that can help them in making more efficient clinical decisions. Almost all the populace can access almost all the medical services with the help of IoT medical devices. Explore the video to know how this smart healthcare system works to provide the best experience to the patients.

Video: (Source)

Advanced IoT medical devices can help a doctor in real-time monitoring of individual patients remotely. This advanced device can easily measure and track the report for an emergency like heart failure, attack, asthma attack, etc., immediately to a physician, this can result in saving the life of the individual.

IoT devices can help patients and doctors to collect numerous data such as sugar levels, blood pressure, weight, and much more. This device stores all the data online, which can later be accessed from anywhere and at any time. IoT automates the complete workflow by allowing the provision of accurate and effective healthcare service to all the patients.

Internet of Things (IoT) for Business is More Powerful and Versatile

IoT has provided us with a lot of new things which have made our life much easier and quicker than ever before. Thanks to this developing technology due to which businesses can solve any of the problems within a matter of seconds. Besides this, it also helps businesses to make transactions, make purchases, take care of our health, and much more efficiently and accurately than ever before.

Data Science Central RSS Feed

Little Blue Pearl


Little Blue Pearl – named after our planet – has kept me occupied for several days during my 14-day Coronavirus isolation. It is fully functional although I suppose it needs a makeover in terms of its appearance. Like a child after summer break, this is going to be my show-and-tell. LBP is designed to process the pandemic data in a particular manner focusing mostly on the number of fatalities.

I created LBP because I believe that the new Coronavirus and many other types of dangerous diseases are here to stay. I will make my contribution to prop up our species in one of the few ways I know how. I begin with a normal chart of fatalities. The image below shows the chart for Canada.


LBP first converts the fatality data into a format that can be more easily accessed by a computer algorithm; basically, this means keeping the activity confined within predictable boundaries. I use a metric that I call the Exp-1, which I wrote about actually on the same day that I developed it.


To demonstrate the bounded nature of the converted data, consider the next chart containing triggers using fixed boundaries – as presented on the legend. On this chart – also generated by LBP – beneficial developments are indicated as more and bigger bumps appear on the top. The opposite is true if the bumps are on the bottom.


LBP’s primary mission is to determine the impacts of qualitative events on the quantitative metric. On the gradient of the chart below, the greatest amount of benefit occurs when the pattern is on the left. The particular event in question therefore seems associated with minimal benefit.


The crosswave pattern chart attempts to show relative benefit: e.g. are we better or worse off implementing or having the event? The chart shows that based on the data available, it seems we were better off not implementing or having the event. The legend explains the chart to some extent: UT-f = untreated forward; UT-b = untreated backwards; TR-f = treated forward; and TR-b = treated backwards. I have written about the crosswave differential over the years. I will just note at this point that it has been tested using controlled simulations.


The more specific or focused the data and events, the easier it is to determine and especially explain the connection. The particular event in this case is the stoppage of unnecessary business in Ontario. It should be noted that the stats for the entire country are fairly disassociated from regional and provincial realities: i.e. it might be unfair to evaluate the impacts of the stoppage since the metric is national rather than provincial. I am aware also that many fatalities have been connected to old age homes. It is not for me to question the exact reasons for the outcomes. The LBP merely points to the lack of evidence of effectiveness and possible signs to the contrary.

Possible Reasons for Lack of Effectiveness

There is no reason to question the underlying principle: social distancing is likely to prevent the transmission of disease. However, the event that I used is not about social distancing per se but rather the stoppage of non-essential business, which probably has the effect of directing people towards those operations that are considered essential. Consequently, more people – including those are infected – gravitate towards these essential businesses. Infected individuals (asymptomatic or not) find themselves with few options; this might contribute to the accidental transmission of the disease through these businesses. The so-called incubator effect might therefore not simply be a spatial phenomenon per se but also logistical. This is only speculation at this point given the absence of both quantitative metrics and qualitative event data.

An institutional solution – such halting non-essential business – can be problematic if the events connected to it cannot be easily associated with the resulting metrics. LBP enables a systematic approach to connect the dots. LBP takes line-by-line event data structured as follows for each line: (this_is_my_event). Theoretically, the number of lines is unimportant. It is therefore possible to incorporate a large amount of non-parametric or category-type data: e.g. codes associated with quality control and compliance checklists. If any jurisdiction would be interested in providing me with event data associated with their Coronavirus experience, I am offering to periodically review their distributions for free. This is meant to help me further develop LBP.

Kdnuggets RSS Feed

Easy Text-to-Speech with Python


Tags: , ,

Python comes with a lot of handy and easily accessible libraries and we’re going to look at how we can deliver text-to-speech with Python in this article.

By Dhilip Subramanian, Data Scientist and AI Enthusiast



Text-to-speech (TTS) technology reads aloud digital text. It can take words on computers, smartphones, tablets and convert them into audio. Also, all kinds of text files can be read aloud, including Word, pages document, online web pages can be read aloud. TTS can help kids who struggle with reading. Many tools and apps are available to convert text into speech.

Python comes with a lot of handy and easily accessible libraries and we’re going to look at how we can deliver text-to-speech with Python in this article.

Different API’s are available in Python in order to convert text to speech. One of Such API’s is the Google Text to Speech commonly known as the gTTS API. It is very easy to use the library which converts the text entered, into an audio file which can be saved as a mp3 file. It supports several languages and the speech can be delivered in any one of the two available audio speeds, fast or slow. More details can be found here

Convert Text into Speech


Import gTTS library and “os” module in order to play the converted audio

from gtts import gTTS 
import os

Creating a text that we want to convert into audio

text = “Global warming is the long-term rise in the average temperature of the Earth’s climate system”

gTTS supports multiple languages. Please refer to the documentation here. Selected ‘en’ -> English and stored in the language variable

language = ‘en’

Creating an object called speech and passing the text and language to the engine. Marked slow = False which tells the module that the converted audio should have a high speed.

speech = gTTS(text = text, lang = language, slow = False)

Saving the converted audio in a mp3 file named called ‘text.mp3’“text.mp3”)

Playing the converted file, using Windows command ‘start’ followed by the name of the mp3 file.

os.system(“start text.mp3”)



text.mp3 file

The output of the above program saved as text.mp3 file. Mp3 file should be a voice saying, 'Global warming is the long-term rise in the average temperature of the Earth’s climate system'

Convert a Text File into Speech

Here, covert the text file into speech. Reading the text file and pass to gTTS module


Import gTTS and os library

from gtts import gTTS 
import os

Reading the text file and store into object called text. My file name is “draft.txt”

file = open("draft.txt", "r").read().replace("n", " ")

Choosing language English

language = ‘en’

Passing the text file into gTTS module and store into speech

speech = gTTS(text = str(file), lang = language, slow = False)

Saving the converted audio in a mp3 file named called ‘voice.mp3’"voice.mp3")

Playing the mp3 file

os.system("start voice.mp3")



Converted draft.txt file into voice.mp3

Draft.txt file saved as a voice.mp3 file.Play the Mp3 file to listen the text presented in the draft.txt file


GTTS is an easy tool to convert text to voice, but it requires an internet connection to operate because it depends entirely on Google to get the audio data.
Thanks for reading. Keep learning and stay tuned for more!

Bio: Dhilip Subramanian is a Mechanical Engineer and has completed his Master’s in Analytics. He has 9 years of experience with specialization in various domains related to data including IT, marketing, banking, power, and manufacturing. He is passionate about NLP and machine learning. He is a contributor to the SAS community and loves to write technical articles on various aspects of data science on the Medium platform.

Original. Reposted with permission.


Kdnuggets RSS Feed

Linear algebra and optimization and machine learning: A textbook


This book teaches linear algebra and optimization as the primary topics of interest, and solutions to machine learning problems as applications of these methods. Therefore, the book also provides significant exposure to machine learning.

Sponsored Post.

Linear Algebra and Optimization for Machine Learning: A Textbook (Springer), authored by Charu C. Aggarwal, May 2020.

Table of Contents


PDF Download Link (Free for computers connected to subscribing institutions only). The PDF version has links for e-readers, and is preferable in terms of equation formatting to the Kindle version.

Buy hardcover from Springer or Amazon (for general public)

Buy low-cost paperback edition (MyCopy link on right appears only for computers connected to subscribing institutions)

A frequent challenge faced by beginners in machine learning is the extensive background requirement in linear algebra and optimization. This makes the learning curve very steep. This book, therefore, reverses the focus by teaching linear algebra and optimization as the primary topics of interest, and solutions to machine learning problems as applications of these methods. Therefore, the book also provides significant exposure to machine learning. The chapters of this book belong to two categories:

Linear algebra and its applications: These chapters focus on the basics of linear algebra together with their common applications to singular value decomposition, similarity matrices (kernel methods), and graph analysis. Numerous machine learning applications have been used as examples, such as spectral clustering, kernel-based classification, and outlier detection.

Optimization and its applications: Basic methods in optimization such as gradient descent, Newton’s method, and coordinate descent are discussed. Constrained optimization methods are introduced as well. Machine learning applications such as linear regression, SVMs, logistic regression, matrix factorization, recommender systems, and K-means clustering are discussed in detail. A general view of optimization in computational graphs is discussed together with its applications to backpropagation in neural networks.

Exercises are included both within the text of the chapters and at the end of the chapters. The book is written for a diverse audience, including graduate students, researchers, and practitioners. The book is available in both hardcopy (hardcover) and electronic versions. In case an electronic version is desired, it is strongly recommended to buy the PDF version (as opposed to the Kindle version for which it is hard for Springer to control layout and formatting of equations) at the following Springerlink pointer. For subscribing institutions click from a computer directly connected to your institution network to download the book for free. Springer uses the domain name of your computer to regulate access. To be eligible, your institution must subscribe to “e-book package english (Computer Science)” or “e-book package english (full collection)”. If your institution is eligible, you will see a (free) `Download Book’ button.

Data Science Central RSS Feed

Visualizing Survey Data

In this post, we will be visualizing survey data for the top survey question types. I’m Tyler Lubben, founder of VizualSurvey, a survey data cleaning tool that cleans your survey data from any survey software so you can visualize it in any of our Tableau, Power BI, Spotfire or Qlik Sense prebuilt dashboard templates. Let’s go ahead and start!

Lets start off with an overview, depending on where you get your survey data from the structure of your data export will be different. This has been a huge pain point when trying to add this data to a BI or data visualization tool. The main thing is each question type is structured differently and someone in a not optimal layout. So for each of our survey charts will be based on the survey question type, that will drive which visualization we are going to show. In this post, we are going to be visualizing survey data using Tableau due to it being, in my opinion, the best survey data visualization tool for visualizing survey data. We are going to over going what’s the best survey charts for analyzing survey data from the top survey question types.

Geographic Question Type

The first survey question type we are going to look at is geographical data. I always recommend asking for a zip code when creating a survey because you can always get a free geographic file on the internet to bring in City, County, State, etc. When someone answers the zipcode question in your survey, they only have to enter five digits, which they are a lot less likely to mess up compared to the spelling of a City or State. If you look below, I have created a map in Tableau using the zip code from the survey results. The size and color of the circles are dependent on the distinct count of response id. If we joined in a geographic file, we could have brought in State or City and mapped those as well.


Ranking Question Type

The next survey question type we are going to look at is the ranking question type. The ranking question type is a little more complicated. The example below is ranking competitors from high to low, based on who they think has the largest market share. They are ranked one to four based on whoever has the highest average score. The colors that split up the bars show you the distribution of the answers for each choice. This is a great way to visualize ranking question types because you see the overall rank, the average score, and the distribution.


NPS Question Type

The next question type is the Net Promoter Score question type. The NPS score is based on three buckets detractors, passive, and promoters. In the chart below, you can see how many are in each bucket and the NPS score for each one. This chart is split out by gender, but we could have it be in total as well. This gives a distribution in the background again, just like the ranking question, but also gives us our NPS score.


Numeric Question Type

The next one is the numeric question type. The visualization you see below is a little different then what you are probably used to. You usually see an average, but here you can see the average, distribution, outliers, and a lot more. All of these circles are all independent answers from our respondents, the larger the circle, the more people selected that exact number. This visualization is known as a box whisker plot. You can see what the upper and lower hinges and whiskers are, median, averages, and any outliers. So this gives you a lot of information outside of just the average, which is only half the story.


Likert Question Type

The matrix question type, also known as Likert, is visualized in a stacked bar chart. In the visualization below, we have the different selection options going down our rows, and the color is based on the answer options for each. If you hover over it, you can see what the percent of total is as well as the total number of respondents. This is a very common chart type for this question type, which is why you see it so often. It’s simple yet effective.


Multiple Choice Question Type

Here we have a basic bar chart to show multiple select and single response survey question types. This one is straightforward but practical as well. It is effortless to create; you just have to watch out when you do the multiple choice question type because you have to make sure you are counting distinct respondents instead of just counting rows since a respondent can answer more than one answer. In this chart, we are showing a percent of the total and the actual number of respondents.


Open Ended Question Type

Everybody kind of hates trying to visualize open-ended question types. You can’t do much with it, to be honest. Here we created a word cloud but it doesn’t do great if you have a whole bunch of long sentences or just a lot of different answers. You could do some grouping beforehand to make it look better but will take some manual work. Open ended is a very hard one, so I keep it simple here.


So those are all of the popular survey question types and how I would visualize the survey results. Now usually, no matter where you get your survey data from, you will have to spend hours cleaning your survey data to get it into a suitable format before you can even think of creating these charts. All of these charts that you saw today are from VizualSurvey’ s Tableau dashboard survey template, which you can connect your survey data to the templates for free using our free trial.

Originally posted here

Data Science Central RSS Feed

Machine Learning(ML) – Basic Terminologies in Context

Basic Terminology in Context – Machine learning should be treated as a culture in an organisation where business teams, managers and executives should have some basic knowledge of ML and its terminology. There are many online courses available which are designed for students, employees with little or no experience, managers, professionals and executives to give them a better understanding. This post is part 2 of Machine Learning (ML) – Basics you need to know.

Machine Learning Outlook

Today’s machines are learning and performing tasks; that was only be done by humans in the past like making a better judgment, decisions, playing games, etc. This is possible because machines can now analyse and read through patterns and remember learnings for future use. How to harness this magnificent machine learning and its bundle pack in real-life business is still a challenge for many though. Why ML is so good today; for this, there are a couple of reasons like below but not limited to though.

  • The explosion of big data
  • Hunger for new business and revenue streams in this business shrinking times
  • Advancements in machine learning algorithms
  • Development of extremely powerful machine with high capacity & faster computing ability
  • Storage capacity

The primary goal of machine learning (ml) is to build an automated data model for analytical reasons. The objective behind the goal is to build a system that learns from the data based on the applied algorithm. The output can be obtained by mapping output to input or detecting patterns/structure or learning by reward/punishment method.

Deep Learning - Introduction to Artificial Neural Networks.

Not to be confused AI is not Machine learning though Machine Learning is a technique out of the AI bundle to achieve artificial intelligence. Let’s just say AI generally defines & demonstrate creativity in the following traits:

  • Planning & Predicting
  • Learning & Adopting
  • Reasoning & Logic Building
  • Problem Solving & Avoiding
  • Knowledge Representation
  • Perception & Reasoning
  • Motion Detection and Manipulation

Machine Learning helps to implement all those above by using correct methods and algorithms with correct data sets. In earlier times it was believed that human intelligence can be precisely described, and machines can simulate it with AI. Before the machine starts attempting simulation, it needs to do learning with lots of data.

Machine Learning – Integral Issues

Machine learning techniques are accelerating almost on a daily basis with intentions to bring good values to the businesses of today. It is revolutionizing the way we do our business and what should be done to improve upon. ML develops its own encompassing strategy from the experience it comes across over the period of time.

The Machine Learning hype, too much information on the internet and using ML terms by almost every tech show has actually created a “misinformation epidemic” of ML. This whole game and environment are entirely centered around artificial neural networks working like the human brain and machine learning.

Screen Shot 2018-10-06 at 21.08.29

The correct information and little basic information about machine learning for non-tech and tech business professionals is the need of today. For business people, knowledge in the area of machine learning is less important than knowledge of their own business to create required and correct data models.

Machine Learning gives exposure to business to make data-driven, more informed and intelligent decisions after input of correct data from the business and not by creating complex and unnecessary use of relevant algorithms. Such decisions help to make faster and better results compared to traditional approaches. There are a lot of common mistakes that are made and should be avoided in order to successfully submerge machine learning in an analytics strategy for better business transformation. Some classic issues are

Inadequate Infrastructure – As mentioned above the requirement of a powerful machine and accelerated hardware with high storage capacity and faster computing ability is a basic need in machine learning which is normally ignored. This a fatal problem and issue.

Data Quality Problems – The explosion of big data has created great hunger for new business and revenue streams in these shrinking times. What data is correct and what is not is the primary issue to resolve with a business goal. Example insurance company selling tooth insurance needs data for how teeth hygiene are maintained by customers along with brushing habit and brand of toothpaste used, time spent on brushing teeth and frequency rather than data of blood report and blood pressure.

Correct skilled resources – Implementing machine learning without qualified data scientists is the biggest issue of any business today. The cost of the skilled resource is pretty high, supply is very low and demand is big.

Implementation without Strategy – Advancements in machine learning algorithms have created a tsunami-like environment where businesses are running behind this buzz word. Unfortunately, the challenge is big as not all corporates are able to make a decision about which strategy to be picked up at what time at this time of digital transformation. Not all business models need complex machine learning without proper analysis. Should machine learning be brought in the form of advancement to analytics or simple regression models are enough.

Machine Learning – Common Terms

Learning in simple words can be defined as “Learn, validate, test and repeat” for every new subject. In Machines, it’s more about the accuracy of the improvement of the algorithms over time. Machine Learning accuracy improves over time the more data it eats better it gets better. We have machine learning all around us like in our email box, eCommerce site and even at banks. The fundamental terminologies in ML often overlap with statistics but that just limited to the basic and initial part only. How learning evolve and at what rate and how it over right its previous known learnings are interesting items to look at.


  • Learning – As mentioned above learning in simple words can be defined as “Learn, validate, test and repeat” for every new subject. The 3 key factors in machine learning are
    • Learning Rate – Also called step size, LR is a kind of fine-tuning parameter used for moving towards a minimum of a loss function in the optimisation algorithm. In Neural networks, its called as value factor to determine the level of updates for the value of weights of the network.
    • Learning Rate Schedule – Usually a constant learning rate. In neural networks, it is used for training to achieve optima of neural network performance. LRS helps to reduce the learning rate during training by adjusting with a pre-defined schedule. The best approach to learning rate schedules should be to decrease the learning rate linearly from a bigger value to a small value.
    • Learning Rate Decay – Movement during the process of achieving a local minimum in gradient descent that helps to speed up learning algorithm. By slowly reducing the learning rate over time we achieve learning rate decay. Initially learning rate is bigger that reduces to a smaller value this process comes under learning rate decay.
  • Gradient – A simple vector that provides direction for the maximum rate of change to reach the optimal solution. If walking from the top of the hill to the bottom then steepest direction would be fastest, optimal and logical. Gradient provides that steepest direction. In the machine, the learning goal is always to achieve optimal solutions. Gradient as a vector that provides directions for the maximum rate of change to reach the optimal solution.
    • Vanishing Gradient – Vanishing gradient a problem during the training phase of neural networks. During the backpropagation process instability of gradient values, it causes instability affects the earlier layers within a neural network. A big limiting factor for the performance and accuracy of a neural network.
    • Exploding Gradient – Pretty much similar to the vanishing gradient problem, the exploding gradient problem is a result of the instability of the gradients within a neural network during backpropagation.
    • Gradient Clipping –A technique that is used to bring regularity in the instability of gradient values within a neural network. This is achieved by enforcing a threshold on the values that the gradients can take.

The fundamental terminologies in ML algorithms used in the majority with teaching intentions while in of deep-learning procedures is tweaked to learn itself. ML often overlap with statistics but that just limited to the basic and initial part only. The learning algorithm normally has three stages i.e. representation, evaluation and optimisation. Reaching optimal accuracy and precision is always the dream.

Points to Note:

All credits if any remains on the original contributor only. We have covered all basics around Machine Learning. Machine Learning is all about data, computing power and algorithms to look for information. In the previous post, we covered Generative Adversarial Networks. A family of artificial neural networks.

Books + Other readings Referred

  • Research through open internet, news portals, white papers and imparted knowledge via live conferences & lectures.
  • Lab and hands-on experience of @AILabPage (Self-taught learners group) members.

Feedback & Further Question

Do you have any questions about Supervised Learning or Machine Learning? Leave a comment or ask your question via email. Will try my best to answer it.

Conclusion – I particularly think that getting to know the types of machine learning algorithms actually helps to see a somewhat clear picture. The answer to the question “What machine learning algorithm should I use?” is always “It depends.” It depends on the size, quality, and nature of the data. Also, what is the objective/motive data torturing? As more we torture data more useful information comes out. It depends on how the math of the algorithm was translated into instructions for the computer you are using. And it depends on how much time you have. To us, at AILabPage we say machine learning is crystal clear and ice cream eating task. It is not only for PhDs aspirants but it’s for you, us and everyone.

Read original article here

Data Science Central RSS Feed

How Digital Customer Onboarding Is Improving Customer Retention For Banks


In this digital era, customers expect convenience from their banking institutions. When the customer is given an option to sign up for a bank account either by going to the bank or by sitting at home and doing it online, it’s easy to guess what option the vast majority prefer.

In many industries, digital customer onboarding is as easy as inputting the name and email address in the app. Usually, the customers don’t even consider reading terms and conditions if it isn’t about money. But when it comes to the banking institutions, there are so many onboarding requirements and government regulations. It is tricky but imperative for banks to create a convenient digital onboarding experience for their customers.

A smooth onboarding experience can be a differentiating factor between you and your competition. Many customers are more likely to drop back the minute they find it cumbersome to fill-in the details. A study reveals that nearly 40 per cent of the brits abandon the onboarding process, because they find it difficult to provide a large volume of information in a banking app or website.

Certainly, if you streamline the process of onboarding, you will achieve less customer churn and more happy customers. A seamless onboarding process will also help you to gather more accurate information about your customers, which will help you to manage customer lifecycle in a better way. It also indicates to your users that you are committed to providing them access to your services very conveniently. With every digital onboarding process, the primary goals for the banking institution will be to:

  • Avoid unnecessary re-routing to physical channels
  • Offer convenient identity verification
  • Provide rapid access to the account and services
  • Provide a seamless customer experience

At his point in time, digital customer onboarding is becoming a trend, and it is possible that we will see it as an integral part of financial services. But there are few pitfalls that the businesses may need to look after.

But first, what really is Digital Customer Onboarding? Well, it is a process that offers customers with a simple, fast, and easy way to provide personal information to the respective financial institution. It also includes the mandatory information required by the government regulatory bodies. It eliminates the need for the customers to visit the bank branch, which is especially a burden if the customer works during bank working hours.

The digital onboarding process currently seems like the must-have feature for banks, but it isn’t possible for banking institutions to make it prominent. The following are some issues that we need to address to realise a seamless onboarding experience for customers.

Too much friction

When onboarding asks for too much information and gets time-consuming, people either postpone or give up the process. Pulling already captured information, and enabling auto-fills can be an option to keep things simple and quite effortless. Eliminating the need for manual entry of photographs and ID proofs at banks is also a great way to streamline the onboarding process. It can be done by offering document uploading features within the app.

Regulatory Compliances

While the digital onboarding focuses on providing convenience to the customers, the compliances made mandatory by the regulatory bodies cannot lose priority. The digital banking platform should encompass top-notch encryption and comply with the geographical relevant KYC (Know Your Customer) and AML (Anti-Money Laundering) regulations. The digital platforms can also use two-factor authentication to validate the access.

Moreover, there are other ways to ensure a secure and authenticated customer onboarding, which includes video conferencing. Many industry verticals are already using this technology to verify the customer IDs. One such example is Upwork.

Limitation of digital onboarding with current banking infrastructure

There are some external limitations such as rigid regulatory requirements, people-dependent processes, and complex legacy banking systems and technologies. Though the possibilities of achieving a seamless digital onboarding experience are high, it is vital to understand that the awesome front-end cannot be plugged into the old machinery. This means that a seamless customer onboarding is hard to realize without thinking about a digital transformation.

For starters, banks may need to look for Fintech consultants that can evaluate the bank’s existing digital infrastructure and CRM opportunities. The software vendor can carry out infrastructure audits to identify what system upgrading should be done, and they can also estimate the cost of implementation. To achieve success in this process requires banks to be on-top of their budgeting process. Banks also need to have good marketing strategies in place to promote the successful implementation of their digital transformation to their existing customers and potential new customers.

Data Science Central RSS Feed

The subtle art of chatbot development- Client Requirements versus Client Expectations

Chatbots are virtual agents capable of emulating the conversation of a human. Chatbots are becoming very popular in providing online services or queries. Recently, chatbots have been gaining lots of limelight due to the development of Natural Language Processing (NLP) capabilities. Today, a chatbot can respond in a similar way a human agent would. Being devoid of emotions, a chatbot can be expected to offer the same quality of service throughout the day.

Despite the several advantages that chatbots can offer, careful consideration should be taken to understand the purpose of the chatbot. A chatbot can only emulate the conversation of a human but does not a human. While many expect a chatbot to be able to answer every question which the latter encounters, the chatbot is limited to the stories which were used to train the chatbot. A good storyboard is strictly dependent on the requirement set out by the client.

In this context, the human input(client) is the most important component of the chatbot itself. Many people misinterpret the NLP capabilities of a chatbot as being able to evolve on its own. Far from it, a chatbot will only evolve in a cyclic manner, throughout the continuous inflow of data to which it is subjected. The data are refined and re-adjusted to increase the stories of the chatbot so that it responds to a more extensive range of queries.


Source: chatbotsmagazine

The most important component of the chatbot is to understand the purpose it serves and the target audience. Failure to adhere to this will result in a diluted response from the chatbot. The requirement must be well-framed, and client input is essential. Chatbot developers will only be able to translate the requirements of the client into stories that will be used to train the chatbot. Furthermore, the limitations must also be set out so that both the expectations and the requirements of the client equate. Else, the chatbot will be trap under the development cycle forever without even going through the whole cycle as illustrated above.

Consider the example of SIRI, a virtual assistant embedded in the IOS ecosystem, underwent several iterations before being able to provide such an extensive range of responses. When a bot is first published, it needs to be continuously monitored. Post-deployment a bot may still be unresponsive to certain user conversations. The best way to monitor the bot is to monitor the user conversation. Several iterations are required to finally meet the expectations of the users, which is different from that of the client. The learning process is dependent on user interactions and not on client requirements. The client requirements only set the high-level objectives or purpose of the chatbot.

To conclude it is important to differentiate between the requirements and the expectations of the clients. A good chatbot is only dependent on the client’s requirement and not only on the ability of the chatbot