I think fear comes in all forms. There can be a fear of judgment, fear of the page, fear of failure, and fear of success. It might sound strange but I fear praise more than criticism. Being…
Wikipedia defines Data Science as “A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data”
Further it states thatdata science constitutes the process of using the mostpowerful hardware, most powerful programming systems and the most efficient algorithms to solve problems.
There are lot of key words structured and unstructured data, powerful systems, efficient algorithms and scientific methods to name a few. Data science is not just about building predictive models; it’s much more than that. Data science is an area which requires creativity, ability to work and analyze large and diverse data, team work and curiosity and so on.
But why do we need data science? Why do we need to find insights from various data sources? Why do we need to work with unstructured data? Why do we need to have complex systems and processes? In short, whyis data science becoming so popular? In 2012, HBR called data science as “The sexiest job of the 21stcentury” and it has become a buzzword since then. So why didit become such a trend that everybody wants to become a data scientist.
Prior to the rise of the internet, the data we used to have was small in size and mostly structured. This data could be easily analyzed using simple BI tools. In fact, doing a pivot table in excel to summarize and analyze the data used to be enough, but the current scenario is totally different. Today, most of the data we have is unstructured or semi-structured. In fact, it is predicted that by 2020, 80% of the data will be unstructured. Now we have big data, cloud systems which gives us the ability to store large amounts of diverse data overlonger periodsof time. Simple BI tools are not capable of processing this huge volume and variety of data. I still remember few years back excel used to have around 65k rows and some 250 columns. Though now it has increased to 1m rows and 16k+ columns in the new version but thereno way thatit can today or 10 years down the line handle large datasets. So, we do need more complex and advanced analytical tools and algorithms for processing, analyzing and drawing meaningful insights out of data.
It’s all good that we have so much data both structured and unstructured and to analyze this data we need complex systems, tools etc. but why do we need to do all this. As it’s said that thedevil’s in the details, if by extracting and analyzing purchase behavior of a customer wecan recommend a product or service which might be useful,then it’s not only going to increase our chances of cross selling/upselling to increase revenue but might also help us improve customer experience. No doubt this data was available earlier but now with vast amount and variety of other data to complement such predictions, we are able to train models more effectively and recommend products to customers with more precision.
Another example could be identifying if a customer will click on a particular advertisement or email campaign. By using the data from customer profile, past interactions and similar customer behavior history, we can be more accurate about our ability to influence customer decisions.
Those are simple cases of how we can use data science to bring more value for the business and the customer. Following are the domainswhere data science is creating a lot of value:
1) Marketing — upselling/cross selling, churn, predicting lifetime value of a customer
2) Sales — demand forecasting, discount offering
3) Medical/healthcare — disease prediction, medication effectiveness
4) Social media — sentiment analysis, digital marketing
5) Automation — self driving cars, drones, defective item identification in manufacturing
6) Travel — dynamic pricing, predicting flight delays
7) Credit & insurance — fraud & risk detection, claims prediction
Following is an image from Gartner which explains how as we move up the difficulty level, we increase the value we generate by moving from what happened to how it happened to what will happen and how can we it happen. So it’s about moving from descriptive to prescriptive and in the process increasing the value.
As we move from information to optimization, we need to move from descriptive to prescriptive analytics. If we want to create more value then we need to increase the difficulty level and focus on what will happen and how we can make it happen. In order to do this, we need more and more focus on data science which includes processing of hidden layers of data using complex systems and efficient algorithms.
What are the typical stages of a data science project?
A typical data science project has the following 4 stages:
This includes collecting data from various sources, conditioning and transforming data, cleaning data so that machine can understand.
2) Analyze data —
Finding significant patterns and trends using statistical methods. This includes visualizing the data using graphs including bar charts, histograms, line graphs, box plots etc. which can help us get a better picture of the data
3) Suggest hypothesis and take action —
This is the stage where we build a model and take action basis the model output. This is the stage which is considered to be the most exciting. This is the stage which includes tasks such as what we are going to predict — is it a classification or regression problem or clustering problem?
4) Interpreting the output —
This is the most important stage where we interpret and analyze model output. This requires explain model output to non-technical business person. We haveto build the model to solve a business problem and a business person would not understand technicality of a model, so it’s important that the model output is explained in a way that a layman is able to understand. When we build a model we are looking for actionable insights which can be implemented. In this process, onlytechnical skills are not sufficient. One essential skillthat you need is to be able to tell a clear and actionable story. If your presentation does not trigger actions in your audience, it means that your communication was not efficient. So here storytelling becomes an essential anda key skill.
Now we have seen various stages of a data science project. Now let’s discuss who part of data science. Usually, data science is considered as a silo work stream but it’s actually not. People think data scientists are all geeks who have their big glasses on all the time, seating in coffee shop trying to solve world problems. Let’s break this myth.
As per an HBR article, a data science team should have the following talents:
1) Project management
2) Data wrangler
3) Data analysis
4) Subject expertise
Project manager– data science project is a team work and it requires collaboration with people across departments. A good project manager employing scrum like methodology with great organizational abilities and strong diplomacy skills will help to bridge cultural gaps by bringing disparate talents together at meetings and getting all team members to speak the same language.
Data wrangler– also referred to as data mining is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as Data Science. It includes finding, cleaning, and structuring data; and creating and maintaining algorithms and other statistical engines. People with wrangling talent will look for opportunities to streamline operations — for example, by building repeatable processes for multiple projects and templates for solid, predictable visual output that will jump-start the information-design process. Data wrangler is someone who finds, cleans and structure data so that it can be used for data science projects.
Data Analysis/ML Engineer/AI Specialist — this includes finding patterns & meaning in the data, set hypothesis and test them, and build predictive/machine learning models. A good data analyst is not similar to data wrangler. A data analyst should have the ability to critically evaluate the data, should be curious and innovative in applying statistical and machine learning techniques. A data analyst is someone who can make sense out of any kind of data and help answer any problem statement.
Subject matter expert — data science can’t be separate from business or management team. If a data science team is working on a building a propensity model to target customers for a mortgage product, then it is utmost important that someone from thebusiness who has deep understanding of the product will be aninvaluable asset in the data science team. Also, people with knowledge of the business and the strategy will inform project manager and data analysis and keep the team focused on business outcomes, not just on building the best statistical models. As I mentioned earlier, data science is team work and each team member is needed to have a successful outcome.
Design — you might have created an analysis which gives strong recommendations, leading to high revenue generation or high cost reduction, or you might have created a model which has 99% accuracy but until you present it in a format which management can understand — all this is a waste. Effective visual communication is extremely important and it includes creating visuals which are simple, meaningful and impactful. Data visualization is an art and is more than colors, fonts, type of graph, and amount of information and so on.
Storytelling — I will explain the power of storytelling using a recent experience I have had in training. Our trainer asked “can you explain me what is Newton’s law?”Nobody answered. The trainer asked another question — “can you explain thetheory of relativity?”Nobody answered. The trainer asked one last question — “can you explain me the thirsty crow story?” everybody went ahead to say yes! It’s not like people didn’t know what newton’s law is or what is the theory of relativity but since those are technical concepts and we, humans tend to remember stories better. The reasonbeing that stories are explained using visuals and are simple to understand. If you notice children’sstory books, those have images and that’s the reason we remember stories better. This is the power of storytelling. The ability to present data insights as stories will, more than anything else, help close the communication gap between algorithms and executives.
Biggest mistake organizations make is to expect a data scientist to perform all the above roles and this is one of the reasons data science projects fail to create value. Instead of expecting a data scientist to possess all these skills, we should look to create a data science team which includes experts from each of the above areas.
Data science is an exciting field to be in. it’s amazing to see how we can use data to make decisions about our customers, our organization and our lives. Google, Facebook, Amazon, Uber are data driven organizations and they are leading the way
I am going to end this topic by talking about few examples of how data science has helped across industries.
Sprint a US based telecom is using data to create better customer experiences. In 2014, Sprint had a customer churn rate of 2.3% — twice as much as its biggest competitors. The company was relying on customer experience agents — who were relying on their own judgement to comb through data on how to best serve the customer. Previously, the agent would look through more than 20 offers, trying to pick the best one while on the phone with the customer. Sprint knew it needed to get away from relying on its employees to make these split-second decisions. After implementing a data solution which has AI capability, Sprint deployed predictive and self-learning analytics to identify customers at risk of churn and proactively provided personalized retention offers. As a result, Sprint reduced customer churn by 10% to historic lows, while also increasing its net promoter score by 40%, boosting customer upgrades by 8 times, convincing 40% more customers to add a new line, and improving overall customer service agent satisfaction.
Another example comes from Royal Bank of Scotland (RBS) which used data to move from a sales-driven culture to being more of a trusted partner for the customer. The company went through a transformation focusing on becoming a more trusted advisor to the customer than a typical bank would be. For example, analytics helped the bank identify customers that were in need of financial advice. Now, when RBS sees a customer that’s continuously overdrawing their bank account, the bank will flag that customer and give them a call to provide financial advice.
But not everything is positive all the time. A data science application can lead to loopholes which need to be carefully investigated.
If the data which is used to train a predictive model is biasedthen it is going to lead to biasedoutcomes. For example, a recent article by HBR suggested that voice recognition which is another application of data science still has a significant race and gender biases. Research by Dr. Tatman published by the North American Chapter of the Association for Computational Linguistics (NAACL) indicates that Google’s speech recognition is 13% more accurate for men than it is for women, and Google is regularly the highest performer — as compared to Bing, AT&T, WIT, and IBM Watson systems when it compares to speech recognition.
These biases have serious consequences in people’s life. For example, an Irish woman failed a spoken English proficiency test while trying to immigrate to Australia, despite being a highly-educated native speaker of English. She got a score of 74 out of 90 for oral fluency. Sounds eerily familiar, right? This score is most likely a failure of the system.
Why does this bias exist? Disparities exist because of the way we’ve structured our data analysis, databases, and machine learning. Similar to how cameras are customized to photograph white faces, audio analysis struggles with breathier and higher-pitched voices. The underlying reason may be that databases have lots of white male data and less data on female and minority voices. For example, TED Talks are frequently analyzed by speech scientists, and 70% of TED speakers are male.
AI is therefore set up to fail. Machine learning is a technique that finds patterns within data. When you use speech recognition, the system is answering the question “given this audio data, which words best map onto this data, given the patterns and data in the database?” If the database has mostly white male voices, it will not perform as well with data it sees infrequently, such as female and other more diverse voices.
Data can provide huge insights for companies, but making the most of the data being generated is no longer possible without the help of a data science team which is always working collaboratively in an agile manner and creates solutions using data which is fair and diverse.
How can you connect with the largest generation on Earth, scale commerce and build community? With a simple, customer-focused approach. Gen Z has a current spending power of about $143 billion, even… Read more
During the last few years being ambitious, just graduated and looking for an opportunity for fresh start, I’m faced with the situation of losing myself in a brainstorm of ideas. I have to admit that… Read more
Modern-day cyberattacks keep growing in sophistication and sheer volume. This dynamic makes it virtually impossible to detect and block all attacks using the traditional methods of comparing incoming… Read more