Dhiraj's Data Analytics Blog: 2018

Wednesday, 1 August 2018

A framework for continuous monitoring and real-time analytics

Disclaimer - The views, thoughts, and opinions expressed in the text belong solely to the author, and should not in any way be attributed to the author’s employer, or to the author as a representative, officer or employee of any organisation.

This article is an excerpt from my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).

In today's highly competitive and heavily regulated business environment, no one can really afford to be reactive when it comes to managing one’s day-to-day business risks. Instead, it is now all about rapidly responding to risk events through ongoing, real-time monitoring, with advanced analytics to spot the anomalies and potential threats before they can cause any serious harm to the business.

An organisation’s ability to measure, monitor and optimise business processes also has a direct impact on revenue and customer satisfaction. Claims processing in healthcare, trade confirmations and settlements in financial institutions, and new service activation in telecommunications companies are all examples of complex business processes. Being able to correlate data across multiple systems and geographies to gain a real-time, end-to-end view of an entire business process, and to pinpoint and resolve problems quickly is therefore vital to remain competitive and profitable as an enterprise.

But it is easier said than done. In most medium to large enterprises, business processes flow across multiple diverse IT systems and manual touch-points, giving rise to a situation where data is scattered everywhere in large quantities - in client facing systems, operations and finance applications, obscure legacy systems, in databases and network folders, on message queues, and system logs. Data is also locked down in unstructured documents, like Microsoft Excel, e-mail, chat and voice messages. To make matters even worse, quite often than not, the content, quality, structure, and definitions of the data also vary from one system to another. Without a single view of relevant data, and a consistent understanding of its meaning enterprise-wide, how do you build an effective continuous monitoring and real-time analytics capability?

A framework for continuous monitoring and real-time analytics

Creating a real-time data architecture and using it to run streaming data analytics applications is a complicated undertaking. For starters, such systems don't come in a box, and setting them up is a complex process that requires piecing together various data processing technologies and analytical tools to meet the particular needs of the end users. Compounding the problem is the availability of a whole range of proprietary and open source big data technologies, each vying for a piece of the real-time and predictive analytics market. Quite often, we also get caught in the technical details of these competing offerings and the marketing buzzwords, and lose sight of the big picture - i.e. what are we trying to achieve?

With these challenges in mind, I have been on the lookout for quite some time for a well-supported, open source, and one-stop shop type solution that can meet the continuous monitoring and real-time analytics needs of both large and small enterprises. In my opinion, the ideal solution would be the one that meets the following requirements –

a) It must be able to automatically collect data in real-time, from any type of source and in any format, and transform the data (e.g. filter, enrich and standardise) on the fly.

b) It must provide a built-in mechanism for the storage and fast retrieval of lots and lots of structured and unstructured data (i.e. a schema-less NoSQL type database rather than a traditional relationship database).

c) It must provide a mechanism to interrogate the data in real-time using built-in and configurable analytical models. Additionally, we should also be able to interrogate the data using external tools, such as custom-built machine learning algorithms.

d) It must have configurable data visualisation and dashboard functionalities.

e) Finally, we should be able to put it all together with minimal development and maintenance costs.

This might sound like I am asking for a lot. But the good news is that I did find two open source platforms (although certain advanced features require commercial license) that pretty much meet these requirements - Elastic Stack (link) and TICK Stack (link).

Elastic Stack is well suited for collating and analysing in real-time both structured and unstructured data, such as event logs (system logs, transaction records, trade booking etc.), free form texts (emails, web pages, twitter messages etc.) and documents (Word, PDF, Excel files etc.).

TICK Stack, on the other hand, has been designed from ground up to facilitate the collection, storage and analysis of time series (share price, sensor data etc.) and metrics (sales figures, financials etc.) data at scale.

In chapter nine of my book (Practical Data Analysis), I have summarised some of the key features that make Elastic Stack and Tick Stack the perfect tools for continuous monitoring and real-time analytics. Time permitting, I plan to write a few tutorials on how to use these tools in some real-world continuous monitoring use cases (keep an eye on my blog if you are interested). Nevertheless, the best way to learn about the amazing things you can do with these two open source tools would be to install them on your laptop or desktop (Windows, Linux or Mac OS), load a sample of freely available datasets, and play around with their interactive data analytics and visualisation capabilities, and plug-in your own machine learning or statistical models. In addition, there is also a huge collection of YouTube videos covering these two platforms that you can watch and learn from.

Dhiraj Bhuyan, 02 August 2018

Monday, 30 July 2018

Using sentiment analysis to gauge what your customers like or dislike

This article is an excerpt from my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).

One of the earliest expressions of public opinion was rebellion. Peasants rebelled against oppressive regimes all throughout history. When the king saw his subjects in open rebellion, it was a pretty clear sign that the public support for his rule was eroding. Unpaid tax was another clue; when rulers saw their tax receipts dwindle and heard reports of tax collectors being killed, they knew that public opinion was turning against them. With the passing of time however, both the rulers and the ruled have learned of better ways to express their views and opinions - from freely held elections to participation in legislative activities, using media and communication, and non-violent rallies, protests and demonstrations.

By the turn of the 21st century, something even more phenomenal happened - the global Internet revolution exploded. From MySpace and LinkedIn to Facebook, Twitter, Instagram, Snapchat and WhatsApp, people found new and innovative ways of communicating and sharing with each other, around the clock and across great distances. The emergence of these social networking platforms and their ever-increasing user base has not only transformed forever the way society works, but also the nature and content of what people share and discuss. Today, whether you are the President of a nation or the organiser of a popular spring revolution, a celebrity or a common man, you can tweet your story and share your opinion about anything to everything, and have it instantly reach millions of people across the world, uncensored and unprejudiced. The ripple effect of this social media led transformation has now spread so far and wide that we are ever more opinionated as individuals and as a society than perhaps in any other time in our history.

In not so distant past, due to the shortage of data, door to door surveys and opinion polls were the only real means to gauge the sentiments of the general public towards particular brands, goods, political views or ideologies. In this new information age, where thoughts and opinions are shared so prolifically, and where peer advice and recommendations are in plentiful supply, paradoxically, the challenge is no longer the lack of data on public opinion, but how to make sense from too much of it. The sudden eruption of activity in the area of opinion mining and sentiment analysis, has thus occurred as a response to the surge of interest in automated information-gathering systems that can answer the question - what do the general public think?

Sentiment analysis or opinion mining is a recently developed sub-branch of the study of Natural Language Processing (NLP) techniques, and covers the computational analysis of people’s individual or collective opinions and emotions towards particular brands, goods, political views or ideologies, with the objective of finding answers to some very pertinent questions such as:

- How did the market react to the increase in interest rate?

- What do people think about the individuals we do business with?

- How did the general public react to the changes in the tax law?

- What do the general public think about the election debate?

- Why has the sale of a product declined?

- Which features of a product do people like or dislike most?

- Which food retailers do people prefer?

The ability to ask and almost instantly get an answer (with a degree of certainty) to questions such as the ones above is extremely beneficial in many critical decision-making tasks, and in a variety of domains, such as poll forecasting, marketing, stock market trading and continuous credit risk assessment. However, sentiment analysis and opinion mining is still an area of active research, and as with any emerging technology, it is prone to some degree of error. Take for example the following sentence –

“The train is late yet again ….. brilliant …..”

Most humans would be able to quickly interpret that the individual was being sarcastic with the use of the word “brilliant”. Without contextual understanding, a sentiment analysis tool on the other hand may see the word “brilliant” and incorrectly classify the sentence as expressing a positive sentiment.

Maybe one day, as machine learning algorithms evolve, sentiment analysis tools will become proficient in understanding the linguistic expressions of irony or sarcasm. But for now, we will have to live with the capabilities as well as the limitations of the existing sentiment analysis tools and techniques. Hence, if your use case requires a 100% accuracy (in being able to correctly classify a sentiment or an opinion), the currently available sentiment analysis and opinion mining techniques may not be suitable for you. On the other hand, if you are more than happy with an accuracy rate of 70% to 90% (or even higher for certain domains, such as brief social media posts on Twitter and Facebook), you are in luck!

Unfortunately, there is not enough space in a post like this to write about what these amazing technologies. Nevertheless, if you are curious about how you can use open source technology to gauge people’s sentiments and opinions based on what they are tweeting or posting, I have written a chapter on “Sentiment Analysis and Named Entity Recognition” in my book. If you want to go even further and build an enterprise grade platform (fully open source) that can stream tweets and chat messages in real-time, and give you a dashboard view of how people’s sentiments and opinions towards specific brands or topics are evolving over time, I will highly recommend reading chapter nine of my book on “Continuous Monitoring and Real-Time Analytics”.

Dhiraj Bhuyan, 30 July 2018

PS: If you enjoyed reading this post, you may like the following two articles as well.

1. Can a machine learn our language?

2. Is it time to reboot our approach to fraud detection?

Sunday, 29 July 2018

Can a machine learn our language?

Disclaimer - The views, thoughts, and opinions expressed in the text belong solely to the author, and should not in any way be attributed tothe author’s employer, or to the author as a representative, officer or employee of any organisation.

This article is an extract from my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).

Can a machine learn our language? It turns out that yes, a machine can learn our language, perhaps even better than us humans. Because in our lifetime, we can at best endeavour to become proficient in understanding only a handful of languages. Given enough compute power and data, a machine on the other hand can learn many languages, in a matter of days. Not convinced? Well, let’s see if I can persuade you to think otherwise!!

On 14th of August 2013, three researchers from the Google Knowledge team (Tomas Mikolov, Ilya Sutskeverand Quoc Le) published an article on Google’s Open Source blog - “Learning the meaning behind words”. This post stirred a lot of excitement, euphoria and bewilderment in the Machine Learning, artificial intelligence and natural language processing community. Because in that article was a very simple yet surreal message - Google scientists have successfully designed a computationally efficient neural network (called word2vec) that allows machines to grasp and numerically represent (in a series of numbers called word vectors) the “meaning” behind words and their semantic relationships, by simply reading what people are writing and posting. And just like that, a new era for computational linguistics was born.

While the notion of learning the meaning behind words and sentences by skimming through documents is nothing new (for example, Hollywood has over the years managed to entertain Sci-Fi movie buffs by dramatising scenes of aliens travelling to our planet and scanning through documents on the internet to learn about the human race), Google’s word2vec concept has been so successful in improving and simplifying the existing state-of-the-art solutions to a number of natural language processing (NLP) problems that it is almost on the verge of replacing more traditional representation of words in computational linguistics. It is also widely featured as a member of the new wave of Machine Learning algorithms that are causing a tectonic shift in the technology landscape in the recent years.

But what makes Google’s word2vec so powerful as a tool in natural language processing tasks? To answer this question, let’s quickly explore some of the key features of word2vec.

a) Many Machine Learning algorithms (including deep learning networks) require their inputs to be numbers or vectors (list of real numbers) of fixed length; and they simply won’t work on strings or plain texts. So, a natural language modelling technique like “word embedding” is typically used to map the words or phrases in a vocabulary to a corresponding fixed length vector of real numbers. Google’s word2vec is a word embedding technique that not only maps each word in a vocabulary to a unique vector of real numbers, but also encodes in the same vector the meaning of the word and its semantic relationship with the other words in the vocabulary.

b) For a language model to be able to predict the meaning of a text, it needs to be aware of the contextual similarity of words. For instance, we tend to find fruit words (like apple or orange) in the context of where, how or why they are grown, picked, eaten and juiced; but you wouldn’t expect to find those same concepts in such close proximity to, say, the word automobile. The vectors created by word2vec preserve these similarities, so that words that regularly occur nearby in text will also be in close proximity in the vector space.

c) An interesting feature of word vectors is that because they are numerical representations of contextual similarities between words, they can be manipulated arithmetically just like any other vector. For example, if you use the word vectors from Google’s pre-trained word2vec model (more on this later) and apply the following vector arithmetic: king minus man plus woman, the resulting word vector is closest to the vector representation of the word queen!! The other way to look at this vector arithmetic is that the distance between the words man and woman in the word2vec vector space is same as the distance between the words kingand queen (understanding of male and female relationship).

king - man + woman => queen

Similarly (understanding of country and capital relationship):

Paris + Germany – France => Berlin

What this means is that the word2vec representation of words is able to preserve the syntactic and semantic relationships between words such as gender, verb tense, and country and capital. If you think about it, this is truly remarkable as all of this knowledge simply comes from skimming through lots and lots of text with no other information provided about their semantics.

d) In this age of social media and micro blogging, internet slang is constantly changing the way we speak and communicate, so much so that many of the phrases and vocabulary that may be heard being spoken by today’s high school and college students is often difficult to understand. Apparently, they have their very own vocabulary that no dictionary recognises. In this context, word2vec is an extremely powerful word discovery tool, as it does not really care how a word is spelled or represented (e.g. as an emoji). As long as there is sufficient volume of context words or texts to work with, word2vec can easily locate these new words, and uncover their meanings and relationships with other internet slangs or mainstream vocabulary. As a side note though, millennials are not the only ones to use jargons and slangs. In many professional domains (such as in the financial markets), technical jargons are frequently used as a shorthand by people in-the-know to make communication easier.

e) Google open sourced its implementation of the word2vec model alongside the academic paper that explains the model’s architecture. Google also published a pre-trained word2vec model containing 300-dimensional vectors for over three million words and phrases that was created using a collection of Google News articles (comprising of approximately 100 billion words). These initiatives significantly accelerated the understanding and adoption of the word2vec concept by practitioners of Machine Learning, artificial intelligence and natural language processing. Additionally, it also led to new innovation (such as document to vector or doc2vec and sense to vector or sense2vec), centered around the practical use of word2vecin NLP applications.

If you are interested in learning more about word2vec, and how it can be applied to natural language processing tasks such as automatically grouping documents based on their similarities, labelling documents as belonging to certain pre-defined categories (e.g. finance, politics, technology, or entertainment related), and automatically generating a summary of a document, read chapter six of my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).

Dhiraj Bhuyan, 24 July 2018

Is it time to reboot our approach to fraud detection?

Disclaimer - The views, thoughts, and opinions expressed in the text belong solely to the author, and should not in any way be attributed to the author’s employer, or to the author as a representative, officer or employee of any organisation.

This article is an excerpt from my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).

With advances in computer technology, online banking and e-commerce, also comes increased vulnerability to fraud. Hackers and cyber criminals are continuously finding new ways to target their victims, from phishing attacks and stolen credit card details to creating false accounts. According to a recent report published by the Financial Fraud Action (FFA UK), in the UK alone, financial fraud related losses totaled £768.8 million in 2016, eighty percent of which was attributed to payment card related frauds (see figure below). Prevented fraud during the same period totaled £1.38 billion. This represents incidents that were detected and prevented by the banks and card companies, and is equivalent to £6.40 in every £10 of attempted fraud being stopped. In other words, the UK banks and card companies were unable to timely detect and prevent 36% of the frauds in monetary terms. These figures clearly indicate that there is still much work to be done in the area of payment card fraud detection.

Yearly breakdown of fraud related losses on UK-issued cards.

(Source - FFA UK's "Fraud The Facts 2017" report)

The traditional approach to tackling the problem of payment card related frauds is to use a set of rigid rules and parameters to query transactions, and to direct the suspicious ones through to the fraud department for human review. Rules are extremely easy to understand and are developed by domain experts and consultants who translate their experience and best practices to code to make automated decisions. But when a rules-based fraud detection system gets operationalised, one starts with say 100 fraud scenarios and 100 rules to handle it. As time goes by and as the fraudsters change tactics, we encounter more and more fraud scenarios and start adding more rules to keep the number of false positives and negatives under control. There comes a point where nobody really knows or can measure how well the rules work or how many exceptions there are - this is the situation today with a lot of legacy hand-crafted and rules-based fraud and anomaly detection systems.

But do we have a better alternative? The answer is yes, we do, and it is called Machine Learning! Machine Learning is simply a form of artificial intelligence that enables computers to "learn" (i.e. progressively improve performance on a specific task) with data, without being explicitly programmed to do so. It is based around the idea that we should really just be able to give machines access to data and let them learn for themselves.

Traditional Programming vs. Machine Learning

But why is Machine Learning a better alternative to a rules-based expert system designed by domain specialists and consultants? It is simply because machines are much better than humans at identifying patterns in data (especially in the world of big data that we live in) and detecting anomalies in those patterns. It can also process large datasets, and can recognise thousands of features on a user’s purchasing journey instead of the few that can be captured by creating rules. This ability to see deep into the data and make concrete predictions for large volumes of transactions makes Machine Learning a very promising alternative to the traditional rules-based approach for detecting and preventing frauds.

But words are meaningless without proof! Hence, to demonstrate the power of Machine Learning techniques in detecting fraudulent transactions and anomalous events, in chapter seven (“Fraud Detection Using Machine Learning Techniques”) of my book (“Practical Data Analysis”), I used a real-world credit card fraud dataset (anonymized and freely available) to test the prediction accuracy of four Machine Learning techniques – Random Forest, Boosted Trees (XGBoost), Auto Encoder, and Ensemble model. The table below summarises how a credit card fraud detection system built around these four Machine Learning techniques performed (detailed narrative on how to design and implement them using open source tools is included in my book).

Based on these results, the Ensemble model is clearly the winner - it was able to detect 83% of the fraudulent transactions, and only 9 out of every 100 transactions that the system flagged as fraudulent are genuine. Although the system failed to detect 17% of the fraudulent transactions, given the limited size of the dataset that I used to train these models (60% of two days’ credit card transactions), in my opinion, this is still a very good result. With a bigger training dataset and some feature engineering, it is very likely that the prediction accuracy of these models will improve even further.

Dhiraj Bhuyan, 29 July 2017

Dhiraj's Data Analytics Blog

Wednesday, 1 August 2018

A framework for continuous monitoring and real-time analytics

Monday, 30 July 2018

Using sentiment analysis to gauge what your customers like or dislike

Sunday, 29 July 2018

Can a machine learn our language?

Is it time to reboot our approach to fraud detection?

A framework for continuous monitoring and real-time analytics

My Book

Search This Blog