Data mining is not a new concept but a proven technology that has transpired as a key decision-making factor in business. There are numerous use cases and case studies, proving the capabilities of data mining and analysis. Yet, we have witnessed many implementation failures in this field, which can be attributed to technical challenges or capabilities, misplaced business priorities and even clouded business objectives. While some implementations battle through the above challenges, some fail in delivering the right data insights or their usefulness to the business. This article will guide you through guidelines for successfully implementing data mining projects.
At its core, data mining consists of two primary functions, description, for interpretation of a large database and prediction, which corresponds to finding insights such as patterns or relationships from known values. Before deciding on data mining techniques or tools, it is important to understand the business objectives or the value creation using data analysis. The blend of business understanding with technical capabilities is pivotal in making big data projects successful and valuable to its stakeholders.
Business need should be the driving force for implementing data analysis strategy, and technology can be the enabler. Business leaders should know the area of concern they want to tackle using data analytics. Some of the major concerns across business are customer management, revenue growth, bringing operational efficiency and so on. Your business strategy should tackle one major problem at a time, which is then broken down into smaller, specific use cases. Understanding the business value is crucial in identifying and further breaking down business problems. It will form the base of your data strategy which can then be powered by technology to get the desired results. The details of business and technology strategy vary by industries and individual companies, but running through these steps can provide crucial insights on the need for big data analytics. Patrons should ask these questions before embarking on this journey - Do I need this technology? How can it be beneficial? How can I evaluate its effectiveness? These assessments are imperative in defining the need and long-term benefits of data analysis strategy.
Defining your business need for big data analysis:
It has been observed that a poor or undefined business need is often the cause of a failed implementation project. Businesses should emphasize on clear objectives of their big data strategy, such as, identifying high value customers to offer specific products or services, process improvements for optimizing cost, and so on.
Once a business has set its definite goals, the next task is to identify the key metrics or insights to fulfil it. CDOs can quest on the most appropriate insights which can be offered as a service to accomplish business objectives. In case of multiple objectives, they can prioritize them based on its ease of implementation and impact to the business. Using proof of concepts to evaluate hypothesis can reduce future surprises while scaling up. POC also helps in identifying gaps in technical solutions and in providing insights on their business usefulness.
Once we have figured out the business need, we can indulge in the technical side of data mining steps. Primarily, data mining process includes four crucial steps:
Data identification and acquisition is the foremost step for successful implementation. Understanding the business challenges that you are trying to solve helps in determining the source and types of data to utilize. Data can be in any form - it can be a subset of variables, or can be data samples from a larger database. The key data should directly correlate to the business objective. We will further discuss this in detail while selecting data mining techniques.
Data cleansing or cleaning is done on the target data set to improve its effectiveness for fulfilling data mining objectives. This process is primarily to identify inaccurate, incorrect or incomplete data and then replacing, modifying or removing it. This process ensures that your data is complete and error free, thus making it more relevant and effective.
Data exploration is at the core of data mining activity. The main objective of this step is to identify the correct data mining techniques or methods and selecting the best suited algorithms for those techniques. Some of the most known data mining techniques include association, classification, regression, segmentation, link analysis, etc. Selecting data mining techniques among the pool is one of the most difficult decisions. However, the selection parameter should consider the objective of business and the available data sets. Mostly, two or more combinations of data mining techniques are used, but it depends on the scale of the project. The last leg of this activity is to perform data mining, i.e., to search for patterns hidden in the data.
Data presentation is the final activity which includes interpretation and evaluation of patterns and presenting them to users in a logical and understandable way. The main objective here is to present useful patterns and discard any redundant or irrelevant ones. For any inconsistency, revision of data exploration step is required. The success is attributed to its usefulness in providing insights to decision makers and its relevance to the business objectives.
Identifying the data mining operations
Classification - This data mining function is used to classify data into different buckets/classes based on constraints. The technique is used in large data sets to predict category of class labels based on training data sets. Some of the business cases which utilize these techniques are diagnosis of patient’s medical condition to select medical treatment, classifying individuals in different credit groups based on their financial data and segregating individual loan applicants in different credit risk parameters. Most prominently used classification algorithms are Naive Bayes, SVM (Support Vector Machines), k-nearest neighbor classifier and ANN (Artificial Neural Network). Determining the classification algorithm is crucial and confusing at times; it requires experts to evaluate the best for a given project. For example, Naive Bayes algorithm, though simple to implement, required a large data set for training. ANN can be used with fewer parameters but requires high processing time.
Regression – This operation is used to predict the real value variable. Traditional data models are developed using statistical methods like linear and logistics regression. When compared with classification, both are used for prediction. However, output is categorized in classification, and it is numeric output in regression. Some prominent examples for regression operation are determining the crime rate of a city based on different parameters, property valuation based on factors like location, floor area, etc., insured scoring systems (like in auto insurance) to predict likelihood of an insured meeting with an accident, etc. A few popular regression algorithms are Generalized Linear Models (GLM) for Linear and Support Vector Machines (SVM) for both linear and nonlinear regression.
Segmentation - The main objective here is to identify clusters of records, which can be mutually exclusive and exhaustive and can have hierarchical categories, with same behaviors. It is widely used in marketing to discover homogenous groups of customers and segment them according to their lifestyle, geography etc.
Link Analysis – Link analysis is used to evaluate connections or relationships between nodes/records. It is used in marketing product affinity, where the seller might be interested in knowing which items can be sold together. In insurance, this technique is used for fraud detection by identifying the claims patterns through network visualization. This operation is mostly used in conjunction with segmentation analysis.
Deviation – This operation is used to determine any deviations in the data due to anomalies or exceptions. It is mostly used to determine any unusual patterns, changes of data in a fixed time series, discrepancies from previous data and any data points in a dataset which do not belong to any cluster.
Following a systematic approach for data mining implementation can greatly reduce the risks of project failure. Moreover, it can help business and technical folks to determine the need of data analytics and the best tools and techniques to choose from. In my next blog, my focus will be on identifying the most appropriate data analysis techniques for solving current business problems across industries.
What is your say on data mining and its application? Let us know your views at email@example.com.