Statistics and their applications in data mining

Pictograms of clouds, graphs and pie chart

Updated: 28 June 2024
Contributor: Jim Holdsworth

What is data mining?

Data mining is the use of machine learning and statistical analysis to uncover patterns and other valuable information from large data sets.

Given the evolution of machine learning (ML), data warehousing, and the growth of big data, the adoption of data mining, also known as knowledge discovery in databases (KDD), has rapidly accelerated over the last decades. However, while this technology continuously evolves to handle data at a large scale, leaders still might face challenges with scalability and automation.

The data mining techniques that underpin data analyses can be deployed for two main purposes. They can either describe the target data set or they can predict outcomes by using machine learning algorithms.

These methods are used to organize and filter data, surfacing the most useful information, from fraud to user behaviors, bottlenecks and even security breaches. Using ML algorithms and artificial intelligence (AI) enables automation of the analysis, which can greatly speed up the process.

When combined with data analytics and visualization tools, such as Apache Spark, data mining software is becoming more straightforward and extracting relevant insights can be gained more quickly than ever. Advances in AI continue to expedite adoption across industries.

Ebook Generative AI and ML for the enterprise

Learn how to confidently incorporate generative AI and machine learning into your business.

Register and download

Benefits and challenges

Benefits

Discover hidden insights and trends: Data mining takes raw data and finds order in the chaos: seeing the forest for the trees. This can result in better-informed planning across corporate functions and industries, including advertising, finance, government, healthcare, human resources (HR), manufacturing, marketing, research, sales and supply chain management (SCM).

Save budget: By analyzing performance data from multiple sources, bottlenecks in business processes can be identified to speed resolution and increase efficiency.

Solve multiple challenges: Data mining is a versatile tool. Data from almost any source and any aspect of an organization can be analyzed to discover patterns and better ways of conducting business. Almost every department in an organization that collects and analyzes data can benefit from data mining.

Challenges

Complexity and risk: Useful insights require valid data, plus experts with coding experience. Knowledge of data mining languages including Python, R and SQL is helpful. An insufficiently cautious approach to data mining might result in misleading or dangerous results. Some consumer data used in data mining might be personally identifiable information (PII) which should be handled carefully to avoid legal or public relations issues.

Cost: For the best results, a wide and deep collection of data sets is often needed. If new information is to be gathered by an organization, setting up a data pipeline might represent a new expense. If data needs to be purchased from an outside source, that also imposes a cost.

Uncertainty: First, a major data mining effort might be well run, but produce unclear results, with no major benefit. Or inaccurate data can lead to incorrect insights, whether incorrect data was selected or the preprocessing was mishandled. Other risks include modeling errors or outdated data from a rapidly changing market.

Another potential problem is results might appear valid but are in fact random and not to be trusted. It’s important to remember that “correlation is not causation.” A famous example of “data dredging”—seeing an apparent correlation and overstating its importance—was recently presented by blogger Tyler Vigen: “The price of Amazon.com stock closely matches the number of children named ‘Stevie’ from 2002 to 2022.” 1 But, of course, the naming of Stevies did not influence the stock price or vice versa. Data mining applications find the patterns, but human judgment is still significant.

Data mining versus text mining versus process mining

Data mining is the overall process of identifying patterns and extracting useful insights from big data sets. This can be used to evaluate both structured and unstructured data to identify new information and is commonly used to analyze consumer behaviors for marketing and sales teams. For example, data mining methods can be used to observe and predict behaviors, including customer churn, fraud detection, market basket analysis and more.

Text mining—also known as text data mining—is a sub-field of data mining, intended to transform unstructured text into a structured format to identify meaningful patterns and generate novel insights. The unstructured data might include text from sources including social media posts, product reviews, articles, email or rich media formats such as video and audio files. Much of the publicly available data around the world is unstructured, making text mining a valuable practice.

Process mining sits at the intersection of business process management (BPM) and data mining. Process mining provides a way to apply algorithms to event log data to identify trends, patterns and details of how processes unfold. Process mining applies data science to discover bottlenecks, and then validate and improve workflows.

BPM generally collects data more informally through workshops and interviews and then uses software to document that workflow as a process map. Since the data that informs these process maps is often qualitative, process mining brings a more quantitative approach to a process problem, detailing the actual process through event data.

Information systems, such as enterprise resource planning (ERP) or customer relationship management (CRM) tools, provide an audit trail of processes from log data. Process mining uses this data from IT systems to assemble a process model or process graph. From there, organizations can examine the end-to-end process with the details and any variations outlined.

How data mining works

The data mining process involves several steps from data collection to visualization to extract valuable information from large data sets. Data mining techniques can be used to generate descriptions and predictions about a target data set.

Data scientists or business intelligence (BI) specialists describe data through their observations of patterns, associations and correlations. They also classify and cluster data through classification and regression methods, and identify outliers for use cases, such as spam detection.

Data mining usually includes five main steps: setting objectives, data selection, data preparation, data model building, and pattern mining and evaluating results.

1. Set the business objectives: This can be the hardest part of the data mining process, and many organizations spend too little time on this important step. Even before the data is identified, extracted or cleaned, data scientists and business stakeholders can work together to define the precise business problem, which helps inform the data questions and parameters for a project. Analysts might also need to do more research to fully understand the business context.

2. Data selection: When the scope of the problem is defined, it is easier for data scientists to identify which set of data will help answer the pertinent questions to the business. They and the IT team can also determine where the data should be stored and secured.

3. Data preparation: The relevant data is gathered and cleaned to remove any noise, such as duplicates, missing values and outliers. Depending on the data set, an additional data management step might be taken to reduce the number of dimensions, as too many features can slow down any subsequent computation.

Data scientists look to retain the most important predictors to help ensure optimal accuracy within any model. Responsible data science means thinking about the model beyond the code and performance, and it is hugely impacted by the data being used and how trustworthy it is.

4. Model building and pattern mining: Depending on the type of analysis, data scientists might investigate any trends or interesting data relationships, such as sequential patterns, association rules or correlations. While high-frequency patterns have broader applications, sometimes the deviations in the data can be more interesting, highlighting areas of potential fraud. Predictive models can help assess future trends or outcomes. In the most sophisticated systems, predictive models can make real-time predictions for rapid responses to changing markets.

Deep learning algorithms might also be used to classify or cluster a data set depending on the available data. If the input data is labeled (such as in supervised learning), a classification model might be used to categorize data, or alternatively, a regression might be applied to predict the likelihood of a particular assignment. If the data set isn’t labeled (that is, unsupervised learning), the individual data points in the training set are compared to discover underlying similarities, clustering them based on those characteristics.

5. Evaluation of results and implementation of knowledge: When the data is aggregated, it can then be prepared for presentation, often by using data visualization techniques, so that the results can be evaluated and interpreted. Ideally, the final results are valid, novel, useful and understandable. When these criteria are met, decision-makers can use this knowledge to implement new strategies, achieving their intended objectives.

Data mining techniques

Here are some of the most popular types of data mining:

Association rules: An association rule is an if/then, rule-based method for finding relationships between variables in a data set. The strengths of relationships are measured by support and confidence. The confidence level is based on how often the if or then statements are true. The support measure is how often the related elements are shown in the data.

These methods are frequently used for market basket analysis, enabling companies to better understand the relationships between different products, such as those that are frequently purchased together. Understanding customer habits enables businesses to develop better cross-selling strategies and recommendation engines.

Classification
: Classes of objects are predefined, as needed by the organization, with definitions of the characteristics that the objects have in common. This enables the underlying data to be grouped for easier analysis.

For example, a consumer product company might examine its couponing strategy by reviewing past coupon redemptions together with sales data, inventory stats and any consumer data on hand to find the best future campaign strategy.

Clustering
: Closely related to classification, clustering reports similarities, but then also provides more groupings based on differences. Preset classifications for a soap manufacturer might include detergent, bleach, laundry softener, floor cleaner and floor wax; while clustering might create groups including laundry products and floor care.

Decision tree:
This data mining technique uses classification or regression analytics to classify or predict potential outcomes based on a set of decisions. As the decision tree name suggests, it uses a tree-like visualization to represent the potential outcomes of these decisions.

K-nearest neighbor (KNN): Also known as the KNN algorithm, K-nearest neighbor is a nonparametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points are found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.

Neural networks:
Primarily used for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output.

If that output value exceeds the set threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, making adjustments based on the loss function through the process of gradient descent. When the cost function is at or near zero, an organization can be confident in the model’s accuracy to yield the correct answer.

Predictive analytics: By combining data mining with statistical modeling techniques and machine learning, historical data can be analyzed by using predictive analytics to create graphical or mathematical models intended to identify patterns, forecast future events and outcomes, and identify risks and opportunities.

Regression analysis
: This technique discovers relationships in data by predicting outcomes based on predetermined variables. This can include decision trees and multivariate and linear regression. Results can be prioritized by the closeness of the relationship to help determine what data is most or least significant. An example would be for a soft drink manufacturer to estimate the needed inventory of drinks before the arrival of predicted hot summer weather.

Data mining use cases

Data mining techniques are widely adopted by business intelligence and data analytics teams, helping them extract knowledge for their organization and industry. Some data mining use cases include:

Anomaly detection
While frequently occurring patterns in data can provide teams with valuable insights, observing data anomalies is also beneficial, assisting organizations with fraud detection, network intrusions and product defects. While this is a well-known use case within banking and other financial institutions, SaaS-based companies have also started to adopt these practices to eliminate fake user accounts from their data sets. Anomaly detection can also be an opportunity to find new and novel strategies or target markets that have been overlooked in the past.

Assess risk
Organizations can more accurately locate and determine the scale of risk with data mining. Patterns and anomalies can be uncovered in the cybersecurity, finance and legal fields to pinpoint oversights or threats.

Focus on target markets
By searching across multiple databases to find close relationships, data mining can accurately connect behaviors and customer backgrounds with sales of specific items. This can enable more targeted campaigns to help boost sales.

Improve customer service
Customer issues can be discovered and fixed sooner if the full sum of customer actions—on-site, online, over mobile apps or on a telephone—can be reviewed with data mining. Customer service agents can have access to more complete and insightful information on the customers they serve.

Increase equipment uptime
Operational data can be mined from industrial equipment that can help predict future performance and downtime, and enable the planning of protective maintenance.

Operational optimization
Process mining uses data mining techniques to reduce costs across operational functions, enabling organizations to run more efficiently. This practice can help to identify costly bottlenecks and improve decision-making for business leaders.

Industry use cases

Customer service
Data mining can create a richer data source for customer service by helping to determine which factors most please the customers and what factors cause friction or dissatisfaction.

Education
Educational institutions have started to collect data to understand their student populations and which environments are conducive to success. With courses often using online platforms, they can use various dimensions and metrics to observe and evaluate performance, such as keystrokes, student profiles, classes attended and time spent.

Finance
When researching risk, financial institutions and banks often want to cast a wide net, to capture any factors that might negatively impact cash flow and retrieval. Data mining tools can be useful in finding and weighing a combination of factors that indicate a good or bad risk.

Healthcare
Data mining is a useful tool for the diagnosis of medical conditions—including the reading of scans and images—and then assists in the suggestion of beneficial treatments.

Human resources
Organizations can gain new insights into employee performance and satisfaction by analyzing multiple factors and finding patterns. Data can include start date, tenure, promotions, salary, training, peer performance, work delivery, use of benefits and travel.

Manufacturing
From raw materials to final delivery, all aspects of the manufacturing process can be analyzed to improve performance. What is the cost of materials and are there options? How efficient is production? Where are the bottlenecks? What are the quality issues and where do they arise, both internally and with customers?

Retail
By mining customer data and actions, retailers can identify the most productive campaigns, pricing, promotions, special product offers and successful cross-sells and up-sells.

Sales and marketing
Companies collect massive amounts of data about their customers and prospects. By observing consumer demographics, media responses and customer behavior, companies can use data to optimize their marketing campaigns, improving segmentation and targeting and customer loyalty programs, all helping to yield higher return on investment (ROI) on marketing efforts. Predictive analyses can also help teams set expectations with their stakeholders, providing yield estimates for any increases or decreases in marketing investment.

Social media
Analysis of user data can help uncover new editorial opportunities or new sources of advertising revenue for specific target audiences.

Supply chain management (SCM)
Using data mining, product managers can better predict demand, gear up production, adjust providers or adapt marketing efforts. Supply chain managers can better plan shipping and warehousing.

Related solutions Enterprise search platform

Find critical answers and insights from your business data by using AI-powered enterprise search technology.

Explore IBM Watson® Discovery IBM Db2® Warehouse

A fully managed, elastic cloud data warehouse built for high-performance analytics and AI.

Explore IBM Db2 Warehouse on Cloud IBM SPSS® Modeler

Import large volumes of data from several disparate sources to reveal hidden data patterns and trends.

Try the 30-day free trial Resources Article Data mining techniques

Identify patterns and trends with predictive analytics and key techniques.

Read the article Blog 3 new steps in the data mining process to ensure trustworthy AI

Explore how to mitigate your own biases when creating machine learning models.