# An Introduction to Data Mining

As the amount of data we gather and store every day increases, we need to look at how we can fully utilize the knowledge contained in it. How could we go about extracting information from data sets and applying that to improve software systems? This process is called Data Mining and there are various areas and techniques that form a part of it. Each could provide valuable insights, depending on the data available and the questions you want answered.

This post clarifies what is meant by data mining by comparing it to similar terms and explaining some of its major areas. I won't go into details of any specific techniques, but will instead provide a simple description of each field. This is intended to emphasise possible applications in our every day lives and provide an introduction for anyone who is interested in pursuing data mining. Some common algorithms are linked for further reading.

## How is it different from Machine Learning?

Data mining and machine learning are two terms that are often confused. That is because data mining makes use of many of the techniques in machine learning. The main difference lies in their goals. Machine learning aims to predict new data based on what it learns from previously seen inputs. Data mining draws on machine learning techniques to find patterns and draw conclusions from a data set. Data mining also makes use of concepts from other fields such as Artificial Intelligence, Statistics and Database Systems. ^{1}

## Is this Big Data?

Data mining does not necessarily refer to Big Data. Big Data refers to mining of datasets that are too big to store in memory on a single computer or streams of data that come in too fast to allow time expensive calculations to be done. In order to process datasets like these efficiently, traditional Data Mining techniques need to be modified. Some examples of how this is dealt with include, among others, distributed algorithms or reducing the complexity of the data using dimensionality reduction. With streams of data, one approach would be to store a summarisation of already seen data and adapt that to incoming data. The techniques discussed below, though, focus only on traditional data mining techniques and assumes a single device can perform the processing.

## Areas in Data Mining

### Association Rules

This area of data mining deals with defining rules that describe relationships between items in a dataset.

A common example of this, is basket analysis. Let's say you have an online retail store and you want to find out which items are commonly bought together. Finding association rules involves identifying frequent sets of items and determining whether they are significant enough to be considered a rule. Rules that are significant (referred to as "interesting" rules) can be evaluated by what is called the **support** (how many orders contain both items A and B) and **confidence** (of the orders that contain item A, how many also contain item B). A rule might look like: If a user buys a printer, they are highly likely to buy ink as well. ^{2}

In a big dataset with a large number of transactions, special techniques are needed to find association rules without having to store every pair in memory.

**Techniques:** Apriori algorithm, PCY algorithm

### Clustering

Clustering is used to discover groupings in data. This can be useful when you want to identify similar products or users based on common set of attributes ^{3}. Clustering techniques define a measure of similarity, known as the distance, between data points. The lower the distance between two items, the more similar they are.

An example application of clustering would be if you need to optimize placement of delivery centers to reduce delivery times and fuel consumption. The data points here would be coordinates and the similarity measure can be defined as the driving distance between points or the direct distance between points. Some clustering techniques, like k-means clustering, find the cluster center points, which can be used as possible locations for delivery centers. The density of the cluster helps prioritize groups (The more people in the area, the more value it would add by expanding there).

**Techniques:** k-means clustering, hierarchical clustering

### Classification

Where clustering tries to group items relative to each other, classification instead tries to fit data into predefined categories.

Take sentiment analysis for example, where you take a set of documents, tweets, etc. and determine whether the overall feeling towards something is positive or negative. Each item will be classified and put into a positive, negative or neutral group. The size of each group would inform us about the overall sentiment towards a topic.

Machine learning is very useful here, as an algorithm could learn to recognize members of each class, based on a training set or past guesses.

**Techniques:** Decision trees, Naïve Bayes

### Regression Analysis

Regression analysis is used to find relationships between attributes ^{4}. This allows us to determine what the effect would be if one attribute changes.

A real estate company might want to determine what effect the size of a property has on the price of the property. A regression analysis technique would attempt to find a curve that best fits the graph. This could help guide valuations of new properties. Of course, this is an over simplified example, as there would be many more variables involved in determining the price.

**Techniques:** Ordinary least squares, Principle component regression

### Anomaly Detection

Sometimes we need to find items that are unlike the rest of the items in a dataset, in order to identify unusual patterns. Anomaly detection aims to identify these outlier items.

An example of where this can be used, is detecting sudden changes in usage patterns of a user. Consider a person that spends between R6 000 and R8 000 on their credit card per month around Cape Town and one day spends R80 000 in one transaction overseas. This would be worth investigating as it could indicate possible fraud.

Many of the techniques mentioned above can be modified to identify data points that don't follow the general rules or patterns found in a dataset.

**Techniques:** k-nearest neighbours, outliers in clustering algorithms

## Conclusion

In this post we looked at a few different ways in which knowledge can be gained from data. These techniques can be combined and extended to extract interesting information. More advanced techniques from Artificial Intelligence, such as Genetic Algorithms and Particle Swarm Optimization, can even be added to get high quality results much more efficiently.

The world is filled with a vast amount of data that could potentially provide valuable insights into various aspects of business. Data mining can help extract knowledge from this data, which helps businesses improve software and ultimately customer experience.