How does data preparation affect cluster analysis results? A case study

Illustration showing a stressed person over a computer with charts and data, surrounded by warning signs, symbolizing a data problem.In a world of ubiquitous data, business decisions are increasingly based on advanced statistical analysis. But have you ever wondered what happens when the input data is not properly prepared? Even the best algorithms cannot guarantee reliable results if the input data has not been properly prepared. Data transformation is a key step that can significantly affect the correctness and interpretation of analysis results.

In this article, we will discuss why data transformations are so important and what errors can result from their absence or misapplication, particularly in the context of cluster analysis.

What are data transformations?

Data transformation is the process of adjusting the raw values to meet the requirements of the analysis. This is particularly important in cluster analysis, where the quality and form of the input data determine how the observations are grouped. Algorithms such as k-means, hierarchical cluster analysis or DBSCAN calculate distances between points in a multidimensional space. If the data are not properly transformed, the results of the analysis may be incorrect, difficult to interpret or even misleading.

Data transformations usually involve the following steps:

  • Data scaling, e.g. normalisation – bringing the values of variables into a comparable range (e.g. 0-1). When variables have different units of measurement or ranges, variables with large values (e.g. revenue in thousands) may dominate the analysis, ignoring variables with smaller values (e.g. number of transactions at 1-10). Clustering results will be distorted because distances between points will mainly depend on this one variable, which may lead to an artificial division of customers solely on the basis of their revenue, while other characteristics describing customers will be ignored.
  • Standardisation – transforming the data to have a specific mean and standard deviation (e.g. mean = 0, standard deviation = 1). This type of transformation, like normalisation, eliminates differences in the distributions of variables that may result from their different units of measurement and different variances.
  • Outlier removal – eliminates anomalies that may distort the results of the analysis. This is particularly important in the k-means method, as cluster means (centroids) are calculated as arithmetic means, and as we know, the mean is a measure that is very sensitive to outlier observations. The consequence can be to shift the cluster towards outliers or even create a separate cluster for outliers only, ignoring more significant patterns in the data. For example, in customer data, a single customer with an unusually large revenue (e.g. 10 times the average) may cause the entire focus to revolve around this anomaly.

Bad transformation

Not only the absence of transformations, but also the choice of an inappropriate transformation can lead to distorted results. An example of this is over-normalisation, which brings all variables into the same range (e.g. from 0 to 1) regardless of their actual interpretation. This can lead to the loss of important differences between observations that are central to the business context.

Example 1

Imagine that a company analyses its customer data in terms of two variables: Weekly spend (£), Number of transactions per week (number of purchases).

Table 1 Table showing the original customer data.

Table showing data for clients A, B, C, D, including their weekly expenses in PLN and the number of weekly transactions.

We normalise each variable according to the formula:

Mathematical formula for min-max normalization: 'X minus X min, divided by X max minus X min'.

After normalisation, the differences between clients become less pronounced, especially for clients A and C.

Table 2 Table showing the normalised data.

Table showing normalized weekly expenses and number of transactions for clients A, B, C, D after data normalization.

Let’s see how this affects the distances between points in two-dimensional space.

Distances are calculated in real-valued (e.g. Euclidean) space:

  • Customer A and Customer B:

Mathematical formula calculating the square root of the sum of squared differences (10-10000) and (1-10), with a result approximately 9990.

  • Customer A and Customer C:

Mathematical formula calculating the square root of the sum of squared differences (10-500) and (1-5), with a result approximately 490.

The differences are very clear – Customer B is much more different from Customer A than Customer C.

Distances between customers (after normalisation):

  • Customer A and Customer B:

Mathematical formula calculating the square root of the sum of squared differences (0.000-1.000) and (0.000-1.000), with a result approximately 1.414.

  • Customer A and Customer C:

Mathematical formula calculating the square root of the sum of squared differences (0.000-0.049) and (0.000-0.444), with a result approximately 0.447.

After normalisation, the differences between customers appear to be much smaller, which may lead to Customer B and Customer C being considered similar.

This example shows that important differences have been lost, which can lead to inappropriate groups of customers, with high and low spenders in the same clusters. Customer A (£10/week) and Customer B (£10,000/week) are brought into a similar space, even though they actually represent extremely different target groups. A better solution in this situation would be to use either standardisation or a logarithmic transformation for variables with a large spread of values (e.g. Expenditure).

Another problem may be that failing to bend to the problem of outlier observations may result in outliers dominating the analysis. This is well illustrated by the following example.

Example 2

We analyse monthly gas consumption among customers. Standard customers show a seasonal pattern: highest consumption in winter, lowest in summer. However, two large customers appear in the dataset who consume gas at rates many times higher than other customers.

Line chart 'Customer Segmentation with a clear division on extreme outliers' showing changes in gas consumption (m³) over 12 months by customers 1-6 and two outlier customers.

In the segmentation presented, we see how the presence of outliers forces artificial divisions that are not naturally reflected in the data. Normally, we would expect segments to result from actual consumption patterns – e.g. low, medium and high gas consumption. However, the presence of extremes (outliers) means that the boundaries between segments are deformed and the structure of the groups is no longer intuitive. Instead of a logical division, we have:

  • A broad category of ‘standard customers’, which included both low and medium users. In reality, these two groups should be treated separately, but the impact of the outliers has raised the centre of focus enough to blur the differences between them.
  • The first outlier, whose high usage makes it naturally incompatible with the standard group, but could still be analysed as the upper limit of typical customers.
  • An extreme outlier whose values are so extreme that they completely dominate the graph and analysis. This single customer has such high consumption that it shifts the averages and affects the classification of the other customers, changing their position in the segmentation.

Such a situation is problematic to analyse because it distorts the true picture of customers – instead of three natural segments, we have one broad one and two artificially separated outliers, and it hinders business decision-making – a company may wrongly conclude that most customers fall into one group, when in fact the differences in their consumption are significant. It can also lead to faulty marketing strategies. If a company develops an offer on the basis of this segmentation, it can misalign services, for example by offering customers in the middle segment tariffs designed for customers with very low or very high consumption.

This is a great example of how outliers can force segmentations that, instead of mapping the data correctly, merely adjust for anomalies, leading to misinterpretations and decisions.

How to avoid these errors associated with missing or choosing an inappropriate transformation?

First of all, before transforming, it is important to thoroughly understand the data you are working with and to perform an initial exploration of the data. Graphs (e.g. histograms or box-and-whisker plots) will help to assess whether the data are homogeneous or have a skewed distribution or outliers. The latter may be visible in box-and-whisker plots or in descriptive statistics (outliers, e.g. 1st and 99th percentile). However, it is worth considering whether outliers represent errors in the data or important information to be included in the analysis (e.g. VIP customers). It is also important to assess the range and units of measure of different variables.

Consideration should also be given to transformations to fit the algorithm. The k-means method uses the Euclidean distance, which is sensitive to the scale and range of the variables, so normalisation or standardisation (especially for variables with large scatter) of the data is essential. In hierarchical methods, scaling also plays a key role, but different distance measures (e.g. Minkowski, Manhattan) can additionally be considered, which may be less sensitive to the range of the variables.

Different transformations are worth considering. There is no one-size-fits-all solution – so it is worth comparing the analysis results obtained with different transformations (normalisation, standardisation, logarithmisation). When evaluating, it is important to consider whether the results are in line with business intuition – whether the segments have a logical rationale and are a good representation of reality. Graphs that compare the results of the cluster analysis and allow you to see whether transformations have improved the quality of the division can be very helpful. For multivariate data, special dimension reduction techniques such as PCA, t-SNE or UMAP can be used.

Summary

Data transformation is one of the key steps in data analysis, and its importance cannot be overstated. Without proper preparation, data can mislead analysts, resulting in the creation of illogical or artificial groupings, ignoring relevant patterns and, ultimately, making the wrong business decisions. Such mistakes can cost companies not only time and money, but also customer trust.

By taking a thoughtful approach to transformations, adapting them to the requirements of the algorithms used and testing different methods, the accuracy and reliability of the results can be increased. Properly transformed data opens the door to better interpretation of reality and accurate decisions that translate into business success.

Is your data ready for analysis? If you’re not sure, get in touch with an expert or ensure your data is robustly prepared before running analytical algorithms!

Author: Anna Wilk, Data Analysis Team Leader at StatSoft

Back to news

Do you have questions?

Get in Touch!

Our team is ready to help with any questions you might have. Just fill out the form, send us a message, or give us a call, and we’ll get back to you as soon as we can!

    Skip to content