When you need to understand situations that seem to defy data analysis, you may be able to use techniques such as binary logistic regression. This article details how wine-tasting data and binary logistic regression yielded insight into factors that were important to a panel of experienced wine-tasters. The analysis illustrates that even factors that seem hard to measure, such as taste preferences, can be assessed with statistics if you choose the right analysis.
In this article, we will take a very unusual look at wine tasting. Although tastes vary from person to person and are probably unique (De Gustibus non est discutandum: “In matters of taste, there can be no disputes”), some wines are better than others, and most people would probably recognize a good wine from a bad one.
We are interested in using statistics to understand whether a wine that has, for instance, more sulphates or more chlorides would taste better. Based on that understanding, it could be possible to make a better wine. We will consider several variables, such as acidity, sulphur dioxide, and percentage of alcohol.
We have data from a panel of oenologists who tasted several types of white and red wines and provided binary assessments of quality—good (1) or poor (0)—for each. Here are the variables in our data set:
Our goal is to identify which of these many variables have a significant effect on wine quality.
Even very simple graphs can provide good indications of which variables might be important, and help us understand the structure of our data set. The bar chart below describes the relationship between types of wines (white or red) and the panel’s binary quality responses. The panel tasted more white wines than red, and since we can see that there is a larger proportion of 1 ratings for white wines, we can infer that the panel seems to prefer white wines:
This is interesting information, and is something we might want to consider later, but our primary objective is to evaluate the effects of pH, density, sulphates, alcohol, residual sugar, and other factors on wine quality. Do some of these variables have a significant effect on quality? If so, which ones?
We are interested in identifying variables for which there is a large change between a good wine and a bad one. These variables might be a good predictor of a good wine. The boxplots below illustrate the distribution of the variables according to good or poor wine quality. We can clearly see that we really do have a lot of variables to consider, and using graphs to select variables that have a noticeable effect on wine quality is far from easy.
Regression analysis lets us see how multiple factors affect an outcome, so it would seem to be an ideal method to look at the wine-tasting variables. However, recall that our panel simply ranked each wine as either high- or low-quality. This means we have binary and not continuous response data, so we need to proceed with caution—using a standard regression or ANOVA to analyze a binary response is generally not a good idea.
Because binary data follow a binomial distribution rather than a normal, bell-shaped distribution, standard regression may result in probability predictions that are negative or larger than 100%. We might get an unnecessarily complex model, in which some spurious interactions seem to be significant. In addition, the variance for binary data is not necessarily constant. When the average proportion is close to 0 or to 1, the variability tends to get smaller, since binary data are truncated due to the upper (1) or lower (0) limit. Therefore, effects that may seem to be larger for factor-specific settings might be due not to interactions with other factors, but to nonconstant variance.
Fortunately, there’s a simple solution: since we have binary response data, we simply need to use binary logistic regression.
Before jumping into a regression analysis, we can use a Principal Components (multivariate) Analysis to detect collinearity or correlation among the variables. Identifying variables that are highly collinear—which can make one of the variables almost redundant in some cases—can help us select the best possible binary logistic regression model.
To understand whether some variables are correlated with one another, we could use a standard correlation analysis (Stat > Basic Statistics > Correlation in Minitab), but using a loading plot from a Principal Components Analysis offers a very clear visual illustration of these correlations. Such a plot is more explicit and shows whether some groups of correlated variables might be grouped together.
In Minitab go to Stat > Multivariate > Principal Components, enter the variables, select Graphs, and check Loading Plot. Our data yielded the following:
The Loading plot from the Principal Components Analysis shows that :
Because of these strong collinearities, different models (that include different variables) may be equally acceptable in terms of prediction. This needs to be considered once a final model has been selected.
A standard practice in regression analysis is to start with the “full model,” one that includes all of the potentially significant factors for which you collected data. In this case, we begin the analysis by including all variables and all interactions between those variables and type of wine. Then we began eliminating the variables with the highest p-value. Since we know some variables are highly collinear and could influence one another, we eliminate only one variable at a time, then run a regression using the reduced model.
Ultimately, this iterative process leads us to the model below. It is quite complex, with many significant Wine-Type*variable interactions:
The factors and interactions that remain in the model are statistically significant (with p values < 0.05).You might note that Alcohol and Free S02 both have high p-values, making them candidates for elimination, but since these terms are included in significant interactions, they should remain in the model.
With 15 terms, this model is far too difficult to understand and explain, but it does give us a clue to how we can delve deeper into these data to better understand which factors contribute most to good-tasting wine.
We have 5 significant interactions involving “type” in our model. This indicates that the effects of some variables differ significantly according to red or white wines. Remember also that our panel seemed to have a preference for white over red wines. Perhaps we should consider separate models for white and red wines. This would eliminate the need to include interactions between Types of Wine and other variables, which would greatly simplify the models.
We’ll analyze the white wine data first. As before, we’ll start from the full model and eliminate one factor at a time according to its p-value. This leads us to the following model:
This model includes only 6 terms, and the variables that remain in the model all have low p-values (less than or very close to 0.05). This model is easier to interpret since there are no interactions. Density, for example, seems to have a negative effect on taste because it has a negative coefficient, while pH has a positive effect.
But how do we know this model is acceptable? Goodness of fit tests help us assess model adequacy. See the output from Minitab below:
The p-values for all three goodness-of-fit tests are well over 0.05, so we cannot reject the hypothesis that this model is adequate. That’s encouraging. Another thing we can look at is the number of concordant and discordant pairs in our model. The proportion of concordant/discordant pairs is a measure of the level of agreement between the model predictions and the observations—in other words, how well the model reflects the observed data).
The proportion of concordant pairs is high. Again this is encouraging.
A way to validate the model is to see how well the observed data match the model’s predicted probabilities. The standardized Delta graph checks for large differences between predicted probabilities based on our model and observed probabilities. The graph below shows that we do have some outliers, but on the whole it looks reasonable.
We followed the same process used to analyze the white-wine data—iteratively eliminating variables one at a time from the full model—to create a model for the red wines:
With only two factors, the model is fairly simple and small. We still need to look at the goodness-of-fit tests, however.
The Pearson and deviance tests are good, but the p-value of the Hosmer-Lemeshow test is low. This suggests we might have an issue with the accuracy of this model.
Once again, we’ll create a standardized Delta graph to help validate the model. The graph indicates that we have an outlier in row 34, which might be causing the goodness-of-fit issue. To see if that’s the case, we can eliminate row 34 and rerun the whole analysis.
The new analysis, without data point 34, yields a very similar model. This revised model has the same variables, but slightly different coefficients:
This time the p-values are high for all goodness-of-fits tests, so we do not have a model adequacy issue:
Now let’s look at what Minitab tells us about concordant and discordant pairs:
The Minitab output above shows that the proportion of concordant pairs is high. Moreover, the Delta Beta graph of residuals does not reveal any major outlying observations:
Now that we have models for the red and white wines, we can see what the data tell us about the wine characteristics that influenced our panel’s rankings. For example, this scatterplot summarizes the relationship between the variables for red wines:
The scatterplot indicates that red wines with a larger alcohol percentage and larger fixed acidity content receive higher quality rankings.
The data set we used to build our models was just part of a larger data set that we had divided in two: a training dataset to build our model, and a testing dataset to validate the model. Once we had our final models, we used the testing data to validate and test our final models. When we compared the predictions from the models for the new data with the actual panel results from the second test set of values, we found an overall number of 152 concordant results and 48 discordant results. Considering how difficult it is to analyze personal tastes, this is a very good result!
So when you need to understand situations that, at least on the surface, defy data analysis, why not dig a little deeper by using techniques such as binary logistic regression? You can use a similar approach to what we did with this wine-tasting data to analyze marketing or sales data, to better understand customer preferences, and to gain insight into factors that are important—even if, like taste preferences, they seem hard to measure.