How to analyze the regression task in machine learning
In the following post, you will read about the following items;
- Gaussian Kernel
- Probability density function or PDF
- kernel density estimate or KDE
If you are interested in working with regression problems in machine learning, you analyze your final results to check the model’s performance better.
You may ask, we already have the related metrics such as RMSE, MSE, R2, etc; So, what should we analyze more about the machine learning model?
That was my question as well, but when I started to study the regressors and compare different models together, I encountered the issue that, how I can show the performance of the machine learning model clearly to have a better comparison between the models.
Therefore, I researched to find a better approach for my experiments and found my answer, which I share this solution in this topic. But first, we have to review some basic topics.
Note, I included the related python code inlines, and you can access the complete python code and example, please refer to my kaggle and Github.
Gaussian Kernel
In machine learning, the kernel refers to the shape of the used function. Therefore, in Gaussian Kernel, we have a kernel with the Gaussian distribution shape. We often use the Gaussian function to describe the normal distribution.
Probability density function
probability density function (PDF). The PDF refers to the probability function of the random continuous variable that returns the likelihood of the input. We use the PDF to have the probability of the variable.
Let’s solve an example together to have a clear idea about the PDF.
Assume that the predicted value of one specific regression target for one instance is commonly between two and four. The probability of having the predicted value with the exact value of three is zero. There are too many instances that their predictions are about three and a half but not a three. If the probability of having a prediction between 3 and 3.20 is measurable and suppose that is about 3%. Therefore, the probability of having a prediction value between 3 and 3.21 is measurable with the previous probability. The calculation of this PDF in a moving window reveals the probability of the predicted values in the defined window.
Kernel density estimation
Knowing the probability density function of PDF, we need a tool to estimate it for a random continuous variable. Kernel density estimation (KDE) is a non-parametric for this matter.
import scipy.stats as st
st.gaussian_kde(dataset)
It will estimate the data distribution, and when we are dealing with larger data points, it would be a good tool to use and analyze the results.
Prediction
As the focus of this study is not machine learning training, I skip this part, and we assume that we already trained a regression model and have the predicted and true values. So, in the following, we use these two variables for the statistical tests and try to interpret them.
Performance analyze
gaussian KDE
Here, we assume that we want to evaluate the result of the regression model, which we already have the predicted values. Therefore, we need to use the gaussian KDE from `scipy.stats.`
import scipy.stats as st
Our analysis would be a 2-D analysis, so we have to have an array with two dimensions including the predicted and real values
x = predicted_valuesy = y_testxy = np.vstack((x, y))
The first row is the predicted values
xy[0, :] == x
creat KDE object as
kernel = st.gaussian_kde(xy)
print(kernel)>>> <scipy.stats._kde.gaussian_kde at 0x20a430ef7f0>
we apply the created pdf on a 2-D grid from NumPy. For this matter, we convert the array to a dense grid layer and concatenate both predicted and real values.
xx, yy = np.mgrid[x.min():x.max():100j, y.min():y.max():100j]concat = np.dstack((xx, yy))
The concat is a 3-D array including predicted and real values.
Time to return the dense layer of pdf;
z = np.apply_along_axis(kernel, 2, concat)z = z.reshape(100, 100)print(z.shape)
>>> (100, 100)
The final plot would be as follows;
You can have this plot in a 3-D form as follows. The difference between the 2-D and 3-D plots is that I reshaped the Kernel two 2-dimensions for the 3-D plot.
or
The main could is stored in the GitHub repository.
Hexbin plot
Another useful tool for this interpretation is the hexbin plot. As we have two large numerical variables with too many data points. This tool plots the data points into different hexbins and indicates the counting with the color bar.
Conclusion
In this experiment, I considered a toy regression dataset with 1,500 instances and two targets. I trained a ML regressor model on the dataset and performed the gaussian_kde tool to estimate the data point distribution for the predicted and real values.
I illustrated the comparison results in different approaches including the wireframe, 3-D, 2-D, and hexbin plots.