Data prepration

Machine learning algorithms are widely used for various tasks such as classification and regression. For training a machine learning model, you need to feed it with proper data to …

April 6, 2022 8 minute
Data prepration,Data Mining

Description :

Machine learning algorithms are widely used for various tasks such as classification and regression. For training a machine learning model, you need to feed it with proper data to improve the model’s performance. But what is proper data from the machine learning algorithm’s point of view?

Here we will briefly take a glance at the role of data preprocessing and its importance in the machine learning process.

Feature preprocessing:

Most of the time in real-world datasets, we are dealing with a bunch of data with many issues. Among those issues are missing values, low variance features, high correlated features, etc. so we cannot feed our model raw data. Otherwise, if it does not crash, it will perform poorly. 

In this article, we intend to shortly introduce some of the most famous techniques such as feature selection, feature extraction, etc.

Dealing with missing data:

Almost in all datasets, missing data as a result of human errors or insufficient information in some cases are common. So the data points might be missed. Some machine learning algorithms will make an error in case of missing values (filled with none) presence. Some of them are likely to fill it with 0 or some default value. But to get the best result from models, in the data preparation step, we need to fix this issue. There are various approaches and solutions to solve this problem. in this case, the most helpful point is having good domain knowledge. Because you must fill in missing values with different values and these values are different in each case. But in some situations when there is a little number of missing data points, removing the entire row or column can be an option, and this decision highly depends on the kind and amount of data that you are dealing with.

Also, replacing missing values with median or average values using interpolation can help us to overcome this problem.

Feature selection:

 When we are developing a predictive model, Feature selection is a solution for reducing the number of input variables. It is desirable to reduce the number of input variables to reduce the computational cost of modeling and, in some cases, improve the performance of the model.

Additionally, feature selection is helpful when we have so many features and a lot of them are likely to be redundant, without any useful information, or when they have little variance which are worthless data for our model. 

Statistical-based feature selection methods use evaluation of the relationship between each feature and the labels by using statistics and selecting those input variables that are most relevant to the target variable. The choice of statistical methods depends on the data type of both the input and output variables. For example, input data or target variables could be either numerical or categorical which ends up in 4 different cases, and each scenario needs its optimal feature selection algorithm.

Feature extraction:

Principal component analysis (PCA) is one of the most widely used feature extraction algorithms which is used for dimension reduction. it is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

This method, like any other dimension reduction algorithm, has some disadvantages too. This process does not guarantee that the transformed data includes all information existing in the original data.

Low variance features:

Suppose there is a feature column with the same value for all records. Does this feature add any value to the dataset or help improve model performance? Of course not.

Having a constant value for a feature is an extreme case of having a low variance, where the standard deviation along that column is zero. So you can decide to choose a threshold for std (standard deviation) and drop columns that have an std value of less than that threshold.

Highly correlated features:

If the dataset has some feature columns and those columns have a high correlation with each other, then those columns are almost redundant and only one column, which is most relevant to the target variable, can be kept out of those columns.