17it128-Data_science: Practical2

Practical 2

Aim: Perform following Data Pre-processing (Feature Selection/Elimination) tasks using Python

Theory: Feature selection is the method of reducing data dimension while doing predictive analysis. This method uses the variable ranking technique in order to select the variables for ordering and here, the selection of features is independent of the classifiers used. By ranking, it means how much useful and important each feature is expected to be for classification.

Dataset Description:

The Life Expectancy(WHO) dataset has columns on the factors helpful in predicting the Life expectancy.

Total rows: 2938

Total columns: 22

Various data pre-processing techniques:

Univariate Feature Selection: Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable.

Recursive Feature Elimination: In this method it fits a model and removes the weakest feature/features until the specified number of features is reached. RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid.

PCA: This method is used for dimensionality reduction. This method uses a simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimension.

Correlation Matrix: This matrix shows relation with each an every feature of the dataset .i.e. it will show relation one feature with all other features in the dataset including itself.

Dataset Description:

Task 1 Univariate Feature Selection:

Univariant Feature Selection

Task 2 Recursive Feature Elimination:
Recursive Feature Elimination

Task 3 Heatmap:

Heatmap

Task 4 PCA
PCA
PCA helped us by selecting only the necessary/dominating values to target variable our accuracy increased from 46% to 54%

Questions and Answers:
1. What is the impact on model performance for selecting only correlated features?
A. After selecting only the features having strong co-relation with the target variable I found an increase in the accuracy of the model and hence it improved even if a little bit.

2. Amongst all methods, which method avoids overfitting and improves model performance?
A. RFE is better as it is a wrapper method and thus generalized better on the data and prevents overfitting. Wrapper methods evaluate multiple models using procedures that add and remove predictors to find the optimal combination that maximizes model performance.

Here is the Google Colab Link:link

17it128-Data_science

Pages

Practical2

Practical 2

Dataset Description:

No comments:

Post a Comment

Report Abuse