The New Method for Removing Highly Correlated Variables from Datasets
Апстракт
Reducing of the data dimensionality is necessary and required for optimal model performance in machine learning. Two different approaches are used in practice to solve this problem. The basic idea of the first one is to reduce dimensionality by removing highly correlated variables which implies that multiple variables measure same thing. It is done by removing all variables with high average correlation. In this way, variables are removed regardless to its significance to model accuracy, and as a result model accuracy can significantly drop.
In the second one, for the better understanding of the data, relationship between variables and the model outcome, it is necessary to quantify variable impact on model outcome, and rank them according to these values. By using this approach, variables with lowest importance are removed from data set, and can lead to an increasing in the performance and accuracy of the final model.
In the datasets with the highly correlated variables (e.g. set...s of spectroscopy data), the most important variables can be with the highest average correlation, and after removing those variables the accuracy of the model can be significantly reduced. Based on the previous facts, the new method that used the most important variables with lowest correlation is proposed, as a combination of the previous two, and with this approach it is possible significantly to reduce dataset dimensionality where the variables have small correlation.
Кључне речи:
Machine learning / highly correlated data / variable importance / data dimensionalityИзвор:
CNN Tech 2018 "International Conference of Experimental and Numerical Investigations and New Technologies", Book of Abstracts, 2018, 13-13-Колекције
Институција/група
Mašinski fakultetTY - CONF AU - Dragičević, Aleksandra AU - Kosić, Boris AU - Jeli, Zorana PY - 2018 UR - https://machinery.mas.bg.ac.rs/handle/123456789/5606 AB - Reducing of the data dimensionality is necessary and required for optimal model performance in machine learning. Two different approaches are used in practice to solve this problem. The basic idea of the first one is to reduce dimensionality by removing highly correlated variables which implies that multiple variables measure same thing. It is done by removing all variables with high average correlation. In this way, variables are removed regardless to its significance to model accuracy, and as a result model accuracy can significantly drop. In the second one, for the better understanding of the data, relationship between variables and the model outcome, it is necessary to quantify variable impact on model outcome, and rank them according to these values. By using this approach, variables with lowest importance are removed from data set, and can lead to an increasing in the performance and accuracy of the final model. In the datasets with the highly correlated variables (e.g. sets of spectroscopy data), the most important variables can be with the highest average correlation, and after removing those variables the accuracy of the model can be significantly reduced. Based on the previous facts, the new method that used the most important variables with lowest correlation is proposed, as a combination of the previous two, and with this approach it is possible significantly to reduce dataset dimensionality where the variables have small correlation. C3 - CNN Tech 2018 "International Conference of Experimental and Numerical Investigations and New Technologies", Book of Abstracts T1 - The New Method for Removing Highly Correlated Variables from Datasets SP - 13-13 UR - https://hdl.handle.net/21.15107/rcub_machinery_5606 ER -
@conference{ author = "Dragičević, Aleksandra and Kosić, Boris and Jeli, Zorana", year = "2018", abstract = "Reducing of the data dimensionality is necessary and required for optimal model performance in machine learning. Two different approaches are used in practice to solve this problem. The basic idea of the first one is to reduce dimensionality by removing highly correlated variables which implies that multiple variables measure same thing. It is done by removing all variables with high average correlation. In this way, variables are removed regardless to its significance to model accuracy, and as a result model accuracy can significantly drop. In the second one, for the better understanding of the data, relationship between variables and the model outcome, it is necessary to quantify variable impact on model outcome, and rank them according to these values. By using this approach, variables with lowest importance are removed from data set, and can lead to an increasing in the performance and accuracy of the final model. In the datasets with the highly correlated variables (e.g. sets of spectroscopy data), the most important variables can be with the highest average correlation, and after removing those variables the accuracy of the model can be significantly reduced. Based on the previous facts, the new method that used the most important variables with lowest correlation is proposed, as a combination of the previous two, and with this approach it is possible significantly to reduce dataset dimensionality where the variables have small correlation.", journal = "CNN Tech 2018 "International Conference of Experimental and Numerical Investigations and New Technologies", Book of Abstracts", title = "The New Method for Removing Highly Correlated Variables from Datasets", pages = "13-13", url = "https://hdl.handle.net/21.15107/rcub_machinery_5606" }
Dragičević, A., Kosić, B.,& Jeli, Z.. (2018). The New Method for Removing Highly Correlated Variables from Datasets. in CNN Tech 2018 "International Conference of Experimental and Numerical Investigations and New Technologies", Book of Abstracts, 13-13. https://hdl.handle.net/21.15107/rcub_machinery_5606
Dragičević A, Kosić B, Jeli Z. The New Method for Removing Highly Correlated Variables from Datasets. in CNN Tech 2018 "International Conference of Experimental and Numerical Investigations and New Technologies", Book of Abstracts. 2018;:13-13. https://hdl.handle.net/21.15107/rcub_machinery_5606 .
Dragičević, Aleksandra, Kosić, Boris, Jeli, Zorana, "The New Method for Removing Highly Correlated Variables from Datasets" in CNN Tech 2018 "International Conference of Experimental and Numerical Investigations and New Technologies", Book of Abstracts (2018):13-13, https://hdl.handle.net/21.15107/rcub_machinery_5606 .