Performance comparison of nonlinear and linear regression algorithms coupled with different attribute selection methods for quantitative structure - retention relationships modelling in micellar liquid chromatography
Abstract
In micellar liquid chromatography (MLC), the addition of a surfactant to the mobile phase in excess is accompanied by an alteration of its solubilising capacity and a change in the stationary phase’s properties. As an implication, the prediction of the analytes’ retention in MLC mode becomes a challenging task. Mixed Quantitative Structure –Retention Relationships (QSRR) modelling represents a powerful tool for estimating the analytes’ retention. This study compares 48 successfully developed mixed QSRR models with respect to their ability to predict retention of aripiprazole and its five impurities from molecular structures and factors that de- scribe the Brij - acetonitrile system. The development of the models was based on an automatic com- bining of six attribute (feature) selection methods with eight predictive algorithms and the optimiza- tion of hyper-parameters. The feature selection methods included Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), Re...liefF, Multiple Linear Regression (MLR), Mutual Info and F- Regression. The series of investigated predictive algorithms comprised Linear Regressions (LR), Ridge Re- gression, Lasso Regression, Artificial Neural Networks (ANN), Support Vector Regression (SVR), Random Forest (RF), Gradient Boosted Trees (GBT) and K-Nearest neighbourhood (k-NN). A sufficient amount of data for building the model (78 cases in total) was provided by conducting 13 experiments for each of the 6 analytes and collecting the target responses afterwards. Different experi- mental settings were established by varying the values of the concentration of Brij L23, pH of the aqueous phase and acetonitrile content in the mobile phase according to the Box-Behnken design. In addition to the chromatographic parameters, the pool of independent variables was expanded by 27 molecular de- scriptors from all major groups (physicochemical, quantum chemical, topological and spatial structural descriptors). The best model was chosen by taking into consideration the Root Mean Square Error ( RMSE ) and cross-validation (CV) correlation coefficient ( Q 2 ) values. Interestingly, the comparative analysis indicated that a change in the set of input variables had a minor impact on the performance of the final models. On the other hand, different regression algorithms showed great diversity in the ability to learn patterns conserved in the data. In this regard, testing many regression algorithms is necessary in order to find the most suitable technique for model building. In the specific case, GBT-based models have demonstrated the best ability to predict the retention factor in the MLC mode. Steric factors and dipole-dipole interactions have proven to be relevant to the observed retention behaviour. This study, although being of a smaller scale, is a most promising starting point for comprehensive MLC retention prediction.
Keywords:
Hyper-parameter optimization / Machine learning / Mixed QSRR / MLC / Molecular descriptors / Retention predictionSource:
Journal of Chromatography A, 2020, 1623Publisher:
- Elsevier
Funding / projects:
DOI: 10.1016/j.chroma.2020.461146
ISSN: 0021-9673
WoS: 000538809300005
Scopus: 2-s2.0-85084613665
Collections
Institution/Community
PharmacyTY - JOUR AU - Krmar, Jovana AU - Vukićević, Milan AU - Kovačević, Ana AU - Protić, Ana AU - Zečević, Mira AU - Otašević, Biljana PY - 2020 UR - https://farfar.pharmacy.bg.ac.rs/handle/123456789/3585 AB - In micellar liquid chromatography (MLC), the addition of a surfactant to the mobile phase in excess is accompanied by an alteration of its solubilising capacity and a change in the stationary phase’s properties. As an implication, the prediction of the analytes’ retention in MLC mode becomes a challenging task. Mixed Quantitative Structure –Retention Relationships (QSRR) modelling represents a powerful tool for estimating the analytes’ retention. This study compares 48 successfully developed mixed QSRR models with respect to their ability to predict retention of aripiprazole and its five impurities from molecular structures and factors that de- scribe the Brij - acetonitrile system. The development of the models was based on an automatic com- bining of six attribute (feature) selection methods with eight predictive algorithms and the optimiza- tion of hyper-parameters. The feature selection methods included Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), ReliefF, Multiple Linear Regression (MLR), Mutual Info and F- Regression. The series of investigated predictive algorithms comprised Linear Regressions (LR), Ridge Re- gression, Lasso Regression, Artificial Neural Networks (ANN), Support Vector Regression (SVR), Random Forest (RF), Gradient Boosted Trees (GBT) and K-Nearest neighbourhood (k-NN). A sufficient amount of data for building the model (78 cases in total) was provided by conducting 13 experiments for each of the 6 analytes and collecting the target responses afterwards. Different experi- mental settings were established by varying the values of the concentration of Brij L23, pH of the aqueous phase and acetonitrile content in the mobile phase according to the Box-Behnken design. In addition to the chromatographic parameters, the pool of independent variables was expanded by 27 molecular de- scriptors from all major groups (physicochemical, quantum chemical, topological and spatial structural descriptors). The best model was chosen by taking into consideration the Root Mean Square Error ( RMSE ) and cross-validation (CV) correlation coefficient ( Q 2 ) values. Interestingly, the comparative analysis indicated that a change in the set of input variables had a minor impact on the performance of the final models. On the other hand, different regression algorithms showed great diversity in the ability to learn patterns conserved in the data. In this regard, testing many regression algorithms is necessary in order to find the most suitable technique for model building. In the specific case, GBT-based models have demonstrated the best ability to predict the retention factor in the MLC mode. Steric factors and dipole-dipole interactions have proven to be relevant to the observed retention behaviour. This study, although being of a smaller scale, is a most promising starting point for comprehensive MLC retention prediction. PB - Elsevier T2 - Journal of Chromatography A T1 - Performance comparison of nonlinear and linear regression algorithms coupled with different attribute selection methods for quantitative structure - retention relationships modelling in micellar liquid chromatography VL - 1623 DO - 10.1016/j.chroma.2020.461146 ER -
@article{ author = "Krmar, Jovana and Vukićević, Milan and Kovačević, Ana and Protić, Ana and Zečević, Mira and Otašević, Biljana", year = "2020", abstract = "In micellar liquid chromatography (MLC), the addition of a surfactant to the mobile phase in excess is accompanied by an alteration of its solubilising capacity and a change in the stationary phase’s properties. As an implication, the prediction of the analytes’ retention in MLC mode becomes a challenging task. Mixed Quantitative Structure –Retention Relationships (QSRR) modelling represents a powerful tool for estimating the analytes’ retention. This study compares 48 successfully developed mixed QSRR models with respect to their ability to predict retention of aripiprazole and its five impurities from molecular structures and factors that de- scribe the Brij - acetonitrile system. The development of the models was based on an automatic com- bining of six attribute (feature) selection methods with eight predictive algorithms and the optimiza- tion of hyper-parameters. The feature selection methods included Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), ReliefF, Multiple Linear Regression (MLR), Mutual Info and F- Regression. The series of investigated predictive algorithms comprised Linear Regressions (LR), Ridge Re- gression, Lasso Regression, Artificial Neural Networks (ANN), Support Vector Regression (SVR), Random Forest (RF), Gradient Boosted Trees (GBT) and K-Nearest neighbourhood (k-NN). A sufficient amount of data for building the model (78 cases in total) was provided by conducting 13 experiments for each of the 6 analytes and collecting the target responses afterwards. Different experi- mental settings were established by varying the values of the concentration of Brij L23, pH of the aqueous phase and acetonitrile content in the mobile phase according to the Box-Behnken design. In addition to the chromatographic parameters, the pool of independent variables was expanded by 27 molecular de- scriptors from all major groups (physicochemical, quantum chemical, topological and spatial structural descriptors). The best model was chosen by taking into consideration the Root Mean Square Error ( RMSE ) and cross-validation (CV) correlation coefficient ( Q 2 ) values. Interestingly, the comparative analysis indicated that a change in the set of input variables had a minor impact on the performance of the final models. On the other hand, different regression algorithms showed great diversity in the ability to learn patterns conserved in the data. In this regard, testing many regression algorithms is necessary in order to find the most suitable technique for model building. In the specific case, GBT-based models have demonstrated the best ability to predict the retention factor in the MLC mode. Steric factors and dipole-dipole interactions have proven to be relevant to the observed retention behaviour. This study, although being of a smaller scale, is a most promising starting point for comprehensive MLC retention prediction.", publisher = "Elsevier", journal = "Journal of Chromatography A", title = "Performance comparison of nonlinear and linear regression algorithms coupled with different attribute selection methods for quantitative structure - retention relationships modelling in micellar liquid chromatography", volume = "1623", doi = "10.1016/j.chroma.2020.461146" }
Krmar, J., Vukićević, M., Kovačević, A., Protić, A., Zečević, M.,& Otašević, B.. (2020). Performance comparison of nonlinear and linear regression algorithms coupled with different attribute selection methods for quantitative structure - retention relationships modelling in micellar liquid chromatography. in Journal of Chromatography A Elsevier., 1623. https://doi.org/10.1016/j.chroma.2020.461146
Krmar J, Vukićević M, Kovačević A, Protić A, Zečević M, Otašević B. Performance comparison of nonlinear and linear regression algorithms coupled with different attribute selection methods for quantitative structure - retention relationships modelling in micellar liquid chromatography. in Journal of Chromatography A. 2020;1623. doi:10.1016/j.chroma.2020.461146 .
Krmar, Jovana, Vukićević, Milan, Kovačević, Ana, Protić, Ana, Zečević, Mira, Otašević, Biljana, "Performance comparison of nonlinear and linear regression algorithms coupled with different attribute selection methods for quantitative structure - retention relationships modelling in micellar liquid chromatography" in Journal of Chromatography A, 1623 (2020), https://doi.org/10.1016/j.chroma.2020.461146 . .