Exploring Quantitative Structure-Activity Relationships (QSARs) of Non-Tri cyclic Cyclooxygenase-2 (COX-2) Inhibitors by MLR and PC-ANN

Quantitative structure–activity relationship study using principal component artificial neural network (PC-ANN) methodology was conducted to predict the inhibitory activities expressed as pIC 50 of 73 non-tri cyclic cyclooxygenase-2 (COX-2) inhibitors. The results obtained by MLR shows that the best two models are close to each other with regression coefficient of 0.85. These optimal models were further analyzed by PC-ANN and the best model obtained was with regression coefficient of 0.823 for the test set. The lowest prediction sum of squares (PRESS) value obtained for the prediction set is 4.727 which accounts for predictability of the model. Artificial neural networks provide improved models for heterogeneous data sets without splitting them into families. Both the external and cross-validation methods are used to validate the performances of the resulting models. Randomization test is employed to check the suitability of the models.


INTRODUCTION
Cyclooxygenase-2 (COX-2) inhibition has been one of the most investigated areas of research in the most recent decade owing to its essential role in relieving pain and other inflammatory conditions.Non-steroidal anti-inflammatory drugs (NSAIDs) are deeply used in the treatment of wide variety of inflammatory conditions.NSAIDs are anti-pyretic, analgesic activities and are prescribed as first choice in the treatment of arthritis, rheumatisms and other degenerative of inflammatory joint disease as well as reliving the pains of everyday life.From a historical point of view, the first NSAID with therapeutic benefits was aspirin, which has now been used for more than 100 years as a NSAID.However, these drugs are associated with high risk of gastrointestinal and renal adverse effects.NSAIDs act by inhibition of cyclooxygenase (COX), the enzyme involved in the biosynthesis of prostaglandins and prostacyclins from arachidonic acid.Prostaglandins are involved in physiological functions such as protection of the stomach mucosa, aggregation of platelets and regulation of kidney function.They also have pathological functions such as their involvement in inflammation, fever and pain.
Cyclooxygenase exists in at least two isoforms, namely, the constitutive cyclooxygenase-1 (COX-1) and the inducible cyclooxygenase-2 (COX-2).Inhibition of COX-1 is accountable for the adverse gastrointestinal and renal effects of NSAIDs while the inhibition of COX-2 accounts for NSAIDs therapeutic effects.All classical NSAIDs, such as aspirin and ibuprofen can inhibit both COX-1 and COX-2, but bind more strongly to COX-1.Selective COX-2 inhibitors have the same anti-inflammatory, anti-pyretic, and analgesic activities as do nonselective NSAIDs but without causing gastric ulceration, bleeding and perforation.
Increasing selectivity for COX-2 also increased toxicity, since the anti-thrombotic prostacyclin is formed by COX-2 and inhibiting its synthesis precipitated heart attacks.The problem of this side action has not yet been resolved.Is it possible to obtain the benefit of low gastrotoxicity and avoid the danger of a heart attack?Perhaps lumiracoxib has provided the solution.However, it now appears that by taking any NSAID, patients risk experiencing a heart attack [1][2][3][4][5][6].The review [7] provides the various structural classes of selective COX-2 inhibitors with special emphasis on their structure activity relationships.In this study, we concentrate on Non-Tricyclic compounds which lack the cyclic central core and have monocyclic or bicyclic structures.
The importance of developing selective COX-2 inhibitors is manifested by the deep efforts dedicated in this field that resulted in the synthesis of hundreds of compounds, which displayed activity against COX-2.In this study, we are concerned in designing a new set of COX-2-selective inhibitors based on simple but statistically sound Quantitative Structure Activity Relationship (QSAR) models whose parameters can be easily obtained by means of commonly available and less costly computational programs.
Quantitative structure activity relationship (QSAR) is the quantitative correlation of structural properties of a compound with its chemical, physical, pharmaceutical, or biological effect.Based on this assumption, many trials were made to correlate various physicochemical properties of a set of molecules with their experimentally known biological activity, and so QSAR goals are: (1) Prediction of the activity of untested molecules, depending on models developed using a series of molecules and (2) Constructing ideas about mechanism of action of a group of compounds leading to a design of new compounds of better activity and less toxicity.QSAR model development process is typically divided into three steps: data preparation, data analysis and model validation.
Data preparation starts by selection of the data set to be used; this may simply be the extraction of data from a database or may need additional experimental studies.There are two steps to complete data preparation: geometry optimization and descriptors calculation.Geometry optimization or minimization is finding the coordinates that represents the potential energy minimum for the molecular structure in its 3D form.Theoretical molecular descriptor is a value that describes the molecular structure numerically.These descriptors can be simple such as molecular weight or complex such as geometrical descriptors.
In data analysis, the first step is to decide which techniques for statistical analysis and correlation to be used.If our correlation models to be built are linear then we use multilinear regression (MLR) or non linear then we use artificial neural network (ANN).
Model validation is the final part of the model development process, the predictive power of the model is tested on an independent set of compounds, generally predictive power is the most important characteristics of the model and model predictivity is the ability of the model to predict accurately the target activity of a compound that was not used for model development.
In model validation step, most of validation processes implement the leave one out (LOO) and leave many out (LMO) cross-validation procedures.The most common outcome parameters resulted from cross-validation procedures are crossvalidated determination coefficient q 2 (R 2 cv) and root mean squares error (RMSE).High R 2 cv and low RMSE values is a result of good and more predictive model and that lead to better description of the observed data.
Multilinear regression (MLR) is multivariate statistical technique to examine the linear relationship between the single dependent variable (activity) and two or more independent variables (molecular descriptors).Collinearity, which often exists between independent variables, generates a severe problem in certain types of mathematical handling such as matrix inversion [8].As it was recently reviewed by Schneider and Wrede [9], the flexibility of ANN for finding out relationships that are more complex allows this method to be widely applied in QSAR studies.Both linear and nonlinear mapping functions can be modeled by configuring the network properly.To obtain powerful and accurate ANN models, one should train a subset of descriptors instead of all generated descriptors [10][11][12][13][14][15].
Billones et al. [16] performed QSAR study of COX-2 inhibitors belonging to nine chemical classes using semi-empirical (AM1) computed quantum mechanical descriptors and electrotopological sate (E-state) indices.Another study was performed by Gupta et al, [17] related to 3D-QSAR of some tetrasubstituted pyrazoles as COX-II inhibitors, a six point pharmacophore with 3 hydrogen bond acceptors, one hydrophobic group and two aromatic rings as pharmacophoric feature was developed.
This study aims to predict the inhibitory activity pIC50 of the data set in reference [18][19][20][21][22][23][24][25] as one group without splitting them into categorizes.This is achieved by applying ANN to develop new statistically validated QSAR models utilizing different types of descriptors.The strength and the predictive performance of the proposed models were verified using cross validation, chance correlation and external test set.Therefore, the motivation of this work is to provide QSAR models that will be used to predict inhibitory activity of unknown compounds and also these models may be used to design new drugs.

Software
Geometry optimizations were performed using HyperChem (Version 7.5; Hypercube, Inc, USA, http://www.hyper.com)at the AM1 level of theory.An AM1 optimization was chosen because it was developed and parameterized for common organic structures.Descriptors were calculated using HyperChem and DRAGON (Milano Chemometrics and QSAR Group, USA, evaluation version 5.0, http://www.disat.unimib.it/vhml)software.SPSS software (version 13.0, SPSS, Inc.) was used for the simple MLR analysis while ANN analysis was performed using MATLAB (Version 7.0.1 (R14), http://www.mathworks.com).

Chemical data and descriptors
A data set of 73non-tricyclic COX-2 inhibitors and their inhibitory activity (pIC50) were obtained from reference [18][19][20][21][22][23][24][25] and used in this study.These inhibitors and their inhibitory activities are included in Table S1 in the supporting information.
The structures of the compounds are drawn by hyperchem software.The resultant structures are 2D then we convert them to 3D.HyperChem software was used to optimize the different compound structures using AM1 semi-empirical level.The optimization was preceded by the Polak-Rebiere algorithm.To be sure that we reached global minima, geometry optimization was run multiple times with different starting points for each molecule.
In this study, a pool of 1481 descriptors classified into 18 different groups was calculated using Dragon software.The constant or nearly constant descriptors for all the 73 compounds were discarded from further analysis.Furthermore, chemical descriptors such as HOMO, LUMO and polarizability were calculated using HyperChem software.Depending on the HOMO and LUMO values, electrophylicity, electronegativity, hardness, and softness descriptors were calculated.Other descriptors such as surface area approximate, surface area grid, volume, mass, polarizability, hydration energy, octanol-water partition coefficient (logP), and refractivity were calculated.Discarding highly inter-correlated (r>0.95) descriptors and following the procedure described in the next section, this number of descriptors was declined to 14 descriptors in the "final" MLR regression model (model 14 in Table 1).

Multiple linear regression (MLR) analysis
Multiple linear regression analysis with stepwise selection and elimination of variables was employed to model the inhibitory activity (pIC50) relationships with each group of descriptors separately.Log1/IC50 is the dependent variable and the set of descriptors as independent variables.Then, the -optimal‖ descriptors for each group were selected and gathered in one group to perform new MLR analysis.

Principal components analysis (PCA)
Collinear descriptors add redundancy to the input data matrix and consequently the performances of the models obtained by using these descriptors would be degraded.PCA and more specifically factor analysis, groups together variables that are collinear to form a composite indicator capable of capturing as much of common information of those indicators as possible.Each factor reveals the set of variables with the highest relationship.The idea under this approach is to explain the highest possible variation in the indicators set using the smallest possible number of factors.Consequently, the index no longer depends upon the dimensionality of the data set but it is rather based on the 'statistical' dimensions of the data.Application of PCA on a descriptor data matrix results in a loading matrix containing factors or PCs, which are orthogonal and therefore have no correlation with each other.The PC's were calculated by singular value decomposition (SVD) method in MATLAB environment (MathWork Inc.Version 7.0.1 (R14)).Due to the quality of data, a previous treatment of the data is essential before applying the multivariate analysis methods.Scaling and centering is one of the pre-processing methods needed before performing the regression methods joint with feature extraction.Projection methods results depend on the normalization of the data.Descriptors with small absolute values have a small contribution to overall variances leading to biased PC's caused by the presence of other descriptors with higher values.In order to have the focus on the important variables in the model, equal weights are assigned to each descriptor, with appropriate scaling.Furthermore, descriptors were standardized to unit variance and zero mean (autoscaling) to give all variables the same importance.Then, the data matrix containing the entire set of descriptors and activity were simultaneously subjected to PCA.

Principal component-artificial neural network (PC-ANN) analysis
ANNs are computer-based models in which a number of nodes, also called neurons are interconnected by links forming netlike structure ‗‗layers.''A variable value is assigned to every neuron.
There are three kinds of neurons: (a) the input neurons which receive their values from independent variables and constitute the input layer, (b) the hidden neurons which collect values from other neurons, giving a result that is passed to a successor neuron, (c) the output neurons which take values from other units and correspond to different dependent variables, forming the output layer.In this sense, network architecture is commonly represented as I-H-O, where I, H, and O are the number of neurons in the input, hidden, and output layers, respectively.
The weights are links between units that condition the values assigned to the neurons.The weights are adjusted through a training process in order to minimize network error.For this, a non-linear transfer function relates the input parameters with the outputs.Commonly neural networks are adjusted, or trained, so that a particular input leads to a specific target output.
In PC-ANN analysis, as a preliminary treatment, the input data (i.e., molecular descriptors) were normalized to have zero mean and unity variance, and then were subjected to PCA before being introduced into the neural network.It should be illustrated that for each MLR resulted model, separate ANN models were developed so that the input's descriptors were the subsets selected by the stepwise MLR methods.In the case of each MLR model, a feed-forward neural network with back-propagation of error algorithm was constructed to model the activity-structure relationships between the descriptors on one hand and inhibitory activity on the other hand.The model development in ANN and the network architecture is fully described by us [13] and others [14]

MLR analysis
In continuation to recent QSAR studies [26][27][28][29] done using similar methods, we developed an ANN-QSAR model that describes the inhibitory activity of a series of compounds using large number of different descriptors.MLR were performed on each one of the groups of descriptors individually (individual approach described in Ref. [30] by Deeb) where pIC50 is the dependent variable.Stepwise method is used to develop multilinear equation by correlating dependent variable (activity) and the best independent variables.
Next, a new or -final‖ MLR analysis was performed by correlating the dependent variable (activity) and the optimal descriptors selected from the individual MLR models.Table 1 shows the regression models suggested from the -final‖ MLR analysis.The number of descriptors in these models is varied between 1 and 14.Model 14 and 15 are close to each other with one descriptor less.The highest coefficient of determination (R 2 ) obtained, is 0.720 for a regression model with 14 descriptors (model 14).Table 2 shows a key for the different descriptors used in the final MLR model.S2 shows also those models 14 and 15, have the highest R 2 and R 2 cv values as well as the lowest root mean square error (RMSE) values.Also, PRESS/SST is less than 0.4 for these models.Thus, models 14 and 15 were chosen for further analysis with ANN.

PCA
The inputs of the ANN were the subset of the descriptors used in different MLR models (Table 1).First, PCA was performed to classify the molecules into training (60%), validation (20%) and test (20%) sets.Figure 1 shows the first and second PC's for the factor spaces of the descriptors and COX-2 inhibitory activity data.According to the pattern of the distribution of the data in factor spaces (Figure 1), the training, validation and test sets molecules were selected homogenously so that molecules in different zones of Figure 1 belong to the three subsets.As we can also see from figure 1 that compounds number 37 and 41 are outliers, which mean that those two compounds behave in a different way in comparison to the rest of other compounds with respect to activity and descriptors.Therefore, the total number of compounds now is 71 compounds.The molecules subjected to the preliminary treatment mentioned previously, the classified data were used as an input for the ANN.

ANN
In this study, a three-layered feed-forward ANN model with back propagation learning algorithm [32] was employed.At first, non-linear relationship between the subset of descriptors selected by stepwise selection-based MLR and COX-2 inhibitory activity was preceded by ANN models with similar structure.The number of hidden layer's nodes was set to 7 for all models, and the number of nodes in the input layer was the number of descriptors.
The correlation coefficients and cross-validation parameters of ANN analysis for ANN model numbers 14 and 15 are given in Table S3 in the supporting information.This table shows that the results of the two models are close to each other.However, model 14 seems to be better than model 15 since it has higher correlation coefficient for the test set as well as lower relative standard error of prediction.
To optimize the performance of the ANN models 14 and 15, these models were trained using different number of hidden nodes up to 20.Choosing the best model was based on cross-validation parameters and determination of minimum prediction error [33].For the evaluation of the predictive ability of a multivariate calibration model, RMSEP is an important statistical parameter to find the best number of hidden nodes.Moreover, because large numbers of hidden nodes often draw attention to the risk of overfitting [34], considering models with low prediction error is avoided if a large number of hidden nodes are used in their network training.
The results of optimizing the number of hidden nodes for models 14 and 15 are summarized in table S4 and table S5 in the supporting information respectively.
Figure 2 shows the PRESS values against the number of hidden nodes as well as the regression factor against number of hidden nodes for model 14 and 15.This figure shows that the lowest PRESS (4.727) is obtained when using 7 hidden nodes for model 15 with regression coefficient for the test set of 0.823.For model 14, the lowest PRESS (5.928) is obtained when using 12 hidden nodes with regression coefficient for the test set of 0.757.

Figure 2. PRESS against number of hidden nodes as well as regression factor against number of hidden nodes for model 14 and 15 respectively.
Randomization test is performed to investigate the probability of chance correlation for the optimal models (models 14 and 15 with 12 and 7 hidden nodes in the network, respectively).Chance correlation was done using the same configuration parameters and the same activation functions of all our ANN models.The results of chance correlation for models 14 (using 12 hidden nodes) and 15 (using 7 hidden nodes) are summarized in Tables S6 and S7 in the supporting information, respectively.These tables show that the coefficients of determination obtained by chance are low in general while the RMSE values are high.This indicates that the models obtained from ANN are better than those obtained by chance.As we can see, our models were validated by calculating different statistical parameters, using external test set and finally performing randomization test.
Figure 3 shows plot of the predicted activity against observed ones for the training and test sets compounds as well as their residuals for models 14 and 15 To check the presence of more outliers in a model, for the training and test sets, the standard deviation of the observed activity data was calculated.The residue which is equal to the difference between the predicted and observed one were calculated also.Finally, if the value of the residue is larger than two times the standard deviation of the observed activity, then this point is considered as an outlier.We found that there was no outlier in our data.

COMPARISON WITH OTHER QSAR STUDIES
Few QSAR studies [35][36][37][38][39][40] have been performed on COX-2 inhibitors in the literature.But the data set of these studies was smaller than the data set used in our study.In these studies they used one core while in our study we used different cores and perform QSAR on the whole data without splitting them into cores or families.The previous studies used 3D-QSAR and others used topological indicies while in our study we performed QSAR by applying PC-ANN method and using pool of descriptors.Finally, our results indicate that the proposed models have better predictivity than the models proposed by other studies.Our models may be used to design new COX-2 inhibitors.
Recently [41], we carried out a QSAR study of 48 tricyclic COX-2 inhinitors using MLR and PC-ANN.The results obtained by PC-ANN give advanced regression models with good prediction ability.

CONCLUSIONS
MLR as well as ANN modeling method combined with the individual [30] factor selection approach is applied to predict the COX-2 inhibitory activity of a set of 73 non-tricyclic compounds.The results obtained by MLR shows that the best two models are close to each other with regression coefficient of 0.85.These optimal models were further analyzed by PC-ANN and the best model obtained was with regression coefficient of 0.823 for the test set.The lowest prediction sum of squares (PRESS) value obtained for the prediction set is 4.727 which accounts for predictability of the model.. ANN provides improved models for heterogeneous data sets without splitting them into families and gives good regression models with good prediction ability.
Generally, the models obtained from MLR analysis are better than those obtained by ANN analysis which may account for linear relation between the inhibitory activity of these 73 non-tricyclic inhibitors and descriptors.Both the external and cross-validation methods are used to validate the performances of the resulting models.Employed randomization test indicates that the models obtained from ANN are better than those obtained by chance.

Supplementary material
. The data set was divided into three subsets: training, validation and external test sets.The training and the validation sets are the norm in all model training processes.The test set is used to test the trend of the prediction precision of the model trained at some point of the training evolution.The extracted PC's for each MLR model were classified homogenously, based on the factors space of the descriptors, into training set (60%), validation set (20%) and external test set (20%) according to the PCA and the first two PC's were plotted against each other (see Figure1).Afterward, the training set was used to optimize the network performance.The regression between the network output and the observed activity was calculated for each set individually.The training function 'trainscg' was used to train the network.To find models with lower errors, the ANN algorithm was run many times, with different geometry and initial weights each time.

Figure 1 .
Figure 1.First and second principal components for the factor spaces of the descriptors and non-tricyclic COX-2 inibitory activity data.

Figure 3 .
Figure 3. Predicted inhibitory activities against observed ones and their residuals for model 14 and 15 respectively.The correlation between calculated and observed pIC50 for the training set of model 14 is given by: Calculated pIC50 = 0.534 Observed pIC50 + 2.783 (1) And for the test set of this model is given by: Calculated pIC50 = 0.4252 Observed pIC50 + 3.486 (2) While the Correlation between calculated and observed pIC50 for the training set of model 15 is given by: Calculated pIC50 = 0.432 Observed pIC50 + 3.393(3)And for the test set of this model is given by: Calculated pIC50 = 0.503 Observed pIC50 + 3.80(4)

25 C
Compounds classified in the training or calibration set, P compounds classified in the external test set (prediction set), V compounds classified in the validation set.Out compounds classified as outliers.

Table 1 . Final MLR model summary.
According to the above equation, the most important descriptor in this equation is G3m which is related to the 3rd component symmetry directional WHIM index / weighted by mass; it is directly proportional to the activity of the compounds.The second important descriptor is G2u which is related to 2nd component symmetry directional WHIM index / unweighted; it is directly proportional to the activity of the compounds.

Table 2 . Key for the different descriptors used in the final MLR model. Descriptor symbol Description
[31], leave one out (LOO) cross validation was performed on models 10-15 since these models have coefficients of determination larger than 0.6[31].The results of LOO cross validation are summarized in TableS2in the supporting information.This table shows that the cross-validation coefficient of determination (R 2 CV) has positive values starting from model 10 to model 15.Table

Table S1 . Molecular structures and observed inhibitory activities of the 73 non-tricyclic COX-2 inhibitors expressed as pIC50.
*Ref 19 D e c e m b e r 24, 2 0 1 4 *Ref 21 D e c e m b e r 24, 2 0 1 4