### Problem formulation

This work presents a regression-based approach for predicting binding affinity scores. The input data consists of: *D* = *{d*_{1}*, d*_{2}*, …, d*_{n}*}* where D represents the drug space, and the number of drugs is n, *T* = *{t*_{1}*, t*_{2}*, …, t*_{m}*}* where T represent the target protein space, and the number of target proteins is m, and finally *Y* = *{y*_{ij}^{N}*}* which is the label (i.e., continuous values of binding affinity) where *y*_{ij} is the binding scores between drug_{i}-target_{j} pair and N is the number of observed affinities. For each drug-target pair, we extracted different features using several techniques, explained later in the feature extraction section. The feature vector (FV) is represented by *X ⊆ {x*_{1}*, x*_{2}*, …, x*_{n*m}*}* and their labels (i.e., continuous values) *Y ⊆ {y*_{1}*, y*_{2}*, …, y*_{n*m}*}* where n*m is the number of all possible (drug, target) pairs. It is necessary to mention that, in the Davis dataset, all drug-target pairs have affinity values (i.e., labels), while in the KIBA and PDBBind Refined datasets, many pairs have ‘NaN’ labels which we excluded from *X* and *Y*. Most of the previous works generated each drug-target pairs’ features using different techniques applied on SMILES and sequences of drugs and proteins, respectively. However, up to now, there is no published work that deals with binding affinity prediction problems as a network-based problem which we adjust in our study.

### Overview of the Affinity2Vec model

Figure 1 illustrates the pipeline for Affinity2Vec, which consists of four main parts:

- 1.
Input representations, data preprocessing, and constructing heterogeneous network for the DTBA.

- 2.
Applying embedding techniques on the drugs’ SMILES and targets’ sequences to generate feature representations for individual drugs and target proteins. Also, we integrated the binding affinity, drug similarity, and target similarity graphs to form the complete DTBA network.

- 3.
Applying three feature extraction techniques to generate three different feature types: the embedding features, the meta-path score features, and the hybrid model that combine the previous two types of features.

- 4.
Applying ML regressors to establish the regression models.

We explain each of these steps in more details in the corresponding sections.

### Preprocessing and constructing DTBA network

In our study, we constructed a weighted heterogeneous graph *G(V, E)* using DTBA network integrated with target-target similarity and drug-drug similarity graphs, where *V* is the set of vertices (i.e., drug and target nodes) and *E* is the set of edges that represent one of three types of edge: (1) *D-T* edges reflect the binding strength values between drugs and targets, (2) *D-D* edges or *T-T* edges represent the connection between two drugs or two target proteins, respectively, with a similarity score above a certain threshold. It is essential to mention that we used two similarity matrices for drugs and targets, (*DDsim1, TTsim1*) and (*DDsim2, TTsim2*), from different sources to construct heterogeneous graphs *G1* and *G2*, respectively, which are used later in graph-based feature extraction. We explain the formulation of the graph edges in more details in subsections that follow.

#### Filtering similarity edges

First, we normalized each similarity graph separately to have the similarity scores ranging within [0,1]. Before integrating the target-target similarity and drug-drug similarity graphs, we applied a preprocessing step by filtering each graph separately. If we include all similarity scores as edges connecting similar target proteins or similar drugs, the graph will be a very complex network, specifically when the number of targets or drugs is large. Therefore, we removed all the weak similarity scores for the target similarity graph by filtering the edges within a specific threshold. We applied the same process to filter the drug similarity graph. We did an empirical analysis for the target-target similarity and drug-drug similarity graphs separately, as explained and illustrated in the Supplementary Material, Fig. S1, to specify different threshold values for each graph. For the Davis dataset as an example, the threshold for the drugs 2D chemical structure similarity is set to 0.3 while the threshold of the targets normalized Smith-Waterman alignment similarity is set to 0.04. Thus, each similarity edge below the threshold is removed, resulting in getting target-target similarity and drug-drug similarity subgraphs. The insight of applying this process is to reduce the noisy information introduced when using all similarity scores, including those shallow similarity scores, which do not provide any informative meaning that affects the performance. That is, decreasing the edges in each similarity graph reduces the model’s running time.

#### Formulating binding affinity edges

In Davis, KIBA, and PDBBind datasets, the lower affinity values indicate the stronger binding between the drug and the target. Thus, the binding affinity values have a similar meaning of distance (e.g., the smaller distance between the two entities, the stronger relations between them), which has the opposite meaning of the similarity edge used in the target-target similarity and drug-drug similarity graphs. As a result, we need to have a consistent meaning for all edge types (the higher value of edge, the stronger relationship) and a consistent value range (i.e., [0,1]) in this heterogeneous network. For this purpose, applying an exponential function is a potentially good candidate to convert the affinity values (the element which has a high value should be converted into a low value) and output them in the range [0,1]. Another function that fits our goal is a SoftMax function, a form of logistic regression (i.e., normalized exponential function), which normalizes the input value into a value vector that follows a probability distribution whose total sums up to one. We applied both functions on the binding affinity values for the three datasets separately and adopted the best result approach. The exponential and SoftMax \(\sigma\) functions^{50} are described in Eqs. (2) and (3), respectively.

$$f({z)}_{i}= {e}^{(-\alpha {z}_{i})}$$

(2)

$$\sigma ({z)}_{i} =\frac{{e}^{{z}_{i}}}{{\sum }_{j=1}^{K}{e}^{{z}_{j}}} ,\text{for i}=1, …,\text{ K and }z=({z}_{1}, \dots , {z}_{k}) \in {\mathbb{R}}^{K}$$

(3)

where *z* is the data samples of affinity values, and parameter alpha \(\alpha\) in Eq. (2) can be tuned or possibly estimated from the data. After preprocessing the affinity values, we augmented the DTBA with new affinity values, and target-target similarity and drug-drug similarity subgraphs to construct the extensive DTBA heterogeneous network, which is used later in the graph-based feature extraction step.

### Learning representation for drugs and proteins

In this step, the inputs are the SMILES of the drugs and the amino-acid sequences for the target proteins, and then different techniques are applied to generate embeddings for each drug and each target protein.

### Drug encoders

To capture the critical properties and generate features for each drug, we applied the sequence-to-sequence learning model (seq2seq, aka encoder-decoder model)^{51}. Specifically, we adopted the seq2seq fingerprint model introduced in^{52}. This unsupervised data-driven DL molecular embedding method applied several modifications into the original version of the seq2seq learning model to fit drug discovery tasks that are in our scope. Our objective is to learn the vital features of the drugs. We accommodated and adjusted this version of the seq2seq model to encode drugs that suit our work for the following reasons: (1) this method is data-driven, and there is no need for any expert knowledge; (2) the fingerprints (i.e., feature representations) generated by this method can construct the original drug representations, which means it ensures that the information encoded in the fingerprint vector is sufficient; (3) this method uses unsupervised training on a substantial unlabeled dataset by applying the power of deep neural networks (DNN). The seq2seq model that we applied consists of a multi-layered Gated Recurrent Unit (GRU) network that maps the input into a fixed dimension FV, and a deep GRU network that decodes the FV back to the original form. It is necessary to mention that the seq2seq model implemented GRU instead of long short-term memory (LSTM), usually implemented in the encoder-decoder model, which has similar performance but it accelerates the training process. A dropout layer is attached to avoid overfitting in the training phase. The last thing to mention, an extra fingerprint extraction layer set is fed-forwarded to only the perceiver network to extract the resulting fingerprint embeddings, which we take advantage of to obtain the embeddings for each drug.

Figure 2 shows the overview of the seq2seq model that we utilized and then applied. The input and the output of the seq2seq model are the SMILES strings that serve as the text representation. That is, we fed the drugs’ SMILES to the model so that the perceiver converts the SMILES strings into a fixed-sized FV, and then the interpreter translates it back to the original SMILES strings. The intermediate fixed-sized FV is extracted as the seq2seq fingerprint. The intermediate FV encodes all the essential information to recover the original drugs’ representation. Hence, we expect the seq2seq fingerprint to capture the precise information we can use for downstream ML tasks. In our work, we trained the model using around 85,900 SMILES, which we split into the tokens that can appear in the SMILES string (i.e., the alphabet for the seq2seq model). We assigned these tokens with the maximum SMILES length to the seq2seq model with other parameters. Furthermore, we utilized several parameters of our seq2seq model using different sets of tested values (see Table 2), and we applied Adam optimizer in the DL training process. Finally, we extracted the embeddings for each drug in Davis, KIBA, and PDBBind Refined datasets by predicting each drug’s SMILES features.

### Protein encoder

To generate meaningful feature representations for the target proteins, we applied ProtVec^{53}, a feature extraction method for protein amino-acid sequences. ProtVec^{53} is an unsupervised data-driven continuous distributed representation of biological sequences that captures a diverse range of informative biophysical and biochemical characteristics. Similar to the Skip-gram word embeddings model’s training process in natural language processing (NLP), a large corpus is needed to train distributed representation of biological sequences. Thus in this work, for ProtVec^{53}, 546,790 amino-acid sequences of different proteins were downloaded from the SwissProt UniProt server^{54}. Then, those sequences were broken down into subsequences (i.e., biological words) using the n-gram model. However, instead of using an overlapping window of 3 to 6 residues, they generated three lists of shifted non-overlapping words (*3-gram is a “biological” word consisting of 3 amino acids*), as shown in Fig. 3. Thus, the ProtVec is trained based on 1,640,370 (546,790 × 3) sequences of 3-gram through a Skip-gram neural network model that maximize the probability of observed word sequences. We used the pre-trained model to generate the embeddings for each target protein in our datasets. Then we represent each protein sequence by summing up the overlapping 3-gram feature representation.

### Feature extraction model

We developed three ways to extract the features for each drug-target pair:

- (1)
We applied graph mining techniques to compute the drug-target meta-path score features.

- (2)
We used the embedding FV for each drug and target described in the previous section.

- (3)
We produced a hybrid model that combines the features from the first two ways.

These features capture the characteristics of the target proteins, drugs, and DTBA network.

### Meta-path-based features

After learning feature representation for each target and drug, we calculated the cosine similarity (CosSim) between two drugs vector representations (i.e., *drug*_{i} and *drug*_{j} embeddings corresponding to *v*_{i} and *v*_{j,} respectively) for each pair of drugs as follows:

$$CosSim({v}_{i},{v}_{j})=({v}_{i}. {v}_{j}) /( \Vert {v}_{i}\Vert . \Vert {v}_{j}\Vert )$$

(4)

Likewise, we applied the same process to target proteins. Therefore, we constructed two new similarity matrices, which are *TTsim2*, target-target similarity matrix with size *m*m*, where m is the number of targets, and *DDsim2*, drug-drug similarity matrix of size *n*n* where n is the number of drugs. Then, we applied min–max normalization on each similarity graph to have all similarity scores between 0 and 1. At this stage, we created two different extensive weighted heterogeneous networks DTBA with the same binding affinity graph but with different drug-drug and target-target similarity graphs: *G1**(DTBA, TTsim1* subgraph*, DDsim1* subgraph*)*, and *G2**(DTBA, TTsim2* graph*, DDsim2* graph*)*. It is necessary to state that we excluded all test data in our model’s training process, including the binding affinity edges that belong to the test set. Then, we applied graph mining techniques similar to^{55} to extract features from these two graphs *G1* and *G2*, and then either used the features from each graph individually or combined them. For each weighted heterogeneous graphs *G1* and *G2*, we computed each drug-target path scores as described in Eq. (5):

$$score\left({d}_{i},{t}_{j}\right)= \sum_{p=1}^{n}\prod {P}_{weights}$$

(5)

*P* = {*p*_{1}*, p*_{2}*,…, p*_{q}*, **…, p*_{l}} is the set of paths from drug_{i} to target_{j}. In our work, we limited the path length to be equal to or less than three because of computational complexity. Thus, we obtained six path structures. Each path structure starts with a drug node and ends with a target node, and there is no cycle in the path (each node in the path appears only once). These path structures are represented by C_{h} where *h* = {*1, 2, 3, 4, 5, 6*}, which are: (C_{1}: (D-D-T), C_{2}: (D-T-T), C_{3}: (D-D-D-T), C_{4}: (D-T-T-T), C_{5}: (D-D-T-T), and C_{6}: (D-T-D-T). The meta-path score for each path *P*_{q} is the product of all weights *w*_{x} of edges from *d*_{i} (drug node) to *t*_{j} (target node) in each path structure belongs to C_{h}, as follows:

$$score\left(h,{d}_{i}, {t}_{j}, q\right)=\prod_{{\forall w}_{x}\in {P}_{q}, { P}_{q}\in {R}_{ijh}}\left({w}_{x}\right)$$

(6)

The weight of similarity edges between two nodes of the same type indicates the transition probability between those nodes, and the weight of the preprocessed values of binding affinity edges indicate the probable strength with which drugs bind to the target proteins (the larger probability indicates a more significant degree of binding affinity). Therefore, multiplying edge weights ranging between [0,1] across the path penalizes longer paths, which fits our objective, contrary to the summing of edge weights across longer paths, which conflicts with our goal. Following our previous published work^{55}, we obtained 12 meta-path-based scores by calculating the sum and max scores (i.e., features) under each path structure as shown in Eqs. (7) and (8), respectively.

$$SumScore\left(h,{d}_{i}, {t}_{j}\right)= {\sum }_{\forall q: { P}_{q}\in {R}_{ijh}}score\left(h,{d}_{i}, {t}_{j},q\right)$$

(7)

$$MaxScore\left(h,{d}_{i}, {t}_{j}\right)= {MAX}_{\forall q: { P}_{q}\in {R}_{ijh}}\left(score\left(h,{d}_{i}, {t}_{j},q\right)\right)$$

(8)

where *R*_{ijh} is the set of all paths under all path structures between drug_{i} and target_{j}. We applied the same procedure to both *G1* and *G2*, which means our FV dimension is either 12 or 24 when we combine *G1* and *G2* FVs. It is worth mentioning that the meta-path score features were encoded in commuting matrices as introduced in^{56}, which we calculated by multiplying two or three adjacency matrices. The length of the meta-path equals the number of multiplied adjacency matrices which are: DTBA, target-target similarity, and drug-drug similarity. Using 3D matrix multiplication accelerates the running time to get the meta-path scores.

### Embedding-based features

After automatically generating the feature representations for each drug and each target using different DL techniques (seq2seq model for drugs and ProtVec model for proteins), we obtained two embedding matrices *f* and *Pv* for drugs, and targets, respectively. Drug embedding matrix *f* has the size equal to *|n| x d* (*n* is the number of drugs and *d* is the feature dimension)*,* and the target embedding matrix *Pv* has a size equal to *|m| x k* (*m* is the number of targets and *k* is the feature dimension which is 100). Then, because our goal is to predict drugs and targets potentially linked with their binding values, we created an embedding FV for each drug-target pair by applying the concatenation operation, as shown in Fig. 1c. Thus, for any possible drug-target pair, we concatenated a drug FV with target FV resulting in a longer FV which has the dimension of *d* + *k* (i.e., number of the features) and the number of the samples are *n x m* (i.e., number of drugs multiplied by the number of targets).

### Hybrid model-based features

In this step, we concatenated the meta-path scores features with the drug embedding features and target embedding features, for all drug-target pairs.

### Regression model

After extracting all the drug-target pairs’ features, we normalized the features for the training and testing sets separately using min–max normalization^{57}. Then, the normalized FVs were fed with their label Y (i.e., binding affinity values) into a supervised ML regression model to predict the binding affinity values. Any models’ prediction performance relies on identifying the most critical features of the studied dataset. Thus, we developed three models using different sets of features (explained above) and tuned the models’ parameters using these feature sets. We named these models *Affinity2Vec_Pscore* (when we use the graph-based features), *Affinity2Vec_Embed* (when we use embedding-based features), and *Affinity2Vec_Hybrid* (when we combine both sets of features).

Unlike the well-known parametric linear regression model that assumes the target variable can be expressed as a linear combination of the independent variables, the gradient boosted trees are nonparametric models that approximate any distribution function from the data. They usually achieve better prediction performance than linear regressors. Hence, we utilized gradient boosting regression, particularly Extreme Gradient Boosting Regressor (XGBoost)^{58}. XGBoost regressor is implemented using an optimized distributed gradient boosting library named XGBoost^{59} that is highly efficient, flexible, and portable. The objective of XGBoost in the prediction process is to minimize the loss function, which is, in our case, the mean square error (MSE).

We selected XGBoost because:

- 1.
It is much faster than other ensemble regressors.

- 2.
Its core algorithm is parallelizable, meaning that it exploits the power of the multi-core machine and works on GPUs.

- 3.
It can be optimized during the validation process to enhance the performance because it has a wide variety of tuning parameters.

In this study, all the Affinity2Vec models utilized ML XGBoost regressor and DL regressor to predict the affinity values. The DL model that we utilized is a feedforward artificial neural network (ANN) that consists of three fully connected (FC) layers. For activation function, we used the rectified linear activation function (ReLU). XGBoost performed better than DL in our experiments and this result may be a consequence of DL models working better with larger feature numbers. Thus, we only reported the results of XGBoost. The most critical parameters that we optimized include but are not limited to: 1/ the number of trees in the ensemble, which is usually tuned by increasing the trees until we see no further improvements, 2/ the maximum depth of each tree, 3/ the learning rate (often set to small values such 0.1, 0.01), 4/ the subsample which is the number of the data samples used in each tree, and 5/ colsample_bytree which is the number of features used in each tree, set to a value of 1 when using all features.

### Evaluation metrics

To acquire a more realistic view of Affinity2Vec’s prediction performance, we implemented the same evaluation metrics widely used to evaluate the regression-based state-of-the-art methods. These metrics are: the mean squared error (MSE)^{60}, the concordance index (CI)^{61}, the regression toward the mean index (rm2)^{62}, and the area under the precision-recall curve (AUPR)^{63}.

#### Mean square error (MSE)

MSE^{60} is the most popular metric to evaluate regression-based models to measure how close the fitted line is to the actual data points. Thus, MSE is used as a loss function that the training model tries to minimize. MSE is defined as follows:

$$MSE = 1/n {\sum }_{i=1}^{n }{({p}_{i }-{y}_{i} )}^{2}$$

(9)

where ** p** represents the prediction values vector, y represents the actual values vector, n is the number of samples, and MSE is the average sum of the square difference between the predicted values and the actual values. The formula uses the square to ensure the negative values do not cancel the positive values. The smaller the value of MSE, the better the regression model’s performance. Root mean squared error (RMSE) is the square root of MSE, used as a substitute for MSE in several studies.

#### Concordance index (CI)

CI has been widely used to evaluate regression models’ ranking performance^{61}. The CI of a set of paired data is equal to the probability that two random pairs with different label values are predicted in the correct order defines as follows:

$$CI = 1/ Z \sum_{dx>dy} h (bx – by)$$

(10)

*bx* and *by* are the prediction values for the larger affinity *dx* and the smaller affinity *dy*, respectively, *Z* is the normalization constant, and *h(m)* is the Heaviside step function^{64} defined as follows:

$$h\left(m\right)=\left\{\begin{array}{c}1, \quad if\, m>0\\ 0.5,\quad if\, m=0\\ 0,\quad if\, m<0\end{array}\right.$$

(11)

#### Regression toward the Mean (rm2 index)

The *rm2*^{65} has been extensively used to validate regression‐based quantitative structure–activity relationship (QSAR) models. *rm2* is a modified version of the squared correlation coefficient, also known as determination coefficient, (*r*^{2}), implemented to evaluate the binding affinity model’s external predictive potential. If *rm*2 is greater than 0.5 on the test data, it indicates the performance is acceptable. *rm*2 is defined using two other evaluation metrics: *r*^{2}^{66} is the proportion of variation in the outcome described by the predictor variables. The *r*^{2} corresponds to the squared correlation between the actual values and the predicted values in multiple regression models, while \({r}_{0}^{2}\) is a squared correlation coefficient with zero intercepts:

$${r}_{m}^{2} = {r}^{2} \times \left(1 – \sqrt{{r}^{2} – {r}_{0}^{2}}\right)$$

(12)

More details of the formulation are explained in^{65,66}. The higher the *r*^{2} and *rm*2, the better the performance of the regressor. Note, we also used *r*^{2} in a nonparametric statistical approach called *Y*-Randomization (also known as *Y*-Scrambling)^{67}, that served as the final validation step.

#### Area under precision-recall curve (AUPR)

AUPR evaluation metric^{63} frequently used with binary classification problems known to perform well with imbalanced data since it differentiates the predicted scores of the positive data samples and the predicted scores of negative data samples. We utilized AUPR to evaluate our Affinity2Vec’s performance in binary classification prediction. The closer the value of AUPR is to 1, the better the performance. We can easily convert our datasets into their binary shapes by setting specific binding affinity thresholds for each dataset separately. Thus, if the binding affinity value is above the threshold, it is transformed to 1 (known interaction); otherwise, it will be zero. Our study follows the suggested thresholds from prior work^{23,39,68}, where we set the threshold values to 7 and 12.1 for the Davis and KIBA datasets, respectively. For the PDBBind Refined dataset, we specified 6, 7, and 10 as the thresholds following the previous work^{44}. We calculated the AUPR using each threshold and then calculated the average AUPR.

### Experimental settings

We applied two settings to evaluate our models. For setting-1, like the state-of-the-art methods, we utilized the nested cross-validation (CV) approach to evaluate Affinity2Vec’s. The advantage of the nested CV approach over the regular CV is that it attempts to overcome bias and overfit the training data. Thus, it tunes the hyperparameters of the regressors in the training stage. The nested CV procedure that we applied is as follows: we randomly partitioned the dataset into six equal sets, one of them used as a hold-out set for testing, and the other five sets for training fivefold CV. Hence, all four evaluation measurements obtained are for the five models’ averages on the hold-out test data. In addition, as mentioned before, we eliminated all binding affinity edges that belong to the test data from the graph that we formulated. To make the performance comparison unbiased, as fair as possible, we used the same benchmark datasets (Davis and KIBA) and setting of nested CVs, applied the same splits of the training and testing data as the current state-of-the-art methods, and finally, utilized the same four evaluation metrics.

For settings-2, we followed the same setting process of prior works^{42}, using: a time-based splitting protocol, the same dataset (PDBBind Refined v.2015), and the same evaluation metrics. Thus, we split the dataset into three subsets: training, validation, and test sets based on the publication years that were provided with the dataset. For training data, we used all protein–ligand complexes from the year 2011 and before. For the validation data, we used protein–ligand complexes for 2012, while for test data, we used all complexes from 2013 and after. Therefore, we obtained 2188, 312, and 547 data samples (ligand–protein complexes) for training, validation, and testing sets, respectively.

We executed all our experiments on a Linux Ubuntu 18.04.5 LTS Intel Xeon Platinum 8176 workstation, 64-bit OS, with 112 processors and 2 GPUs (Quadro and Titan) with CUDA version 11.0. For implementation, we used python 3.8 and some required libraries such as Keras^{69} for DL, DeepChem for seq2seq fingerprint model^{51}, ProtVec^{53} for proteins’ embeddings, XGBoost^{59} for regression, and more.