1.Extract features from smiles using RDKit

In this section, I demonstrate a piece of code that uses RDKit ¹ toolkit to extract molecular descriptors from known active molecules (ligands) targeting a specific protein. These descriptors are then used to train a machine learning model (e.g., support vector machine) to predict the binding activity of a ligand.

The descriptors used in this code are taken from the published paper: ² Quantum machine learning framework for virtual screening in drug discovery: a prospective quantum advantage and the dataset used here is LIT-PCBA dataset³, which is specifically designed for virtual screening and machine learning.

Descriptors introduction¶

The chosen molecular descriptors(features) are generally classified into 5 classes:

Atomic species: number and kind of atoms composing a molecule.
Structural properties: number of single and double bonds, presence of aromatic rings, etc.
Physical-chemical properties: molecular weight and lipophilicity
Basic electronic information: general reactivity, polarity and rough electron density.
Molecular complexity: the complexity of every atom environment in a molecule which depends on the molecular structure.

The specific descriptors are presented as follow (from paper² in appendix B):

Code Demo¶

In the dataset, both active and inactive molecules are given in SMILES format. Here we only take the active ones (active.smi) as example:

# import packages
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Mol
from rdkit.Chem import Descriptors
from rdkit.Chem.rdchem import BondType, BondStereo
from rdkit.ML.Descriptors import MoleculeDescriptors

# load 7167 molecules from SMILES file
suppl = Chem.SmilesMolSupplier('actives.smi')
# molecule counts
print(len(suppl))

For practical purposes, the descriptors are not categorized into five classes but are instead grouped for ease of reference.

# declare the descriptors keywords
ATOMIC_SPECIES = ['C','N','O','P','S','F','Cl','Br','I']
BOND_TYPES = [BondType.SINGLE, BondType.DOUBLE, BondStereo.STEREOE]
AROMATIC_PROP = ['num_aroma_atoms', 'AP']
OTHER_PROP = ['NumHAcceptors', 'NumHDonors','NumHeteroatoms','NumRotatableBonds','NHOHCount','NOCount','MolWt', 'MolLogP']
ELEC_INFO = ['FpDensityMorgan1','FpDensityMorgan2','FpDensityMorgan3','MaxAbsPartialCharge','MinAbsPartialCharge','NumValenceElectrons']
MOL_COMP = ['BalabanJ','BertzCT','Chi0','Chi0n','Chi0v','Chi1','Chi1n','Chi1v','Chi2n','Chi2v','Chi3n','Chi3v','Chi4n','Chi4v','HallKierAlpha','Ipc','Kappa1','Kappa2','Kappa3']

# Pre-initialize descriptor calculator 
other_calc = MoleculeDescriptors.MolecularDescriptorCalculator(OTHER_PROP)
elec_calc = MoleculeDescriptors.MolecularDescriptorCalculator(ELEC_INFO)
mc_calc = MoleculeDescriptors.MolecularDescriptorCalculator(MOL_COMP)

dataset=[]
for mol in suppl:
    # atom features: 'C','N','O','P','S','F','Cl','Br','I'
    mol_atom = [atoms.GetSymbol() for atoms in mol.GetAtoms()]
    atom_count = [mol_atom.count(atom) for atom in ATOMIC_SPECIES]

    # bond features: single, double, stereo
    mol_bond = [bonds.GetBondType() for bonds in mol.GetBonds()]
    mol_stereo = [bond.GetStereo() for bond in mol.GetBonds()]
    mol_total = mol_bond + mol_stereo
    bond_count = [mol_total.count(bond) for bond in BOND_TYPES]

    # aromaticity feature: No. of aromatic atoms + aromatic proportion
    num_aroma_atoms = len(list(mol.GetAromaticAtoms()))
    AP = num_aroma_atoms / mol.GetNumHeavyAtoms()
    aromatic = [num_aroma_atoms, AP]

    # other properties
    other_property = list(other_calc.CalcDescriptors(mol))
    el_property = list(elec_calc.CalcDescriptors(mol))
    mc_property = list(mc_calc.CalcDescriptors(mol))

    # combine all the features
    features = (atom_count + bond_count + aromatic + other_property + el_property + mc_property)
    dataset.append(features)  # save the data in a 2d array

# create dataframe
column_name = (ATOMIC_SPECIES + ['single', 'double', 'stereoE'] + AROMATIC_PROP + OTHER_PROP + ELEC_INFO + MOL_COMP)
df_actives = pd.DataFrame(data=dataset,columns = column_name)
df_actives

	C	N	O	P	S	F	Cl	Br	I	single	...	HallKierAlpha	Ipc	Kappa1	Kappa2	Kappa3
0	17	5	2	0	1	0	0	0	0	16	...	-1.93	1.175242e+06	16.533706	7.530765	3.546026
1	21	3	3	0	1	0	0	0	0	7	...	-3.32	4.534871e+06	18.062438	8.141321	3.841228
2	24	4	3	0	1	0	0	0	0	13	...	-3.39	1.794330e+07	21.827384	8.625014	4.750895
3	25	6	1	0	0	0	0	0	0	14	...	-3.32	4.101819e+07	20.575350	9.042240	4.338801
4	16	2	3	0	0	0	1	0	0	9	...	-2.79	1.331666e+05	14.160087	5.537123	2.313694
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7162	11	5	2	0	1	0	0	1	0	8	...	-1.63	3.969946e+04	14.772349	6.213661	3.273688
7163	21	0	8	0	0	0	0	0	0	18	...	-3.65	2.084500e+06	20.093761	7.402287	3.572501
7164	14	0	3	0	0	0	0	0	0	11	...	-1.64	4.165319e+03	15.360000	9.576366	9.298416
7165	8	2	5	0	0	0	0	0	0	7	...	-2.02	1.862336e+03	11.057042	4.467681	2.989007
7166	22	1	5	0	0	0	0	0	0	14	...	-3.41	2.506479e+06	19.354312	8.083712	3.174880

7167 rows × 47 columns

# add labels for actives and save files
df_actives['label']= 1  # pandas will broadcast the value to all lines
df_actives.to_csv('actives.csv', index=False) # save index or not, True by default

There are 7167 molecules and 47 descriptors(features) in total. Before input all data to train and test a Support Vector Machine, we need to first refine these features. Specifically, we will employ feature reduction and feature selection methods to reduce the dimension of features while retaining the most significant data patterns. The resulting dimensionality reduction decreases computational complexity, removes noise and redundancy, ultimately improving accuracy and generalization of ML model. In the next step, I will employ Principle Component Analysis(PCA) for feature reduction and Analysis of Variance(ANOVA) for feature selection using scikit learn library.

Reference¶

RDKit: Open-source cheminformatics. https://www.rdkit.org. ↩
Mensa, S., Sahin, E., Tacchino, F., Kl Barkoutsos, P., & Tavernelli, I. (2023). Quantum machine learning framework for virtual screening in drug discovery: a prospective quantum advantage. Machine Learning: Science and Technology, 4(1), 015023. ↩↩
Tran-Nguyen, V. K., Jacquemard, C., & Rognan, D. (2020). LIT-PCBA: an unbiased data set for machine learning and virtual screening. Journal of chemical information and modeling, 60(9), 4263-4273. ↩