VIRTUAL SCREENING STATEGIES IN DRUG DISCOVERY – A BRIEF OVERVIEW

Computer-aided drug design has now become a compulsory tool in the drug discovery and development process. It uses computational approaches to discover, develop, and analyze drugs in order to identify potential compounds with expected biological activities. In the first part, this review provides a comprehensive introduction of the virtual screening technique, knowledge and advances in both structure-based virtual screening and ligand-based virtual screening strategies. In the second part, recent database of compounds provided worldwide and drug-like parameters which are helpful in supporting for the virtual screening process will be discussed. This information will provide a good platform to estimate the advance of applying these techniques in the new drug-lead identification and optimization.


INTRODUCTION
Viet Nam has a long history of traditional medicine, since long time ago, our ancestors have known how to use the surrounding plants as inexpensive but effective medicines in treating diseases [1]. Through time, the experience of using these medicinal plants has been improved, not only common colds can be treated, but even terminal illnesses such as cancer, cardiovascular diseases ... can also be treated or supported with traditional medicine. The mechanism of action behind these remedies to this day is still a question that modern science cannot fully explain.
In recent decades, millions of compounds have been isolated from plants, marine organisms and microorganisms worldwide [2,3]. Amongst these, many compounds have potential to develop into drugs serving human life. The study of the chemical composition of these plants and animals has contributed to elucidating the ability to cure diseases of the traditional remedies [4]. In addition, it also contributes to the discovery of the main bioactive compounds that help in the treatment of diseases and avoid side effects. However, due to financial and technical difficulties, not all the isolated compounds were tested for their therapeutic activity, even if they were tested, it was not sufficient.
Nowadays, with the robust development of information technology, complex chemical processes can be simulated with relatively high accuracy. With the advantage of saving time and money on the testing of large compound databases coupled with an increasing number of biological targets, a number of virtual screening methods and virtual biopharmaceutical assays have been developed by scientists using computerized software [5,6].
In this review, the concept of virtual screening along with its strategies applied in drug discovery will be presented. This is followed by a brief introduction about the database of small molecules and drug-likeness parameters in supporting the virtual screening procedures.

OVERVIEW OF COMPUTER-AIDED DRUG DESIGN
The application of information technology in chemistry -biology -medicine research has been developed since the late 1950s in the world. In the 1960s, simple computer programs were available to simulate the NMR spectrum [7]. Using Hansch model to analyze the structureactivity relationship, multiple computers were connected to solve complex regression equations [8]. However, the actual molecules were quite complex to solve the problems of spatial structure at that time.
In the 1970s, with the improvement in processing speed and user-friendly interface, IT started to have a more significant contribution [8]. The main difficulty during this time was that there were no computer programs able to accurately describe molecules and their properties from theoretical results. This barrier was then solved by graphical programs powerful enough to represent HOMO, LUMO, MUP (molecular electrostatic potential), bipolar moment vectors off molecules [7,8]. In the early 1990s, multi-core computers (clusters) have been developed with enough power to perform computations on chemical processes in a short time [9]. These results contributed to the increasing interest of scientists in the use of information technology in chemical research.
In traditional research of natural products chemistry in the past, compounds were mainly isolated randomly through experiments and their biological activities were then identified using simple assays such as: antibiotic, anti-flammatory and cytotoxicity assays. Since recent decades, in developed countries, new drug generations are being discovered and developed through powerful genetic and biochemical screening tools [10]. These methods will allow the rapid and precise detection of compounds containing the desired activity in a wide variety of extracts. More importantly, these trials also provide preliminary information about the mechanism of action of the bioactive compounds in drug development, which is important for the orientation of further drug design in the later stage [5,8].
To conduct these screening methods, it is necessary to determine the crystal structure of the targeted protein/enzyme (receptor) in which its function is responsible for the development of disease. In addition to accurate prediction and understanding of the mechanism of action of the drug, these methods also provide important knowledge for the development of new drugs when the disease has become drug-resistant [11,12]. When the drug is used incorrectly or due to environmental conditions, chemical agents can lead to resistance due to a mutation in the structure of the DNA of the pathogenic protein. The traditional research pathway could not help to detect these changes, however, with the application of new computational technology in chemistry and biology, the problem can be solved by studying changes in the DNA structure, changes in the interaction between receptorsbioactive molecule (ligand), thus, suggesting ideas for scientists to modify the structure of the currently used bioactive molecule to make the drug's effectiveness back. This study requires a close collaboration of researchers in three fieldsbiology, chemistry and medicine, in which: -Molecular biologists are pioneers in the research and discovery of the crystal structure of the proteins/enzymes that are responsible for the emerge and development of disease.
-Chemists screen the big database of molecules based on their potential to inhibit these biological targets and then synthesize/semi-synthesize them.
-Biological experiments, pre-clinical and clinical tests in the following stage are the combination of work between chemists, biologists, doctors, and pharmacists.
In modern bioactive compounds screening models, a virtual in silico (virtual screening) method has emerged recently and immediately plays an important role in the drug discovery process. This method uses advances in computer science to virtual screening, describe and predict new structure of compounds that are expected to have biological activities [13,14]. The main advantage of this method is that it minimizes the cost and time involved in drug discovery and development. It is often described as a multi-step sequential method through different screening criteria from which gradually narrows the selection of compounds with the potential to develop drugs with desired biological activities. The compounds studied do not have to be readily available, and their bioactivities are predicted virtually so it could save the expenses and material [13,15]. Based on this principle, any compound can be assessed through virtual screening. Depending on the scale of the study, the compound database for virtual screening can reach tens of millions of compounds, and all of these compounds can be analyzed at a single screening.

COVID-19
Main protease 1 billion [23] Spike protein 10 millions [35] Typically, each new drug on the market costs about 800 million euros and takes 10-15 years for the research and development process [16 -19]. Meanwhile, with modern networked computer systems (eg Grid computation) millions of structures can be virtually screened in a matter of weeks. For example, WISDOM (World-wide in Silico Docking On Malaria) is a successful project using Grid in screening and developing anti-malarial drugs on networked machines around the world. During the three years of the project (2007 -2010), hundreds of millions of compounds were screened and dozen of potential compounds have been tested in vitro followed by in vivo and are under clinical and preclinical testing [20 -22]. Another typical example is the COVID-19 pandemic, since the first case appeared in December 2019 until now, there is no efficient drugs has been discovered and with the pressure to find effective drugs quickly, many research units around the world have been using virtual screening method to screen billion of compounds with the aims to repurposing drugs or find new therapeutic compounds for treatment (Table 1) [23 -29].
The in silico screening methods usually uses receptor-ligand interactions to find the compounds (ligand) whose structure is best predicted to bind with the receptor (targeted protein/enzyme) here with the lowest ΔG value (Figure 1) [36]. The structure of the receptor in a three-dimensional model (3D) is determined for each study case, the ligands are developed based on the structure of chemical compounds, especially the well-known skeleton and clearly sourced. The research using virtual screening method was first recognized and published in 1997 [37]. Since then, the application of this model has been increasingly popular and becomes a new research trend in the pharmaceutical industry, along with that is the number of published studies related to this field is increasing dramatically (Table 2, Figure 2).  In Viet Nam, research directions in terms of chemistry and biological activity of natural compounds play an important role in finding sources of medicines, contributing to agricultural development and environmental protection as well as producing some functional foods. The research and application of information technology in the fields of chemical and other life sciences have been initiated and developed during recent decades. In chemistry, up to now, most of the research focuses on isolation, structure elucidation, design and synthesis of compounds, study of the relationship between structure of some series of compounds and their bioactivities. The use of information technology for in silico screening of new drugs recorded only few studies published. However, it is gradually becoming a new trend attracting the attention of many research groups in Viet Nam. For the structure-based virtual screening approach (SBVS), the input data included: X-ray crystal structure of the targeted protein/enzyme (receptor) and database of compounds (ligands). These compounds will be screened by docking them on the active sites of the receptor using different computation algorithms. In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex. Next, the docking score will be evaluated to rank the binding affinity between the ligand and the receptor. This is usually a multi-step process in which compounds are ranked and selected based on the interaction score and a number of other criteria [38]. Usually, only a handful of compounds with the highest scores are physically tested. Typically, only few compounds with the highest rank will be processed for further in vitro and in vivo experiments.

Structure-based virtual screening approach
In the early years of this virtual screening method, the algorithm software used for research was called UCSF Dock, since then a lot of other softwares have been developed, for example: Gold, Dock, Glide, FlexX, AutoDock (Table 3) [39 -63].
One of the critical steps in the SBVS model is the scoring of ligands [59,60]. Nowadays, although prediction of binding conformation between ligands and receptor could be done with different software, the scoring and ranking compounds are still challenging. Some of the difficulties come from the fact that in some cases, molecular interactions are difficult to parameterize. Scoring function is used for the following purpose: a) to evaluate the binding pose of a compound generated by different algorithms to choose the most energetically preferred pose; b) to rank the studied compounds from which determine the most potential candidate. The scoring methods have been continuously developed over the years [61,62], they could be grouped into three main categories: force field-based, knowledge-based and empirical [63,64]. Some scoring models use a combination of force field-based and empirical models.
The force field scoring function [65,66] assumes the free binding energy is the sum of molecular mechanical force fields potentials: Coulomb, Van der Waals, hydrogen bonds. Solvation [67,68] and entropy [69] energies can also be considered. The empirical scoring function [52,70] considers the free binding energy to be the sum of the bonds including: hydrogen bonds, hydrophobic bonds by fitting the calculated score with experimental binding affinity data for a training set of ligand-protein complexes [71]. The knowledge-based scoring function [72,73] is based on statistical data analysis of atomic pair frequencies in ligand-protein complexes with known three-dimensional structures.
Over the past two decades, considerable efforts have been made to refine the scoring functions to accurately predict binding free energies, thus, they can be used for ranking except in the case of quantitative biological activities. However, due to the complexity of the ligandprotein binding process and the approximate calculations performed when calculating the desolvation and entropy processes, the docking score has yet to prove accurate in the binding affinity prediction [59,74,75]. Some methods that have been proposed to improve scoring include adding elements to calculate solvation and entropy effects [68] to give precise algorithm using high-level quantum calculations [76], target-specific scoring functions [77] and scoring simultaneously by combining multiple scoring models [78,79]. On the other hand, it is more efficient way to use the docking score as the orientation to determine the suitability of the interaction in combination with other parameters such as tightness-of-fit by specific molecule that reflect the essence of the binding event. These parameters can be obtained by observing hydrogen bonds, which is very important parameter in docking, the spatial configuration of the π-π bond and/or the space occupancy of the hydrophobic region that pre-positions the ligand in the binding site.
Another unexploited aspect of the SBVS model is the flexibility of the target receptor [80], which will consume more computer resources and be more complex to process. In recent years, one of the biggest challenges facing many docking algorithms has been the flexible processing of target receptors. "Soft docking" (included in all docking softwares) allows small overlaps between the ligand and the receptor without large steric penalties [81]. However, this can increase the failure of outcome results because it causes more diverse structures to be bonded. It also does not allow change of large conformation compounds, such as side-chain rotations or protein backbone motions. Some softwares such as AUTODOCK4 [46], DOCK [41], GOLD [48], EADock [49], IFREDA [51], FlexE [82] or GLIDE induced Fit [83] (Table 4) allow simulation around torsional degrees of freedom of the selected side chain using similar methods to explore the spatial conformation of flexible ligands. Currently, many other theoretical methods are being developed continuously and their applications also have great potential for virtual screening in the future. One of these theories is the Relaxed Complex Scheme (RCS). RCS uses a set of low energy structures extracted from the molecular dynamic (MD) simulation for searching in databases via molecular docking [84,85]. It combines the advantages of the docking algorithm with the structural dynamic information obtained by MD simulation, detailed computation of the dynamic structure of both receptors and docked compounds. Longer-time of MD simulations could increase the possibility of studying the receptor's spatial configuration before docking. This model has been developed in combination with various MD software packages including: AMBER [86], NAMD [87], GROMACS [88] and AUTODOCK for ligand docking [46].

Ligand-based virtual screening approach
For ligand-based virtual screening approach (LBVS), the already known bioactive data are available in order to identify biologically active or inactive compounds and then search for more potential compounds based on structural similarity, pharmacology and other criteria.
One of the most popular models of LBVS studies is the quantitative structure-activity relationship (QSAR). The objective of QSAR is to determine the correlation between the structural/physical properties of known bioactive compound and their biological activity [89,90]. Information on compound activity levels such as binding affinity (KD) or inhibitory concentration (IC 50 ) is essential for QSAR. Here the structure of a compound is often described by a set of structural and physical information that is considered relevant to their binding ability. The quality of the QSAR model is influenced by the compatibility of each case, structuredbiological activity input data, compound description, the effect of the peripheral data, the suitability of the developed correlations, the 3D configuration, and the selection of solution directions [91].
Machine learning is increasingly being used more commonly in the algorithm for the research direction of LBVS in order to quickly and accurately establish and find the structureactivity correlation. Various technologies have been developed, each of them has its own advantages and disadvantages. Among these methods, regression models and classifications such as: Multiple Linear Regression, Nearest Neighbors, Naïve Bayesian Classification, Support Vector Machines, Neural Networks and Decision Trees have been applied successfully. These algorithms rely on certain different properties between active and inactive compounds to filter out potential candidate [92].
The efficiency of machine learning technology depends on many factors such as: diversity of data, ability to handle imbalances in data files (the number of inactive compounds is often superior to bioactive compounds) and parameters of the bioactivity of the compounds.

DATABASE OF SMALL MOLECULES
One of the prerequisites in traditional drug development is the identification of a specific biological target, for example, a compound that has been studied and demonstrated that its ability to interact with that target leads to the possibility to cure or improve symptoms. This first step involves the identification of potential biological targets and then validate them. Potential biological targeting requires research in the "Biological Space" (Figure 3) through human genetic sequencing, depending on high-speed sequencing technology and computer algorithms to process large amounts of output. Once a biological target has been found and validated, the next step is to identify an entity that can selectively interact with that target in a way that can induce a healing effect. According to the concept of the field of drug research, this entity is a small molecule chemical compound. Finding a compound that selectively binds to the active site of the receptor is not an easy task. To increase the chance of success, it is necessary to search in the "Chemical Space". In theory, the total number of compounds in the Chemical Zone can be estimated up to 10 million compounds [93 -95]. This is a very large number and is beyond the capabilities of scientists currently.
Although there have been many attempts to establish such super-large databases, obtaining sufficient compounds for the "Chemical Zones" are not possible at present. In addition, only few pharmaceutical corporations are known to possess database of more than 2 million compounds. However, only a small amount of compounds in those databases are stable, water-soluble, have functional groups suitable for binding to biological targets such as proteins or nucleic acids and have sufficient structural complexity [96] to be classified in the "Medicinal Chemistry Space" region. It is argued that the compounds in the "Chemical Zone" resulting from traditional screening collection are insufficient to solve unvalidated biological targets, thus, further extensive research is needed outside of this "Chemical Zone". A feasible source for research could be constructed from natural compound derivatives which are obtained from bacteria, plants, animals, and marine organisms through emerging technologies. These compounds form the natural product-like combinatorial libraries [3,97]. The drug-like compound concept was devised to define the properties required for a compound to be developed successfully to drug. Over time, more stringent regulations along with procedures with drug-oriented properties have been applied to compounds during database screening. Table 5 shows some criteria defined by Hann and Oprea [98]. Solubility in water (logS) -5/0.5

Others No toxic and reactive fragments
There are many in silico tools available today that can be used to build compound databases with drug-like properties. These are the features based on empirical principles. A typical example is the Lipinski's Rule of Five [99] which states that a compound is considered nondrug-like if there are more than 5 given hydrogen bonds, more than 10 received hydrogen bonds, a molecular mass greater than 500 and the hydrophilic index was greater than 5. This principle was recently revised using the pharmacokinetic data in rats [100]. Many of the relevant rules have also been changed and the new "Rule of Three" [101] proposition defines fragments properties with an average molecular mass ≤ 300 Da, Clog P value ≤, quantity hydrogen bonding for ≤ 3, the number of hydrogen bonds received ≤ 3. Recently, the Pfizer rule "Rule of 3/75" has described that compounds with Clog P ≤ 3 and surface polarization area (TPSA ) > 75 are highly resistant to in vivo tests [102]. Table 6 provides information on database of compounds containing drug-like properties that comply with the "Rule of Three" and "Rule of Five" rules. Table 6. Example of databases containing compounds with drug-like properties that comply with the rules of "Rule of Three" and "Rule of Five".
The screening process may be affected by high energy levels or unrealistic configurations of the compound. Some configuration building methods do not provide the lowest energy level for shaping and ranking spatial configurations, which in turn leads to configurations with high energy levels. If these configurations are not removed, it will lead to erroneous results in docking.
Compound databases are often distributed free of charge by commercial companies or research institutes. These include drugs, carbohydrates, synthetic compounds, natural compounds, etc. (Table 7) [109 -116]. ZINC [109] is a free online database with the capacity of up to 13 million compounds in the current version with information on biological activity (molecular weight, ClogP and number of rotational bonds). Other database files such as druglike compounds, potency and fragments have also been introduced.  Table 8 presents some of the commercial databases provided by other distributors [92]. Table 8. Database of small molecules provided by commercial distributors [92].

Lipinski's Rule of Five
Lipinski's Rule of Five helps distinguish between molecules that have a drug-like potential and those do not have potential as an oral drug [99,100]. It predicts the drug-likeness of compounds based on whether or not they meet the following rules: a) molecular weight below 500 Dalton; b) High lipophilicity (expressed as LogP less than 5); c) Less than 5 hydrogen bond donors; d) Less than 10 hydrogen bond acceptors; e) Molar refractivity should be between 40-130. In which: LogP value (partition coefficient between octanol and water) represents the ratio at equilibrium of the concentration of a compound between two phases, an oil and a liquid phase. The LogP value plays an important role in assessing the absorption, transport, distribution of substances and drug interactions with receptors [117]. This is one of the basic parameters that can be used to evaluate whether or not a compound has the potential to develop into drug. Molar refractivity is a measure of the total polarizability of a mole of a molecule [118].
In general, compounds that violate two or more criteria are predicted to be less likely to be developed as oral medications. Based on literature studies, several suggestions should be noted for drug development orientation such as: The higher the LogP value suggests that the more easily the compound disperses across the cell membrane and dissolves well in the lipid medium; Drug used orally, absorbed in the intestine should have a value of 1.35 ≤ LogP ≤ 1.8; Drugs targeting the central nervous system should have a value of LogP ~ 2; Most metal complexes with good permeability have LogP ≤ 6, the number of groups receiving hydrogen bonds 10 and the number of groups giving hydrogen bonds 5; Drugs used sublingually should have LogP ≥ 5 [98 -100, 119, 120].

Introduction to ADME
Depending on the nature of the drug and the treatment goals, people may deliver drugs into the body in different ways. Either way, drugs eventually enter the bloodstream at varying degrees to where it takes effect. ADME (Absorption, Distribution, Metabolism, Excretion) meaning absorption, dispersion, metabolism, and excretion are drug interactions with the body through the influence of molecular biology [121][122][123] . Determining these parameters is complicated because the body is a system equipped with a myriad of mechanisms to remove foreign entity that enters inside it during metabolism or excretion. The body uses a set of enzymes with metabolic functions (the most important in the cells are the family of hemoprotein cytochromes P450 which present in the liver), transporters, excretion, the cavity will absorb and then metabolize drugs, etc.

Absorption
Absorption is the entry of the drug into the general circulation of the body. In order to choose the appropriate way to introduce drugs into the body, it is necessary to base on the treatment purpose, properties of the drug, dosage form, and pathological state of the patient ... The route of drug delivery into the body greatly affects the absorption and effects of the drug. There are many ways to bring drugs into the body such as gastrointestinal tract, injection route, respiratory tract and skin [124].
The biological barrier is the body's self-defense mechanism from the penetration of toxins as well as exogenous substances. Drugs are identified as exogenous substances, thus, biological barriers significantly prevent the penetration of the drug to the desired destination. Many drugs are effective in laboratory studies (in vitro) but have failed in animal or human trials, mostly due to the inability to penetrate the biological barrier of the body to reach the target [125]. From the perspective of the organs in the body, their biological barrier is the outermost layer of epithelial cells of the organs and the endothelial barrier (the compartment between the capillaries and endothelial cells). From a cellular perspective, the biological barrier of a cell is the cell membrane separating the intracellular and extracellular environment (the cell membrane) [125,126].
Cell membranes (biofilms) are composed of plaques, consisting of lipid layers with two molecular rows, considered as soft structure, which is a dense liquid. In the lipid layer, there are membrane-transported albumin and lipoprotein particles, the two sides of the membrane together containing polar groups. The membrane is characterized by a rapid change in structure, albumin molecules are floating in the membrane, the spatial structure is also so altered that the membrane can form channels for small molecules, water-soluble substances, and ions to pass through to enter the cell. The membrane's barrier function is also capable of creating frameworks for receptor molecules or enzymes to attach to on its face or inside.

Distribution
Once absorbed, the drug enters the bloodstream to be transported to its target of action. In the blood, drugs can exist in two forms: free form and protein-associated form of plasma. Some drugs may be partially decomposed in the bloodstream [121,126].

Metabolism
Metabolism is the process of transforming drugs in the body under the effect of enzymes. Through metabolism, the majority of drugs are often reduced, lost effect or toxicity [127]. Some drugs still retain the same pharmacological effect, some drugs only work after being metabolized. Therefore, metabolism is the body's detoxification process for drugs. The liver is the most important organ in drug metabolism. In addition, drug metabolism can also occur in other organizations such as kidneys, intestines, lungs, blood... Oral medications must undergo initial metabolism in the liver before entering the circulatory system of distribution in the body. This initial process of metabolism is often so strong that the drug loses its effectiveness and sometimes it is necessary to turn it into an intravenous drug to ensure its activity.
Most of the drug metabolism reactions in the body, especially in the liver involve the participation of many different enzymes. Among them, cytochrome P450 (CYP) is an enzyme that plays a major role in drug metabolism [129]. Cytochrome P450 performs drug metabolism in 3 ways: oxidation, hydrolysis, hydroxylation (step 1), then the enzyme glucuronosyltransferase (UDP-GT) will attach glucuronic acid to the drug (step 2). Glucuronic acid group contains more OH and COOH so it is easily filtered and eliminated by the kidneys (Figure 4).
For example, aspirin, after being hydrolyzed by CYP in the liver, is converted to Salicylic acid and subsequently, a glucuronic acid group is added to the UDP-GT to go to the next process ( Figure 5). The result of this metabolism by CYP leads to a new salicylic acid being practically the active compounds, so aspirin is also known as a pro-drug (precursor). A precursor is a drug or compound that, after administration, is converted (in the body) into a drug with pharmacological activity. Inactivated precursors are pharmacologically inactive drugs that are converted into an active form in the body. Instead of using the drug directly, a corresponding precursor can be used to improve the way the drug is absorbed, distributed, metabolized, and excreted (ADME). The precursor is often designed to improve bioavailability when the drug itself is poorly absorbed from the gastrointestinal tract. A precursor that can be used to improve how a drug selectively interacts with cells or processes which are not its intended targets. This helps reduce drug side effects or unwanted effects, especially important in treatments such as chemotherapy, that can cause serious and unwanted side effects.

Excretion
Drug elimination is the process that leads to a decrease in drug concentration in the body. Drugs are excreted from the body mainly through the kidneys. In addition, they can also be eliminated through other routes such as gastrointestinal tract, respiratory tract, skin, sweat, breast milk or tears [130].
Some drugs can be eliminated at the same time in different ways, but normally each drug has its main elimination pathway depending on the nature and its chemical structure, on dosage form and administration route, etc.

CONCLUDING REMARKS
In this review, we have briefly introduced the concept of computer-aided drug design which is a new research trend worldwide in recent years. The virtual screening method has provided itself as an effective and important method for drug discovery process through two main strategies including SBVS and LBVS approaches. In addition, an overview of current databases of small molecules and information on drug-likeness parameters also presented. In conclusion, we suggest that VS methods play a pivotal role in drug discovery research and there are obvious opportunities to utilize this computational screening technology in the future.
Declaration of competing interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Conflict statement:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.