Extracting physiochemical features from the DNA sequence

Today, much attention in computational biology is focused on gene finding, i.e., the prediction of gene location and gene products from experimentally uncharacterized DNA sequences. In order to predict, we must know the characteristics of the gene sequences.

Moreover several machine learning methods can not be applied on the DNA sequence and start prediction. we need to know the characteristics of the DNA sequence in the form of data i.e. generate data by extracting the features from the sequence. This generated data will be fed to the Machine Learning for the prediction purpose.

You can extract several physiochemical properties such as Aliphatic composition, Aromatic composition, Non-polar composition, Polar composition, Charged composition, Positive composition, Negative composition, Theoretical Isoelectric Point, Acidic Composition, Basic Composition,  Boman(Potential Protein Interaction) Index,  theoretical net charge of a protein sequence, hydrophobic moment of a protein sequence, hydrophobicity index, instability index of a protein sequence, Kidera factors, amino acid length, molecular weight, etc from the sequence.

In this article we will be discussing how we can extract several physiochemical features from the DNA sequence. Following are the steps by which we can extract the features from the DNA sequence.

  1. First, convert the DNA sequence to their corresponding amino acid sequence. Please see the tables given below:

We can convert DNA sequence to amino acid sequence using R with the help of seqinr package. See the following code script:

2. Once you have converted the DNA sequence to amino acid, you can extract the several physiochemical properties from the sequence. We can extract the following properties with the aacomp function present in peptide package. See the blow script :

The output of aacomp is a matrix with the number and percentage of amino acids of a particular class

  • Tiny (A + C + G + S + T)
  • Small (A + B + C + D + G + N + P + S + T + V)
  • Aliphatic (A + I + L + V)
  • Aromatic (F + H + W + Y)
  • Non-polar (A + C + F + G + I + L + M + P + V + W + Y)
  • Polar (D + E + H + K + N + Q + R + S + T + Z)
  • Charged (B + D + E + H + K + R + Z)
  • Basic (H + K + R)
  • Acidic (B + D + E + Z)


aindex : The Aliphatic Index is defined as the relative volume occupied by aliphatic side chains (Alanine, Valine, Isoleucine, and Leucine). It may be regarded as a positive factor for the increase of thermostability of globular proteins.

boman : This index is equal to the sum of the solubility values for all residues in a sequence, it might give an overall estimate of the potential of a peptide to bind to membranes or other proteins as receptors, to normalize it is divided by the number of residues. A protein have high binding potential if the index value is higher than 2.48.

charge : It computes the theoretical net charge of a protein sequence.

hmoment : It computes the hydrophobic moment of a protein sequence. Hydriphobic moment is a quantitative measure of the amphiphilicity perpendicular to the axis of any periodic peptide structure, such as the a-helix or b-sheet. It can be calculated for an amino acid sequence of N residues and their associated hydrophobicities Hn.

hydrophobicity : This function calculates the GRAVY hydrophobicity index of an amino acids sequence using one of the 38 scales from different sources.

instaIndex : It computes the instability index of a protein sequence. This index predicts the stability of a protein based on its amino acid composition, a protein whose instability index is smaller than 40 is predicted as stable, a value above 40 predicts that the protein may be unstable.

kidera : This function calculates the average of a factor using one of the ten Kidera factors.

mw : This function calculates the molecular weight of a protein sequence. It is calculated as the sum of
the mass of each amino acid using the scale available on Compute pI/Mw tool.

pI : It computes the isoelectic point (pI) of a protein sequence. The isoelectric point (pI), is the pH at which a particular molecule or surface carries no net electrical charge.

These are some of the main features that can be extracted from the sequence and hence can be fed to machine learning algorithms for prediction purposes.

Leave a Reply

Your email address will not be published. Required fields are marked *