Extracting physiochemical features from the DNA sequence

Today, much attention in computational biology is focused on gene finding, i.e., the prediction of gene location and gene products from experimentally uncharacterized DNA sequences. In order to predict, we must know the characteristics of the gene sequences.

Moreover several machine learning methods can not be applied on the DNA sequence and start prediction. we need to know the characteristics of the DNA sequence in the form of data i.e. generate data by extracting the features from the sequence. This generated data will be fed to the Machine Learning for the prediction purpose.

You can extract several physiochemical properties such as Aliphatic composition, Aromatic composition, Non-polar composition, Polar composition, Charged composition, Positive composition, Negative composition, Theoretical Isoelectric Point, Acidic Composition, Basic Composition,  Boman(Potential Protein Interaction) Index,  theoretical net charge of a protein sequence, hydrophobic moment of a protein sequence, hydrophobicity index, instability index of a protein sequence, Kidera factors, amino acid length, molecular weight, etc from the sequence.

In this article we will be discussing how we can extract several physiochemical features from the DNA sequence. Following are the steps by which we can extract the features from the DNA sequence.

  1. First, convert the DNA sequence to their corresponding amino acid sequence. Please see the tables given below:

We can convert DNA sequence to amino acid sequence using R with the help of seqinr package. See the following code script:

2. Once you have converted the DNA sequence to amino acid, you can extract the several physiochemical properties from the sequence. We can extract the following properties with the aacomp function present in peptide package. See the blow script :

The output of aacomp is a matrix with the number and percentage of amino acids of a particular class

  • Tiny (A + C + G + S + T)
  • Small (A + B + C + D + G + N + P + S + T + V)
  • Aliphatic (A + I + L + V)
  • Aromatic (F + H + W + Y)
  • Non-polar (A + C + F + G + I + L + M + P + V + W + Y)
  • Polar (D + E + H + K + N + Q + R + S + T + Z)
  • Charged (B + D + E + H + K + R + Z)
  • Basic (H + K + R)
  • Acidic (B + D + E + Z)


aindex : The Aliphatic Index is defined as the relative volume occupied by aliphatic side chains (Alanine, Valine, Isoleucine, and Leucine). It may be regarded as a positive factor for the increase of thermostability of globular proteins.

boman : This index is equal to the sum of the solubility values for all residues in a sequence, it might give an overall estimate of the potential of a peptide to bind to membranes or other proteins as receptors, to normalize it is divided by the number of residues. A protein have high binding potential if the index value is higher than 2.48.

charge : It computes the theoretical net charge of a protein sequence.

hmoment : It computes the hydrophobic moment of a protein sequence. Hydriphobic moment is a quantitative measure of the amphiphilicity perpendicular to the axis of any periodic peptide structure, such as the a-helix or b-sheet. It can be calculated for an amino acid sequence of N residues and their associated hydrophobicities Hn.

hydrophobicity : This function calculates the GRAVY hydrophobicity index of an amino acids sequence using one of the 38 scales from different sources.

instaIndex : It computes the instability index of a protein sequence. This index predicts the stability of a protein based on its amino acid composition, a protein whose instability index is smaller than 40 is predicted as stable, a value above 40 predicts that the protein may be unstable.

kidera : This function calculates the average of a factor using one of the ten Kidera factors.

mw : This function calculates the molecular weight of a protein sequence. It is calculated as the sum of
the mass of each amino acid using the scale available on Compute pI/Mw tool.

pI : It computes the isoelectic point (pI) of a protein sequence. The isoelectric point (pI), is the pH at which a particular molecule or surface carries no net electrical charge.

These are some of the main features that can be extracted from the sequence and hence can be fed to machine learning algorithms for prediction purposes.

Sitecore Analytics Basics

Analytics is one of the basic pillar of Sitecore and any website that is on Sitecore but not implementing Analytics is simply overspending.

I remember when I started working on Sitecore, I used to wonder why would anyone in their sane mind pay so much for Sitecore just to host a website. I knew Sitecore provided some content management features that all CMS systems provide but it was about a year into Sitecore that I was really hit with all the features that Sitecore Analytics provide and all the door that it opens.

As a beginner, I just knew that websites can track data using cookies and other stuff but that all was on client side. I was not aware of all the stuff that can be done on the server side. As an initial impression, I saw Sitecore Analytics as a dashboard to check out how many users visited a certain webpage on your site.

In one of my side project, I had towork on Personalization in Sitecore. This is when I understood the importance and potential of Analytics in Sitecore. Not only I learnt how personalized in Sitecore, I also got the understanding about how the User Interactions are handled and stored in Sitecore. In this Blog post, I will briefly discuss how Analytics in Sitecore works.

Note: This is based on my experience on Sitecore 8.2

Whenever a user requests for a webpage in Sitecore, the following steps proceeed:

  1. Request is sent to browser.
  2. Browser passes on the request to Sitecore Experience Platform.
  3. Sitecore Experience Platform, based on the data saved in cookies, checks Analytics Database(Mongo DB in this case) if the User has visited the website before.
  4. In case user has visited the website before, Sitecore XP moves to personalization engine to check if the current user has any applicable rules(These rules can vary from basic stuff like webpage visited to complicated rules like goals triggered and Engagement value exceeding)
  5. In case a rule is applicable, Sitecore XP applies the conditions related to the rule and displays the personalized content to the user.
  6. This whole data is also sent to SQL database on specific intervals in order to be available for Dashboards and reports.