Bioinformatics

Cell is the basic structural, functional and biological unit of all known living organisms.

Cell nuclei contains cell’s genetic materials DNA and RNA that forms the chromosomes. Human has 23 pairs of chromosomes. Genome is distributed along these chromosomes.

DNA stands for Deoxyribonucleic acids. DNA has two strands of four types of nucleotides.

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T)

The two polymers are complementary to each other.

Genome Data

This is how gnome data looks like

Genome replication

Genome replication basically is to copy or make a clone of the gnome. Like copying that text but it is not for one time it is for million of times. Gnome replication is a complex process and till now it is not fully understood by the biologists.

Dna Replication

Finding replication origin is a key for many applications gene therapy, genetic modified food like frost-resistant tomatoes and pesticide-resistant corn

The idea of gene therapy simply is to infect patient who lack a specific gene with artificial gene that encode a therapeutic protein. Once become inside the cell it will be replicated and treat the patient.
First human gene therapy experiment link

So where is engineering here. It is purely biological problem. Only biologist can find experimentally the origin of replication. Simply he could cut cell DNA to pieces and find which piece will make the cell stop replication and hence it must be the origin of replication.

It is not well defined computational problem

Hidden message in Replication origin

The Gold-Bug short story tells us about a hidden message of pirate treasures. This message must be encoded to be understood.

It was noticed that ;48 sequence is frequently repeated in that message, given that it is encoded from english and the most frequent word in english is THE we can replace ; with T, 4 with H, and 8 with E.

We have a better chance now to encode the message

So frequent word is the hidden message for decoding.

Counting words

So the first problem we need to solve is to count number of repetitions of a pattern in the text.

PatternCount(Text, Pattern)
        count  0
        for i  0 to |Text|  |Pattern|
            if Text(i, |Pattern|) = Pattern
                count  count + 1
        return count

PatternCount("GCGCG", "GCG")
> Output
2

Lets try to implement it

Getting Started with Python

Python is interpreted high level programming language for general purpose computing.

You already installed anaconda. We can use spyder IDE to run our code, but we will use jupyter notebook.

Variables

Python is untyped language

x = 5
y = 'Hello SBME'

Lists

# Initialize the list
myList = [2, 55, 565]

# add an element at the end of the list
myList.append(8)

# Add element in specific index
myList.insert(1, 7)

print( myList ) # ??
print( myList[0] ) # ??
print( myList[3] ) # ??
print( myList[4] ) # ??

Arithmetic Operations


x = 19
y = 18 

z = x / y
z = x * y
z = x + y
z = x - y

Comparison Operators

operator meaning
== Equal to
!= Not equal to
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to

Logical Operations

x = True
y = False 

x or y 
x and y 
not x 

If, elif, else


x = 25 
y = 20 

if x < y:
    print("y is greater than x")
elif x == y:
    print("x and y are equal")
else:
    print("x is greater than y")

Loops


for i in range(10):
  print( i )

i = 0
while i < 10 :
  print( i )
  i += 1

Functions

def mean( list ):
  sum = 0
  for element in list:
    sum += element
  return sum / len( list )

m = mean([1,12,42,1,23,12])
print( m )

Importing Libraries

import numpy as np

Frequent words

Given a text find frequent words with a specific length k.

        FrequentWords(Text, k)
            FrequentPatterns  an empty set
            for i  0 to |Text|  k
                Pattern  the k-mer Text(i, k)
                Count(i)  PatternCount(Text, Pattern)
            maxCount  maximum value in array Count
            for i  0 to |Text|  k
                if Count(i) = maxCount
                    add Text(i, k) to FrequentPatterns
            remove duplicates from FrequentPatterns
        return FrequentPatterns
FrequentWords("ACGTTGCATGTCGCATGATGCATGAGAGCT", 4)
> Output
CATG GCAT

Reverse complement

Given a text of DNA string get its reverse complement

    def reverseComplement(Text):
        complementDict = {}
        complementDict["A"] = "T"
        complementDict["C"] = "G"
        complementDict["G"] = "C"
        complementDict["T"] = "A"
        reverseComplement = "" # an empty string
        for i in reversed(Text):
            reverseComplement = reverseComplement +  complementDict[i]
        return reverseComplement

reverseComplement("AAAACCCGGT")
> Output
ACCGGGTTTT        

Section Demo

You can download section demo from here

Finding Hidden Messages in DNA (Bioinformatics I)

Exercises

Choose the correct answer

  1. in DNA, Adenine (A) is the complement of
    • Guanine (G)
    • Cytosine (C)
    • Thymine (T)
    • None of the above
  2. in DNA, Cytosine (C) is the complement of
    • Adenine (A)
    • Guanine (G)
    • Thymine (T)
    • None of the above
  3. DNA replication produces …… of original DNA
    • different copies
    • identical copies
    • complemented copies
    • None of the above
  4. Gene-Therapy is to inject the patient with an artificial gene that encodes
    • Harmful protein
    • Therapeutic protein
    • Effectiveness protein
    • None of the above
  5. Finding frequent words is the key algorithm to locate ….. of the DNA
    • Replication origin
    • Complement
    • Sequence
    • None of the above
  6. Bioinformatics replaces the lab experiments with
    • Real experiments
    • Computational analysis
    • Advanced experiments
    • None of the above
  7. The frequent 3-mer (3 letters word) in this DNA sequence (ATTGCATGC) is
    • ATT
    • ATG
    • TGC
    • None of the above
  8. The reverse complement of the DNA sequence (AAAACCCGGT) is
    • CCCCAAATTG
    • ACCGGGTTTT
    • TTTTGGGCCA
    • None of the above
  9. The output of the following code snippet
total =  0
for​​ i  in​​ range( 5  ):
    total += i
print(total)
  • 10
  • 20
  • 30
  • None of the above
    1. The output of this case is
      total =  0
      for​​ i  in​​ range( 5  ):
        total += i
        print(total)
      
  • Same as the previous question
  • Different
  • No output

If it is different, What is the output?