motif finding algorithms sudarsan padhy iiit...

Motif Finding Algorithms

Sudarsan Padhy IIIT Bhubaneswar

Outline

• Gene Regulation• Regulatory Motifs• The Motif Finding Problem• Brute Force Motif Finding• Consensus and Pattern Branching: Greedy Motif Search• PMS: Exhaustive Motif Search• Comparison of Motif Finding Algorithms• Presentation of Our Work on Motif Finding

Central Dogma: DNA -> RNA -> Protein

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Splice Sites

Almost all donor splice sites exhibit GU, Almost all acceptor splice site exhibit AG. Not all GUs and AGs are used as splice site, exploit that exons have higher GC content or that certain motifs are locatednearby

Transcription Start Sites (TSS)

Combinatorial Gene Regulation

• A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed

– How can one gene have such drastic effects?

Regulatory Proteins• Gene X encodes regulatory protein, a.k.a. a

transcription factor (TF)

• The 20 unexpressed genes rely on gene X’s TF to induce transcription

• A single TF may regulate multiple genes

Regulatory Regions• Every gene contains a regulatory region (RR)

typically stretching 100-1000 bp upstream of the transcriptional start site

• Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor

• TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS

Transcription Factor Binding Sites

• A TFBS can be located anywhere within the Regulatory Region.

• TFBS may vary slightly across different regulatory regions since non-essential bases could mutate

MOTIF• A nucleotide / amino acid sequence pattern

having biological significance• Occurs repeatedly,either in same molecule or in

many molecules. Found in upstream intergenic region

• Functionally important regions of genome( gene) can be recognized by searching such patterns

• Used for locating binding sites, regulatory signals, to control gene expression

• Identifying potential drug target sites

Motifs and Transcriptional Start Sites

geneATCCCG

geneTTCCGG

geneATCCCG

geneATGCCG

geneATGCCC

Identifying Motifs

• Genes are turned on or off by regulatory proteins

• These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase

• Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS)

• So finding the same motif in multiple genes’regulatory regions suggests a regulatory relationship amongst those genes

Identifying Motifs: Complications• We do not know the motif sequence

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next

• How to discern it from “random” motifs?

The Motif Finding Problem• Given a random sample of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

Ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the pattern that is implanted in each of the individual sequences, namely, the motif

The Motif Finding Problem (cont’d)

• Additional information:

– The hidden sequence is of length 8

– The pattern is not exactly the same in each array because random point mutations may occur in the sequences


• The patterns revealed with no mutations:cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaact

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

acgtacgtConsensus String


• The patterns with 2 point mutations:cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaacc

at

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc


• The patterns with 2 point mutations:cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaacc

at





Can we still find the motif, now that we have 2 mutations?

Defining Motifs

• To define a motif, lets say we know where the motif starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Motifs: Profiles and Consensusa G g t a c T

tC c A t a c g

tAlignment a c g t T A g

ta c g t C c A

tC c g t a c g

G

_________________

A 3 0 1 0 3 1 10

Profile C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 31

T 0 0 0 5 1 0 14

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Consensus

• Think of consensus as an “ancestor”motif, from which mutated motifs emerged

• The distance between a real motif and the consensus sequence is generally less than that for two real motifs

Consensus (cont’d)

Evaluating Motifs

• We have a guess about the consensus sequence, but how “good” is this consensus?

• Need to introduce a scoring function to compare different guesses and choose the “best” one.

Defining Some Terms• t - number of sample DNA sequences• n - length of each DNA sequence• DNA - sample of DNA sequences (t x n

array)

• l - length of the motif (l-mer)• si - starting position of an l-mer in

sequence i• s=(s1, s2,… st) - array of motif’s starting

positions

Parameters

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat





l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60s

DNA

n = 69

Scoring Motifs

• Given s = (s1, … st) and DNA:

Score(s,DNA) =

a G g t a c T tC c A t a c g ta c g t T A g ta c g t C c A tC c g t a c g G_________________

A 3 0 1 0 3 1 1 0C 2 4 0 0 1 4 0 0G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

_________________

Consensus a c g t a c g t

Score3+4+4+5+3+4+3+4=30

l

t

∑= ∈

l

i GCTAkikcount

1 },,,{),(max

The Motif Finding Problem

• If starting positions s=(s1, s2,… st) are given, finding consensus is easy even with mutations in the sequences because we can simply construct the profile to find the motif (consensus)

• But… the starting positions s are usually not given. How can we find the “best”profile matrix?

The Motif Finding Problem: Formulation

• Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score

• Input: A t x n matrix of DNA, and l, the length of the pattern to find

• Output: An array of t starting positions s = (s1, s2, … st) maximizing Score(s,DNA)

Motif finding

• The motif finding problem manifests itself in two forms

Known motif and unknown position: given a motif of length lwith d mismatches, find its position in a genomic/protein sequence.

Unknown motif and unknown position: given a set of sequences each of length m find a motif of length l with d mismatches.

• (l,d)-motif problem Introduced by Pavel Pevzner in 2001

The problem is to determine the motif M of length l , given nnucleotide sequences each of length m, and each containing a planted variant of M. More precisely, each such planted variant is a substring M that is with up to d point mutations. (d << l)

Expected number of motifs

• Assume that the sequences are i.i.d – Probability that a given l-mer (l length motif) C occurs with up-to d

mutations at a given position of a random sequence is

• Then the expected number of length l motifs that occur with up to d mutations at least once in each of the t random length n sequences is:

• Given t sequences of length n each, l the length of the motif, using a Brute force approach the complexity is l (n – l + 1)t = O(l nt).

• For t = 10, n = 1000, l = 10 { Fastest Computer-Jaguar-1759 tera Flops}– 1030 computations – 1018 /( 3600x24x365 x1759)>20 Million Years

The Motif Finding Problem: Brute Force Solution

– Compute the scores for each possible combination of starting positions s

– The best score will determine the best profile and the consensus pattern in DNA

– The goal is to maximize Score(s,DNA) by varying the starting positions si, where:

si = [1, …, n-l+1]i = [1, …, t]

BruteForceMotifSearch

1. BruteForceMotifSearch(DNA, t, n, l)2. bestScore 03. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1)

to (n-l+1, . . ., n-l+1)4. if (Score(s,DNA) > bestScore)5. bestScore score(s, DNA)6. bestMotif (s1,s2 , . . . , st) 7. return bestMotif

Running Time of BruteForceMotifSearch

• Varying (n - l + 1) positions in each of tsequences, we’re looking at (n - l + 1)t sets of starting positions

• For each set of starting positions, the scoring function makes l operations, so complexity is l (n – l + 1)t = O(l nt)

• That means that for t = 8, n = 1000, l = 10 we must perform approximately 1020

computations – it will take billions years

CONSENSUS: Greedy Motif Search

• Find two closest l-mers in sequences 1 and 2 and forms 2 x l alignment matrix with Score(s,2,DNA)

• At each of the following t-2 iterations CONSENSUS finds a “best” l-mer in sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing

Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• CONSENSUS sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first

Some Motif Finding Programs

• CONSENSUSHertz, Stromo (1989)

• GibbsDNALawrence et al (1993)

• MEMEBailey, Elkan (1995)

• RandomProjectionsBuhler, Tompa (2002)

• MULTIPROFILER Keich, Pevzner (2002)

• MITRAEskin, Pevzner (2002)

• Pattern BranchingPrice, Pevzner (2003)

Planted Motif Challenge

• Input:– n sequences of length m each.

• Output: – Motif M, of length l– Variants of interest have a hamming distance

of d from M

How to proceed?• Exhaustive search?

• Run time is high

Best NeighborBranch from the seed strings

Find best neighbor -highest score

Don’t consider branches where the upper bound is not as good as best score so far

Scoring• PatternBranching use total distance score:• For each sequence Si in the sample S = {S1, . . . , Sn}, let

d(A, Si) = min{d(A, P) | P ∈ Si}.• Then the total distance of A from the sample is

d(A, S) = ∑ Si ∈ S d(A, Si).

• For a pattern A, let D=Neighbor(A) be the set of patterns which differ from A in exactly 1 position.

• We define BestNeighbor(A) as the pattern B ∈D=Neighbor(A) with lowest total distance d(B, S).

PatternBranching Algorithm

PatternBranching Performance

• PatternBranching is faster than other pattern-based algorithms

• Motif Challenge Problem: – sample of n = 20 sequences– N = 600 nucleotides long– implanted pattern of length l = 15 – k = 4 mutations

PMS (Planted Motif Search)

• Generate all possible l-mers from out of the input sequence Si. Let Ci be the collection of these l-mers.

• Example:AAGTCAGGAGTCi = 3-mers:AAG AGT GTC TCA CAG AGG GGA GAG

AGT

All patterns at Hamming distance d = 1

AAG AGT GTC TCA CAG AGG GGA GAG AGTCAG CGT ATC ACA AAG CGG AGA AAG CGTGAG GGT CTC CCA GAG TGG CGA CAG GGTTAG TGT TTC GCA TAG GGG TGA TAG TGTACG ACT GAC TAA CCG ACG GAA GCG ACTAGG ATT GCC TGA CGG ATG GCA GGG ATTATG AAT GGC TTA CTG AAG GTA GTG AATAAC AGA GTA TCC CAA AGA GGC GAA AGAAAA AGC GTG TCG CAC AGT GGG GAC AGCAAT AGG GTT TCT CAT AGC GGT GAT AGG

Sort the lists

AAG AGT GTC TCA CAG AGG GGA GAG AGTAAA AAT ATC ACA AAG AAG AGA AAG AATAAC ACT CTC CCA CAA ACG CGA CAG ACTAAT AGA GAC GCA CAC AGA GAA GAA AGAACG AGC GCC TAA CAT AGC GCA GAC AGCAGG AGG GGC TCC CCG AGT GGC GAT AGGATG ATT GTA TCG CGG ATG GGG GCG ATTCAG CGT GTG TCT CTG CGG GGT GGG CGTGAG GGT GTT TGA GAG GGG GTA GTG GGTTAG TGT TTC TTA TAG TGG TGA TAG TGT

Eliminate duplicatesAAG AGT GTC TCA CAG AGG GGA GAG AGTAAA AAT ATC ACA AAG AAG AGA AAG AATAAC ACT CTC CCA CAA ACG CGA CAG ACTAAT AGA GAC GCA CAC AGA GAA GAA AGAACG AGC GCC TAA CAT AGC GCA GAC AGCAGG AGG GGC TCC CCG AGT GGC GAT AGGATG ATT GTA TCG CGG ATG GGG GCG ATTCAG CGT GTG TCT CTG CGG GGT GGG CGTGAG GGT GTT TGA GAG GGG GTA GTG GGTTAG TGT TTC TTA TAG TGG TGA TAG TGT

Find motif common to all lists

• Follow this procedure for all sequences• Find the motif common all Li (once

duplicates have been eliminated)• This is the planted motif

Motif finding

• The motif finding problem manifests itself in two forms

Known motif and unknown position: given a motif of length lwith d mismatches, find its position in a genomic/protein sequence.

Unknown motif and unknown position: given a set of sequences each of length m find a motif of length l with d mismatches.

• (l,d)-motif problem Introduced by Pavel Pevzner in 2001

The problem is to determine the motif M of length l , given nnucleotide sequences each of length m, and each containing a planted variant of M. More precisely, each such planted variant is a substring M that is with up to d point mutations. (d << l)

PMS Running Time

• It takes time to– Generate variants – Sort lists– Find and eliminate duplicates

• Running time of this algorithm:

w is the word length of the computer

REFERENCES

1. N.C.Jones,P.A.Pevzner-An Introduction to Bioinformatics Algorithms- MIT press-2004

2. P.A.Pevzner-Computational Molecular Biology-An Algoritmic Approach-MIT press-2000

3. D.Gusfield- Algorithms on Strings, Trees and Sequences- Computer Science and Computational Biology-Cambrigde Univ. press-1997

Other Methods

• Existing approaches with limitations– The motif finding problem has been studied extensively for

many years in the literature using different approaches.

• Statistical– [Workman and Stormo(2000); Thijs et al.(2001); Wang and

Stormo(2003); Xing et al.(2004); Regnier and Denise(2004); Sinha et al. [2004]; Siddharthan et al. (2005); Shida (2006 ); Leung and Chin (2006); Carmack et al.(2007)]

– The limitations of existing statistical approach is that it ignores evolutionary relationship which can give some information about where a motif is most likely to be found between very similar genomes.

Other Methods(continued)• Heuristic approach

– [Liu et al.(2002); Buhler and Tompa(2002); Blanchette and Tompa (2002); Liu et al.(2004); Hu et al.(2006); Mendes et al.(2006); Wei and Jensen (2006); Peng et al. (2006)]

– has polynomial complexity but does not guarantee optimality, where as Combinatorial (exhaustive) approach guarantees optimality at the cost of combinatorial complexity.

• Exhaustive– [Marsan and Sagot (2000); Pevzner and Sze (2000); Sinha and

Tompa(2000); Apostolico et al.(2000); Eskin and Pevzner (2001); Liu et al.(2002 ); Sinha (2003); Wang et al.(2005); Hon and Jain (2006); ]

– Approach guarantees optimality at the cost of combinatorial complexity.

Despite many studies this problem is far from being satisfactorily solved.

OUR METHODS(A. mohapatra,P.M.misra,S.Padhy)

• Motif finding using Information-Entropy with Kullback-Leibler-Divergence

– International Journal of Computer Science and Network Security, Vol. 9 No. 1 pp. 147-154, 2009(ISSN: 1738-7906)

• Motif Search in DNA Sequences Using Generalized Suffix Tree– IEEE Computer Society ,( ISBN: 0-7695-3068-0, DOI: 10.1109/ICIT.2007.18)

• Motif Prediction using Weighted Degree String Kernels with Shift and Mismatch

– ACM (ICAC3-09), Pages 56-61, ACM ( ISBN: 978-1-60558-351-8 DOI: http://doi.acm.org/10.1145/1523103.1523116)

. Prediction of Motifs with Unknown Length and Allowable Mismatches using Kernel Based Approach

Motif finding using Information-Entropy with

Kullback-Leibler-Divergence

International Journal of Computer Science and Network Security, Vol. 9 No. 1 pp. 147-154, 2009(ISSN: 1738-7906)

Related Motif finding Approaches

• Information entropy, first introduced by Claude E.Shannon in his paper “A Mathematical theory of Communications” in 1948 describes information content of a signal or event.

• Models based on Positional Weight Matrix [Bucher,1990]– [CONSENSUS, Gibbs sampling, EM etc.] PWM can be created by

counting the number of times a base is observed at each position– Limitations

Existing algorithms based on PWM consider the most popular element in each column of the alignment matrix, and completely ignore the rest which are comparatively weak (probability of occurrence is less).

• Models based on Information Entropy [Schneider, 1990][LOGOS,2004]From information entropy point of view all sequences (strong or weak) can contain different amount of information about the biological signal in various forms and thus can be a potential candidate for replacing the PWM model

Contd.• Let Σ be an alphabet of size N, for DNA Σ = {A, C, G, T}.

A symbol is received from an alphabet of size N, Letters are sent with equal probability

• Thus the information entropy contained in a symbol from an alphabet of size N is log 2 N , under uniform probability distribution.

• Close parallels between the mathematical expressions for the thermodynamic entropy S, and the Shannon information theoretic entropy H. The measure of information (Shannon entropy) is mathematically, similar to that of disorder (Boltzmann entropy)

∑=

−=

==

N

iipip

nnN

InppH

12

21

log

log),...,(

Motifs: Profiles and Consensus

a G g t a c T tC c A t a c g t

Alignment a c g t T A g ta c g t C c A tC c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0

G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

PWMTTGTGGCTTTTGATAAGTGTCATTTGCACTGTGAGATGCAAAGTGTTAAATTTGAA

TTGTGATATTTATTACGTGATATGTGAGTTGTGAGCTGTAACCTGTGAATTGTGAC

GCCTGACTTGTGATTTGTGATGTGTGAACTGTGACATGAGACTTGTGAG

SEQUENCES

PWM

Models based on Information Entropy

• Given a set of genomic sequences, it can be used to rank each member sequence according to its information content.

• This helps in evaluating each input sequence and takes care of weak signals which were completely ignored by earlier PWM model.

• Limitations– The model based on information entropy measures the

information quantity. It says nothing about the quality of information.

– Given two substrings in a sequence with equal quantity of information, only may contain biologically relevant information (e.g. motif).

Proposed Methodology : Shannon Entropy along with Kullback-Leibler Divergence

• Motivations– Biological observation

• The biological motif, more stable structure and is expected to have lower entropy

– Physical observation• Chaotic systems have a higher entropy value and the

stable systems have lower entropy value. Motifs are short conserved regions in DNA, RNA, or Protein sequences and are expected to have lower entropies.

Proposed Methodology : Shannon Entropy Along With Kullback-Leibler Divergence

• From Kullback–Leibler divergence– In probability theory the Kullback–Leibler

divergence is a non-commutative measure of the difference between two probability distributions .

– It is non-negative and is 0 if both distributions are equivalent (p=q).

– The smaller the divergence, the more similar is the distribution of the two variables .

– This motivates that the measure based on KL divergence can be used to model the important aspects of interesting signals or significant patterns (e.g. motifs) in DNA sequences

AlgorithmStep1. Given a set of sequences and length of an unknown signal,

divide each sequence to get substrings of length equal to the given length of the signal.

Step2. Compute the information entropy of all substrings obtained in Step 1.

Step3. Choose those substrings that fall within a selected range of lower information entropies computed in step2.

Step4. Find a consensus sequence of the set of substrings obtained in step3.

Step5. Calculate the Kullback Leibler divergence for each substrings of step3 with respect to the consensus string obtained in step4.

Step6. Keep those substrings that fall within a selected range of lowerKullback-Leibler divergence computed in step5.

Step7. The consensus of the set of strings described in Step6 is the required signal.

Implementation

• We tested our method on various synthetic datasets.– Randomly implanted motif

• As the results on the synthetic data set are promising, we applied our method to real data sets:– transgenic data regions of E.Coli – ECRDB70 database.

• Implemented in MatLab 7.1, Windows environment

Performance Evaluation

• The performance of our model on the real data set was computed using the following standard parameters

– Sensitivity: It is the percent of true motifs correctly predicted as true motifs. It measures how good our model in hitting the real motifs is.

– Specificity: It is the percent of non-motifs correctly predicted as non-motifs. It measures how good our model in hitting the non motifs.

– Accuracy: It is percentage of correctly predicted motifs. It measures how much accurate is the model in correctly predicting the motifs.

Result

Result

Information Entropy

Synthetic Data with randomly implanted motif

Motif - TAGCTTCATCGTTGA

Predicted - AAAAAAGAAAAAGAA

Result

Kullback Leibler Divergence


Motif – AGAGAGAGAGAGAGA

Predicted - AGAGAGAGAGAGAGA

Result




Predicted - AATTTTCTTCATTTA

Result




Predicted - TAGTTTCATCATTGA

Result


E.Coli ECRDB Database

Motif - AAATTTCGA

Predicted - AACTCTCGA

Result• The KL-Divergence model generated the

consensus with an improved accuracy of 74% as compared to 57% in the model based only on Information entropy.

• The sensitivity and specificity of the KL-Divergence model were 75% and 72% respectively as compared to 33% and 59% in the model based on Information entropy.

Conclusion• The PWM based model ignores the weak signals since

the model completely discards the bases with lower probability of occurrence.

• The model based on information entropy overcomes the deficiencies associated with the PWM model but fails to distinguish between signal and noise when the information contents are same.

• The proposed model based on information entropy along with Kullback-Leibler divergence successfully addresses the problem of distinguishing between the relevant and irrelevant information.

conclusion

• We observe from the above test runs of the proposed algorithm that besides predicting the motifs which are correctly predicted by other methods it also predicts more effectively the motifs which are otherwise not correctly predicted by other models.

Motif Search in DNA Sequences Using Generalized Suffix Tree

IEEE Computer Society ,( ISBN: 0-7695-3068-0, DOI: 10.1109/ ICIT. 2007.18)

Generalized Suffix Tree• Generalized Suffix tree (GST)

– When a set of strings are required to be represented by suffix tree, a generalized suffix tree (GST) is used.

– A GST for strings S1, S2, S3…..SN can be built in O(|S1|+|S2|+|S3|+……|SN|) time using Ukkonen algorithm.

Generalized suffix tree (Example)

Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2

{$ #b$ b#ab$ ab#bab$ aab#abab$

} 1

2

a

b

ab

$

ab$

b

3

$

4

$

5$

1

b#

a

2

#

3

#4

#

Related Work• Leaf count information

– has been proven to be useful in reducing the search space for a given set of sequences.

– When the nodes are ranked on the basis of leaf count in a generalized suffix tree (GST) and if there is a partial match in the label of the first edge, the chances of full match in the subtree are highest. If there is no partial match, the biggest subtree from the search space can be eliminated from further processing.

Limitations– This may not always be true in biological applications.– The occurrences of the motif in the given genomic sequences

may not be exact.– Hence it is important to find them even in the presence of small

errors.

Contd…• Appropriate branching decision

– Reduce the search space. – The branching decision is based on very limited information about the

likelihood that a particular branch will lead to a solution.Limitation

– This decision is usually taken at a very early stage of the tree, yet it can have dominating influence on the overall efficiency of the search process.

– It may so happen that even the “reasonable” choices made at an early stage can lead to extremely poor results.

• Dynamic branching– has been used as another strategy for further reduction of search space. It

is capable of identifying un-influential branches on the basis of more reliable information available from later stages.

LimitationIt does not allow mismatches.

Additional Parameters

– Edge label lengthThe number of characters on the edge from the root

to a node, is an important parameter in measuring the biological information it carries.

– Frequency of visits of an edgeMore visits of an edge in the GST points to a more

probable conserved region and signify higher probability of occurrences of that pattern in the motif.

– Partial match with interspersed mismatchDuring the process of exact matching if there is a

partial match in an edge then the chances of a full match in the later stage is maximized.

Proposed Scoring Scheme– Let x be the number of leaves in the subtree

rooted at v.

– Edge length of node v is the number of characters from the root to v.

– Let t be edge length of node v, y be the frequency of visit of that edge. Let z be the number of matches of the edge label with the given motif with match weight w.

– Then the total score of node v = x + y*t + z*w.

Algorithm1. Read all the input DNA sequences2. Construct a generalized suffix tree (GST) for the

input sequences. While constructing the GST, at each internal node at tree depth i, store the edge label length, number of leaves and frequency of visit of an edge.

3. For each edge label at tree depth i compute the partial match score comparing with the given motif.

4. Compute the scoring function for each of the nodes at tree depth i. Rank the nodes in non-increasing order of the scoring function. If total score of two sub trees are equal we assign higher rank to the one that has more leaf count.

Algorithm ….5. Search the sub tree under that node that has got

highest score Update l as l - number of characters in the path traversed so far .

Update d as d - the number of mismatches in the path traversed so far. If along an edge number of mismatches >d then visit the next ranked node.

6. Otherwise7. Repeat step 3 and 4 till a motif is reported. (l=0)

Result

Conclusion• We have discussed the motif finding problem as a type

of tree based search problem.

• Presented an algorithm that preprocess the generalized suffix tree using parameters like leaf counts, edge label length with frequency of its visit and partial match with allowable interspersed mismatches.

• A composite scoring scheme is used to rank the nodes and is used to discard the unproductive search space by selecting the highest ranking search space.

conclusion• Allows mismatches

• Reduces search space

• Explained with synthetic and implanted data

Motif Prediction using Weighted Degree String Kernels with Shift and Mismatch.

ACM (ICAC3-09), Pages 56-61, ACM ( ISBN: 978-1-60558-351-8 DOI: http://doi.acm.org/10.1145/1523103.1523116)

• Problem: Find a given motif of length l with up to m number of mismatches in a given set of DNA sequences.

• We use a class of string kernel, called ‘weighted degree kernel with shift’ and extend it to allow mismatches.

• we use the suffix tree based mismatch tree data structure to train the SVM using a scoring scheme as a speedup measure during implementation

Support Vector Machines (SVM)

• Support Vector Machines (SVMs) are supervised learning algorithms first introduced by Vapnik in 1995.

• SVM use Kernel functions as the core for classification and regression analysis.

• The kernel function corresponds to an inner product in some expanded feature space.

• The computation in high dimensional feature space is avoided through the use of kernels.

• Higher dimensional feature space allows obtaining a linear separation for data which are otherwise not linearly separable in lower dimension.


SVMs and related kernel methods are extremely good at solving prediction problems in computational biology.

– Can easily handle non-vector inputs such as variable length sequences or graphs.

– Ability to deal with high-dimensional large data sets.

– Flexibility in modeling even with small training samples.

– Due to Mercer’s conditions on the kernels the corresponding optimization problems are convex and hence have no local minima.

These facts mark a clear advantage of SVM over other pattern recognition algorithms, such as neural networks.

Related Work• Weighted Degree Kernel [Ratsch, G., & Sonnenburg, S. (2004)]

– The WD kernel of order K compares two sequences x and x’ of equal length L by summing all contributions of k- mer matches of length k ε {1, 2, ….K’ )weighted by coefficient βk

where uk,i(x) = string of length k starting at position i of the sequence x. I(.) evaluates to 1 when the argument is true, zero otherwise.

• Limitations– Recognition of matching blocks using WD kernel strongly depends on the

position of the subsequence and does not tolerate any positional variation, e.g. if there is a single shift in one of the sequence by just only one position, the WD kernel fails to discover similar blocks and returns a lower similarity score

))()((),( ',

1

1,

1

''

xuxuIxxK ik

kL

iik

K

kk =∑∑=

+−

==β

)1()1(2

''

'

++−

=KK

kKkβ

Related Work

Weighted Degree Kernel with shift (WDS) [Ratsch, G., Sonnenburg, S., & Scholkopf, B. (2005)]

• WD kernel with shift tolerates small positional variations in the sequences to be compared.

• Where

• Limitations• Do not allow mutations (mismatches)

∑∑ ∑==≤+

=

+−

=

s

sLis

k,i,s,x,xs

K

k

kL

ik

''μδβ)K(x,x

0

'

1

1

1

[ ]))(xu(x)I(u))(xu(x)I(uμ 'sk,ik,i

'k,isk,ik,i,s,x,x' ++ =+==

)(kK)k(Kβk 1'

1'2++−

=)(s/δs 121 +=

Proposed Kernel

• To allow mutations in the candidate motifs, weighted degree kernel with shift and mismatches is proposed.

Where

u ≠ u’ evaluated to true if and only if there are exactly m mismatches between u and u’. Choose βk,m = total weight / no. of possible m-mismatching k-mer.

∑∑ ∑∑=≤+

=

+−

===

s

sLis

xxsiks

K

k

kL

imk

M

mxxk

0,,,,

1

1

1,

0

''),( μδβ

⎥⎦⎤

⎢⎣⎡

+≠+≠+= )'

(xsk,ium(x)k,iI(u)'

(xk,ium(x)sk,iI(u'k,i,s,x,xμ

Proposed Kernel

• To allow mutations in the candidate motifs, weighted degree kernel with shift and mismatches is proposed.

Where

u ≠ u’ evaluated to true if and only if there are exactly m mismatches between u and u’. Choose βk,m = total weight / no. of possible m-mismatching k-mer.

∑∑ ∑∑==≤+

=

+−

==

s

sLis

xxsiks

K

k

kL

imk

M

mxxK

0 ,,,,

'

1

1

1,

0

''),( μδβ

⎥⎦⎤

⎢⎣⎡

+≠+≠+= )'

(xsk,ium(x)k,iI(u)'

(xk,ium(x)sk,iI(u'k,i,s,x,xμ

AlgorithmStep1: Input the m training sequences

and n test sequences from files in FASTA format

Step 2: Create two arrays of strings, one for the training DNA sequences and other for the test and validation DNA sequences. The sizes of these arrays are m x m and n x n respectively.

Step 3: Initialize the parameters such as the length of the motif, allowable mismatches, maximum shift, and length of sequences to a desired value. Also initialize the constant C>0,that sets the relative importance of maximizing the margin and minimizing the amount of slack.

Step 4: Compute the kernel function for each pair of training sequences using the following steps:Step (4a): Compare all pairs of sequences by summing all contribution of the k-mer matches of length 1 to k.

Step (4b): Compute the indicator function which evaluates to 1 when the contributions are same, zero otherwise.

Step (4c): Reward each matching substring by a weighting coefficient :

Step 5: Set the number of iterations, initial margin of separation and precision for the SVM computation, and initial values of the Lagrange multipliers

Step 6: Compute the optimum values for all the training sequences by solving the quadratic optimization problem using step 1 of the algorithm for general SVM.

m

kk,m

)Σ)(mk

(

ββ1−

=mkβ ,

iα

Step 7: Compute the bias b using step 2 of the algorithm for general SVM.

Step 8: For each of the test and validation sequences compute the SVM decision function based on the computed parameters using step3 of the algorithm for general SVM. The value indicates whether the sequence contains the motif or not.

Step 9: Compute the performance measures.

Performance EvaluationReceiver operating Characteristic Curve (ROC Curve)• The ROC is represented by plotting the true positive rate (TPR) vs. false

positive rate (FPR).• Area under the ROC curves (auROC):• The area under Receiver Operating Curve is calculated by adding

trapezoids. auROC quantifies the quality of the classifier and value >0.5 indicates better performance.

• Area under the PRC (auPRC):• PRC point = true positives / all true labels, true positives / (true positives +

false positives)• Cross Correlation coefficient (CC)• CC = (true positives * true negatives - false positives * false negatives) / sqrt

( (true positives + false positives) * (true positives + false negatives) * (true negatives + false positives) *(true negatives + false negatives) )

• F-Measure and Accuracy• F-measure = 2 / (1 / precision + 1 / recall)• Accuracy = (true positives + true negatives) / all labels

Implementation• The proposed kernel was first tested on various synthetic data sets.

As the results of these tests are promising, the kernel was applied to real data sets comprising of the C. Elegans and Drosophila Melanogaster acceptor sites.

• The SHOGUN tool box was used for implementation of the proposed kernel. The kernel was developed using c++. A suffix tree based data structure with a scoring scheme is used to speedup the SVM training and classification. – Synthetic Data.

• randomly generated strings with signals of different length implanted at random positions

– Real Dataset.• C.elegans acceptor sites, 262421 sequences containing 15507

true splice sites. (5.9%)• Drosophila acceptor sites, 98367 sequences containing 1583

true splice sites. (1.6%)

Result

Synthetic Data Synthetic Data (5 Mut)

Synthetic Data (5Mut)

Result

Drosophila C.Elegans

Result

conclusion• The WDSM kernel allows mismatches to be used along with shift

while computing the kernel from the given DNA sequences.

• The mismatch and shift values together resulted in better representation of the motifs as well as the positions within thesequence.

• This has resulted in higher accuracy in classifying the DNA sequences for predicting the motifs.

• The method has ability to detect motifs even with a small size of training samples.

• A suffix tree based data structure with a scoring scheme is used to speedup the SVM training and classification.

Prediction of Motifs with Unknown Length and Allowable Mismatches Using Kernel

Based Approach

Related Works

• Spectrum Kernel [Leslie, C., Kuang, R., & Eskin, E. (2003)]

• Especially tuned to small alphabets (like the 2-bit DNA alphabet).– Defined as the inner product Where, Øk maps a sequence x that consists of

characters in ∑ to a feature space of size |∑|k– Let #u(x) is the number of occurrence of u in x,

k be the length of pattern u, ∑ be an alphabetThen Øk(x) can be defined as

)'().()',( xxxxK kk ΦΦ=

∑= kε u

))((#)( xuxkφ

Related Works

• Weighted Degree kernel without shifts (WDKWS) introduced by Sonnenburg et al., 2006, uses a weighted sum of spectrum kernels for all degrees at every position of the sequence.

• The difference between the WDKWS and the spectrum kernel are the position dependence and the consideration of K-mers vs. 1‚…‚K-mers, i.e. i.e. also considering subsequences of the K-mers during computation.

Related Works

• Weighted Degree kernel with shift and mismatch (WDKSM)– As already presented, aims at improving the

representations of the weighted degree kernel which allows mismatches.

– Has resulted in higher accuracy in classifying the DNA sequences for predicting the motifs.

– Ability to detect motifs even with small sized training samples.

Limitations• Spectrum Kernels

– Cannot extract any positional information from the sequence which goes beyond the k-mer length.

– Well suited for describing the content but not for analyzing signals where motifs appear in a certain order.

• Weighted Degree kernel without shifts– When using long K-mers, the memory demand becomes

intractable and the sparse weight vectors need to be stored and operated with more efficiently. Hence it is difficult to implement this kernel for practical purpose.

Limitations• Weighted Degree kernel with shift and mismatch

(WDKSM)– This works fine for classification and finding motifs when

the length of the motif and the number of allowable mismatches are known.

• Most motif discovery algorithms for DNA sequences require the motif's length as input.– Styczynski et al. introduced the Extended (l,d)-Motif

Problem (EMP) where the motif's length is not an input parameter.

– Their algorithm takes an unacceptably long time to run.

Motivation for new Kernel• Feature Space

– The size of the feature map Ø for string kernels for any l length pattern over alphabet A is of dimension l |A|.

– Feature space size increases exponentially with the increase in the pattern length. So difficult for higher length patterns.

– Most of the dimensions of the feature space are sparse.– The feature space can be reduced to the total number of sub sequences

occurring in the training set.

• Behavior of Spectrum Kernels– the output no longer depends on the characteristics of the input such as

from which organism the sequence has been obtained, the length of the sequence, the distribution of nucleotides etc.

– e.g. the kernel computed from “AAAA”, “CCCC” or “AAAA”, “TTTT” or “GGGG”,”CCCC” are same.

Motivation for new Kernel– Local and global requirements of motif prediction

• The string kernels rely on the indicator functions and pair wise string comparisons for computation of the kernel value.

• Values should be more significant for related or similar sequences as compared to unrelated or dissimilar ones.

• Not observed in the existing string kernels as the sequence comparisons are made starting from single nucleotides to a certain length, allowing shifts and mismatches

• We have proposed a kernel which provides meaningful scores depending only on the motifs it contain

– The kernel abstraction can be utilized for designing a generalized kernel for detection of unknown motifs without depending on the training information of the organisms or other domain specific information.

Motivation for new Kernel

• Design a new spectrum kernel to possibly cater to the above observations by reducing search space, generating meaningful values which can be generalized.

Proposed Weighted Spectrum Kernel

– Let S be a set of n sequences each of length l from alphabet ∑ for which we intend to determine the motifs with unknown length l and known mutations d.

– Let x and x’ be pair of sequences from S, ui be any k-mer of length k occurring in x at position i and ui (x,x’) be the number of occurrences of ui in x and x’.

– Let ui(S-{x,x’}) be the number of occurrences of ui in the sequences of the training set S-{x,x’} , Bi be the weight attached to each position i.

– The weight coefficient Bi = 2(T- ui +1)/T(T+1), where T is the total number of all possible length k k-mers occurring in the training set S-{x,x’}.

Proposed Weighted Spectrum Kernel

– Then the proposed weighted degree spectrum kernel is defined as:

– This kernel during computation of the spectrum values besides considering the number of times the pattern occurring in the pair of sequences being compared, also considers the number of times the k-mer occurs in the other sequences of the training set.

})',{(),()',( '1

1xxSuxxuiBxxK ii

kl

i−∑=

+−

=

Implementation• The proposed kernel was first tested on various synthetic data sets. As the

results of these tests are promising, the kernel was applied to real data sets comprising of the C. Elegans and Drosophila Melanogaster acceptor sites and also the transcription start sites (TSS) of human genome hg16.

• The kernel was developed using c++. The SHOGUN tool box was used for implementation of the proposed kernel for the quadratic optimization.

• Synthetic Data.– randomly generated strings with signals of different length

implanted at random positions.• Real Dataset.

– C.elegans acceptor sites, 262421 sequences containing 15507 truesplice sites. (5.9%)

– Drosophila acceptor sites, 98367 sequences containing 1583 true splice sites. (1.6%)

– Hg16 TSS of human genome with 14,18,392 sequences containing 14897 true sites (1.05%)

Result

Result

Synthetic data set 1

Kernel Accuracy AuROC

Fixed Degree 65.00% 0.833

Spectrum Kernel 38.00% 0.204

Weighted String Kernel 53.00% 0.688

Weighted Spectrum Kernel

70.00% 0.913

Synthetic data set 2






90.95% 0.803

Result

C.Elegans Acceptor data

Kernel Accuracy AuROCFixed Degree 93.10% 0.85

Spectrum Kernel 94.05% 0.57Weighted String Kernel 92.10% 0.63


94.10% 0.91

Result

Drosophila Acceptor data






98.39% 0.881

Result

Hg16 TSS data






98.94% 0.75

conclusions• The performance of the new kernel was found to be better

both in terms of the classification accuracy and auROC. • The kernel also outperformed the existing string kernels by

allowing for classification with motif length of more than 8.• The new kernel yielded better performance from the other

kernels with a considerable improvement of the auROC from 0.449 to 0.81.

• improved performance of the SVM classifier using the new kernel was also observed when the data sets containing acceptor sites of Drosophila, C.Elegans were used.

• the ability of the SVM to classify sequences with mutated motifs using the same training data.

• The accuracy of classification did not depend on the training motif, the mutations and to some extent on the length.

conclusion

• Cross classification of sequences for prediction of unknown motif.

• The SVM acceptor sites of drosophila was used to predict the C.Elegans acceptor sites and showed an accuracy of 94% similar to the result obtained when trained with its own samples.

ReferencesLeslie, C., Kuang, R.,and Eskin, E., “Inexact matching string kernels for protein classification”, Kernel Methods in Computational Biology,pp 95-112, MIT Press, 2003.

Ratsch, G., Sonnenburg, S., & Scholkopf, B. (2005). Rase: Recognition of alternatively spliced exons in c. elegans. ISMB 2005.

Ratsch, G., & Sonnenburg, S. (2004). Accurate splice site prediction for caenorhabditis elegans, 277-298. MIT Press series on Computational Molecular Biology. MIT Press.

Sonnenburg, S., Raetsch, G., Schaefer, C.,and Schoelkopf, B., “Large Scale Multiple Kernel Learning”, Journal of Machine Learning Research,7, pp1531-1565, 2006.

Styczynski, M.P., Jensen, K.L., Rigoutsos, I.,and Stephanopoulos, G.N., “An extension and novel solution to the (l,d)-motif challenge problem”, Genome Informatics, Vol. 15, pp.63–71, 2004.

THANK YOU

motif finding algorithms sudarsan padhy iiit...

Documents