Impact factor for posts is a measurement of importance.

Impact factor for users reflect their authority, reputation and contribution on a particular topic.

Rating reflects the quality of posts.

Rating on Voofie is not a simple average of all ratings, but a weighted average of rating, weighted by the impact factor of users who rated.

Explore exciting communities of

Literature review on Motif Finding

22 Jan 10

Gene expression is an important topic in molecular biology. Gene expression is usually regulated by the use of different transcription factors. These transcription factors will recognize the specific signals on the DNA sequence and bind on these as to assist the binding of RNA polymerase onto the DNA molecule. Hence, these signals actually control the timing of expression of genes. These transcription factor binding sites have a special pattern, and we call these patterns motifs. Finding these motifs can help us understand the mechanism and control of gene expression. Yet, there exist many different kinds of transcription factors and they differ very much with each others. Hence their patterns are difficult to find and many computational methods have been developed, and their ability to discover a motif various.

Bookmark and Share

Content

Motif as a Gene expression regulator

Motif is an important gene expression regulator as expression of most genes are regulated by rate of transcription, and rate of transcription can be regulated by binding of RNA polymerase. The mechanism of transcription factors recognition and characteristics of transcription factor binding sites will be examined in this section.
Transcription factors contain domains that allow them to bind with high affinity and selectivity to short regions of DNA that have a particular base sequence. One Example is the homeodomain present in many organisms.

Fig 1 Cylinders represent the three α- helices of the homeodomain, ribbons represent the sugar-phosphate backbone of the DNA and bars symbolize the base pairs. The recognition of helix(3) is shown in red.

Fig 1 Cylinders represent the three α- helices of the homeodomain, ribbons represent the sugar-phosphate backbone of the DNA and bars symbolize the base pairs. The recognition of helix(3) is shown in red.[1]

The helix III fits into the major groove of DNA, with one face highly hydrophobic to pack against helices I and II. The other side of helix III is hydrophilic, with multiple positive charges.[1] Several amino acids in the hydrophilic face decide its recognition sequence. For example, exchanging the recognition helix in the bicoid(Bcd) homeodomain for that of Antennapedia(Antp) resulted in a protein with the DNA-binding specificity of ANTP and not that of Bicoid. Most interestingly, a Bcd protein with the DNA-binding specificity of Antp could also be obtained by exchanging only the ninth amino acid in the recognition helix, replacing the lysine residue in Bcd with the glutamine residue found in the Antp protein.[2]  So we can see different amino acids have different degree of importance in the DNA recognition process.

Analysis of motif

Besides the amino acid sequence, the DNA sequence present in promoter region also affects the strength of binding of transcription factor.

Core

Many homedomain proteins recognize DNA sites with a very similar sequence, often involving ATTA. For example, in the homedomains of the Drosophila proteins Fushi-tarazu, the protein can recognize both CCATTA and CAATTA.[1] We can see that in transcription factor binding sites, there are some highly conserved nucleotide positions and some other nucleotide positions that are less conserved. We call the region of highly conserved nucleotides the core of the transcription factor binding site. In an analysis of transcription factor binding site database TRANSFAC, it showed that most of the transcription factors stored in TRANSFAC contain a core length of 4 nucleotides.[3]

Length

The motifs in TRANSFAC have a mean length of 14.3 +/- 4.7nt with a minimum of 6nt and maximum of 32nt. [3]. In fact, a transcription factor may recognize several DNA sequence of different length. For example, D1 motif from fly contains binding site of different length, GGGTTTTTCCA and GGTTTTCCA, this make the finding of binding site by computer program much more difficult.

Representation of motif

There are two common methods to represent a motif, namely string representation and matrix representation.

String representation uses a length-l string of symbols(or nucleotides) A,C,G,T to describe a motif. In this model, the number of different length-l motifs is limited to 4l.[4] This representation is relatively simple and hence easy for us to develop motif finding algorithms base on this. However, some information about the motif may lose since the probability for a nucleotide to occur at a particular position may not be 100%. For example, for a particular position, 50% of occurrence is A, 30% of C, 20% of G but no T. Simply representing this position as A will be misleading. So a wildcard system is developed. For example, K denotes G or T.

Motif representation uses a 4*l matrix of real numbers to represent the motif where the j-th column of 4 numbers gives us the probability, respectively, that symbol A,C,G or T occupies the j-th position of the motif. Although the matrix is more expressive than the string, there are an infinite number of 4*l matrices of real numbers that form the solution space and no algorithm have been able to guarantee that the optimal motif can be found.

Computational models of motif and algorithms that search motifs

There are various algorithms that search motifs, based on string or motif representation.

Algorithms base on string representation

Voting algorithm [5] is one of such algorithms. This algorithm first includes the sequences containing the transcription factor binding sites as the input. After several improvements, it also includes sequences not containing binding sites as negative input set to further improve the accuracy.

In voting algorithm, a more formal definition of motif finding problem is developed, namely Planted (l,d)-motif problem.

Planted(l,d)-motif problem: Suppose there is a fixed but unknown nucleotide sequence M(the motif) of length l, Given t length-n nucleotide sequences, and each sequence contains a planted variant of M, we want to determine M without knowing the positions of the planted variants. A variant is a substring derivable from M with at most d point substitutions.

They define a length-l sequence (substring) s’ to be a d-variant of another length-l sequence s if the Hamming distance between s’ and s is at most d. Let N(s,d) be the set that contains all d-variants of a length-l sequence s. For each length-l substring σ in the input sequences gives one vote to all length-l sequences s in N(σ,d). If each length-l sequence s can get at most one vote from each input sequence, the motif M will get exactly t votes because of the assumption that each input sequence has exactly one planted variant of M.

The algorithm run in O(nt(3l)d) time, which is faster than the brute-force approach which takes O(nt4l) time. However, the time needed grows exponentially with d, thus, it cannot handle long motifs with large Hamming distance d because the space requirement increases exponentially with d. In order to improve time and space complexity, method of Random projection and heuristic improvements have been used.

The above algorithm assumes each sequence contains only one transcription factor binding site. It is because each length-l sequence gets at most one vote from each sequence. And motif is defined by the sequence that getting exactly t votes. The algorithm exVote[6] remove this restriction by allowing each length-l sequence getting more than one vote from each input sequence. Motif is defined by those length-l sequences getting more than k votes. The algorithm will output those motifs with expected number of occurrence, based on l, d and k, is higher than a threshold.

The above two algorithms only make use of sequence in the positive set, i.e. known to contain transcription factor binding site. In fact sequences known to containing no transcription factor binding site contain useful information also. A formal definition of such problem is given by Generalized planted(l,d) motif problem.[7]

Generalized planted(l,d)-motif problem: Suppose there is a fixed but unknown string M(the motif) of length l. Given t length-n sequence, each of which contains a planted d-variant of M, and f length-n sequences which contains no d-variant of M, we want to determine M without a priori knowledge of the positions of the planted d-variants.

The extra information provided by the negative set might make some of the previously unsolvable problems based on information in T, e.g. (9,3), (11,4), (15,6) and (20,8) motif problems solvable. An analysis of the expected number of (l,d) string occur in positive set but not negative set show that the expected number peak at a particular d for each l value.

In order to find the motif, each length-l substring in T and F gives one vote to its d-variant. The motif M would receive at least one vote from each sequence in T and no vote from any length-l substring in F. The algorithm will first identify a set of candidate motifs by voting from T when d is small and from F when d is large. The algorithm will then filter out the false candidate motifs by F or T accordingly.

Algorithm base on matrix representation

Many motif finding algorithms using matrix representation make use of a technique called expectation maximization. The problem of finding motif is equal to finding matrices that have a score higher than a threshold. We refine a seed bit by bit to improve its score until it cannot be further improve again or after a predefined number of iteration.

ALSE [8] is an algorithm based on expectation maximization, which also make use of positive and negative set of input. The algorithm translate the motif finding problem into: Given a set of length-n sequences T, a set of length-n sequences F and a 4*w probability maxtrix M, we can determine a threshold αM such that Nm=N(t(M, αM), f(M, αM),|T|,|F|) is minimum. Where t(M, αM) is the number of substring in the true input set which score is higher than αM. And score is defined by:

\f[Score(\sigma ,M) = \log \prod{M(\sigma [i],i)}\f]

Which multiply probability of occurring a particular nucleotide for all the positions in the matrix. The motif finding problem is to find a probability matrix M so as to minimize Nm.

The algorithm first find a set of probability matrices with low value of N(t,f,|T|,|F|) as seeds. The voting algorithm described before can be used to discover a set of motifs in string representation. Each length-w string representation motif σ is converted to a 4 * w probability matrix M such that for each column I, the value of M(σ[i],i) is 0.5 and the rest are 0.5/3. A fixed number of matrices, say 100 matrices with the lowest values of the Nm will be chosen as seeds.

Secondly, the EM-like iterative step to refine these seed matrices to search for the correct motif. Algorithm ALSE first calculates the probability of each substring in the true set being a binding site. Based on these probabilities, matrix M is refined to M’ such that those binding sites according to M and its corresponding threshold α will yield a higher score with respect to M’. Consequently, ALSE can also adjust threshold α’ so that t(M’, α’) increases at a faster rate or decreases at a slower rate than f(M’, α’), in other words, to obtain a lower value of N(t,f,|T|,|F|)

In all the above algorithms, the models assume all the binding sites have the same transcription factor binding strength. Hence they have the same expression level. Yet it is not the case in real biological systems. Energy-based Motif finding algorithm (EBMF) [9] was developed to make use of such information in motif finding. The problem of motif finding is now translate into: Given the length of binding sites l, an energy threshold α, a distance threshold dmin, t sequences S={si}in which each sequence si has a corresponding binding energy ei, we want to find a 4 * l energy matrix M to minimize the prediction error:

\f[\sum{E_{total}(s_{i},M}-e_{i})^{2}\f]

Where Etotal(si,M) is the expected binding energy between the transcription factor and sequence si given that at least one binding site in si is bound by the transcription factor.

EBMF first identify a set of candidate matrices based on the strings that occur frequently in the input sequences of strong signal. Like ALSE, voting algorithm is used to generate seed. After that, EM-like iteration which is based on solving a set of equations relating the expected binding energy total to the given binding energy. This gives a new energy matrix M’ that minimize the predication error.

Conclusion

First, motif finding programs do not give satisfy results. The discover rate of transcription factor binding sites is only around 20-30%, even after hours of running of the finding program. Even worse, common motif finding programs return 50 to 100 possible motif results, common users find it difficult to analyze which one is the actual one.
Second, motif finding programs require users to input specific motif length. This length is usually not known by users. Users may have to search for optimal length by trial and error. Furthermore, a transcription factor may bind to DNA sequence of different length. To my best knowledge, all motif finding problems consider transcription factor binding site of the same length and do not allow gap.

Finally, transcription factors may have some special pattern not considered by the algorithms. It is known that some motif is a palindrome, or having a particular nucleotide every 2 other nucleotides. If we include this information in to the motif finding algorithm and treat TFBS-like region with special care, we may be able to find more motifs. Effort should be spent in identifying pattern of binding sites of each type of transcription factors (leucine zipper, zinc fingers, etc).
 

Footnotes and Citations

  1. Robert J. White, (2001), Gene Transcription-Mechanism and Control, Blackwell Science. pp. 26.
  2. David S. Latchman, (1998), Eukaryotic Transcription Factors, 3rd ed, Academic Press. pp. 236.
  3. Fogel at al, (2005), "A Statistical analysis of the TRANSFAC database",Biosystems(81):137-154.
  4. Leung, H., Chin, F., Chan, B., (2005), "Finding Exact Optimal Motif in Matrix Representation by Partitioning.",Bioinformatics 21
  5. Chin, F and Leung, H, (2005), "Voting Algorithms for Disovering Long Motifs.",APBC:261-271.
  6. Leung, H and Chin, F, (October 2005), "An efficient Algorithm for the Extended (l,d)- Motif Problem With Unknown Number of Binding Sites",Proceedings of the IEEE 5th Symposium on Bioinformatics and Bioengineering:11-18.
  7. Leung, H and Chin, F, (October 2005), "Generalized Planted (l,d)-Motif Problem with Negative Set",Proceedings of the 5th Workshop on Algorithms in Bioinformatics:264-275.
  8. Leung, H and Chin, F, (2005), "Finding Motifs from All Sequence With and Without Binding Sites.",Bioinformatics
  9. Leung H, Chin F, Yiu SM, Rosenfeld R, Tsang WW, (2005), "Finding motifs with insufficient number of strong binding sites.",J Comput Biol.12(6):686-701.
0 Comments

Please login to post comment.

What is Voofie?

Voofie organizes knowledge, discovers useful resources and recognizes knowledgable users.

Bookmark your blog in Voofie to get more traffic as well as building a reputation in your field!

Explore more about it. Become a member—our FREE Registration takes just seconds.

Page Info
24Impacts
0/0 rates
1202
Your Rating:
Version: 1
Submitted: 22 Jan 10
Permalink