Detection of Transcription Factor Binding Sites by Means of Multivariate Signal Processing Techniques

dc.contributor
Universitat de Barcelona. Departament d'Electrònica
dc.contributor.author
Pairó Castiñeira, Erola
dc.date.accessioned
2016-01-20T07:53:27Z
dc.date.available
2016-01-20T07:53:27Z
dc.date.issued
2015-07-21
dc.identifier.uri
http://hdl.handle.net/10803/336663
dc.description.abstract
Gene expression is a complex and highly regulated process. Most of the regulation is controlled by short DNA sequences that can be bound by some proteins called transcription factors (TF). Binding to these sites, the transcription factors, can start the transcription of mRNA, stop it, or just control the amount of mRNA produced. The DNA binding sites of these transcription factors have some specific characteristics: (1) They are short sequences (2) They can be located anywhere in the genome and (3) they are degenerated, which means that some mutations in the binding site sequence do not alter its functionality. These characteristics made impossible to look for a specific sequences in a specific region and, create the need to model the binding sites in order to detect them. Due to the importance of gene expression in the study of cell differentiation and its implication in some genetic diseases, many computational models and experimental processes to model binding site motifs and then find them into a genome have appeared. The computational models can be divided into two main groups: motif discovery methods which try to find binding sites within a set of co-regulated sequences without previous knowledge and motif search methods which use previous known sites to create a model and then try to locate binding sequences fitting this model. Most of the algorithms for binding site detection (both discovery and search) are based on Position weight matrices (PWM), which are matrices of frequencies of each nucleotide in each position, and assume that positions are independent. Some others take into account interdependences, but they need many sequences to be trained and high computational times. The focus of this thesis is to use the conversion from symbolical to numerical DNA and the previous knowledge of binding site sequences in order to construct models for DNA motifs. In this context, known multivariate signal processing techniques can be the ideal tools to construct models which can take into account interdependences without needing a large number of sequences or a high computational time. To characterize the transcriptions factors, the relationships TF-protein were studied, showing that most transcription factors regulate the expression of 5-10 genes and at the same time most proteins are regulated by more than 1 TF. The study of interdependences between positions showed that more than 90% of the binding sites have significant interdependences, but that the percentage of interdependences is not enough to classify TF according to structure. The conversion of DNA motif matrices into numerical matrices allows the use ofl Component Analysis (PCA) to model the binding sites which captures the information of the interdependences into the covariance, a second order statistics. Using the hypothesis that the binding sites will fit better to the PCA model than genomic, sequences, the Q-residuals can be used to detect binding sites within the genome. When compared to PWM the Q-residuals detector performs as least as well, and the improvement of detection is significantly correlated to the percentage of positions with interdependences. The disadvantage of these PCa models is that they are difficult to interpret. Converting the DNA symbolical matrix into a DNA numerical cube allows the calculation PARAFAC models which are easier to interpret. Since PARAFAC models have unique solutions, their scores can be combined with the PARAFAC Q-residuals in order to construct a quadratic detector that also performs better than PSSM models. When the numerical detectors are compared to detectors that take into account interdependences, they perform better when there are not many sequences available, but there are more sensitive to the number of positions.
eng
dc.format.extent
179 p.
cat
dc.format.mimetype
application/pdf
dc.language.iso
eng
cat
dc.publisher
Universitat de Barcelona
dc.rights.license
L'accés als continguts d'aquesta tesi queda condicionat a l'acceptació de les condicions d'ús establertes per la següent llicència Creative Commons: http://creativecommons.org/licenses/by/3.0/es/
dc.rights.uri
http://creativecommons.org/licenses/by/3.0/es/
*
dc.source
TDX (Tesis Doctorals en Xarxa)
dc.subject
Electrònica
cat
dc.subject
Electrónica
cat
dc.subject
Electronics
cat
dc.subject
Factors de transcripció
cat
dc.subject
Factores de transcripción
cat
dc.subject
Transcription factors
cat
dc.subject
Processament de senyals
cat
dc.subject
Proceso de señales
cat
dc.subject
Signal processing
cat
dc.subject.other
Ciències Experimentals i Matemàtiques
cat
dc.title
Detection of Transcription Factor Binding Sites by Means of Multivariate Signal Processing Techniques
cat
dc.type
info:eu-repo/semantics/doctoralThesis
dc.type
info:eu-repo/semantics/publishedVersion
dc.subject.udc
53
cat
dc.contributor.director
Marco Colás, Santiago
dc.contributor.director
Perera Lluna, Alexandre
dc.embargo.terms
cap
cat
dc.rights.accessLevel
info:eu-repo/semantics/openAccess


Documents

EPC_PhD_THESIS.pdf

2.576Mb PDF

This item appears in the following Collection(s)