R/parallel Parallel Computing for R in non‐dedicated environments

dc.contributor
Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius
dc.contributor.author
Vera Rodríguez, Gonzalo
dc.date.accessioned
2013-09-18T16:31:54Z
dc.date.available
2013-09-18T16:31:54Z
dc.date.issued
2010-07-21
dc.identifier.isbn
9788449039065
dc.identifier.uri
http://hdl.handle.net/10803/121248
dc.description.abstract
Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage co Traditionally, parallel computing has been associated with special purpose applications designed to run in complex computing clusters, specifically set up with a software stack of dedicated libraries together with advanced administration tools to manage complex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R. mplex IT infrastructures. These High Performance Computing (HPC) solutions, although being the most efficient solutions in terms of performance and scalability, impose technical and practical barriers for most common scientists whom, with reduced IT knowledge, time and resources, are unable to embrace classical HPC solutions without considerable efforts. Moreover, two important technology advances are increasing the need for parallel computing. For example in the bioinformatics field, and similarly in other experimental science disciplines, new high throughput screening devices are generating huge amounts of data within very short time which requires their analysis in equally short time periods to avoid delaying experimental analysis. Another important technological change involves the design of new processor chips. To increase raw performance the current strategy is to increase the number of processing units per chip, so to make use of the new processing capacities parallel applications are required. In both cases we find users that may need to update their current sequential applications and computing resources to achieve the increased processing capacities required for their particular needs. Since parallel computing is becoming a natural option for obtaining increased performance and it is required by new computer systems, solutions adapted for the mainstream should be developed for a seamless adoption. In order to enable the adoption of parallel computing, new methods and technologies are required to remove or mitigate the current barriers and obstacles that prevent many users from evolving their sequential running environments. A particular scenario that specially suffers from these problems and that is considered as a practical case in this work consists of bioinformaticians analyzing molecular data with methods written with the R language. In many cases, with long datasets, they have to wait for days and weeks for their data to be processed or perform the cumbersome task of manually splitting their data, look for available computers to run these subsets and collect back the previously scattered results. Most of these applications written in R are based on parallel loops. A loop is called a parallel loop if there is no data dependency among all its iterations, and therefore any iteration can be processed in any order or even simultaneously, so they are susceptible of being parallelized. Parallel loops are found in a large number of scientific applications. Previous contributions deal with partial aspects of the problems suffered by this kind of users, such as providing access to additional computing resources or enabling the codification of parallel problems, but none takes proper care of providing complete solutions without considering advanced users with access to traditional HPC platforms. Our contribution consists in the design and evaluation of methods to enable the easy parallelization of applications based in parallel loops written in R using non-dedicated environments as a computing platform and considering users without proper experience in parallel computing or system management skills. As a proof of concept, and in order to evaluate the feasibility of our proposal, an extension of R, called R/parallel, has been developed to test our ideas in real environments with real bioinformatics problems. The results show that even in situations with a reduced level of information about the running environment and with a high degree of uncertainty about the quantity and quality of the available resources it is possible to provide a software layer to enable users without previous knowledge and skills adapt their applications with a minimal effort and perform concurrent computations using the available computers. Additionally of proving the feasibility of our proposal, a new self-scheduling scheme, suitable for parallel loops in dynamics environments has been contributed, the results of which show that it is possible to obtain improved performance levels compared to previous contributions in best-effort environments. The main conclusion is that, even in situations with limited information about the environment and the involved technologies, it is possible to provide the mechanisms that will allow users without proper knowledge and time restrictions to conveniently make use and take advantage of parallel computing technologies, so closing the gap between classical HPC solutions and the mainstream of users of common applications, in our case, based in parallel loops with R.
eng
dc.format.extent
136 p.
dc.format.mimetype
application/pdf
dc.language.iso
eng
dc.publisher
Universitat Autònoma de Barcelona
dc.rights.license
ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.
dc.source
TDX (Tesis Doctorals en Xarxa)
dc.subject
Parallel loops
dc.subject
Oportunistic computing
dc.subject
Bioinformatics
dc.subject.other
Tecnologies
dc.title
R/parallel Parallel Computing for R in non‐dedicated environments
dc.type
info:eu-repo/semantics/doctoralThesis
dc.type
info:eu-repo/semantics/publishedVersion
dc.subject.udc
004
cat
dc.contributor.authoremail
gonzalo.vera.rodriguez@gmail.com
dc.contributor.director
Suppi Boldrito, Remo
dc.embargo.terms
cap
dc.rights.accessLevel
info:eu-repo/semantics/openAccess
dc.identifier.dl
B-23719-2013


Documents

gvr1de1.pdf

1.995Mb PDF

This item appears in the following Collection(s)