Tolerancia a fallos en la capa de sistema basada en la arquitectura RADIC

Author

Castro León, Marcela

Director

Rexachs del Rosario, Dolores Isabel

Date of defense

2013-05-30

ISBN

9788449038402

Legal Deposit

B-22961-2013

Pages

118 p.



Department/Institute

Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius

Abstract

La demanda de major rendiment de les aplicacions cient ques es satisf a incrementant la quantitat de components. No obstant aix o, un major nombre de components implica una major probabilitat de fallada. L'abrupta caiguda dels temps mitjans entre fallades en els sistemes actuals impulsa la investigaci o de mecanismes de toler ancia a fallades per garantir l'execuci o d'una aplicaci o a un cost raonable. Message-Passing Interface (MPI), l'est andard de programaci o m es utilitzat per les aplicacions cient ques, t e un comportament fail-stop, realitzant una parada segura de tots els processos en cas de detectar una fallada en qualsevol dels nodes del cl uster. Com a consequ encia, es perd l'execuci o que s'hagu es fet en tots els nodes de processament. Els sistemes de c omput d'altes prestacions, han anat implementat mecanismes per a garantir el servei, normalment basades en t ecniques de rollback-recovery mitjan cant l' us de Checkpoint/Restart. Aquestes solucions s'han implementat a nivell d'aplicaci o, la qual cosa no es transparent, o b e, a nivell de llibreria, la qual cosa no es generalitzable a altres llibreries i es deixen fora del camp de soluci o a un divers nombre d'aplicacions. Es proposa un sistema de toler ancia a fallades transparent i autom atic per l'aplicaci o paral lela de manera que pugui utilitzar-se sense modi car l'aplicaci o i amb la llibrer a de pas de missatge que prefereixi l'usuari. Es basa en detectar els errors en las comunicacions de sockets causats per les fallades de nodes i recon gurar-los en forma autom atica per a comunicar-se amb la nova adre ca a on es migra el proc es. Funciona en conjunt amb un sistema que protegeix l'estat de c omput dels processos i, en cas de fallades, els recupera en un altre node de c omput mitjan cant t ecniques de rollback-recovery. S'ha realitzat una validaci o experimental utilitzant aplicacions Master/Worker i Single Program Multipla Data (SPMD) amb comunicacions basades en sockets i en pas de missatges Message Passing Interface (MPI). Les execucions es van realitzar en un cl uster multicore, obtenint els nivells desitjats de funcionalitat i prestacions.


La demanda de mayor rendimiento de las aplicaciones cient cas se satisface incrementando la cantidad de componentes. Sin embargo, un mayor n umero de componentes implica una mayor probabilidad de fallo. La abrupta ca da de los tiempos medios entre fallos en los sistemas actuales de altas prestaciones impulsa la investigaci on de mecanismos de tolerancia a fallos para garantizar la ejecuci on de una aplicaci on a un coste razonable. Message-Passing Interface (MPI), el est andar de programaci on m as utilizado por las aplicaciones cient cas, tiene un comportamiento fail-stop, realizando una parada segura de todos los procesos si se detecta un fallo en un nodo del cl uster. Como consecuencia, se pierde la ejecuci on que se hubiera hecho en todos los nodos de procesamiento. Los sistemas de c omputo de altas prestaciones han implementado mecanismos para garantizar el servicio, normalmente basados en t ecnicas de rollback-recovery mediante uso de Checkpoint/Restart. Estas soluciones se han implementado a nivel de aplicaci on lo cual no es transparente, o bien, a nivel de librer a, lo cual no es generalizable a otras librer as y dejan fuera del campo de soluci on a un n umero diverso de aplicaciones. Se propone un sistema de tolerancia a fallos transparente y autom atico de modo que pueda utilizarse sin modi car la aplicaci on y con la librer a de paso de mensaje que pre era el usuario. Se basa en detectar los errores en las comunicaciones de socket causados por fallos de nodos y recon gurarlos en forma autom atica para comunicarse con la nueva direcci on a donde se migra el proceso. Funciona en conjunto con un sistema que protege el estado de c omputo de los procesos y en caso de fallos, los recupera en otro nodo de c omputo por medio de t ecnicas de rollback-recovery. Se ha realizado una validaci on experimental utilizando aplicaciones Master/Worker y Single Program Multipla Data (SPMD), con comunicaciones basadas en sockets y en paso de mensajes Message Passing Interface (MPI). Las ejecuciones se realizaron en un cluster multicore, obteniendo los niveles deseados de funcionalidad y de prestaciones.


The demand of more performance of scienti c applications is achieved by increasing the amount of components. However, a growing number of components implies that the probability of failure increases as well. The remarkable decrease of average times between failures in the current High Performance Computing systems encourages the investigation of mechanisms of fault tolerance suitable for new architectures which allow to guarantee the execution of an application at a reasonable cost. Message Passing Interface (MPI), the standard of programming more used by scienti c application, has a fail-stop behavior, by carrying out a safe stop of all the processes in case of detecting a failure in any of the nodes of the cluster. As a consequence, the execution which could have been done in all the processing nodes until that moment is lost. High Performance Computing has implemented mechanisms in order to guarantee service, usually based on techniques of rollback-recovery by using the Checkpoint/Restart. Those solutions have been implemented at an application level which is not transparent, or, at library level, which is not extended to other libraries and leave out several applications. A transparent and automatic fault tolerance system in proposed in this thesis, in such a way that the application can be used without being modi ed and with the message passing library preferred by the user. It is based on detecting failures in the communications of the socket caused by failures of nodes and recon gure them in an automatic way to communicate with the new direction where the process is migrated. This method works along with a system which protects the status of computation of the processes and in the case of failure, they are recovered in other node of computation by using techniques of rollback-recovery. An experimental validation has been carried out by using applications Master/Worker and Single Program Multipla Data (SPMD), with communications based on sockets and on Message Passing Interface (MPI). The executions were made in a multicore cluster, obtaining the desirable levels of functionality and performance.The demand of more performance of scienti c applications is achieved by increasing the amount of components. However, a growing number of components implies that the probability of failure increases as well. The remarkable decrease of average times between failures in the current High Performance Computing systems encourages the investigation of mechanisms of fault tolerance suitable for new architectures which allow to guarantee the execution of an application at a reasonable cost. Message Passing Interface (MPI), the standard of programming more used by scienti c application, has a fail-stop behavior, by carrying out a safe stop of all the processes in case of detecting a failure in any of the nodes of the cluster. As a consequence, the execution which could have been done in all the processing nodes until that moment is lost. High Performance Computing has implemented mechanisms in order to guarantee service, usually based on techniques of rollback-recovery by using the Checkpoint/Restart. Those solutions have been implemented at an application level which is not transparent, or, at library level, which is not extended to other libraries and leave out several applications. A transparent and automatic fault tolerance system in proposed in this thesis, in such a way that the application can be used without being modi ed and with the message passing library preferred by the user. It is based on detecting failures in the communications of the socket caused by failures of nodes and recon gure them in an automatic way to communicate with the new direction where the process is migrated. This method works along with a system which protects the status of computation of the processes and in the case of failure, they are recovered in other node of computation by using techniques of rollback-recovery. An experimental validation has been carried out by using applications Master/Worker and Single Program Multipla Data (SPMD), with communications based on sockets and on Message Passing Interface (MPI). The executions were made in a multicore cluster, obtaining the desirable levels of functionality and performance.The demand of more performance of scienti c applications is achieved by increasing the amount of components. However, a growing number of components implies that the probability of failure increases as well. The remarkable decrease of average times between failures in the current High Performance Computing systems encourages the investigation of mechanisms of fault tolerance suitable for new architectures which allow to guarantee the execution of an application at a reasonable cost. Message Passing Interface (MPI), the standard of programming more used by scienti c application, has a fail-stop behavior, by carrying out a safe stop of all the processes in case of detecting a failure in any of the nodes of the cluster. As a consequence, the execution which could have been done in all the processing nodes until that moment is lost. High Performance Computing has implemented mechanisms in order to guarantee service, usually based on techniques of rollback-recovery by using the Checkpoint/Restart. Those solutions have been implemented at an application level which is not transparent, or, at library level, which is not extended to other libraries and leave out several applications. A transparent and automatic fault tolerance system in proposed in this thesis, in such a way that the application can be used without being modi ed and with the message passing library preferred by the user. It is based on detecting failures in the communications of the socket caused by failures of nodes and recon gure them in an automatic way to communicate with the new direction where the process is migrated. This method works along with a system which protects the status of computation of the processes and in the case of failure, they are recovered in other node of computation by using techniques of rollback-recovery. An experimental validation has been carried out by using applications Master/Worker and Single Program Multipla Data (SPMD), with communications based on sockets and on Message Passing Interface (MPI). The executions were made in a multicore cluster, obtaining the desirable levels of functionality and performance.

Keywords

Radic; Tolerancia; Fallos

Subjects

68 - Industries, crafts and trades for finished or assembled articles

Knowledge Area

Tecnologies

Documents

mcl1de1.pdf

1.300Mb

 

Rights

ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

This item appears in the following Collection(s)