Toward a universal privacy and information-preserving framework for individual data exchange

Author

Ruiz, Nicolas

Director

Domingo-Ferrer, Josep

Muralidhar, Krishnamurthy

Date of defense

2019-09-25

Pages

140 p.



Department/Institute

Universitat Rovira i Virgili. Departament d'Enginyeria Informàtica i Matemàtiques

Abstract

Data on individual subjects, which are increasingly gathered and exchanged, provide a rich amount of information that can inform statistical and policy analysis in a meaningful way. However, due to the legal obligations surrounding such data, this wealth of information is often not fully exploited in order to protect the confidentiality of respondents. The issue is thus the following: how to ensure a sufficient level of data protection to meet releasers’ concerns in terms of legal and ethical requirements, while still offering users a reasonable level of information. This question has raised a range concerns about the privacy/information trade-off and has driven a quest for best practices that can be both useful to users but also respectful of individuals’ privacy. Statistical disclosure control research has historically provided the analytical apparatus through which the privacy/information trade-off can be assessed and implemented. In recent years, the literature has burgeoned in many directions. In particular, techniques applicable to micro data offer a wide variety of tools to protect the confidentiality of respondents while maximizing the information content of the data released, for the benefit of society at large. Such diversity is undoubtedly useful but has several major drawbacks. In fact, there is currently a clear lack of agreement and clarity as to the appropriate choice of tools in a given context, and as a consequence, there is no comprehensive view (or at best an incomplete one) of the relative performances of the techniques available. The practical scope of current micro data protection methods is not fully exploited precisely because there is no overarching framework: all methods generally carry their own analytical environment, underlying approaches and definitions of privacy and information. Moreover, the evaluation of utility and privacy for each method is metric and data-dependent, meaning that comparisons across different methods and datasets is a daunting task. Against this backdrop, this thesis focuses on establishing some common grounds for individual data anonymization by developing a new, universal approach. Recent contributions to the literature point to the fact that permutations happen to be the essential principle upon which individual data anonymization can be based. In this thesis, we demonstrate that this principle allows for the proposal of a universal analytical environment for data anonymization. The first contribution of this thesis takes an ex-post approach by proposing some universal measures of disclosure risk and information loss that can be computed in a simple fashion and used for the evaluation of any anonymization method, independently of the context under which they operate. In particular, they exhibit distributional independence. These measures establish a common language for comparing different mechanisms, all with potentially varying parametrizations applied to the same data set or to different data sets. The second contribution of this thesis takes an ex-ante approach by developing a new approach to data anonymization. Bringing data anonymization closer to cryptography, it formulates a general cipher based on permutation keys which appears to be equivalent to a general form of rank swapping. Beyond all the existing methods that this cipher can universally reproduce, it also offers a new way to practice data anonymization based on the ex-ante exploration of different permutation structures. The subsequent study of the cipher’s properties additionally reveals new insights as to the nature of the task of anonymization taken at a general level of functioning. The final two contributions of this thesis aim at exploring two specific areas using the above results. The first area is longitudinal data anonymization. Despite the fact that the SDC literature offers a wide variety of tools suited to different contexts and data types, there have been very few attempts to deal with the challenges posed by longitudinal data. This thesis thus develops a general framework and some associated metrics of disclosure risk and information loss, tailored to the specific challenges posed by longitudinal data anonymization. Notably, it builds on a permutation approach where the effect of time on time-variant attributes can be seen as an anonymization method that can be captured by temporal permutations. The second area considered is synthetic data. By challenging the information and privacy guarantees of synthetic data, it is shown that any synthetic data set can always be expressed as a permutation of the original data, in a way similar to non-synthetic SDC techniques. In fact, releasing synthetic data sets with the same privacy properties but with an improved level of information appears to be invariably possible as the marginal distributions can always be preserved without increasing risk. On the privacy front, this leads to the consequence that the distinction drawn in the literature between non-synthetic and synthetic data is not so clear-cut. Indeed, it is shown that the practice of releasing several synthetic data sets for a single original data set entails privacy issues that do not arise in non-synthetic anonymization.

Keywords

dades individuals; privadesa; informació; datos individuales; privacidad; información; individual data; privacy; information

Subjects

004 - Computer science and technology. Computing. Data processing; 51 - Mathematics; 512 - Algebra; 517 - Analysis

Knowledge Area

Ciències

Documents

TESI.pdf

3.393Mb

 

Rights

ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.

This item appears in the following Collection(s)