Understanding Privacy

Project Group A: Understanding Privacy

Projects in this group focus on different sources of personal data, and address different technology areas required as building blocks to comprehensively understand user privacy.

We will develop methods for identifying privacy-relevant information in digital user habitats, in particular from natural language utterances in online forums, and from visual data.

We will develop methods for predicting how information may be disseminated, through (unintended or malicious) software leakage, and through information spreading in online social networks.

We will analyze how natural language utterances and visual data may enable an adversary to link a user’s online accounts across sites, and we will build on all these methods to predict how an adversary may compromise a user’s privacy by combining all these sources of information.

We will start the investigation of suitable user interaction paradigms, understandably presenting the outcome of particular threat analyses, giving the user better control over her privacy.

Naturally, the projects will concentrate on adapting and advancing state-of-the-art methods to meet the challenges imposed by the highly heterogeneous, unstructured, and dynamic nature of the domain.

We need to extract information from large-scale heterogeneous data sources; to advance image and software analysis methods; to pioneer the linkage of unstructured user records consisting mostly of natural language text; to advance methods enabling convenient privacy specification by laymen users.

Many of our methods will devise computer-processable models – of user data, of a user’s current exposure to the extent that can be determined, of inferences an adversary could make, of hypothetical near-term future events – and advance associated analysis and simulation techniques for assessing and predicting privacy threats.

A1: Personal Information for Privacy Awareness

Jens Dittrich & Gerhard Weikum

Privacy is often at stake because of the accumulation of sensitive personal information that a user conveys over an extended time period on a variety of platforms (social networks, discussion forums, review sites, etc.). An additional risk is that information conveyed only to friends may be transitively made visible to a broader community. Finally, data that a user provides to commercial or public services (for shopping, registering for events, etc. sometimes becomes widely visible although the user assumed that it would be kept confidential. Possible reasons could be software bugs, careless system administration, or that the service runs out of business or is acquired so that policies are neglected or changed.
Even if each individual leakage of these kinds may be relatively harmless, the major risk is that someone could compile all this information and draw conclusions about the user from the entirety of information.
For example, a job recruiter or an insurance could decline someone’s application based on the user’s digital traces collected over years, much of which the user already forgot. It is impossible to completely prevent such situations as the criticality builds up over an extended time period. However, what users need is tools and guidance to determine what information is visible beyond its originally intended scope and to assess the potential privacy risk.
The goal in this project was to develop models, methods, and scalable tools for this very purpose of improving the user’s awareness about long-term traces in the digital world and support her in understanding the potential criticality of her disclosed personal information. This entails a number of sub-goals: (1) Find and retrieve the entire personal information that the user has disclosed on the Internet over an extended time period. (2) Determine which piece of information was visible to whom. (3) Analyze the provenance of how each piece became visible (e.g., by other users copying, citing, or forwarding some data). (4) Continuously monitor the user’s actions that leave digital traces (such as posts, but also clicking on “like” buttons, rating other users’ posts or products, connecting with new friends, and so on) and match those actions in realtime against a database of past actions.

A2: Privacy Implications of Visual Data Dissemination

Mario Fritz & Bernt Schiele

Many people share and disseminate massive amounts of visual data (images, videos), be it on webpages, in social networks or through personal communication. Even though it is obvious that visual data contains privacy relevant information, it is unclear, which privacy implications visual data dissemination has for individuals sharing such information and for others that can be associated to the visual information.
This project has investigated methods that extract such privacy relevant information from visual data in order to get a better understanding of the implications of releasing visual data. The investigations are structured into four parts. The first focus is on what type of information can be extracted from such data sources in terms of activities, interactions and social roles. Second, linkability of persons between different data recordings was investigated in order to understand how the aggregation of large sources of visual data will affect privacy. Third, connections to social network were established in order to understand the different quality and complementarity of information that can extracted from visual sources.
Fourth, we model the implication of releasing additional visual data under different attack scenarios and counter measures in order to evaluate different options users might consider. Throughout the project, we are tightly interlinked with other CRC projects by providing the results of our visual analysis in order to arrive at a more holistic picture of data emitted by users in the context of social networks, and evaluating different threat scenarios for the user. We approached the associated challenges by researching computer vision and probabilistic inference methods to extract and infer privacy relevant information from disseminated visual data. Also, possible counter measures were explored such as blurring of faces of individuals and how much (or little) effect this had on the information that can be inferred.

A3: Analysis of Software Privacy Leakage

Lionel Briand & Andreas Zeller

As Web and mobile applications can access more and more sensitive data, such as photos, location, contacts, or sites visited, there is a need to analyze and understand how these applications treat these data—whether they access sensitive data; how they process sensitive data; and how and where they propagate sensitive data. In this project, we have developed a suite of novel software engineering techniques set up to analyze privacy leakage from existing software—resulting in privacy patterns that abstract and summarize how applications access, process, and propagate sensitive data. Our patterns express abstractions over sources and sinks of sensitive data—that is, where information comes from and where it goes; this information is automatically extracted from existing software. We have considered two domains: mobile applications, focusing on local, in-device privacy leakage, as well as Web applications, focusing on distributed, cross-device privacy leakage.
Our techniques rely on a multi-disciplinary approach, which combines program analysis, constraint solving, and meta-heuristics to tackle the challenges posed by industrial code bases.

A4: Privacy Threats in Social Networks

Krishna Gummadi & Manuel Gomez Rodriguez

Any information posted by a user on social networking sites like Twitter and Facebook can spread or exploited in multiple ways. First, friends of the user who see the information can spread the information to others, who in turn can spread it to more users. Such information spreading can easily bring information to recipients it was not originally intended to, and can even set off large-scale social contagions or cascades of information both within and across users of different sites. Second, many sites offer mechanisms that can tremendously increase the exposure of information to a lot of users, such as proactively recommending such information to other users or allowing third-party advertisers to target users leveraging their personal data.
Unfortunately, users sharing information on social networking sites today lack a good understanding of how the different ways in which information spreads increase the exposure of their information. Not knowing which or when or how other users or advertisers will get to see an individual’s personal information makes the individual powerless in controlling exposure of her own information as well as her exposure to third-party ads, resulting in serious privacy loss. This problem manifests itself across a broad range of scenarios, from small-scale undesired spreading within user’s friends circles (e.g., the wrong picture reaching the classroom bully), all the way to worstcase scenario violations, recently highlighted by popular media, where ads related to job or housing opportunities, are predominantly shown to users belonging to certain demographic groups, resulting in a discriminatory use of personal information.
The central question we originally emphasized is: To what extent can we predict and control how widely, how quickly, and to whom a piece of private information will spread? However, since the proposal, the threat posed by third-party advertisers exploiting users’ personal data on social networks to discriminatorily deny opportunities for certain social groups exposure or to spread misinformation and stoke societal divisions has grown to such prominence that we decided to investigate our original objectives in the context of online targeted advertising.

A5: Semantic and Statistical Linkability

Michael Backes & Gerhard Weikum

Even cautious users of online platforms—those that try to limit the information they reveal on a single platform — often underestimate how much information can be learned about the user by combining the information from multiple platforms. In the worst case, the different accounts, maybe even virtual identities, can be linked. This problem — linkability — is one of the biggest threats in modern digital habitats. Attackers (in the information security sense) can be NSA-style intelligence agencies, but also the advertising industry, which has a monetary interest in combining various data.
In this project, we developed a general model about the attacker’s ability to identify and link a user’s profiles across different platforms by methods such as Statistical Language Models. We focused in particular on user-generated contents, which is mostly unstructured text. To automatically infer relationships, we used techniques from natural language processing. While, in the original proposal, we planned to analyze network traffic, we chose to extend the domain of interest.
We deviated from the proposal in that we also considered two other domains in the context of social networks that have gained popularity in recent years: hashtags and location information, e.g., whensharing the participation at a public event. Assessing the likelihood of linkability alone is, of course, not sufficient. Users depend on using the Internet in their day-to-day lives. We therefore evaluated possible protection mechanisms against linkability (and other privacy incursions). Generally speaking, these mechanisms perturb user profiles or user-generated content in order to strike a balance between hiding unique properties of a user’s profile and providing the effect the user intended by sharing this information. In different domains such as user-generated content, choice of hashtags and provision of location data, we developed automated systems that sanitize user-generated content with the goal of minimizing linkability risks. To this end, we compared different approaches w.r.t. trade-offs between unlinkability, performance and utility for the user.

A6: Understandable Privacy Specification

Antonio Krüger

It is well known that, in current digital user habitats, especially in online social networks, users experience severe difficulties in understanding and specifying their privacy settings. Two key reasons for this are that: (1) The possible settings are typically technically motivated, not allowing to accurately reflect users’ actual priorities. (2) It is typically very complex for users to understand the consequences of their settings, and thus to configure the settings to suit their needs. As a result, users often stick to the default settings even though they do not necessarily match their personal preferences. Within the project, we developed a combination of techniques suitable to address these difficulties for some exemplary domains. First, (1) has been addressed through advancing user-oriented models, and a novel combination of two different kinds of such models. We modelled the user’s personality and privacy attitudes, capturing the user’s needs and preferences in a generic manner that can be adapted to all of the aforementioned domains, enabling the system to explicitly refer to and reason about the user’s individual view. Furthermore, we developed a new user interface that visualizes the privacy rules and allow the users to (a) easily get an overview on the current privacy state (b) detect possible privacy leaks or misconfigurations and (c) allow the user to review, adapt and fix the privacy settings. These components approach privacy from different perspectives (user vs. system), and mutually benefit from each other through a user feedback cycle, including the possibility for users to conveniently give in-situ feedback while witnessing unwanted consequences of information sharing, based on wearable computing technology making them aware of the final audience during the post.
The overall goal was to put users in the position to effortlessly understand their exposure and help them assess in quantitative and qualitative terms the privacy consequences of their actions. This was ensured through a user-centered design process, including extensive lab user studies. While our basic methodology and ideas apply to arbitrary privacy-relevant information in principle, different types of privacy-relevant information naturally differ widely in the specifics, i.e., in the required model attributes, meaningful privacy visualisations, suitable forms of in-situ feedback, etc. Addressing all of these different aspects is far beyond scope of any single research project, and would be more distracting than useful. We therefore focused on four different domains where user privacy plays an important role, namely location sharing, social media posts, mobile app permission settings, and sensitive data captured from an intelligent retail store, which tracks actions like the customers’ movements through the store, as well as viewed or bought products. These were highly relevant use cases in its own right. As most of the concepts, and many of the technologies, we developed a general-purpose in principle. It can be expected that our work will yield useful insights for other types of privacy-relevant information as well. The enforcement of privacy policies, i.e. allowing/disallowing information sharing according to the user’s rules, was implemented prototypically on the social network platform developed within this project.