Project A5

Semantic and Statistical Linkability

Principal Investigators

Project Summary

Even cautious users of online platforms—those that try to limit the information they reveal on a single platform — often underestimate how much information can be learned about the user by combining the information from multiple platforms. In the worst case, the different accounts, maybe even virtual identities, can be linked. This problem — linkability — is one of the biggest threats in modern digital habitats. Attackers (in the information security sense) can be NSA-style intelligence agencies, but also the advertising industry, which has a monetary interest in combining various data.
In this project, we developed a general model about the attacker’s ability to identify and link a user’s profiles across different platforms by methods such as Statistical Language Models. We focused in particular on user-generated contents, which is mostly unstructured text. To automatically infer relationships, we used techniques from natural language processing. While, in the original proposal, we planned to analyze network traffic, we chose to extend the domain of interest.
We deviated from the proposal in that we also considered two other domains in the context of social networks that have gained popularity in recent years: hashtags and location information, e.g., whensharing the participation at a public event. Assessing the likelihood of linkability alone is, of course, not sufficient. Users depend on using the Internet in their day-to-day lives. We therefore evaluated possible protection mechanisms against linkability (and other privacy incursions). Generally speaking, these mechanisms perturb user profiles or user-generated content in order to strike a balance between hiding unique properties of a user’s profile and providing the effect the user intended by sharing this information. In different domains such as user-generated content, choice of hashtags and provision of location data, we developed automated systems that sanitize user-generated content with the goal of minimizing linkability risks. To this end, we compared different approaches w.r.t. trade-offs between unlinkability, performance and utility for the user.

Role Within the Collaborative Research Center