Limitations of Information Flows

In part 1 of my summary of the Private AI series we covered lesson 2. This lesson was all about information flows and how they are fundamental to our society and human collaboration. We also learned about how information flows are often broken today because of the privacy-transparency trade-off.

Tip: To make the matter less abstract, you can replace "information flows" with your favorite example. Take democracy, scientific research, or communities working together to help the environment.
To improve information flows, we need to understand what exactly is not working today. Not only in the general terms of privacy and transparency, but in detail. In lesson 3 we learn about three key technical problems that form the foundation for privacy and transparency issues:
  1. The copy problem
  2. The bundling problem
  3. The recursive enforcement problem

The Copy Problem

Note: The copy problem describes that you lose control over how someone uses your data once you share a copy.
Suppose I have a piece of information (i.e., a document or an mp3 file). If I make a copy of this and give it to you, I lose technical control over this copy. I have to trust you that you don't use it against me, that you follow laws governing my data, that you don't share it with somebody else without my knowledge.

There are laws attempting to prevent people from misusing information, like HIPPA or GDPR or CCPA. But they are really difficult to enforce.

That's why the copy problem is so important as a technical issue. Because - no matter what the law says - it determines what people actually can do with a piece of information.

You might be tempted to say: uncontrolled copying of all information sounds terrible, let's stop this! But be careful. While the copy problem might hurt you sometimes, it is also protecting some of your most treasured freedoms. While anyone who stores your information can make copies of it, you can also copy anyone's data that you store. Any attempt to limit this ability could have a big impact on your life.

Example: Digital piracy - the sharing of copyrighted songs, movies, software - is a classic example of the copy problem. As soon as a digital copy of a file is sold to the first customer, this customer could share it with all other potential customers. There is no way for the copyright holder to control this.

In reaction to this, the entertainment industry developed DRM software. You can read about DRM in this comprehensive article.

Note: DRM stands for digital rights management. It’s a set of technologies to control the use, modification, and distribution of copyrighted works.
DRM software prevents your computer from playing files you didn't buy. It is controversial because DRM software is a great potential threat to privacy and agency over consumer devices. It lets central authorities control what you can and cannot do with your personal devices.

But funding for the creation of art also stands at risk. Artists deserve compensation for the value they create!

An ideal solution would be a very selective enforcement of a copy limitation. Unfortunately, this is impossible to do: computers are machines that operate by making copies. Even a stream is a download - without a save button. But you still can make copies of the content. To prevent data from being copied, you need incredibly invasive software.

Example: Dropbox prevents you from sharing copyrighted material. They scan every file you upload to a shared folder, to check if it contains copyrighted material.

The copy problem causes a privacy-transparency trade-off. Sometimes you might want to share data, but you have to weigh the benefits of sharing against the risks of misuse. A solution would radically change many industries, offering the best that both sides have to offer.

The Bundling Problem

Note: The bundling problem is this: It can be difficult to share a certain piece of intended information, without also needing to reveal additional information to verify the intended information.
Example: A bartender checks your ID to verify your age. But he does not only see your date of birth but also your home address, your full name, where you were born et cetera. In fact, it wouldn't even be necessary for him to see your full birth date. It does not matter whether you are 19 or 49, only whether you are over 18. But if you just carried around a card that said "Greater than 18" or "Yes", then how would the bartender verify it's true?

This problem is everywhere. More examples:

  • You share an image to prove something, but there are other things in that image, too
  • A news organization reports about protests. It shows videos of individual protesters, which could later be used against them
  • Researchers share sensitive medical data, when all they needed were the patterns within this data

The Problem of Surveillance

Another example is home security systems. If you set up a video camera outside your front door, does it only record information about intruders? Of course not! It records every person that walks by, every car, every dog. Absolutely everything, 24/7 and 365 days a year. Your ability to watch the 0.01 percent of the footage that actually matters, comes bundled with the need to record also the other 99.99 percent. And we hope that the 99.99 percent are not misused.

Almost all sorts of surveillance suffer from this bundling problem. Rare events justify the collection of massive amounts of information, which is not supposed to be used for anything. Most people don't know how to build a surveillance system that only records the rare events that it is intended to identify. But at the end of this course, you will learn how to do this.

AI Governance

The bundling problem is also a topic in AI governance.

Note: AI governance is about evaluating and monitoring algorithms for effectiveness, risk, bias and ROI (Return On Investment) (Source: forbes.com)
Example: Courts use AI to support them in parole decisions or sentencing decisions. Machine Learning models might predict how likely it is someone will violate paroles. Often it is hard to audit these algorithms. Do they behave like advertised? Were they developed in a responsible manner? The companies that build these algorithms have two credible reasons not do disclose any details:
  1. How exactly the algorithm works might be valuable intellectual property
  2. If the details of the algorithm were public, it might be easy to fool

Artificial Bundling Problems

Sometimes information that could be unbundled isn't unbundled, because someone in a powerful position does not want it to be unbundled.

Examples:

  • You have to provide your email address to read an article
  • You want to use a free trial, but you have to enter your full details and credit card
  • You want to text with your friends, but you have to agree that a service scans all images and links you send

So, there are different forms of the bundling problem:

  • Artificial bundling problems that are forced upon you
  • Natural bundling problems

The boundary between them is increasingly grey. But in this course, you will learn how to tell the difference. And in many cases, how to avoid both.

The Recursive Enforcement Problem

Couldn't third party oversight institutions solve a lot of the issues caused by the copy problem and bundling problems? Why not make undesirable uses of data illegal? While this sounds good in theory, enforcing such rules is much harder to do in practice.

Note: Recursive enforcement: when enforcing privacy regulations, we end up in a recursive loop. Each authority that supervises other entities must itself be supervised by an authority.
Example: Imagine a PhD student who uses medical records for his research. We worry that he might misuse the data. We could use a third party authority to make sure nothing bad is happening. The PhD students' supervisor seems like a good fit. But how would the supervisor actually detect if the student misused the data? As soon as the data is on his computer, he could do anything with it, for example share it. The supervisor is unlikely to find out.

The solution seems to be: the data must stay on the supervisor's machine, not on the student's computer. This might be a bit of an inconvenience, but now the supervisor can watch everything the student does with the data. But what about the supervisor: now he has the ability to misuse the data! Who controls the supervisor? The university? And so on. We call this the recursive enforcement problem. It is also called the recursive oversight problem.

It's one of the most important problems we face. This is the core technical problem of data governance. If you have to put data onto someone's computer, then who makes sure that that someone doesn't misuse it?

Note: Data governance is the process of managing the availability, usability, integrity and security of the data in enterprise systems [...] Effective data governance ensures that data is consistent and trustworthy and doesn’t get misused. (Source: techtarget.com)
The problem of authorities needing their own authorities is also known in political science. It has been tackled through systems of decentralized governance. Things like democracy, representative government, checks and balances.

This is much harder to do with data. How can multiple people have ownership over a data point, that still has to live on a single machine?

There is a new class of technologies that allows this, and we will learn about it in the next part.

Conclusion

This lesson explored the three major technical problems that underlie the privacy-transparency trade-off. The copy problem, the bundling problem, and the recursive enforcement problem.

In the last two lessons, we learned a lot about the problems of today's information flows. In Part 3, we will begin to learn about solutions!

If you found a paragraph that needs improvement, please let me know in the comment section or on Twitter, I'm @daflowjoe. I'm also happy to hear from you if you found this summary helpful! :)