- The Impact of Structured Transparency
- Academic Research
- Industry, Research & Development
- News Media
- Machine Learning Startups
- Consumers Need Structured Transparency Institutions
- Intelligence Agencies, Statistical Services
- The Next Steps for Structured Transparency
Part 1 of my summary of the Private AI series was all about information flows and how they are fundamental to our society. We also learned about how information flows are often broken today because of the privacy-transparency trade-off.
In Part 2, we discussed which technical problems exactly are underlying the privacy-transparency trade-off.
In Part 3, we learned about solutions. Structured transparency lets us analyze privacy issues in a structured way. We learned about input and output privacy, two of the five guarantees of structured transparency.
In Part 4, we continued with structured transparency. Covering input and output verification as well as flow governance.
In Part 5, we learn about the groups who will be affected by these new techniques. We have to understand their motivations and pain points to predict how they will respond. This will prepare you to take advantage of the opportunities that are arising.
The primary motiviation of researchers: answering important, impactful questions. To answer these questions, researchers need access to quality information. This need for data makes collaboration between researchers important. But once a researcher shares a copy of their data, it becomes difficult to control what others do with it. The copy problem. That's why collaboration today is often very slow.
How does each component of structured transparency apply to researchers?
- Input Privacy: SMPC allows collaboration without sharing plain-text data
- Output Privacy: data owners could prevent reverse-engineering of the outputs of a computation, using Differential Privacy. Larger datasets can make it easier to ensure this.
- Input and output privacy will be enough for most research collaborations since there is trust between the actors.
- Input Verification: If trust was lacking, input verification techniques could be used. They would allow a data owner to prove specific attributes of the dataset to the researcher.
- Output Verification: might be required if there is competition between institutions. Could be used to prove that a statistical result was actually computed by the data owner using the computations requested by the researcher.
- Flow Governance: allows data owners to make extra sensitive data available for appropriate research. Distributing control across third parties like funding bodies or consortiums.
- More access to data will lead to better research results. This can increase impact in comparison with a result only from self-collected data.
- Grants tend to favor projects that use cutting-edge methods. Privacy-preserving tools will be cutting edge for quite a while.
Researchers have already started adopting these tools and making them available for other researchers. One example for medical imaging applications: PriMIA.
Why is access to more data important? Because we want to create AI systems that people can trust. For this, it is important that we have representative and unbiased data available to train on. It should represent the entire spectrum of what you find in real patients. The algorithms should also be universally useful. They should be applicable to people of all genders and ethnic backgrounds.
Why can't we collect all the data in one big database?
Research and Development are the research departments of companies. Their motivations overlap with academic researchers, they want to answer important questions. But there are two additional motivations:
- Intellectual Property. The relevant data or the ML model itself might be proprietary.
- Validation. A ML model in a healthcare setting needs certification to be approved.
It's very expensive to certify healthcare AI applications. You need to verify that your application works for all kinds of patients in different settings. Specifically: different genders, age groups, ethnicities. You also need to make sure that the different devices that generate diagnostic data all work well with your algorithm. For this purpose you need access to private datasets. This is very difficult. It takes about 16 to 30 months to validate an algorithm, with costs from 1.5 to 2.5 million dollars. A huge burden for anyone developing an algorithm. AI will continue to grow in healthcare, but these kinds of costs and risks are a real obstacle.
What we need: an environment where algorithms can be validated on private datasets. Without ever moving or sharing the data. The algorithm IP has to be protected as well. If the weights of a deep learning model leaked, it would pose a great risk for the company developing the algorithm. An example for such a tool: BeeKeeper AI
There are many reasons why industry already adopts these techniques for structured transparency. In addition to all the benefits for academic researchers, using these tools can give industry researchers a competitive advantage. They can speed up regulatory processes and reduce costs.
Consumers want better privacy. That's true, but far from the only point that can be improved by structured transparency. The goal: create information flows that maximize good in the world.
One ancient information flow that could be improved: community wisdom. Folk tales, religion and community norms are forms of community wisdom. We hold on to these ancient forms of community wisdom. But because this is such a deep human desire, new ways of exchange appear with new technologies. One example is anonymous advice on the internet. Reddit, Quora, Health advice. There are thousands of communities on the internet for this kind of information flow.
Word of mouth doesn't scale well for several reasons:
Reason 1: There are millions of products. Maybe you know no one that uses this product.
Reason 2: You might not feel comfortable asking people you know. If you wonder about a certain medication or if a particular doctor has a high reputation.
Example: Online product reviews. Without verification, it's difficult to trust a review of a product. Or look at anonymous forums: which of these strangers can you trust?
Reason 4: Wisdom sometimes is not with one single person. Wisdom can mean learning from the aggregate of our experiences. That's exactly what machine learning was built for!
Focus on the goal: help people getting trusted information. Which they need to make everyday decisions.
Remember the Polis system from lesson 2: it's not about creating the most efficient, most scalable solutions. But about replicating information flows for community wisdom that have worked organically.
News Media could also profit from structured transparency solutions.
Journalists have to protect their sources, but want to get out their message. This is an excellent example of real-world, manual structured transparency.
- Input and output privacy: hiding the source and modifying the contents to prevent identification of the source.
- Input verification: by testifying that a source is credible.
- Output verification: journalists are subject to review by the editor. This provides governance over the information at hand.
Which parts of this information flow could be improved?
Example: Whistleblowing. Journalists spend their careers on building a trusted network. A very manual process, with meetings and phone calls. There is a lot of stake for a whistleblower, they have to trust a journalist. It's a classic structured transparency problem. How to reveal information about the wrongdoing of an organization, without endangering yourself?
Today, there are Signal or WhatsApp that end-to-end encrypt your messages. But what is missing: you can't message someone anonymously. There is no way to use existing public-key cryptography techniques to prove that you are a member of a group. If you have explosive information about an organization, it makes a difference for your credibility if you can prove that you work there.
There is also no anonymous search and discover feature. Most of this work must be done through phone calls, emails, network building. It's possible, but with a lot of friction. If there was less friction, there could be more stories and more accountability for people in positions of power.
Example: Disinformation. It's an input verification and output verification problem. With the rise of deepfakes, we cannot trust video anymore. We need better input verification. This is possible with embedded cryptography in cameras. The camera adds a cryptographic signature to the image. If you saw an image that no actual camera signed, you should be suspicious.
The essential challenge for any machine learning startup: Do you have the data? On day one you have no customers. You need a strategy to get a dataset that is big enough. Big enough to create an ML model that can solve someone's problem. This is why nearly all ML startups start with consulting. Consulting can give you a relationship with a big corporation with millions of users and their data.
Data markets today are one of two very different kinds of markets. Either it's a race to the bottom where data is sold as fast as possible. Or it's a secretive industry where data is almost never sold.
Example: Financial data
- On one hand, there is information about stock trades that is available to almost everyone. The only way to gain an advantage is to receive this information a bit earlier. That's why companies pay millions to get physically closer to the NASDAQ servers.
- On the other hand, there are hedge funds who discover new data that helps predict the market. They never tell anyone about it because others would compete.
This is true for other industries. Some data is traded very fast, some of it almost never. The marketplace for data is fundamentally broken. It's the copy problem: If you sell data, every customer becomes your competitor. Both for use and sale of this data.
What if the copy problem was solved? A creator could sell data in a way that they are always able to charge money for any derivative use.
- That would raise the prices for data that was previously sold as fast as possible.
- The use of data that was previously locked away now could be sold. No customer would also become a competitor.
Structured transparency solves this through a combination of all five guarantees. It's called Federated Data Networks. Instead of selling data, you sell the opportunity to run statistics on your data. If you're a hospital, you don't sell your MRI scans, but allow researchers to study the data while it's in your hospital. Only the high-level insights of the researchers leave the building. Data markets would explode in value. Everyone can participate and prices stay high. This technique is called Federated Learning.
Privacy protection can benefit users and companies: There are people who are unwilling to share data because of privacy concerns. With Differential Privacy, there is a relationship between
ε and the accuracy of your analysis. With perfect privacy - setting
0 - you cannot learn anything about the data. But offer no privacy at all and you might not be able to collect the data. Even if you, as a data analyst, didn't care about personal privacy at all. Only about the accuracy of your analysis. You would have to choose something in between to get the optimum accuracy.
What does a healthy data market look like? It's privacy-perserving, far more profitable and much less risky for everyone involved. Startups have access to all the data in the world from day one. They would of course have to pay for it, but it's all available. Who owns the data? Whoever collected it first. Today, this is mostly large enterprises. But tomorrow, it could be individual people who own their data.
The gist: There is a missing institution within the marketplace for private data. It should protect consumers from harm, and also enable a healthy data market place.
A quick recap of differential privacy and privacy budgets): privacy budget is a measure of leakage that a data owner gives to a scientist studying their data. It is measured using a metric called
ε. It can limit the probability that harm could come to a participant in a data study.
Privacy budgets should be person-centric. A person's
ε should be tracked across different institutions. No individual data holder can ensure that people don't get hurt. Because they don't know where else that person's data might be hosted.
We look at this problem from two perspectives, economics and public safety.
Brilliant paper: Selling Privacy at Auction. You should read at least the first few pages. The idea: If you sell data in a privacy-preserving way, you're not selling data itself but insights. But if you sell too many insights, somebody can use all these insights to recreate the original data. The privacy budget is all about preventing that from happening. By keeping track of the probability somebody could reconstruct your information.
But if you can only release so many insights before somebody can use these insights to reconstruct your data, then
ε itself is the scarce resource. At some point, no further data science can be applied to your information. As there is less and less
ε going around, its price should increase.
- In federated data networks, data owners maintain control over the only copy of their data. We measure data leakage using
εbecomes an important factor in the price of a data point.
- It is a measure of scarcity, how much is left of a dataset insight to sell.
- People's interest to protect their privacy starts to align with the commercial incentive. Because businesses want to maintain control over the only copy of their information to keep prices high. Economic forces that used to work against privacy could soften or even work in favor of privacy.
εis an effective market mechanism to create healthy data markets that protect privacy.
The economic approach described above relies on an important piece of privacy infrastructure. It's called automatic privacy budgeting for remote data science. However, it is not yet mature.
At the moment, the tooling is limited. There are individual algorithms for Differential Privacy available. But no end-to-end infrastructure to keep track of privacy budgets. One of the essential lessons: privacy doesn't come for free. We must think of it as a limited resource. But it is necessary to consistently keep track of this privacy budget.
All this would only be useful to track individual privacy budgets. However, this isn't enough. Privacy budgets have to be person-centric to provide true protection! These systems can never be truly robust, unless someone, somewhere is keeping track of a person's
ε across all institutions that have data about them.
These are examples of the recursive enforcement problem.
There are a lot of areas in government and statistical services that need better information flows.
Example: database queries. A government intelligence agency asks a private company to reveal information about a user. For example, the location history of a user. But they don't want to reveal which user they are looking for. (Because this could leak to the press and ruin the investigation.) The only way to do this without tools for structured transparency: the company has to reveal the data of all their users. This is a serious breach of privacy for all uninvolved users. They would have to trust that their information was not misused, if they even knew that their privacy was violated at all.
With tools for structured transparency it would be possible to do a homomorphically encrypted database query and fetch just the one record that you needed.
Writing regulation is hard. Privacy requirements precede the development of mature technology. But there is active research around privacy and regulators should look at the science.
Regulations are sticky. Some current regulations (for example, HIPPA) encode privacy protections that we view as misguided today. They often focus on anonymization. But better solutions are out there and regulators should look at the science. The US census bureau actually adopted differential privacy techniques for the last census.
Current privacy laws are not written from this global perspective. Most laws focus on reducing the individual harm of any particular interaction, any particular data release. What we need is a holistic regulation that limits the amount of "radiation" that any consumer will ever experience.
We looked at how structured transparency can solve the problems of many different groups of people. But there is one question remaining: When? Timing is important.
The current techniques are the result of decades of work. Some of those techniques are working on your phone right now. For example, federated learning to predict your next words. Millions of users switched to Signal to get better information flows. Consumer demand has already started.
What are the next steps, how can we solve the information flows from lesson 2? Remember, solving privacy would have great benefits:
- Researchers can access data to solve important problems
- Consumers can reclaim agency of their data and fight monopolies
- Citizen-lead data projects can affect policy and protect their environment
- Cross-border data governance does not lead to new colonialism
- Healthy flow for democratic feedback
- More accountability for powerful actors
First step: Awareness. A shared vision. The execution of those information flows will require multi-disciplinary action. We need policy changes and activism.
Next step: Making it easy to build. Wrap up complex algorithms in easy-to-use tools. So that researchers and developers can adopt the techniques covered in this course. This is a great development opportunity and a great role for open source communities. Free open source software will ensure that any developer can have an accessible toolkit.
Another way: Extend current popular tools to have privacy-preserving abilities. SQL is not by default a privacy-preserving language. Multiple research teams are working on enhancing SQL to support DP queries. Combining a familiar language like SQL with sophisticated privacy protection. The result looks and feels like SQL, but the answers have DP baked into them by default.
Practical challenges: Algorithm speed, network connectivity, balancing the trade-offs of each technique. But these are exciting R&D opportunities.
Once the complex algorithms are wrapped in easy-to-use tools, those tools can be wrapped into easy-to-use consumer products. So that you as a consumer don't have to think about it, like HTTPS.
The long term defensible product strategy is to provide information within the forthcoming networks of data. It's about building information flows that help people achieve their goals in the best possible way.
- If we want an existing technology to become mainstream: it's an opportunity for entrepreneurs, investors, activists and lawmakers.
- If we want a novel technique to become mainstream: it's a research opportunity. And an opportunity for government investment.
- If we want technologies that don't exist yet: that's an opportunity for science fiction authors. And for all of us to daydream.