Every decade, the U.S. Census Bureau counts the people in the United States, trying to observe the balance between gathering accurate information and protecting the privacy of the people described in that data. But current technology can reveal a person’s transgender identity by linking seemingly anonymized information such as their neighborhood and age to discover that their sex was reported differently in successive censuses. The ability to deanonymize gender and other data could spell disaster for trans people and families living in states that seek to criminalize them.
In places like Texas, where families seeking medical care for trans children can be accused of child abuse, the state would need to know which teenagers are trans to carry out their investigations. We worried that census data could be used to make this kind of investigation and punishment easier. Might a weakness in how publicly released data sets are anonymized be exploited to find trans kids—and to punish them and their families? This is a similar concern that underscored the public outcry in 2018 over the census asking people to reveal their citizenship—that the data would be used to find people living in the U.S. illegally to punish them.
Using our expertise in data science and data ethics, we took simulated data designed to mimic the data sets that the Census Bureau releases publicly and tried to reidentify trans teenagers, or at least narrow down where they might live, and unfortunately, we succeeded. With the data-anonymization approach the Census Bureau used in 2010, we were able to identify 605 trans kids. Thankfully, the Census Bureau is undertaking a new differential-privacy approach that will improve privacy overall, but it is still a work in progress. When we reviewed the most recent data released, we found the bureau’s new approach cuts the identification rate by 70 percent—a lot better, but still with room for improvement.
Even as researchers who use census data to answer questions about life in the U.S. for our work, we believe strongly that privacy matters. The bureau is currently undertaking a public comment period on designing the 2030 census. Submissions could shape how the census is undertaken, and how the bureau will go about anonymizing data. Here is why this is important.
The federal government gathers census data to make decisions about things like the size and shape of congressional districts, or how to disburse funding. Yet, government agencies aren’t the only people who use the data. Researchers in a variety of fields, such as economics and public health, use the publicly released information to study the state of the nation and make policy recommendations.
But the risks of deanonymizing data are real, and not just for trans children. In a world where private data collection and access to powerful computing systems are increasingly ubiquitous, it might be possible to unwind the privacy protections that the Census Bureau builds into the data. Perhaps most famously, computer scientist Latanya Sweeney showed that almost 90 percent of U.S. citizens could be reidentified from just their ZIP code, date of birth and assigned sex.
In August of 2021, the Census Bureau responded. The organization used the cryptographer-preferred approach of differential privacy to protect its redistricting data. Mathematicians and computer scientists have been drawn to the mathematical elegance of this approach, which involves intentionally introducing a controlled amount of error into key census counts and then cleaning up the results to ensure they remain internally consistent. For example, if the census counted precisely 16,147 people who identified as Native American in a specific county, it might report a number that is close but different, like 16,171. This sounds simple, but counties are made up of census tracts, which are made up of census blocks. That means, in order to get a number that is close to the original count, the census must also tweak the number of Native Americans in each census block and tract; the art of the Census Bureau’s approach is to make all of these close-but-different numbers add up to another close-but-different number.
One might think that protecting people’s privacy is a no-brainer. But some researchers, primarily those whose work depends on the existing data privacy approach, feel differently. These changes, they argue, will make it harder for researchers to do their jobs in practice—while the privacy risks the Census Bureau is protecting against are largely theoretical.
Remember: we’ve shown that the risk is not theoretical. Here’s a bit on how we did it.
We reconstructed a complete list of people under the age of 18 in each census block so that we could learn what their age, sex, race and ethnicity was in 2010. Then we matched this list up with the analogous list in 2020 to find people now 10 years older and with a different reported sex. This method, called a reconstruction-abetted linkage attack, requires only publicly released data sets. When we had it reviewed and presented it formally to the census, it was robust and worrying enough to inspire researchers from Boston University and Harvard University to reach out to us for more details about our work.
We simulated what a bad actor could do, so how do we make sure that attacks like this don’t happen? The Census Bureau is taking this aspect of privacy seriously, and researchers who use these data must not stand in their way.
The census has been collected at great labor and great cost, and we will all benefit from data produced by this effort. But these data can also do harm, and the Census Bureau’s work to protect privacy has come a long way in mitigating this risk. We must encourage them to continue.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.