A massive store of data containing information on about one billion Chinese residents could be one of the biggest breaches of personal information in history.

Portions of the leaked data appeared last week on a known cybercrime forum from someone selling the cache for 10 bitcoins, or about $200,000, and was allegedly siphoned from a Shanghai police database stored in Alibaba’s cloud.

Although details of the breach remain scarce, portions of the data have been verified as authentic, suggesting at least some of the data is real. The origins of the data and how it came to be in the hands of an underground seller, whose motives aren’t known, is still unclear.

News of the alleged breach has gone largely unreported in mainland China where restrictions on speech and expression are tightly controlled, and internet access is censored and strictly restricted.

The breach, if authentic, raises questions about the vast scale of China’s surveillance state, the largest and most expansive in the world, and Beijing’s ability to keep that data secure.

Here’s what we’ve learned so far.

How did the data leak?

In a since-deleted post on the cybercrime forum, the seller claimed to have downloaded the data from a cloud storage server hosted by Alibaba, the cloud computing arm of the Chinese e-commerce behemoth. When reached by TechCrunch on Monday, Alibaba said it was looking into the claims.

Exactly how the data leaked is murky, but experts say that the database may have been misconfigured and exposed by human error since April 2021 before it was discovered. This would seem to rule out a claim that the database’s credentials were inadvertently published as part of a technical blog post on a Chinese developer site in 2020 and later used to siphon the billion records from the police database, since no passwords were needed to access it.

Bob Diachenko, a Ukrainian security researcher, told TechCrunch that his own monitoring records shows the database was also exposed through a Kibana dashboard, a web-based software used to visualize and search huge Elasticsearch databases, in late-April. If the database didn’t require a password as believed, anyone could have accessed the data if they knew its web address.

Security researchers frequently scan the internet for inadvertently exposed databases or other sensitive data, often to collect bounties offered by the companies that they help to secure. But threat actors also run the same scans, often with the goal of copying data from an exposed database, deleting it, and offering the data’s return for a ransom payment — an increasingly common tactic used by criminal dumpster-divers in recent years. Diachenko said that’s what happened on this occasion; a malicious actor found, raided and deleted the exposed database, and left behind a ransom note demanding 10 bitcoins for its return.

“My hypothesis here is that the ransom note did not work and the threat actor decided to get money somewhere else. Or, another malicious actor came across the data and decided to put it up for sale,” said Diachenko.

Little is known about the seller or for what reason the data was dumped online. It’s not uncommon to see large quantities of personal data for sale on cybercrime forums and on the dark web, but seldom for data this sensitive or in such quantity.

What does the data look like?

TechCrunch reviewed a larger sample of the data uploaded by the seller containing three files, about 500 megabytes in total, each containing 250,000 individual records.

The data itself is formatted in JSON, a standard file format for Elasticsearch databases, making it easy to read and analyze. The format of the database suggests it was meticulously maintained and downloaded, rather than created by purely aggregating information from multiple data sources, a common technique used by information sellers and data brokers. However, some data may have been derived from external sources, such as from food delivery orders.

What also makes the data likely to be genuine is the sheer size of the data and that the level of detail would be difficult — though not impossible — to fake.

TechCrunch translated the police records, which were written in Chinese, and redacted personally identifiable information.

The files appear to contain detailed police reports dating back to 1995 through to 2019, including names, addresses, phone numbers, identity numbers, sex, as well as the reason for why the police were called out. The records seen by TechCrunch include granular coordinates where incidents occurred or police reports were made — and the names of informants who made the reports — which match the precise addresses also listed in each record, as well as the individuals’ race and ethnicity. (The Chinese government has incarcerated over a million of its own citizens, mostly from Muslim minority ethnic groups, including Uyghurs and Kazakhs, which the Biden administration has declared a “genocide.”)

The records contain complaints and criminal allegations from serious crimes involving violence to the relatively banal, such as detailing reports of credit card fraud, internet scams, and gambling, which is illegal in China. Several records seen by TechCrunch show police reports cracking down on the use of VPNs, or virtual private networks, used for accessing sites blocked by China’s censorship system and as such are outlawed in China. One record showed a Shanghai resident was accused of using a VPN to post critical remarks about the government on Twitter, which is banned in China. It’s not known what subsequently happened to the individual.

The data also contained full web addresses to photos stored on the same server, none of which were accessible at the time of writing, but the associated data often indicates what was uploaded, such as a person’s residency documentation or their passport when leaving the country. These web addresses are formatted in a way that is consistent with how Alibaba’s cloud service stores files.

Many of the records we examined appeared to contain information on children, based on their dates of birth and ages listed in the data.

Without the (unlikely) confirmation from the Chinese government, it’s difficult to know for sure if the sellers claims are genuine and the data was obtained from Shanghai’s police department, as is claimed. The Wall Street Journal, The New York Times, and CNN have verified portions of the data by calling individuals whose information was found in the database, lending weight to its authenticity.

What is the impact?

This alleged breach, if proved legitimate, could be highly damaging for Beijing, and raises questions about the government’s cybersecurity measures and the impact the breach will have on individuals.

It comes at a time when China is stepping up protection for personal data. Last September, China passed the Personal Information Protection Law, its first comprehensive privacy and data protection legislation, seen widely as China’s equivalent of Europe’s GDPR privacy rules. The law restricts how businesses can collect personal data and is expected to have a sweeping effect on the ad businesses of the country’s biggest tech giants, but allows broad exceptions for government agencies and departments that make up China’s vast surveillance capabilities.

Beijing is already reportedly censoring news of the alleged breach, and Chinese messaging apps WeChat and Weibo are blocking messages and mentions like “data leak” and “database breach.” The Chinese government has not yet commented on the breach.

It’s not the first security lapse involving a massive set of Chinese residents’ data that was left exposed to the wider internet without a password. In 2019, TechCrunch reported that a smart city installation in China was spilling the contents of a facial recognition database of nearby residents.


You can contact this reporter on Signal and WhatsApp at +1 646-755-8849 or zack.whittaker@techcrunch.com by email.