Freeze! How Immobilizing Medical Records Immunizes Patients Against Data Breaches

Analyzing the data in the millions of medical records kept by family doctors can reveal priceless, potentially life-saving findings—but it is not an easy task to undertake safely. The reason is that copying such data, and transmitting it to external researchers for computerized analysis, is riven with risks of data breaches, with frightening ramifications for patient privacy.

Yet all, it seems, is not lost. A team of data scientists, software engineers, and epidemiologists from the Bennett Institute for Applied Data Science at the U.K.’s University of Oxford have innovated around the privacy issues and developed an ingenious secure analytics platform that allows tens of millions of medical records to be interrogated without ever putting patients at a heightened risk of a data spill.

Their secret? Rather than sending medical data out to researchers with all the attendant risks of a breach, they leave the medical records safely in situ, in the secure datacenters of the health record providers approved by the U.K. National Health Service (NHS). What they move instead of the data, however, is the researcher’s analytical software code, which is run securely over the records inside the datacenters instead.

It might sound like a simple conceptual switch, making software move rather than data, but ensuring that the analytics code will actually do the research job it is supposed to, transparently, securely, and in such a way that other data science teams can check or replicate research results, has been a multi-year effort for the  team led by the Bennett Institute’s director, epidemiologist Ben Goldacre, and its chief technical officer, computer scientist Seb Bacon.

Called OpenSAFELY, Goldacre and Bacon’s multidisciplinary team began development of the open-source technique in 2020, in the white heat of the COVID-19 pandemic. Back then, the U.K. National Health Service needed to find fast but secure ways to query the “population-scale” medical records held by family doctors, amounting to some 58 million records in all.

The Bennett Institute’s overarching aims were to help researchers understand which medical interventions were working and which were not, as COVID-19 led to lock-downs in nations around the world. The kind of questions they needed answered included which age groups and ethnicities were at greatest risk from the rapidly spreading SARS-CoV-2 virus; which of the emerging antivirals were most effective against it; which vaccines worked best and which vaccines’ protective effects waned soonest, and whether adults in households with school-age children, for instance, were more likely to suffer COVID-19 infection. (They were not.)

Simply handing out doctors’ data to research teams was not a great idea, as such data in the U.K. are far too detailed to lose in a data breach if, say, cyberattackers intercepted it in transit, or more simply a researcher in receipt of datasets left a laptop containing the data on a bus. “NHS GP (general practitioner) records are an extraordinary resource, cradle-to-grave records of great breadth and depth. They’ve got lots of information about every citizen in the country, and that makes them incredibly powerful for research and also potentially very powerful for innovation,” said Goldacre.

However, continued Goldacre, “Those same records also present enormous challenges. Firstly, they present challenges around privacy. Taking the names and addresses off detailed patient records isn’t enough to protect patients’ privacy because they’re so detailed, especially when you’ve got coverage at whole population scale. So we had to do better than just taking the names and addresses off.”

The level of detail in GP records means an ‘anonymized’ patient might be identifiable to an attacker through the unique combination of medical conditions they had been diagnosed with over a lifetime. The Bennett Institute team had to work out a whole new way of working with patient data if they were going to access it and get critical insights at speed as the pandemic unfolded.

“We didn’t try and ignore the privacy problem. We invented new ways of working to address it, so the data is locked down, but everything else about the platform is completely open,” said Goldacre.

So, just how does OpenSAFELY allow sensitive, confidential records to be opened safely, as its name suggests, to a researcher’s analytics code?

The central trick at the very heart of the platform is that it uses automated tools developed by the Bennett team to give would-be researchers randomly generated dummy data records in the same format as the medical datasets they are interested in, so they can design, write, and perfect their analytic techniques for it without messing with the stuff of real, spillable, confidential patient data.

“The users work with that dummy data to write all of their code, to prepare the data, to turn it into graphs, statistical tests, tables, and analyses. And when their code is written and tested, when they know it can run to completion using that dummy data, they then press a button and their code gets sent off to run at arm’s length in automated tools against the real records. And then they get their results back,” said Goldacre.

This way, if external researchers were to spill that randomized dummy data, the breach does not matter; it’s nonsense data about no one. Once the researchers are happy with their executable code, said Bacon, two things happen: first, they file it in a Github repository for transparency’s sake, so other research teams can check it, reuse all or part of the code, or replicate studies. Second, those OpenSAFELY software tools allow them to execute their analytics on actual patient records, which always remain in situ.  

“The records don’t move. We don’t extract data and send it out to other people or put it on one central machine. Instead, we have built tools which are installed in the datacenters run by the GP system suppliers [TPP and EMIS], where the data already resides and the data is queried in those, at rest,” said Goldacre.

From a software engineering standpoint, Bacon said the two fundamental improvements OpenSAFELY brings to medical analytics computing are privacy and reproducibility. “No researcher ever has direct access to the raw data. It’s always at arm’s length. They can never look at a table of patient information,” he said of the privacy aspect.

On reproducibility, Bacon said that until now, the “sharing of methods in this kind of science has been very poor,” and that their use of Github repositories for researcher analytics solves that, fueling accountability and code-sharing, and aiding peer review by recording, transparently for research teams to see, all the code that OpenSAFELY collaborators have run against real patient data.

The secure analytics platform is certainly pulling in collaborators: so far, 31 research universities and organizations have run 181 research projects using the Bennett Institute’s system, with more than 100 of them completed.

Privacy campaigners have also conceded, in a note to the U.K. government, that OpenSAFELY has nailed down its security well, with leading health pressure group Med Confidential saying that OpenSAFELY “allows reliable and repeatable analysis of sensitive, GDPR-category personal data without needing to create a new copy of the data.” Med Confidential said it’s “the step of copying the data” that is often the source of problems. Said Goldacre, “If you don’t move data around, then you don’t create new attack surfaces for bad actors.”

And the threat of bad actors is indeed on the minds of patients, said Michael Chapman, director of data access at NHS England, after Synnovis, operator of the pathology labs serving three major London hospitals, had its computers bricked in a debilitating ransomware attack in June 2024. Thought to have been mounted by a Russian hacking gang called Qilin, the attack caused the cancellations of 20,800 pathology sample tests (as samples degraded), as well as 10,152 out-patient appointments and 1,710 elective surgeries in the ensuing months. On top of that, patient data seized in the attack was posted on the dark web.

“Are people sensitized to the risk following things like that cyberattack? Yes, very much,” Chapman said, adding that is why they have a “very busy, very active cybersecurity team” that is “making sure OpenSAFELY doesn’t increase that risk. We’re very conscious about that,” he added.

Underscoring the novel level of security OpenSAFELY seems to offer is the fact that the platform is now set to move into a raft of other personal-privacy-sensitive applications beyond GP data. Health research charity Wellcome has awarded the Bennett Institute £7 million (U.S.$9.1 million) to anonymously assess mental health therapy outcomes, considering, among other things, the relative values of different anxiety and depression counseling methods and treatments. In addition, Wellcome has awarded Bennett £10 million (U.S$12.9 million) to further the underlying technologies underpinning secure health analytics.

In education, meanwhile, the U.K.’s National Institute of Teaching is harnessing OpenSAFELY to assess, in part, how interventions in factors such as teacher training affect student outcomes in a number of high schools, with no risk of the teachers or school students being identified. Intriguingly, Bacon says some retail chains are now showing an interest in their secure analytics protocol, too.

“Supermarkets have got huge amounts of data they gather through their loyalty card schemes. They could link that with health data to look at issues related to nutrition, or socioeconomic factors to do with food choices. In principle, some supermarkets might be up for this as part of their corporate social responsibility activity,” he said.

Paul Marks is a technology, aviation, and spaceflight journalist, writer, and editor based in London, U.K.