No data, no honey: the grim reality of machine learning in medicine (and other domains)

The increasing amount of data collected by the numerous internet services people use every day has opened doors to a plethora of breaches and abuses. The latest Facebook scandal may only be the start.

When it comes to private services, the situation doesn’t look any better. By private I mean all those services usually provided to individuals by governments, financial institutions, insurers, healthcare institutions and all their combinations. Such services manipulate data that provide a unique picture of an individual. A picture that is way clearer than the one depicted by the private data of social media platforms or online retailers. Not to mention the degree of sensitivity and the consequences that such data might cause if in the wrong hands.

Humans seem to have reached a cross-point, where they are asked to choose between functionality and privacy. But not both. Not both at all.
No data, no service. That’s what companies building personal finance services say. The same applies to marketing companies, social media companies, search engine companies, and the list goes on.

It was not long ago when the gap between healthcare, medicine and machine learning was filled by encouraging results with technologies like deep learning, creating better radiologists than radiologists, faster cancer pathologists than pathologists, more accurate clinical doctors than doctors [1,2,3,4,5,12]

Humans seem to have reached a cross-point, where they are asked to choose between functionality and privacy. But not both.

Despite the usual skepticism that affects every technology in its infancy, medicine, healthcare and machine learning have started a (not so) new field of research called precision medicine.

Precision medicine proposes the customization of healthcare with machine learning models and aggregated data coming from different domains. Such data describes different biological processes of the same organism. They call them heterogeneous data, to indicate the diversity of the source and the signal that is carried by each of them.

And so, combining data describing metabolites (metabolomics), proteins (proteomics), genes (genomics), genetic mutations (SNP — Single Nucleotide Polymorphisms), and of course demographics, family history, environmental factors and traditional medical lab tests, machine learning models have tried to tackle the challenging tasks once solved by human medical doctors.
For instance, predicting the risk of certain diseases, detecting responsible genes for protein breakdown, or identifying specific genetic pathways as the cause of certain phenotypes, are only a few of the challenges accepted by the community of bioinformaticians and data scientists.

The biggest obstacle that such a community has been dealing with since the early days of precision medicine is something that mathematicians discovered a hundred years before. A monster called under determined systems, that is to say, problems with fewer equations than unknowns. Such systems have either no, or infinite, solutions.

There is no real smart solution to stay away from under determined systems.
Simply put, if the number of equations is not sufficient, scientists can only choose between

  1. reducing the number of unknowns
  2. increasing the number of equations

When in trouble, ignore some data

The first solution consists of ignoring the signals provided by the many diverse datasets collected thus far.
In precision medicine however the number of variables (unknowns) can easily be a few orders of magnitude larger than the observations (number of individuals).
This obviously makes such a gap too deep. Practically speaking, cohorts with a complete profile in all the data sources mentioned above can be in the thousands of individuals, while clearly the number of unknowns (that is all the independent variables) easily explodes to billions .

When in trouble, collect more data

The second solution is the one that has proved to be easier and beneficial in machine learning. Increasing the number of observations, by collecting more data will always help, no matter how fancy or simple the model one is performing.
As a consequence of this undeniable fact, many consortia have been created with the attempt of reducing the gap between number of samples and independent variables. This has clearly brought new challenges and issues, affecting — and many times compromising — the privacy and security of individuals.

One naive way to accommodate such a strategy consists in pooling data in a centralized location with regulated access.
While this strategy has made it possible to build super-profiles of individuals with their demographics information, financial status, insurance details, genetic compounds, drug patient journey of the last ten years, etc. it also has concentrated enormous power and resources in the hands of few administrators.
Similar data collection plans have been adopted across domains, especially for consumer services, where it became very appealing to collect data that go beyond the purpose of the provided service. [6,7,8]

In precision medicine […] the number of variables can easily be a few orders of magnitude larger than the observations

Privacy and precision medicine

When it comes to genetics and biological data, identifying an individual with high accuracy becomes a trivial task. After all, DNA is unique to any individual. Even in the case of working with summary statistics, a participant in the study can be identified from other factors such as age, gender or her geographic location.

Moreover, the type of data manipulated by machine learning algorithms in healthcare and medicine is different by nature: it is not possible to opt-out from our own DNA (at least not at date of writing), as much as it would be on a social network (assuming their administrators truly delete the data upon users’ request).

While centralizing data might solve the problem for machine learning algorithms, it has in fact created many more issues to patients. Let’s imagine an insurance company who can associate the genetic profile carrying a mutation that increases the risk of breast cancer by 80% to certain Alice.
Would such insurer ignore this information and proceed with a subscription? Would a mortgage provider do the same?

Anonymizing data

When researchers realized how risky it was to connect genetic data to personal data, they found anonymization as a viable way to mitigate such risks.

The idea of obfuscating personal data from the genetic profile of an individual seemed to work. Until, it didn’t [9]

As a matter of fact, genealogies allow one to identify individuals who participated in family-based studies. Rare records for certain diseases also isolate the profile of an individual so well that identification becomes a trivial task.
Moreover, machine learning models trained on a combination of genetic and personal data, would not perform as good on a stripped version of the data.

In contrast, anonymizing genetic data, is clearly an oxymoron. The reason why DNA paternity and forensics tests are the most reliable is because DNA identifies an individual uniquely just by comparing a few markers of the query DNA against a database of DNA sequences (the process goes under the name of sequence alignment).

The idea of obfuscating personal data from the genetic profile of an individual seemed to work. Until, it didn’t

Trying to obfuscate markers in the genetic profile of an individual would consistently change the signal carried by genetic material making it unusable for any other analysis.

Encrypting data

One particular type of encryption that allows one to perform calculations on encrypted data is called homomorphic encryption [10].

Such schemes are known to cryptographers and computer scientists to be very demanding in terms of computations, especially for non linear operations such as multiplications and divisions with other encrypted numbers (such schemes are called fully homomorphic).

If the function to perform on encrypted data is as simple as counting or summing, simple encryption schemes have shown to be feasible. But in order to perform arbitrary computations, with multiplications and other non-linear algebraic operations, more complicated schemes are required.

Turns out that the computational complexity suddenly becomes prohibitive, especially for large machine learning problems.
Other forms of encryption such as MPC — Multi Party Computation are less demanding with respect to Fully Homomorphic Encryption schemes. But still orders of magnitude slower than equivalent operations performed on non-encrypted data.

Obfuscating data

One idea behind data obfuscation goes under the name of differential privacy. It consists in obfuscating data by adding noise [11].
When a database is obfuscated with differential privacy, the results to the queries performed by clients will be as accurate as possible not to disclose the identity of the returned records.

This approach returns consistent results on large numbers of records. So for instance if p is the true proportion of people with a specific attribute, it will be possible to estimate such rate without identifying the single individuals with that attribute.

It is easy to conclude that aggregating personal and genetic data in a database obfuscated with differential privacy would work. It definitely would for summary statistics, with a few limitations though.
The possible queries clients are allowed to perform on such a database are quite limited and should be checked every time before getting processed. In fact, specific queries could disclose information that is more difficult to obfuscate.

Probably the biggest limitation comes from centralization itself.

Whenever data is centralized (even in encrypted or obfuscated form) their owners lose control.

This in turn, prevents data owners from getting incentivised every time their data is used in a study.

Despite the numerous attempts to mitigate the risks of compromising the privacy of individuals while providing data driven services, there is no real efficient solution yet.

An efficient solution allows three important facts to occur:

  1. protects the identity of individuals participating to a study
  2. provides support to data driven decisions (returns disease diagnosis, or identifies genetic compounds responsible for a disease, etc. ) and, more importantly
  3. incentivises individuals who share their data

With the advent of blockchain technology and better hardware, the problem hereby described seems to be more approachable. Despite the improvements of encryption schemes and MPC protocols, there is still an enormous slowdown with respect to computing on unencrypted data.

The team of fitchain are using a combination of technologies such as homomorphic encryption, multi party computation, crypto-economics and blockchain to provide private machine learning services.

To know more, join their cause!


[1] Deep learning: from chemoinformatics to precision medicine Kim, IW. & Oh, J.M. Journal of Pharmaceutical Investigation (2017) 47: 317.

[2] Dermatologist-level classification of skin cancer with deep neural networks Esteva, Andre and Kuprel, Brett and Novoa, Roberto A. and Ko, Justin and Swetter, Susan M. and Blau, Helen M. and Thrun, Sebastian — Nature 2017 Volume 542 2017/01/25/online

[3] A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia Su-In Lee, Safiye Celik, et al. Nature Communications Vol 9, Article number: 42 (2018)

[4] Deep learning in radiology: an overview of the concepts and a survey of the state of the art Maciej A. Mazurowski, Mateusz Buda, Ashirbani Saha, Mustafa R. Bashir

[5] Towards automatic pulmonary nodule management in lung cancer screening with deep learning Francesco Ciompi et al.

[6] Duportail, Judith. “I Asked Tinder for My Data. It Sent Me 800 Pages of My Deepest, Darkest Secrets.” e Guardian. September 26, 2017

[7] Davies, Chris. “Nest Google Privacy Row Resumes as thermostat Hacked.” SlashGear. June 24, 2014

[8] Kohler, Carson. “We Heard Social Media Can A ect Your Credit Score. Here.” ePenny Hoarder. August 30, 2017 what-a ects-your-credit-score/.

[9] Erlich Y. Major flaws in ’Identification of individuals by trait prediction using whole-genome’. bioRxiv. 2017;p. 185330.

[10] Lifang Zhang, Yan Zheng, and Raimo Kantoa. 2016 A Review of Homomorphic Encryption and its Applications ICST, Brussels, Belgium, Belgium, 97–106.

[11] Nissim K, Steinke T, Wood A, Altman M, Bembenek A, Bun M, et al. Differential privacy: a primer for a non-technical audience; 2017

[12] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases; 2017

What's your data analytics strategy?

Leave a Reply

Your email address will not be published. Required fields are marked *