The Good, the Bad, and the Messy: Why Healthcare AI Needs Clean Data

Clean data for healthcare AI

Let’s face it: medical records are a mess! Healthcare data comes in so many different formats, it almost looks like Jackson Pollack splattered EMR and EHR on a painting canvas. Images, lab tests, doctors’ scribbled notes, diagnostic and treatment codes, drugs, dosages, and text summaries are just some of the ingredients that make machine learning a complicated process when using medical records. Healthcare AI depends upon clean, organized, and well-categorized data sets.[1]

Why Clean EHR Data Matters in Healthcare AI

When data scientists talk about “cleaning” EHR, there is often a public misconception that analyzing the quality of medical records, and properly preparing them for use by healthcare algorithms may be “viewed as a suspect activity, bordering on data manipulation.”[2] Nothing could be further from the truth.

The adage “garbage in, garbage out” is critical to understanding why accurate, error-free medical records are necessary to train and use healthcare AI. If lab results, drug dosages, and other EHR include incoherent or problematic data sets, any questionable data ingested into a healthcare algorithm may not only be suspect, but potentially fatal. [3]

EHR Can Be Messy

Medical data is complicated. It is often sparse or incomplete. Frequently, it is presented in different and inconsistent units; in some cases, computer errors and lab tests even generate false results. To deal with this, data scientists can develop code to fix errors and normalize samples. They can take into account statistical models, which cover both healthy and ailing patients, and identify out-of-range cases, known as “outliers,” to further adjust the models. Through advanced automation, machine learning-based healthcare technology can accurately identify which of the many data outliers present are real, and which are not. Even detecting EHR outliers can be improved by using machine learning.

The process from which raw data is translated into a coherent “data painting’ involves answering thousands of questions, such as:

  • How can you predict future value of parameters?
  • How do you predict past values of tests?
  • What was the max/min/average value in different time windows in the past: up to 1, 2, 5 or 10 years ago?
  • What was the rate of change in a test in the last 1, 2, and 5 years?
  • What would be a particular predicted value (e.g., LDL, triglycerides, BG) for a blood test result one (1) year ago, given that the patient didn’t do such a test then?

The U.S. National Institutes of Health recently highlighted the importance of ‘the need for extensive data “cleaning”’ in its new NIH Strategic Plan for Data Science. The agency shared its concern about the “the ability of researchers to find and use biomedical research data generated by others.”

Healthcare Data Can Be Missing, Inconsistent, or Unreliable

When patients in the U.S. change jobs, they often change healthcare insurance plans. This can also result in changing doctors, medical centers, as well medications and medical devices covered under their new insurance company’s drug formulary list. Each change has the potential to disrupt the type of, flow, and access to a patient’s healthcare data. How do you analyze the trajectory of patients’ risk if there are missing data points in their EHR?

This is one of the key issues that cleaning EHR data must address. Omissions and erroneous results in medical data can create gaps in EHR histories. On a practical level, they should be expected.

The majority of patients in America do not request digital copies of their EHR from previous doctors and health care networks when they move to another state, or begin seeing new physicians in a new provider network. Doctors they see for the first time may therefore have an incomplete picture of their health. Medications, diagnoses, tests, and medical procedures may be missing. Depending upon how their older medical records are formatted, and differences among provider IT systems, there may also be data interoperability issues.

Unlike some countries, the U.S. does not have yet have a national healthcare data repository[4] capable of giving physicians quick access to a patient’s lifetime of medical data. This likelihood for gaps, omissions, and errors in medical data compounds its sparseness.

The EHR Choice: Clean It or Don’t Use It

Every AI healthcare solution needs to include medical data cleaning its algorithm development protocol and real-world usage. Without it, physicians and care teams will not be able to rely upon algorithms to asses patient risk. EHR dependability will be questionable. Lives are at stake. “Garbage in, garbage out” is not a healthcare IT practice that any medical provider should practice when it comes to using medical data.


[1]JASON report: Artificial Intelligence for Health and Health Care, JSR-17-Task-002, Dec 2017.

[2] Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K (2005) Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities. PLoS Med 2(10): e267.

[3] Maria Korolov, AI’s biggest risk factor: Data gone wrong, CIO, Feb. 13, 2018.

[4] Connecting Health and Care for the Nation: A Shared Nationwide Interoperability Roadmap, U.S. Office of the National Coordinator for Health Information Technology, Oct. 6, 2015.


Share on twitter
Share on linkedin
Share on facebook
Share on email

More like this

How Healthcare AI/MI can Augment Care Delivery

Machine Learning Can Help Win the Worldwide Battle of Diabetes Detection

Making Sense of Artificial Intelligence

Are you looking to solve high-burden disease challenges?