Finding the rare disease needle in the healthcare data haystack

Vera Mucaj
10 min readFeb 28


The problem with rare disease data fragmentation, and how real world data and advanced analytics can help improve the lives of those affected by rare and ultra-rare diseases

The rare disease data problem, as imagined by Dall-E

February 28th is Rare Disease Day, a moment to educate ourselves on rare and ultra rare diseases, and reflect on the daily burdens faced by rare disease patients and their caretakers. It is also an opportunity to pause and brainstorm, as a collective healthcare ecosystem, on how we can improve outcomes for these patient populations. Here, I summarize some ideas on how healthcare data analytics can help accelerate innovation in the rare disease space.

The problem

What constitutes a “rare disease” is defined differently in different parts of the world. In the US, any disease that affects less than 200,000 Americans is considered rare (1). In Europe, it’s defined as a disease that affects fewer than 1 in 2000 individuals (2). Cumulatively, ~7000 rare diseases affect more than 300 million individuals globally, and recent research suggests the number of rare diseases could be above 10,000, making the overall burden of rare diseases very high (3).

For an individual affected by rare disease, the challenges are many. In what’s often referred to as the “Diagnostic Odyssey”, these patients are often undiagnosed or mis-diagnosed, and their journey to diagnosis is long. A longitudinal NORD survey suggests that this problem is getting worse: In 2019, 28% of survey respondents said that it took seven or more years to get to a diagnosis, compared to only 15% reporting the same in 1989. What’s troublesome is 38% of those surveyed received at least one mis-diagnosis in their journey (4). In a Rare Disease Impact Report commissioned by Shire in 2013, it is estimated that, on average, a rare disease patient requires ~7 years and ~8 different physicians to arrive at a final diagnosis (5).

Delayed diagnosis and mis-diagnosis hurts patients, especially since many of these ailments can get progressively worse with time. To compound the issue, receiving a diagnosis is often just the beginning of a long journey with no resolution in sight, as there are currently no treatments for most rare diseases (6,7). Finally, the financial burden of rare disease is high. A recent NIH NCATS study suggests that, in the US alone, the cumulative healthcare costs for ~30 million rare disease patients could be $400 billion a year, with the individual costs of patients with rare disease being 3–5 times higher than those who do not suffer from a rare disease (8).

While the causes leading to the high burden of rare disease are multifactorial, we recognize two common themes. First, the fragmentation of health data is far more pronounced in rare disease patients. When a patient has to see 8+ physicians before receiving a diagnosis, their information can easily remain disconnected and siloed for a long period of time. Second, while the cumulative number of rare diseases is high, the pool for each rare disease is small, which makes it more difficult to connect the dots. This is especially true if symptoms are either systemic and generalizable to many diseases (fatigue, dizziness, etc.) or disjointed in a way that doesn’t agree with any established differential diagnosis. Taken together, these make the “needle in a haystack” problem of rare diseases one of many haystacks with fewer, very well-hidden needles, in each.

Dall-E did a great job hiding the needles but highlighting the silos here 👏

The opportunity

Despite the setbacks and challenges, there is hope and opportunity in advancing rare disease research and treatment. From the perspective of real-world data, here’s two things the industry can and should continue to pursue:

  1. Create a “better haystack”

Connecting large, representative databases can be the first step in answering serious and complex questions around disease diagnosis, progression, and treatment. Longitudinal datasets consisting of electronic health record (EHR) data, medical claims, laboratory data, and more, can help identify the set of individuals who might have similar clinical features, but for whom any given physician may have seen only a few patients in their entire career. Challenges to data linkage and sharing can be addressed through robust governance and use permissioning, ethics experts consultation (e.g., IRBs), and ensuring that data is shared with either patient consent, or in a de-identified way through privacy enhancing technologies (PET) like Privacy Preserving Record Linkage (PPRL). There are many PPRL / PET — powered analytic initiatives for rare diseases in the healthcare data ecosystem — I hope to do justice to a summary of these in a future post. I am highlighting one from Real Chemistry here, wherein they established a model to help reduce time to diagnosis for an ultra-rare disease with a hereditary basis.

PPRL techniques are useful for creating longitudinal datasets at scale without compromising patient privacy, but by virtue of de-identification, personal identifying information (PII) cannot be restored. Thus, these data and resulting analyses cannot be easily returned to patients, or applied to initiatives like recruiting a patient for a clinical study. Patients themselves, on the other hand, have the power to directly share their medical history data for the creation of rare disease registries and other research databases. The process of locating and retrieving medical records belonging to rare disease patients mirrors the overall challenges of healthcare data fragmentation, and is augmented by other inefficiencies in the medical record retrieval process. Recently, Datavant launched our Switchboard for Record Retrieval technology offering to improve this process. A number of academic institutions, rare disease patient advocacy groups and public-private consortia have initiated patient registries and research databases that aim to accelerate rare disease research. My colleagues have written about the opportunity for privacy-preserving tech to help create patient registries that combine the power of “patient-powered” and “site/physician-led” registries.

Making use of both PPRL and consented registry approaches gives multiple pathways for rare disease research, and both should be explored. In either case, a patient’s privacy protection should be of utmost priority. Generally, rare disease patients are amenable to data sharing, if the insights gleaned from those data are used in the service of scientific and medical research that improves their care and that of other rare disease patients (10). Regardless of a consented or de-identified approach, we should ensure information used by researchers is protected from both a privacy and security perspective, and that even in the case of consented data, select patient identifiers are scrubbed in line with limited dataset approaches.

For data that is shared in a de-identified way (in the US, following the guidelines of the HIPAA Privacy Rule), special attention must be paid to ensuring a high bar for “low risk of re-identification”. For rare disease patient populations, this could seem like an insurmountable challenge, given the patient pools could be relatively small. At Datavant, we are especially concerned with this issue, and have put together a few privacy considerations in this blog post.

2. Find the needle(s), and augment their data

So, let’s imagine we have THE ultimate dataset: It’s comprehensive data from a variety of sources, and it’s longitudinal enough to create meaningful natural histories and not lose patient journeys due to fragmentation. That merely gets us to the starting line of meaningful rare disease research. While we have a better haystack, the needles continue to be few, and searching for them without any sort of “metal detector” can be hard. For example: most rare diseases do not have a unique ICD-10 code classification, which makes epidemiological studies challenging (11). With data at scale, analytical approaches can help identify sub-cohorts of patients for whom a direct ICD code is not available. In a recent example, Komodo Health identified a sub-cohort of 370 likely Atypical hemolytic uremic syndrome (aHUS) patients, from a larger pool of 3,266 hemolytic uremic syndrome (HUS) patients, which now allows for better understanding of this patient population (12).

There might be instances where sub-cohorting is not possible, because two unrelated events or diseases could have similar etiology, or we don’t know that one may be causative of the other. I am very hopeful for the potential of rapidly evolving Artificial Intelligence tools (such as many machine learning approaches) in helping us connect dots that would be too difficult through standard scientific approaches (13). To be most effective, these AI tools need to be trained through comprehensive datasets, and I look forward to the continued improvements that are happening at the intersection of healthcare data and AI.

Even with the best dataset, not every rare disease patient has the same level of data available about them. For example, a disease that is caused by a genetic mutation could require confirmatory sequencing of that mutation. Imaging and histopathology data can help subset patient cohorts and sub-cohorts that map to different forms of a rare disease. Tissue sample biobanking can help generate these data points, and allow for the possibility of generating entirely new data points at scale in the future (e.g., data around protein expression and function, metabolic pathway activation, methylomics, and more). One example of a registry that allows for both the collection of historical data and the augmentation of future data points is the ACCELERATE registry at the University of Pennsylvania, which seeks to improve our understanding of Castleman Disease (14).

Once data are transformed into insights, the therapeutic development journey begins. Whether it’s through drug repurposing approaches, or by employing novel interventions like durable gene therapies, having the right insights to get to the right treatment is critical. One recent success story is the treatment of a UK toddler with gene therapy for metachromatic leukodystrophy (MLD) (15). The 19-month old was able to receive this treatment because MLD is fairly well — understood, and a gene therapy for MLD is already on the market (16). The convergence of disease understanding and biotechnology advancements in gene therapies made it possible for this patient to get life-saving treatment. Sadly, her older sibling’s disease has progressed to the point where treatment is not possible, leading to a heartbreaking situation for these siblings and their parents (18). This example and many others show us that accelerating rare disease understanding can have a profound impact on those affected.

A vision for the future

The best rare disease datasets are “living organisms” that continue to evolve and augment, to allow for better longitudinal data capture, and for the introduction of new data and analytic approaches, all while preserving patient privacy. With the availability of more data, better data, and the analytics tools to gather insights from those data, researchers from around the world can better understand disease etiology, mechanisms of action, and bring novel treatment modalities to market faster. Care teams can better connect the dots of disparate symptoms, and guide patients to the right expert, and hopefully, treatment, faster. Technologists can bring the best of AI and other advanced analytics techniques to turn the data into insight in ways we could not have imagined even a few years ago. I am both excited and hopeful for this future.

To learn more about Rare Disease Day, please visit

Big thank you to Karin Eisinger and Samantha Robicheau for helping improve this article.

Note: I am employed by Datavant, however thoughts here are my own and do not necessarily represent the views of my employer.


  1. US Orphan Drug Act of 1983:
  2. European Medicines Agency orphan disease designation overview:
  3. Haendel M., Vasilevsky N., Unni D., Bologa C., Harris N., Rehm H., Hamosh A., Baynam G., Groza T., McMurry J., et al. How many rare diseases are there? Nat. Rev. Drug Discov. 2020;19:77–78. doi: 10.1038/d41573–019–00180-y. — DOI PMC PubMed
  6. Kaufmann, P., Pariser, A.R. & Austin, C. From scientific discovery to treatments for rare diseases — the view from the National Center for Advancing Translational Sciences — Office of Rare Diseases Research. Orphanet J Rare Dis 13, 196 (2018).
  7. Tambuyzer, E. et al. Therapies for rare diseases: therapeutic modalities, progress and challenges ahead. Nat. Rev. Drug Discov. 19, 93–111 (2020).
  8. Tisdale, A., Cutillo, C.M., Nathan, R. et al. The IDeaS initiative: pilot study to assess the impact of rare diseases on patients and healthcare systems. Orphanet J Rare Dis 16, 429 (2021).
  9. Real Chemistry/IPM.AI: Shortening the Rare Disease Diagnostic Odyssey
  10. Courbier, S., Dimond, R. & Bros-Facer, V. Share and protect our health data: an evidence based approach to rare disease patients’ perspectives on data sharing and data protection — quantitative survey and recommendations. Orphanet J Rare Dis 14, 175 (2019).
  11. Institute of Medicine (US) Committee on Accelerating Rare Diseases Research and Orphan Product Development; Field MJ, Boat TF, editors. Rare Diseases and Orphan Products: Accelerating Research and Development. Washington (DC): National Academies Press (US); 2010. 2, Profile of Rare Diseases. Available from:
  12. Komodo Health: Insights Without a Code: How New Approaches Can Raise Visibility Into Ultra-Rare Diseases
  13. Decherchi S, Pedrini E, Mordenti M, Cavalli A, Sangiorgi L. Opportunities and Challenges for Machine Learning in Rare Diseases. Front Med (Lausanne). 2021 Oct 5;8:747612. doi: 10.3389/fmed.2021.747612. PMID: 34676229; PMCID: PMC8523988.
  14. Pierson SK, Khor JS, Ziglar J, Liu A, Floess K, NaPier E, Gorzewski AM, Tamakloe MA, Powers V, Akhter F, Haljasmaa E, Jayanthan R, Rubenstein A, Repasky M, Elenitoba-Johnson K, Ruth J, Jacobs B, Streetly M, Angenendt L, Patier JL, Ferrero S, Zinzani PL, Terriou L, Casper C, Jaffe E, Hoffmann C, Oksenhendler E, Fosså A, Srkalovic G, Chadburn A, Uldrick TS, Lim M, van Rhee F, Fajgenbaum DC. ACCELERATE: A Patient-Powered Natural History Study Design Enabling Clinical and Therapeutic Discoveries in a Rare Disorder. Cell Rep Med. 2020 Dec 22;1(9):100158. doi: 10.1016/j.xcrm.2020.100158. PMID: 33377129; PMCID: PMC7762771.
  15. Mahase E. Toddler becomes first child to receive gene therapy for fatal disorder on the NHS. BMJ. 2023 Feb 15;380:370. doi: 10.1136/bmj.p370. PMID: 36792125.
  16. NORD overview of MLD:
  17. BBC: UK’s most expensive drug Libmeldy saved Teddi Shaw, but is too late for her sister



Vera Mucaj

Passionate about R&D and healthcare data. Thoughts here are my own.