Search this site
Embedded Files

Home

Design

Challenges

Resources

Contact

Challenges — Data Deserves Better

The challenges we've faced have shaped our understanding of what data needs to be truly useful. By sharing lessons learned, we hope to build a  collaborative foundation for better standards and stronger research. 

<a href="https://iconscout.com/illustrations/about-us" class="text-underline font-size-sm" target="_blank">about us</a> by <a href="https://iconscout.com/contributors/delesign" class="text-underline font-size-sm" target="_blank">Delesign Graphics</a>

Graphic by Delesign Graphics

  Data Harmonization  

Authored by: Rhiannon Cameron (2025-10-03 ISO 8601) Contributors: Dr. Emma Griffiths, Charlotte Barclay, Damion Dooley, Dr. William Hsiao

Data harmonization is the process of reconciling differences in how data is collected, categorized, and formatted across sources to ensure consistency and interoperability. It involves standardizing fields, terms, and structures so that datasets from different jurisdictions or systems can be accurately integrated and analyzed. Below are challenges we have experienced and aim to over come with data harmonization best practices.

Non-Harmonized Vocabulary

A broad data harmonization challenge that refers to when fields, terms, formats, and data structures used in data collection that differ in meaning, structure, or usage across datasets.

  •  Lesson Learned:  Using non-harmonized vocabulary across jurisdictions or systems creates barriers to data integration and slows down public health response. Standardizing terminology and formats from the outset is essential for efficient, accurate, and scalable data analysis.

  • Example: The term “fever” may be recorded as: a checkbox (Yes/No), a numeric value (e.g., 38.5°C), or a free-text entry (e.g., “high fever”). These variations make it difficult to merge and analyze data consistently across regions, even though the underlying concept is the same.

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Semantic Ambiguity / Semantic Noise

"Semantic Ambiguity" occurs when the same term is used across different data resources but has multiple possible meanings and it is unclear which one is intended.

  •  Lesson Learned:  Lack of clear labels and definitions leads to "Semantic Noise" which is confusion or miscommunication caused by words that meant the same thing or when the same words have different meanings when context is missing or inconsistent.

  • Example: In SARS-CoV-2 case collection forms terms like “Isolation” are used inconsistently; “Isolation” could mean: Self-isolation at home, Hospital isolation, or Isolation under negative pressure conditions.

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5 

Word Bombs

Occur when a single concept, often a proper noun, is modified by numerous descriptors, resulting in excessively long or inconsistent picklist of seemingly endless combinations.

  •  Lesson Learned:  Overly detailed or combinatorial vocabulary can overwhelm data systems and curators, reducing clarity and interoperability. To avoid word bombs, use controlled vocabularies and structured fields that separate core concepts from modifiers.

  • Example: When collecting an agricultural data, a field lists options for "farm", "fish farm", "turkey farm", "chicken farm", "chicken and turkey farm", etc. when they could instead of structure their data into fields for "facility" (e.g., farm), "organism" (e.g., chicken, turkey, fish).

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Concept Bombs

When a single field or term is overloaded with multiple modifiers or ideas, resulting in overly specific terminology that is difficult to standardize, reuse, or map across datasets.

  •  Lesson Learned:  Packing multiple concepts into one field leads to data standard bloat and reduces flexibility / limits interoperability. Separating distinct ideas into individual fields improves clarity, reusability, and consistency across datasets.

  • Example: Instead of using a single field like “Previous SARS-CoV-2 infection in the last 6 months with treatment” break it into structured fields (note: example fields are not standardized):

      • Infection History: Yes

      • Infection Date: 2025-09-15

      • Time Since Infection: 6 months <automatically calculated from current date - infection date>

      • Treatment Received: Yes

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Timeline Terms

Timeline terms are data descriptors that refer to events or conditions relative to the current time (e.g., “most recent test date”). 

  •  Lesson Learned:  These terms can become outdated or misleading as new data is collected, making them unreliable for long-term analysis unless regularly updated or clearly timestamped. Collect data in a manner where changes in time can be derived relative to other data.

  • Example: Instead of using "most recent PCR test date" use (note: example fields are not standardized):

      • Test Date: 2025-09-15

      • Test Method: PCR

      • Test Result: Negative

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Data Categorization

Refers to inconsistencies in how data fields are grouped or labeled under broader categories/modules, which affects how data is interpreted and compared. 

  •  Lesson Learned:  Differences in categorization can lead to misinterpretation and hinder data integration. Clear, standardized categories improves accuracy and interoperability across datasets.

  • Example: SARS-CoV-2 forms differ in how they categorize data like “risk factors” and “pre-existing conditions,” which can overlap, and there can be a inconsistent categorization of terms —such as grouping “hypotension” under either symptoms or pre-existing conditions.

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5 

Data Granularity

Describes the level of detail captured in a data field.

  •  Lesson Learned:  Differences  descriptors can result in inappropriate mappings and differences in granularity can lead to loss of important information when data is consolidated to the broadest interpretation. 

  • Example: “Cough” might be recorded as “Cough”, “Dry Cough”, “Productive Cough”, “New onset/exacerbation of chronic cough, 

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5 

Data Type

Refers to the kind of data expected in a field.

  •  Lesson Learned:  Inconsistent data types for a common field can hinder automated processing and analysis.

  • Example: The same field (e.g., “Fever”) may be recorded as: Boolean (Yes/No), Numeric (e.g., 38.5°C), or Free text.

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5 

Disparate Questions

Occurs when different forms ask different questions or omit certain fields entirely.

  •  Lesson Learned:  Inconsistent or missing questions across forms limit the ability to merge datasets and conduct inclusive, large-scale analyses.

  • Example: Only some SARS-CoV-2 case collected forms asked about: Indigenous identity, Pregnancy complications, or Specific occupational exposures.

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5 

Inconsistent Indigenous Identification Fields

Refers to the lack of standardized fields for collecting Indigenous demographic data, which affects both public health equity and data governance. 

  •  Lesson Learned:  Lack of standardization limits the ability to analyze health disparities and uphold Indigenous data governance.

  • Example: Some Canadian SARS-CoV-2 case collected forms ask about: First Nations status, Indigenous heritage (First Nations, Métis, Inuit), and/or Community or reserve residence - while others omit this entirely.

Cameron, R., et al. (2025). https://doi.org/10.1186/s13690-025-01604-5 

Ontologies vs. Implementations

Ontologies are structured, precise vocabularies designed to represent concepts unambiguously, grouping related concepts into polyhierarchies and relationships making it easier for humans and machines to link information. Implementations, on the other hand, refer to how these ontologies are applied in real-world systems, where users may prefer familiar or simplified terms.

  •  Lesson Learned:  While ontologies offer clarity and precision, their adoption can be hindered if users are required to abandon familiar terminology. Bridging this gap requires thoughtful mapping between user-friendly terms and ontology labels, and tooling, ensuring usability without sacrificing semantic rigor.

  • Example: A user may prefer to use "LFT", where as looking for "LFT" in the ontology shows "Lung Function Test" and "Liver Function Test" (both listing "LFT" as abbreviations). If users don't want to have to type out "Lung Function Test" for each data entry they can have a picklist with "Lung Function Test (LFT)", so they see the input selection as soon as they type "LFT", which can be applied as an alternative/user interface label via the schema or ontology.

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Ontology Term Branch Patterns

Ontology term branch patterns refer to the hierarchical structure used in ontologies where all concepts are organized under broad categories (e.g. Core Ontology for Biology and Biomedicine (COB): material entity, process, characteristic, etc.). Because domain-specific terms are distributed across multiple thematic ontologies and branches, locating the right term for reuse can be challenging for new data standards developers.

  •  Lesson Learned:  Developers should be prepared to navigate and cross reference multiple ontologies, using tools and guidance to identify and map branches and terms effectively.

  • Example: A developer creating a standard for genomic surveillance in food production may need terms for: food production (from FoodOn), environmental context (from ENVO), Host anatomy (from UBERON), genomic data descriptors (from GenEpiO or OBI), etc.

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Entity vs. Entity Usage

This concept highlights the difference between what something is designed or classified to be (its entity) and how it is actually used in practice (its usage).

  •  Lesson Learned:  In data standard development, failing to distinguish between an entity’s intended role and its real-world application can lead to inaccurate or overly narrow vocabulary that limits reuse and interoperability. Overlooking this distinction can result in rigid or misleading terms that don’t reflect the full range of real-world scenarios, reducing the utility and adaptability of the standard.

  • Example: A curator labeled a field as “wastewater purpose of sequencing”, even though the standard term “purpose of sequencing” already sufficiently described the data. The added specificity was unnecessary and created a term that would not match identical data labeled more generally in other datasets.

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

Entities vs. Information “About” Entities

In ontology-based data standards, it's important to distinguish between entities—the actual things or processes being described (e.g., a host organism, a protein)—and information content entities, which are data or descriptors about those entities (e.g., host age, host health state, or detected mutations).

  •  Lesson Learned:  Conflating entities with information about them can lead to misclassification, semantic errors, and broken relationships in data systems that rely on logical relationships.

  • Example: "DNA" extracted from an organism is a material entity, in that it is physically genomic material contained within a physical sample, but "sequence data" is information about that piece of "DNA.

Cameron, R., et al. (TBD). 10 Simple Rules for Improving Your Standardized Fields and Terms (in preparation).

  Quality Management  

Authored by: Rhiannon Cameron (2025-10-03 ISO 8601) Contributors: Dr. Emma Griffiths, Charlotte Barclay

Quality management ensures that the harmonized data meets predefined standards and is fit for its intended use.

Managing the quality of pathogen genomic data is a complex challenge that requires more than just accurate sequencing—it demands robust specifications, well-structured ontologies, and harmonization tools that work together to ensure consistency and interoperability.

 Lessons Learned:  

    • Applying quality management to data harmonization ensures that integrated data from diverse sources is accurate, consistent, complete, and reliable. This improves decision-making, enhances system interoperability, reduces errors, and builds trust in data-driven processes.

    • Specifications must be designed to address real-world data harmonization challenges, not simply tailored to fit the limitations of a single tool. When tools dictate the structure of the specification, it can lead to rigid systems that fail to support broader data integration needs. Instead, tools should be built to support interoperable schemas, specification maintenance, as well as easy data entry and curation.

Example Considerations:

  • Accuracy: Is the data correct and free from errors?

  • Completeness: Are all required data fields present?

  • Consistency: Is the data uniform across sources?

  • Timeliness: Is the data up-to-date?

  • Validity: Does the data conform to defined formats and rules?

  • Uniqueness: Are there duplicate records/identifiers?

Example Activities:

  • Needs assessment / gap analysis of current data usage and quality and what needs to be done to reach downstream goals.

  • Standardizing data by converting it into a common format and structure.

  • Validation rules to ensure data integrity by supporting standard implementation as well as error detection and correction.

  • Versioning to manage and track changes to schemas, tools, and other resources over time.

  Implicit vs Explicit  

Authored by: Dr. Emma Griffiths (2025-10-07 ISO 8601) Contributors: Rhiannon Cameron

The challenge of automating the development of logical axioms for data interoperability

Different labs with similar projects and surveillance programs often use what appear on face value to be the same variables. However, those variables often have different definitions and/or implicit interpretation criteria - which may arise from different objectives and methods. Understanding these differences and making them explicit and comparable in a machine-readable format is critical for integrating data and the results of analyses across organizations and sectors. However, formalizing concepts and variables - known as axiomatization -  is currently a highly manual process relying on the expertise of individual ontologists. Axiomatizing large lexicons and corpuses of data to enable more complex querying and analysis can be a Herculean effort. Tools to help automate these processes would greatly reduce the burden of this manual workload and increase adoption of ontology-based standards and data management systems, the benefits of which are described elsewhere.

Examples

  1. Allergy skin test results

Skin prick and scratch tests involve applying small amounts of allergens to the skin and looking for a local reaction such as redness and swelling. To determine whether a test is positive, usually the swelling must be above a threshold diameter. Laboratories often set their own thresholds - for instance, Lab A may define a positive result as >= 2mm while Lab B uses >= 5mm . Results, however, are often recorded simply as a “positive” without explicitly providing the interpretation criteria. Sharing of this data across labs should not then be directly integrated without recalibrating results according to the same interpretation criteria.  If both labs report a result as simply “positive”, without specifying the threshold, a 3mm reaction would be considered positive by Lab A but negative by Lab B, leading to misleading comparisons and potentially flawed conclusions when aggregating or analyzing data across labs.

 Lessons Learned:  By sharing interpretation criteria, labs can recalibrate data for accurate integration. 


  1. Food surveillance

"Neapolitan pizza" is a protected Specialità Tradizionale Garantita (TSG) from Naples, Italy, is defined by its soft, thin dough with a raised, airy edge (cornicione), slow fermentation, and toppings of authentic Italian ingredients like San Marzano tomatoes, fresh mozzarella, basil, and extra virgin olive oil, cooked for about 90 seconds in a very hot wood-fired oven. This traditional style is recognized by UNESCO as an intangible cultural heritage.  This is meaningfully different from how the “Neapolitan pizza” label is used in North America, where it can refer to frozen pizzas with thick, seasoned crusts and non-fresh ingredients. In summary, the label “Neapolitan pizza,” refers to a rigorously defined traditional product in Italy, but is used loosely in North America to describe frozen pizzas with very different characteristics.

 Lessons Learned:  Terminology used in food surveillance must be clearly defined, standardized, and accessible across regions. Without clarifying such conflicting definitions, data comparisons and regulatory decisions risk being inaccurate or misleading. 


  1. Ready-to-eat

Like many label claims, “Ready-to-eat” means different things in different parts of the world.

  • In Italy, "Ready-to-eat" food refers to meals or ingredients that can be consumed immediately without further cooking or processing, though some items may require minimal heating or washing. The definition focuses on convenience; safety is implicitly implied from requiring “minimal heating or washing” but it is not explicit and could be misinterpreted. The definition focuses on convenience; safety is implicitly implied from requiring “minimal heating or washing” but it is not explicit and could be misinterpreted.

  • According to the U.S. Food and Drug Administration (FDA), “Read-to-eat” food is safe to eat without additional  steps to achieve food safety, such as cooking or significant reprocessing. The emphasis is on safety, not just convenience, and doesn’t accommodate foods that may require some additional processing. The emphasis is on safety, not just convenience, and doesn’t accommodate foods that may require some additional processing.

  • The Canadian Food Inspection Agency (CFIA) and Health Canada define “Ready-to-eat (RTE)” foods as those that can be consumed without further preparation beyond simple washing, rinsing, thawing, or warming, such as reheating to a safe temperature, and are intended for consumption in their purchased condition. The definition focuses on convenience, but also includes vocabulary that explicitly convey safety expectations.

 Lessons Learned:  Without harmonizing or explicitly documenting these differences, data integration and regulatory alignment across borders can lead to misinterpretation in food safety assessments. Comparing differences in vocabularies can also help highlight where improvements can be made to make the implicit explicit.

  Metadata Extraction  

Authored by: Rhiannon Cameron (2025-10-03 ISO 8601) Contributors: Tian Rabbani, Dr. Emma Griffiths, Charlotte Barclay, Dr. William Hsiao

Metadata extraction is the process of identifying, retrieving, and structuring contextual metadata—descriptive information about data—from various sources.

In practice, it helps with data objectives, context, and traceability. This information can then be used to enable data integration, harmonization, governance, compliance, and discovery. It can be performed manually or automatically to extract relevant information from files and databases.

When extracting pathogen-genomic metadata across diverse fields (T. Rabbani, et al.), manual review was essential. Since our goal was exploratory, identifying how data is described across varied research questions, we lacked a comprehensive predefined vocabulary, making human judgment critical to ensure accuracy and avoid misinterpretation or fabricated matches. Automated approaches, including those using large language models (LLMs), were deemed insufficient because they couldn't reliably interpret context, match non-standard vocabulary, extract from non-machine readable formats, or detect nuanced inconsistencies. 

Rabbani, T, et al. (TBD). Context Matters: A Scoping Review of Metadata Reported in Genomic Epidemiology Studies of Infectious Diseases (in preparation)

 Lessons Learned:  

  • Using an LLM would still require manual review to check findings; so, this labour cost must be considered when defining the scope of the project.

  • It’s essential to establish a shared glossary for tracking concepts and terminology, create a space for communicating and resolving misunderstandings, and implement a quality management process to integrate clarifications into the protocol.

  • Engaging a diverse group of data extractors fosters valuable discussions around what qualifies as a data descriptor and how to define its boundaries. Assumptions that might otherwise go unnoticed or seem self-evident are surfaced, articulated, and critically examined by experienced extractors, leading to more robust and transparent metadata practices. For example, are "isolate database access ID" and "sample database accession ID" synonymous? While there was much evidence in manuscripts/databases of them people treating the same, discussion concluded "no" as the differentiation has been important for structuring data specifications to appropriate relate isolate- to sample-level data.

  • There are many data items during data collection that could be easily inferred from other values. For example, when provided the sample “collection device” as “nasopharyngeal swab”, one could infer the “collection site host” was the “nasopharynx”. We recommend deciding the degree of inferences that are permitted during data extraction before extraction begins.

  • Avoid unintentional data transformations by carefully verifying contextual details and only asserting information you can confirm from the source. For instance, assuming a sample labeled as collected in “Nashville” refers to Tennessee could be misleading, as another Nashville exists in Indiana; the data collector should not declare a broader location unless they can confirm this information with the data source. Similarly, when extracting location data from a hospital in a non-English-speaking country, a curator unfamiliar with the native language might rely on incomplete or less accurate regional information, leading to potential misclassification.

Cameron, R., et al. (2024). http://dx.doi.org/10.17504/protocols.io.eq2lywrzevx9/v1 

  Vocabulary Matching (LLM)  

Authored by: Rhiannon Cameron (YYYY-MM-DD ISO 8601) Contributors: Ivan Gill, Dr. Emma Griffiths, Charlotte Barclay, Dr. William Hsiao

UNDERCONSTRUCTION

Text text text.

Facing a challenge?

Resources

Contact

Link

Centre for Infectious Disease Genomics and One Health

Documentation

Resources

GitHub

GenEpiO

FoodOn

Team

About

CIDGOH Group

Support

Contact

Template by Kirksville Web Design 2024    |    Artwork by Rhiannon Cameron unless otherwise declared.

GitHub
Google Sites
Report abuse
Google Sites
Report abuse