generate_validation_schema

This document provides the specifications for programmatically generating a validation schema for PHES-ODM data. The program is exposed as a function called generate_validation_schema within the PHES-ODM-Validaion Python package.

Context

The PHES-ODM data structure is complex. Trying to manually create the validation schema for the validate_data function can be an error prone and laborious process, especially if the user tries to encode all the validation rules. However, programmatic generation of a validation schema is possible due to the efforts of the PHES-ODM team in creating a machine-readable data dictionary for the data.

In brief, the data dictionary is a CSV/Excel file that encodes all the pieces of the data, how they’re related to each other, as well as validation metadata. Using this data dictionary file, the generate_validation_schema function can automatically generate a validation schema for different ODM data versions.

Features

At its core, the generate_validation_schema function creates a Python dictionary that is actionable by the generate_validate_data function. The details regarding the dictionary creation and the fields that go into it can be seen in the validation-rules folder. This document will go over the function features that are agnostic to each validation rule.

Warnings

During the process of generating a validation schema there may be non-fatal issues that come up when parsing a row. A non-fatal issue is one which should not stop the function execution. Instead, whenever such an issue is encountered the row should be skipped and a warning presented to the user. The goal of these warnings is to not only inform the user of issue but to also provide them with enough information to fix the issue. The function return will consist of the list of warnings encountered, with each warning encoded as an object. The remaining sections in this group will go over each warning, providing examples that generate it.

missing_version1_fields

This warning is reported when generating the schema for a version 1 dataset. The data dictionary contains columns that backports version 2 parts to their version 1 equivalent, this process is described elsewhere. When this backport fails during the generation process, for example due to an invalid version1Location value, a warning should be reported.

For example, the following parts snippet should generate this warning for the aDate row since its missing a value for the version1Table column.

                                                Invalid Parts Table                                                
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ partID      partType      version1Location      version1Table     version1Variable     version1Category    ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ measures   │ table        │ tables               │ WWMeasure        │ NA                  │ NA                  │
│ aDate      │ attribute    │ variables            │                  │ analysisDate        │ NA                  │
└────────────┴──────────────┴──────────────────────┴──────────────────┴─────────────────────┴─────────────────────┘

This warning should be generated in the following cases:

  1. When parsing a version 1 table, it should be generated when:

    1. The version1Location value is not tables, for example,

    ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
    ┃ partID       partType     version1Location      version1Table     version1Variable     version1Category    ┃
    ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
    │ measures    │ table       │                      │ WWMeasure        │ NA                  │ NA                  │
    └─────────────┴─────────────┴──────────────────────┴──────────────────┴─────────────────────┴─────────────────────┘
    
    1. The version1Table value is missing, for example,

    ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
    ┃ partID       partType     version1Location      version1Table     version1Variable     version1Category    ┃
    ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
    │ measures    │ table       │ tables               │                  │ NA                  │ NA                  │
    └─────────────┴─────────────┴──────────────────────┴──────────────────┴─────────────────────┴─────────────────────┘
    
  2. When parsing a version 1 column, it should be generated when:

    1. The version1Location value is not variables, for example,

    ┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
    ┃ partID     partType    measures   version1Location    version1Table   version1Variable   version1Category ┃
    ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
    │ measures  │ table      │ NA        │ tables             │ WWMeasure      │ NA                │ NA               │
    │ aDate     │ attribute  │ header    │                    │ WWMeasure      │ analysisDate      │ NA               │
    └───────────┴────────────┴───────────┴────────────────────┴────────────────┴───────────────────┴──────────────────┘
    
    1. The version1Table value is missing, for example,

    ┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
    ┃ partID     partType    measures   version1Location    version1Table   version1Variable   version1Category ┃
    ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
    │ measures  │ table      │ NA        │ tables             │ WWMeasure      │ NA                │ NA               │
    │ aDate     │ attribute  │ header    │ variables          │                │ analysisDate      │ NA               │
    └───────────┴────────────┴───────────┴────────────────────┴────────────────┴───────────────────┴──────────────────┘
    
    1. The version1Variable value is missing, for example,

    ┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
    ┃ partID     partType    measures   version1Location    version1Table   version1Variable   version1Category ┃
    ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
    │ measures  │ table      │ NA        │ tables             │ WWMeasure      │ NA                │ NA               │
    │ aDate     │ attribute  │ header    │ variables          │ WWMeasure      │                   │ NA               │
    └───────────┴────────────┴───────────┴────────────────────┴────────────────┴───────────────────┴──────────────────┘
    
  3. When parsing a version 1 category, it should be generated when:

    1. The version1Location value is not categories, for example,

    ┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
    ┃ partID    partType   samples  dataType     catSetID    version1Lo…  version1T…  version1Va…  version1C… ┃
    ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
    │ samples  │ table     │ NA      │ NA          │ NA         │ tables      │ WWSample   │ NA          │ NA         │
    │ collType │ attribute │ header  │ categorical │ collectCa… │ variables   │ WWSample   │ collection  │ NA         │
    │ comp3    │ category  │ NA      │ varchar     │ collectCa… │             │ WWSample   │ collection  │ grbCp3     │
    └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘
    
    1. The version1Table value is missing, for example,

    ┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
    ┃ partID    partType   samples  dataType     catSetID    version1Lo…  version1T…  version1Va…  version1C… ┃
    ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
    │ samples  │ table     │ NA      │ NA          │ NA         │ tables      │ WWSample   │ NA          │ NA         │
    │ collType │ attribute │ header  │ categorical │ collectCa… │ variables   │ WWSample   │ collection  │ NA         │
    │ comp3    │ category  │ NA      │ varchar     │ collectCa… │ categories  │            │ collection  │ grbCp3     │
    └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘
    
    1. The version1Variable value is missing, for example,

    ┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━┓
    ┃ partID    partType   samples  dataType    catSetID    version1L…  version1T…  version1V…  version1C…   ┃
    ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━┩
    │ samples  │ table     │ NA      │ NA         │ NA         │ tables     │ WWSample   │ NA         │ NA         │  │
    │ collType │ attribute │ header  │ categoric… │ collectCa… │ variables  │ WWSample   │ collection │ NA         │  │
    │ comp3    │ category  │ NA      │ varchar    │ collectCa… │ categories │ WWSample   │            │ grbCp3     │  │
    └──────────┴───────────┴─────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴──┘
    
    1. The version1Category value is missing, for example,

    ┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
    ┃ partID    partType   samples  dataType     catSetID    version1Lo…  version1T…  version1Va…  version1C… ┃
    ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
    │ samples  │ table     │ NA      │ NA          │ NA         │ tables      │ WWSample   │ NA          │ NA         │
    │ collType │ attribute │ header  │ categorical │ collectCa… │ variables   │ WWSample   │ collection  │ NA         │
    │ comp3    │ category  │ NA      │ varchar     │ collectCa… │ categories  │ WWSample   │ collection  │            │
    └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘
    

The returned warning object should consist of the following fields:

  • warningType: The warning type. Should be set to missing_version1_fields.

  • row: An object containing the row that generated the warning. The row object should only include the partID, version1Location, version1Table, version1Variable, and version1Category columns.

  • rowNumber: The row number in the dictionary that generated the warning

  • message: A string with a human readable version of the warning. The message value depends on the column that generated the warning.

    • If the warning was generated due to the version1Location column then the message should be, ‘Skipping row <row_index> when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are “tables”, “variables”, or “categories”’

    • If the warning was generated due to the version1Table, version1Variable, or version1Category columns, then the message should be, “Skipping row <row_index> when generating a version 1 schema. Version 1 value not found for column <invalid_column_name>”

For example, the warning object for each of the parts snippets shown above are:

{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'aDate',
│   │   │   │   'partType': 'attribute',
│   │   │   │   'version1Location': 'variables',
│   │   │   │   'version1Table': '',
│   │   │   │   'version1Variable': 'analysisDate',
│   │   │   │   'version1Category': 'NA'
│   │   │   },
│   │   │   'rowNumber': 2,
│   │   │   'message': 'Skipping row 2 when generating a version 1 schema. Version 1 value not found for column version1Table'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'measures',
│   │   │   │   'version1Location': '',
│   │   │   │   'version1Table': 'WWMeasure',
│   │   │   │   'version1Variable': 'NA',
│   │   │   │   'version1Category': 'NA'
│   │   │   },
│   │   │   'rowNumber': 1,
│   │   │   'message': 'Skipping row 1 when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are "tables", "variables", or "categories"'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'measures',
│   │   │   │   'version1Location': 'tables',
│   │   │   │   'version1Table': '',
│   │   │   │   'version1Variable': 'NA',
│   │   │   │   'version1Category': 'NA'
│   │   │   },
│   │   │   'rowNumber': 1,
│   │   │   'message': 'Skipping row 1 when generating a version 1 schema. Version 1 value not found for column version1Table'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'aDate',
│   │   │   │   'version1Location': '',
│   │   │   │   'version1Table': 'WWMeasure',
│   │   │   │   'version1Variable': 'analysisDate',
│   │   │   │   'version1Category': 'NA'
│   │   │   },
│   │   │   'rowNumber': 2,
│   │   │   'message': 'Skipping row 2 when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are "tables", "variables", or "categories"'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'aDate',
│   │   │   │   'version1Location': 'variables',
│   │   │   │   'version1Table': '',
│   │   │   │   'version1Variable': 'analysisDate',
│   │   │   │   'version1Category': 'NA'
│   │   │   },
│   │   │   'rowNumber': 2,
│   │   │   'message': 'Skipping row 2 when generating a version 1 schema. Version 1 value not found for column version1Table'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'aDate',
│   │   │   │   'version1Location': 'variables',
│   │   │   │   'version1Table': 'WWMeasure',
│   │   │   │   'version1Variable': '',
│   │   │   │   'version1Category': 'NA'
│   │   │   },
│   │   │   'rowNumber': 2,
│   │   │   'message': 'Skipping row 2 when generating a version 1 schema. Version 1 value not found for column version1Variable'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'comp3',
│   │   │   │   'version1Location': '',
│   │   │   │   'version1Table': 'WWSample',
│   │   │   │   'version1Variable': 'collection',
│   │   │   │   'version1Category': 'grbCp3'
│   │   │   },
│   │   │   'rowNumber': 3,
│   │   │   'message': 'Skipping row 3 when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are "tables", "variables", or "categories"'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'comp3',
│   │   │   │   'version1Location': 'categories',
│   │   │   │   'version1Table': '',
│   │   │   │   'version1Variable': 'collection',
│   │   │   │   'version1Category': 'grbCp3'
│   │   │   },
│   │   │   'rowNumber': 3,
│   │   │   'message': 'Skipping row 3 when generating a version 1 schema. Version 1 value not found for column version1Table'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'comp3',
│   │   │   │   'version1Location': 'categories',
│   │   │   │   'version1Table': 'WWSample',
│   │   │   │   'version1Variable': '',
│   │   │   │   'version1Category': 'grbCp3'
│   │   │   },
│   │   │   'rowNumber': 3,
│   │   │   'message': 'Skipping row 3 when generating a version 1 schema. Version 1 value not found for column version1Variable'
│   │   }
]
}
{
'warnings': [
│   │   {
│   │   │   'warningType': 'missing_version1_fields',
│   │   │   'row': {
│   │   │   │   'partID': 'comp3',
│   │   │   │   'version1Location': 'categories',
│   │   │   │   'version1Table': 'WWSample',
│   │   │   │   'version1Variable': 'WWSample',
│   │   │   │   'version1Category': ''
│   │   │   },
│   │   │   'rowNumber': 3,
│   │   │   'message': 'Skipping row 3 when generating a version 1 schema. Version 1 value not found for column version1Category'
│   │   }
]
}

Finally, the following parts snippet should not have any warnings when generating a version 1 schema.

┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ partID    partType   samples  dataType     catSetID    version1Lo…  version1T…  version1Va…  version1C… ┃
┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ samples  │ table     │ NA      │ NA          │ NA         │ tables      │ WWSample   │ NA          │ NA         │
│ collType │ attribute │ header  │ categorical │ collectCa… │ variables   │ WWSample   │ collection  │ NA         │
│ comp3    │ category  │ NA      │ varchar     │ collectCa… │ categories  │ WWSample   │ collection  │ gbCp3      │
└──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘