generate_validation_schema¶
This document provides the specifications for programmatically
generating a validation schema for PHES-ODM data. The program is exposed
as a function called generate_validation_schema
within the
PHES-ODM-Validaion
Python package.
Context¶
The PHES-ODM data structure is complex. Trying to manually create the validation schema for the validate_data function can be an error prone and laborious process, especially if the user tries to encode all the validation rules. However, programmatic generation of a validation schema is possible due to the efforts of the PHES-ODM team in creating a machine-readable data dictionary for the data.
In brief, the data dictionary is a CSV/Excel file that encodes all the
pieces of the data, how they’re related to each other, as well as
validation metadata. Using this data dictionary file, the
generate_validation_schema
function can automatically generate a
validation schema for different ODM data versions.
Features¶
At its core, the generate_validation_schema
function creates a Python
dictionary that is actionable by the generate_validate_data
function.
The details regarding the dictionary creation and the fields that go
into it can be seen in the validation-rules
folder. This document will go over the function features that are
agnostic to each validation rule.
Warnings¶
During the process of generating a validation schema there may be non-fatal issues that come up when parsing a row. A non-fatal issue is one which should not stop the function execution. Instead, whenever such an issue is encountered the row should be skipped and a warning presented to the user. The goal of these warnings is to not only inform the user of issue but to also provide them with enough information to fix the issue. The function return will consist of the list of warnings encountered, with each warning encoded as an object. The remaining sections in this group will go over each warning, providing examples that generate it.
missing_version1_fields¶
This warning is reported when generating the schema for a version 1
dataset. The data dictionary contains columns that backports version 2
parts to their version 1 equivalent, this process is described
elsewhere.
When this backport fails during the generation process, for example due
to an invalid version1Location
value, a warning should be reported.
For example, the following parts snippet should generate this warning
for the aDate
row since its missing a value for the version1Table
column.
Invalid Parts Table ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ version1Location ┃ version1Table ┃ version1Variable ┃ version1Category ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ measures │ table │ tables │ WWMeasure │ NA │ NA │ │ aDate │ attribute │ variables │ │ analysisDate │ NA │ └────────────┴──────────────┴──────────────────────┴──────────────────┴─────────────────────┴─────────────────────┘
This warning should be generated in the following cases:
When parsing a version 1 table, it should be generated when:
The
version1Location
value is nottables
, for example,
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ version1Location ┃ version1Table ┃ version1Variable ┃ version1Category ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ measures │ table │ │ WWMeasure │ NA │ NA │ └─────────────┴─────────────┴──────────────────────┴──────────────────┴─────────────────────┴─────────────────────┘
The
version1Table
value is missing, for example,
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ version1Location ┃ version1Table ┃ version1Variable ┃ version1Category ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ measures │ table │ tables │ │ NA │ NA │ └─────────────┴─────────────┴──────────────────────┴──────────────────┴─────────────────────┴─────────────────────┘
When parsing a version 1 column, it should be generated when:
The
version1Location
value is notvariables
, for example,
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ measures ┃ version1Location ┃ version1Table ┃ version1Variable ┃ version1Category ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ measures │ table │ NA │ tables │ WWMeasure │ NA │ NA │ │ aDate │ attribute │ header │ │ WWMeasure │ analysisDate │ NA │ └───────────┴────────────┴───────────┴────────────────────┴────────────────┴───────────────────┴──────────────────┘
The
version1Table
value is missing, for example,
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ measures ┃ version1Location ┃ version1Table ┃ version1Variable ┃ version1Category ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ measures │ table │ NA │ tables │ WWMeasure │ NA │ NA │ │ aDate │ attribute │ header │ variables │ │ analysisDate │ NA │ └───────────┴────────────┴───────────┴────────────────────┴────────────────┴───────────────────┴──────────────────┘
The
version1Variable
value is missing, for example,
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ measures ┃ version1Location ┃ version1Table ┃ version1Variable ┃ version1Category ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ measures │ table │ NA │ tables │ WWMeasure │ NA │ NA │ │ aDate │ attribute │ header │ variables │ WWMeasure │ │ NA │ └───────────┴────────────┴───────────┴────────────────────┴────────────────┴───────────────────┴──────────────────┘
When parsing a version 1 category, it should be generated when:
The
version1Location
value is notcategories
, for example,
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ samples ┃ dataType ┃ catSetID ┃ version1Lo… ┃ version1T… ┃ version1Va… ┃ version1C… ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ samples │ table │ NA │ NA │ NA │ tables │ WWSample │ NA │ NA │ │ collType │ attribute │ header │ categorical │ collectCa… │ variables │ WWSample │ collection │ NA │ │ comp3 │ category │ NA │ varchar │ collectCa… │ │ WWSample │ collection │ grbCp3 │ └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘
The
version1Table
value is missing, for example,
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ samples ┃ dataType ┃ catSetID ┃ version1Lo… ┃ version1T… ┃ version1Va… ┃ version1C… ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ samples │ table │ NA │ NA │ NA │ tables │ WWSample │ NA │ NA │ │ collType │ attribute │ header │ categorical │ collectCa… │ variables │ WWSample │ collection │ NA │ │ comp3 │ category │ NA │ varchar │ collectCa… │ categories │ │ collection │ grbCp3 │ └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘
The
version1Variable
value is missing, for example,
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━┓ ┃ partID ┃ partType ┃ samples ┃ dataType ┃ catSetID ┃ version1L… ┃ version1T… ┃ version1V… ┃ version1C… ┃ ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━┩ │ samples │ table │ NA │ NA │ NA │ tables │ WWSample │ NA │ NA │ │ │ collType │ attribute │ header │ categoric… │ collectCa… │ variables │ WWSample │ collection │ NA │ │ │ comp3 │ category │ NA │ varchar │ collectCa… │ categories │ WWSample │ │ grbCp3 │ │ └──────────┴───────────┴─────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴──┘
The
version1Category
value is missing, for example,
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ samples ┃ dataType ┃ catSetID ┃ version1Lo… ┃ version1T… ┃ version1Va… ┃ version1C… ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ samples │ table │ NA │ NA │ NA │ tables │ WWSample │ NA │ NA │ │ collType │ attribute │ header │ categorical │ collectCa… │ variables │ WWSample │ collection │ NA │ │ comp3 │ category │ NA │ varchar │ collectCa… │ categories │ WWSample │ collection │ │ └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘
The returned warning object should consist of the following fields:
warningType: The warning type. Should be set to
missing_version1_fields
.row: An object containing the row that generated the warning. The row object should only include the
partID
,version1Location
,version1Table
,version1Variable
, andversion1Category
columns.rowNumber: The row number in the dictionary that generated the warning
message: A string with a human readable version of the warning. The message value depends on the column that generated the warning.
If the warning was generated due to the
version1Location
column then the message should be, ‘Skipping row <row_index> when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are “tables”, “variables”, or “categories”’If the warning was generated due to the
version1Table
,version1Variable
, orversion1Category
columns, then the message should be, “Skipping row <row_index> when generating a version 1 schema. Version 1 value not found for column <invalid_column_name>”
For example, the warning object for each of the parts snippets shown above are:
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'aDate', │ │ │ │ 'partType': 'attribute', │ │ │ │ 'version1Location': 'variables', │ │ │ │ 'version1Table': '', │ │ │ │ 'version1Variable': 'analysisDate', │ │ │ │ 'version1Category': 'NA' │ │ │ }, │ │ │ 'rowNumber': 2, │ │ │ 'message': 'Skipping row 2 when generating a version 1 schema. Version 1 value not found for column version1Table' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'measures', │ │ │ │ 'version1Location': '', │ │ │ │ 'version1Table': 'WWMeasure', │ │ │ │ 'version1Variable': 'NA', │ │ │ │ 'version1Category': 'NA' │ │ │ }, │ │ │ 'rowNumber': 1, │ │ │ 'message': 'Skipping row 1 when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are "tables", "variables", or "categories"' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'measures', │ │ │ │ 'version1Location': 'tables', │ │ │ │ 'version1Table': '', │ │ │ │ 'version1Variable': 'NA', │ │ │ │ 'version1Category': 'NA' │ │ │ }, │ │ │ 'rowNumber': 1, │ │ │ 'message': 'Skipping row 1 when generating a version 1 schema. Version 1 value not found for column version1Table' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'aDate', │ │ │ │ 'version1Location': '', │ │ │ │ 'version1Table': 'WWMeasure', │ │ │ │ 'version1Variable': 'analysisDate', │ │ │ │ 'version1Category': 'NA' │ │ │ }, │ │ │ 'rowNumber': 2, │ │ │ 'message': 'Skipping row 2 when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are "tables", "variables", or "categories"' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'aDate', │ │ │ │ 'version1Location': 'variables', │ │ │ │ 'version1Table': '', │ │ │ │ 'version1Variable': 'analysisDate', │ │ │ │ 'version1Category': 'NA' │ │ │ }, │ │ │ 'rowNumber': 2, │ │ │ 'message': 'Skipping row 2 when generating a version 1 schema. Version 1 value not found for column version1Table' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'aDate', │ │ │ │ 'version1Location': 'variables', │ │ │ │ 'version1Table': 'WWMeasure', │ │ │ │ 'version1Variable': '', │ │ │ │ 'version1Category': 'NA' │ │ │ }, │ │ │ 'rowNumber': 2, │ │ │ 'message': 'Skipping row 2 when generating a version 1 schema. Version 1 value not found for column version1Variable' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'comp3', │ │ │ │ 'version1Location': '', │ │ │ │ 'version1Table': 'WWSample', │ │ │ │ 'version1Variable': 'collection', │ │ │ │ 'version1Category': 'grbCp3' │ │ │ }, │ │ │ 'rowNumber': 3, │ │ │ 'message': 'Skipping row 3 when generating a version 1 schema. Invalid value for version 1 column version1Location. Allowed values are "tables", "variables", or "categories"' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'comp3', │ │ │ │ 'version1Location': 'categories', │ │ │ │ 'version1Table': '', │ │ │ │ 'version1Variable': 'collection', │ │ │ │ 'version1Category': 'grbCp3' │ │ │ }, │ │ │ 'rowNumber': 3, │ │ │ 'message': 'Skipping row 3 when generating a version 1 schema. Version 1 value not found for column version1Table' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'comp3', │ │ │ │ 'version1Location': 'categories', │ │ │ │ 'version1Table': 'WWSample', │ │ │ │ 'version1Variable': '', │ │ │ │ 'version1Category': 'grbCp3' │ │ │ }, │ │ │ 'rowNumber': 3, │ │ │ 'message': 'Skipping row 3 when generating a version 1 schema. Version 1 value not found for column version1Variable' │ │ } │ ] }
{ │ 'warnings': [ │ │ { │ │ │ 'warningType': 'missing_version1_fields', │ │ │ 'row': { │ │ │ │ 'partID': 'comp3', │ │ │ │ 'version1Location': 'categories', │ │ │ │ 'version1Table': 'WWSample', │ │ │ │ 'version1Variable': 'WWSample', │ │ │ │ 'version1Category': '' │ │ │ }, │ │ │ 'rowNumber': 3, │ │ │ 'message': 'Skipping row 3 when generating a version 1 schema. Version 1 value not found for column version1Category' │ │ } │ ] }
Finally, the following parts snippet should not have any warnings when generating a version 1 schema.
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ samples ┃ dataType ┃ catSetID ┃ version1Lo… ┃ version1T… ┃ version1Va… ┃ version1C… ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ samples │ table │ NA │ NA │ NA │ tables │ WWSample │ NA │ NA │ │ collType │ attribute │ header │ categorical │ collectCa… │ variables │ WWSample │ collection │ NA │ │ comp3 │ category │ NA │ varchar │ collectCa… │ categories │ WWSample │ collection │ gbCp3 │ └──────────┴───────────┴─────────┴─────────────┴────────────┴─────────────┴────────────┴─────────────┴────────────┘