Customizing a Validation Schema¶
Context¶
You can extend the ODM validation
schemas for your specific needs.
In this tutorial, we show you how to use the schema_additions
parameter in the
generate_validation_schema
function to generate new validation rules.
Extending a validation schema¶
Setting the scenario¶
As an example, consider the healthReg
part in the dictionary which is
described as The health planning authority where the site or insititute
is located. healthReg
is a varchar
column in the sites
table. The
core ODM validation schema does not have validation requirements for
healthReg
, other than requireing a varchar
data type. However, you
can add more rules, such as a list of valid health care regions in your
testing program.
For example, public health agencies within Ontario may only want to allow the following health regions:
Ontario Health Regions ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ healthReg ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Ottawa │ │ London │ │ Hamilton │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
We will extend the following validation schema:
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'siteID': { │ │ │ │ │ │ 'required': True │ │ │ │ │ }, │ │ │ │ │ 'healthReg': { │ │ │ │ │ │ 'type': 'string' │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ } │ } }
For demonstration, the schema only contains two columns from the sites
table:
siteID
: which is the primary key for the table and is mandatory; andhealthReg
: which is the column whose validation rule we want to modify
As we can see, the only constraint for the healthReg
column is that it
must be a string. If we validate the dataset shown below,
Invalid Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ healthReg ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Calgary │ │ Edmonton │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Using the validate_data
function with our current validation schema,
we get the errors shown below:
ValidationReport( │ data_version='2.0.0', │ schema_version='2.0.0', │ package_version='0.5.0', │ table_info={'sites': {'columns': 1, 'rows': 2}}, │ errors=[ │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'healthReg': 'Calgary'} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'healthReg': 'Edmonton'} │ │ } │ ], │ warnings=[] )
As we can see, we’re not getting any errors for the healthReg
column.
The only errors are due to the missing siteID
column in the data.
Using the schema_additions
parameter¶
Let’s see how we can extend the original schema to restrict the values
in the healthReg
column. As was said earlier, the schema_additions
parameter in the generate_validation_schema
function can help with
this. The parameter is a Python dictionary that has the following shape:
The top level fields consists of the names of the tables whose schemas need to be extended
These table names themselves should contain a Python dictionary. The fields here contain the names of the table columns that need to be extended.
Finally, each column name should contain a Python dictionary with the names of the validation rules to be updated along with their values. Only certain validation rules can currently be extended. For the list of supported rules refer to the
generate_validation_schema
documentation.
For our current use case, we need to update the healthReg
column in
the sites
table with the list of allowable values from the CSV file we
printed above. The code for building the schema_additions
parameter
can be seen below:
# schema_additions should be a dictionary
schema_additions = {
# Whose top level fields contain the names of the table to modify
"sites": {
# Within which we specify the names of the columns to modify that
# belong to that table. This is also a Python dictionary.
"healthReg": {
# Within we specify the names of the rules to specify and their
# values. Here we add the list of allowed of health regions from
# the CSV file as a string.
"allowed": list(map(lambda row: row["healthReg"], allowed_regions))
}
}
}
With its result printed below:
{ │ 'sites': { │ │ 'healthReg': { │ │ │ 'allowed': [ │ │ │ │ 'Ottawa', │ │ │ │ 'London', │ │ │ │ 'Hamilton' │ │ │ ] │ │ } │ } }
We can use it to extend our schema using the code below:
parts = import_csv_file(ASSET_DIR + "/parts.csv")
updated_validation_schema = generate_validation_schema(
parts,
schema_additions = schema_additions
)
With our new validation schema printed below:
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': {'healthReg': {'allowed': ['Ottawa', 'London', 'Hamilton']}}, │ │ │ │ 'meta': [] │ │ │ } │ │ } │ } }
As we can see, the healthReg
column now has a new rule called
allowed
which has our Ontario specific values.
Finally, lets validate our original dataset with our new validation schema to make sure its working correctly,
ValidationReport( │ data_version='2.0.0', │ schema_version='2.0.0', │ package_version='0.5.0', │ table_info={'sites': {'columns': 1, 'rows': 2}}, │ errors=[ │ │ { │ │ │ 'errorType': 'invalid_category', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'healthReg', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'invalid_category rule violated in table sites, column healthReg, row(s) 1: Invalid category "Calgary"', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'healthReg': 'Calgary'}, │ │ │ 'invalidValue': 'Calgary' │ │ }, │ │ { │ │ │ 'errorType': 'invalid_category', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'healthReg', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'invalid_category rule violated in table sites, column healthReg, row(s) 2: Invalid category "Edmonton"', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'healthReg': 'Edmonton'}, │ │ │ 'invalidValue': 'Edmonton' │ │ } │ ], │ warnings=[] )
We now see errors for our healthReg
column, specifically the
invalid_category
entries in the printed validation repo.
Conclusion¶
In this tutorial we learned how to extend the validation schemas
provided with the package to meet the needs of the different validation
contexts within the wastewater ecosystem. Make sure to check that the
rule you’re trying to modify or add is supported by the package by
looking at the schema_additions
parameter in the
generate_validation_schema
documentation.