# Customizing a Validation Schema ## Context You can extend the [ODM validation schemas](../../../assets/validation-schemas/) for your specific needs. In this tutorial, we show you how to use the `schema_additions` parameter in the [generate_validation_schema](../../specs/generate-validation-schema-function.md) function to generate new validation rules. ## Extending a validation schema ### Setting the scenario As an example, consider the `healthReg` part in the dictionary which is described as *The health planning authority where the site or insititute is located*. `healthReg` is a `varchar` column in the `sites` table. The core ODM validation schema does not have validation requirements for `healthReg`, other than requireing a `varchar` data type. However, you can add more rules, such as a list of valid health care regions in your testing program. For example, public health agencies within Ontario may only want to allow the following health regions:
Ontario Health Regions ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ healthReg ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Ottawa │ │ London │ │ Hamilton │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘We will extend the following validation schema:
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'siteID': { │ │ │ │ │ │ 'required': True │ │ │ │ │ }, │ │ │ │ │ 'healthReg': { │ │ │ │ │ │ 'type': 'string' │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ } │ } }For demonstration, the schema only contains two columns from the `sites` table: 1. `siteID`: which is the primary key for the table and is mandatory; and 2. `healthReg`: which is the column whose validation rule we want to modify As we can see, the only constraint for the `healthReg` column is that it must be a string. If we validate the dataset shown below,
Invalid Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ healthReg ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Calgary │ │ Edmonton │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘Using the `validate_data` function with our current validation schema, we get the errors shown below:
ValidationReport( │ data_version='2.2.3', │ schema_version='2.0.0', │ package_version='1.0.0b1', │ table_info={'sites': {'columns': 1, 'rows': 2}}, │ errors=[ │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'healthReg': 'Calgary'} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'healthReg': 'Edmonton'} │ │ } │ ], │ warnings=[] )As we can see, we’re not getting any errors for the `healthReg` column. The only errors are due to the missing `siteID` column in the data. ### Using the `schema_additions` parameter Let’s see how we can extend the original schema to restrict the values in the `healthReg` column. As was said earlier, the `schema_additions` parameter in the `generate_validation_schema` function can help with this. The parameter is a Python dictionary that has the following shape: 1. The top level fields consists of the names of the tables whose schemas need to be extended 2. These table names themselves should contain a Python dictionary. The fields here contain the names of the table columns that need to be extended. 3. Finally, each column name should contain a Python dictionary with the names of the validation rules to be updated along with their values. Only certain validation rules can currently be extended. For the list of supported rules refer to the `generate_validation_schema` documentation. For our current use case, we need to update the `healthReg` column in the `sites` table with the list of allowable values from the CSV file we printed above. The code for building the `schema_additions` parameter can be seen below: ``` python # schema_additions should be a dictionary schema_additions = { # Whose top level fields contain the names of the table to modify "sites": { # Within which we specify the names of the columns to modify that # belong to that table. This is also a Python dictionary. "healthReg": { # Within we specify the names of the rules to specify and their # values. Here we add the list of allowed of health regions from # the CSV file as a string. "allowed": list(map(lambda row: row["healthReg"], allowed_regions)) } } } ``` With its result printed below:
{ │ 'sites': { │ │ 'healthReg': { │ │ │ 'allowed': [ │ │ │ │ 'Ottawa', │ │ │ │ 'London', │ │ │ │ 'Hamilton' │ │ │ ] │ │ } │ } }We can use it to extend our schema using the code below: ``` python parts = import_csv_file(ASSET_DIR + "/parts.csv") updated_validation_schema = generate_validation_schema( parts, schema_additions = schema_additions ) ``` With our new validation schema printed below:
{ │ 'schemaVersion': '2.2.3', │ 'schema': { │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': {'healthReg': {'allowed': ['Ottawa', 'London', 'Hamilton']}}, │ │ │ │ 'meta': [] │ │ │ } │ │ } │ } }As we can see, the `healthReg` column now has a new rule called `allowed` which has our Ontario specific values. Finally, lets validate our original dataset with our new validation schema to make sure its working correctly,
ValidationReport( │ data_version='2.2.3', │ schema_version='2.2.3', │ package_version='1.0.0b1', │ table_info={'sites': {'columns': 1, 'rows': 2}}, │ errors=[ │ │ { │ │ │ 'errorType': 'invalid_category', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'healthReg', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'invalid_category rule violated in table sites, column healthReg, row(s) 1: Invalid category "Calgary"', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'healthReg': 'Calgary'}, │ │ │ 'invalidValue': 'Calgary' │ │ }, │ │ { │ │ │ 'errorType': 'invalid_category', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'healthReg', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'invalid_category rule violated in table sites, column healthReg, row(s) 2: Invalid category "Edmonton"', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'healthReg': 'Edmonton'}, │ │ │ 'invalidValue': 'Edmonton' │ │ } │ ], │ warnings=[] )We now see errors for our `healthReg` column, specifically the `invalid_category` entries in the printed validation repo. ## Conclusion In this tutorial we learned how to extend the validation schemas provided with the package to meet the needs of the different validation contexts within the wastewater ecosystem. Make sure to check that the rule you’re trying to modify or add is supported by the package by looking at the `schema_additions` parameter in the `generate_validation_schema` documentation.