Customizing a Validation Schema

Context

You can extend the ODM validation schemas for your specific needs. In this tutorial, we show you how to use the schema_additions parameter in the generate_validation_schema function to generate new validation rules.

Extending a validation schema

Setting the scenario

As an example, consider the healthReg part in the dictionary which is described as The health planning authority where the site or insititute is located. healthReg is a varchar column in the sites table. The core ODM validation schema does not have validation requirements for healthReg, other than requireing a varchar data type. However, you can add more rules, such as a list of valid health care regions in your testing program.

For example, public health agencies within Ontario may only want to allow the following health regions:

                                              Ontario Health Regions                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ healthReg                                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Ottawa                                                                                                          │
│ London                                                                                                          │
│ Hamilton                                                                                                        │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We will extend the following validation schema:

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'siteID': {
│   │   │   │   │   │   'required': True
│   │   │   │   │   },
│   │   │   │   │   'healthReg': {
│   │   │   │   │   │   'type': 'string'
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   }
}
}

For demonstration, the schema only contains two columns from the sites table:

  1. siteID: which is the primary key for the table and is mandatory; and

  2. healthReg: which is the column whose validation rule we want to modify

As we can see, the only constraint for the healthReg column is that it must be a string. If we validate the dataset shown below,

                                                Invalid Sites Table                                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ healthReg                                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Calgary                                                                                                         │
│ Edmonton                                                                                                        │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Using the validate_data function with our current validation schema, we get the errors shown below:

ValidationReport(
data_version='2.0.0',
schema_version='2.0.0',
package_version='0.5.0',
table_info={'sites': {'columns': 1, 'rows': 2}},
errors=[
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'healthReg': 'Calgary'}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'healthReg': 'Edmonton'}
│   │   }
],
warnings=[]
)

As we can see, we’re not getting any errors for the healthReg column. The only errors are due to the missing siteID column in the data.

Using the schema_additions parameter

Let’s see how we can extend the original schema to restrict the values in the healthReg column. As was said earlier, the schema_additions parameter in the generate_validation_schema function can help with this. The parameter is a Python dictionary that has the following shape:

  1. The top level fields consists of the names of the tables whose schemas need to be extended

  2. These table names themselves should contain a Python dictionary. The fields here contain the names of the table columns that need to be extended.

  3. Finally, each column name should contain a Python dictionary with the names of the validation rules to be updated along with their values. Only certain validation rules can currently be extended. For the list of supported rules refer to the generate_validation_schema documentation.

For our current use case, we need to update the healthReg column in the sites table with the list of allowable values from the CSV file we printed above. The code for building the schema_additions parameter can be seen below:

# schema_additions should be a dictionary
schema_additions = {
    # Whose top level fields contain the names of the table to modify
    "sites": {
        # Within which we specify the names of the columns to modify that
        # belong to that table. This is also a Python dictionary.
        "healthReg": {
            # Within we specify the names of the rules to specify and their
            # values. Here we add the list of allowed of health regions from
            # the CSV file as a string.
            "allowed": list(map(lambda row: row["healthReg"], allowed_regions))
        }
    }
}

With its result printed below:

{
'sites': {
│   │   'healthReg': {
│   │   │   'allowed': [
│   │   │   │   'Ottawa',
│   │   │   │   'London',
│   │   │   │   'Hamilton'
│   │   │   ]
│   │   }
}
}

We can use it to extend our schema using the code below:

parts = import_csv_file(ASSET_DIR + "/parts.csv")
updated_validation_schema = generate_validation_schema(
    parts,
    schema_additions = schema_additions
)

With our new validation schema printed below:

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {'healthReg': {'allowed': ['Ottawa', 'London', 'Hamilton']}},
│   │   │   │   'meta': []
│   │   │   }
│   │   }
}
}

As we can see, the healthReg column now has a new rule called allowed which has our Ontario specific values.

Finally, lets validate our original dataset with our new validation schema to make sure its working correctly,

ValidationReport(
data_version='2.0.0',
schema_version='2.0.0',
package_version='0.5.0',
table_info={'sites': {'columns': 1, 'rows': 2}},
errors=[
│   │   {
│   │   │   'errorType': 'invalid_category',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'healthReg',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'invalid_category rule violated in table sites, column healthReg, row(s) 1: Invalid category "Calgary"',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'healthReg': 'Calgary'},
│   │   │   'invalidValue': 'Calgary'
│   │   },
│   │   {
│   │   │   'errorType': 'invalid_category',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'healthReg',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'invalid_category rule violated in table sites, column healthReg, row(s) 2: Invalid category "Edmonton"',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'healthReg': 'Edmonton'},
│   │   │   'invalidValue': 'Edmonton'
│   │   }
],
warnings=[]
)

We now see errors for our healthReg column, specifically the invalid_category entries in the printed validation repo.

Conclusion

In this tutorial we learned how to extend the validation schemas provided with the package to meet the needs of the different validation contexts within the wastewater ecosystem. Make sure to check that the rule you’re trying to modify or add is supported by the package by looking at the schema_additions parameter in the generate_validation_schema documentation.