Getting to know the validation library

This document describes and demonstrates the main functions of the validation library. The document is a quarto notebook which you can execute yourself.

The library has two main functions:

  1. validate_data: to validate an ODM dataset using a validation schema.

  2. generate_validation_schema: to generate a validation schema from the ODM dictionary.

Setup

We’ll walk through how to use these two functions, but first we will install the library dependencies by running the following command in the terminal. Make sure you are in the root of the library directory.

pip install -r requirements.txt

Next, the notebook uses the code to rich library to print and display tables. rich can be installed by running the following command in the terminal.

python -m pip install rich

Next, import the two main library functions.

from odm_validation.validation import validate_data, generate_validation_schema

Example: validating a sites table with missing data

Let’s start by validating an ODM dataset. We will use just the sites table that “contains information about a site; the location where an environmental sample was taken.” The table has a number of columns, but we will validate just the geoLat and geoLong columns. These columns are mandatory in the sites table.

Validating an ODM dataset requires a validation schema, which contains the rules to validate. For the demo we created a YAML file which has the validation rules for the sites table. The next code chunk will import the validation schema.

validation_schema = import_schema("./validation-schemas/sites-schema.yml")

pprint(validation_schema, expand_all=True)
{
'schemaVersion': '2.0.0',
'schema': {
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'geoLat': {
│   │   │   │   │   │   'required': True
│   │   │   │   │   },
│   │   │   │   │   'geoLong': {
│   │   │   │   │   │   'required': True
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   }
}
}

The validation schema has two fields:

  • schemaVersion: the version of the ODM the validation schema; and,

  • schema: which has the validation rules

The structure of the validation rules follows a Python library called cerberus that does all the validation heavy lifting. The PHES-ODM validation package integrates the ODM schema into cereberus and then uses cereberus methods to validate ODM data.

The dataset in the next code chunk is a very simple sites table as a CSV file. The dataset has invalid data since it is missing the mandatory geoLat column. You can see in the above schema that geoLat is mandatory by code 'geoLat': {'required': True}.

invalid_odm_dataset = {
    "sites": import_dataset("./datasets/invalid-sites-dataset.csv")
}
pprintDictList(invalid_odm_dataset["sites"], "Invalid ODM Dataset")
                                                Invalid ODM Dataset                                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geoLong                                                                                                         ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                                                                               │
│ 2                                                                                                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Consequentially, we get an error when we try to validate the dataset with our constructed validation schema.

validate_data function

The validate_data function is used to validate ODM data. The function requires two pieces of information: 1. a validation schema; 1. an ODM dataset to validate.

The function returns a validation report for the data. The following code chunk validates our invalid ODM dataset and prints the report.

validation_result = validate_data(
    validation_schema,
    invalid_odm_dataset
)

pprint(validation_result)
ValidationReport(
data_version='2.0.0',
schema_version='2.0.0',
package_version='0.5.0',
table_info={'sites': {'columns': 1, 'rows': 2}},
errors=[
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLong': '1'}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLong': '2'}
│   │   }
],
warnings=[]
)

Understanding error messages

Take a look at the errors field in the report. The report contains the list of errors in our dataset. Here we have two errors, each error says that we are missing the geoLat column.

The error report provides metadata to trace which data row had invalid data. The message field is a human-readable description of the error. Although most of the fields are self-explanatory, if further clarification is needed, the errorType field can be used to dig deeper by finding the specification file for the validation rule. For example the missing_mandatory_column specification can be found in the repo.

The validation report also consists of versioning metadata fields to debug any errors in the ODM schema. These are the data_version, schema_version, and package_version fields.

Example: validating a valid data table

For completeness let’s validate a valid dataset in the next two code chunks.

# Import a valid dataset
valid_odm_data = {
    "sites": import_dataset("./datasets/valid-sites-dataset.csv")
}

pprintDictList(valid_odm_data["sites"], "Valid Sites Table")
                                                 Valid Sites Table                                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geoLat                                               geoLong                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                   │ 1                                                         │
│ 2                                                   │ 2                                                         │
└─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
# Validate the valid dataset
pprint(
    validate_data(validation_schema, valid_odm_data)
)
ValidationReport(
data_version='2.0.0',
schema_version='2.0.0',
package_version='0.5.0',
table_info={'sites': {'columns': 2, 'rows': 2}},
errors=[],
warnings=[]
)

As we can see, the error list is empty.

generate_validation_schema function

Generally, you don’t need to create a validation schema because there are default ODM validation schemas for all ODM versions. However, you can create your own schema with the generate_validation_schema function. This function generates a validation schema from the ODM dictionary. The next chunk has a chunk from the parts sheet in the dictionary.

parts_sheet = import_dataset("./dictionary/parts-v2.csv")

pprintDictList(parts_sheet, 'Version 2 Parts Sheet')
                                               Version 2 Parts Sheet                                               
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ partID       partType       sites      sitesRequired       firstReleased       lastUpdated       status   ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ sites       │ table         │ NA        │ NA                 │ 1.0.0              │ 2.0.0            │ active   │
│ geoLat      │ attribute     │ header    │ mandatory          │ 1.0.0              │ 2.0.0            │ active   │
│ geoLong     │ attribute     │ header    │ mandatory          │ 1.0.0              │ 2.0.0            │ active   │
└─────────────┴───────────────┴───────────┴────────────────────┴────────────────────┴──────────────────┴──────────┘

The above parts sheet describes:

  • A table called sites; and,

  • two mandatory columns in the sites table called geoLat and geoLong.

We can use the generate_validation_schema function to generate a validation schema from the above parts sheet. The function takes two arguments:

  • the parts sheet as a list of Python dictionaries;

  • the ODM dictionary version we want for our generated schema.

The following code chunk generates a validation schema from the above parts sheet for version 2.0.0 datasets.

validation_schema_2 = generate_validation_schema(parts_sheet,
                                                 schema_version="2.0.0")

pprint(validation_schema_2)
{'schemaVersion': '2.0.0', 'schema': {}}

As we can see, we generated an identical validation schema as the one we manually created but with one main difference, it includes a meta field used to trace back to the row(s) in the parts sheet used to generate the validation rule.

Validating our invalid dataset using the new validation schema returns the same error report as above, except for a new validationRuleFields, which is just a copy of the meta field. We can see that by running the next code.

pprint(validate_data(validation_schema_2, invalid_odm_dataset))
ValidationReport(
data_version='2.0.0',
schema_version='2.0.0',
package_version='0.5.0',
table_info={'sites': {'columns': 1, 'rows': 2}},
errors=[],
warnings=[]
)

Validating different ODM versions

We can also generate validation schemas for version 1.0.0 of the ODM dictionary. The next two code chunks import a demo version 1 parts sheet and creates a validation schema.

parts_sheet_v1 = import_dataset("./dictionary/parts-v1.csv")

pprintDictList(parts_sheet_v1, "Version 1 Parts Sheet")
                                               Version 1 Parts Sheet                                               
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓
┃ partID   partType   sites   sitesReq…  version1…  version1…  version1V…  firstRel…  lastUpdat…  status ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩
│ sites   │ table     │ NA     │ NA        │ tables    │ Site      │ NA         │ 1.0.0     │ 2.0.0      │ active │
│ geoLat  │ attribute │ header │ mandatory │ variables │ Site      │ Latitude   │ 1.0.0     │ 2.0.0      │ active │
│ geoLong │ attribute │ header │ mandatory │ variables │ Site      │ Longitude  │ 1.0.0     │ 2.0.0      │ active │
└─────────┴───────────┴────────┴───────────┴───────────┴───────────┴────────────┴───────────┴────────────┴────────┘
validation_schema_2_v1 = generate_validation_schema(parts_sheet_v1,
                                                    schema_version="1.0.0")

pprint(validation_schema_2_v1)
{'schemaVersion': '1.0.0', 'schema': {}}

The printed validation schema is once again identical to the previous one except, the table and column names have been replaced with their version 1 equivalents.

Finally, the next two code chunks import an invalid version 1 ODM dataset and validates it using our version 1 validation schema.

version_one_invalid_odm_data = {
    "Site": import_dataset("./datasets/invalid-sites-dataset-v1.csv")
}

pprintDictList(version_one_invalid_odm_data["Site"], "Version 1 Sites Table")
                                               Version 1 Sites Table                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Longitude                                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                                                                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
pprint(validate_data(validation_schema_2_v1, version_one_invalid_odm_data))
ValidationReport(
data_version='2.0.0',
schema_version='1.0.0',
package_version='0.5.0',
table_info={'Site': {'columns': 1, 'rows': 1}},
errors=[],
warnings=[]
)

Final points

Just a final point, the validation-rules folder contains specifications for all the currently supported validation rules.

That’s the end! May all your data validation reports contain only empty fields!