Getting to know the validation library¶
This document describes and demonstrates the main functions of the validation library. The document is a quarto notebook which you can execute yourself.
The library has two main functions:
validate_data
: to validate an ODM dataset using a validation schema.generate_validation_schema
: to generate a validation schema from the ODM dictionary.
Setup¶
We’ll walk through how to use these two functions, but first we will install the library dependencies by running the following command in the terminal. Make sure you are in the root of the library directory.
pip install -r requirements.txt
Next, the notebook uses the code to
rich library to print and display
tables. rich
can be installed by running the following command in the
terminal.
python -m pip install rich
Next, import the two main library functions.
from odm_validation.validation import validate_data, generate_validation_schema
Example: validating a sites
table with missing data¶
Let’s start by validating an ODM dataset. We will use just the sites
table that “contains information about a site; the location where an
environmental sample was taken.” The table has a number of columns, but
we will validate just the geoLat
and geoLong
columns. These columns
are mandatory in the sites
table.
Validating an ODM dataset requires a validation schema, which contains the rules to validate. For the demo we created a YAML file which has the validation rules for the sites table. The next code chunk will import the validation schema.
validation_schema = import_schema("./validation-schemas/sites-schema.yml")
pprint(validation_schema, expand_all=True)
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'geoLat': { │ │ │ │ │ │ 'required': True │ │ │ │ │ }, │ │ │ │ │ 'geoLong': { │ │ │ │ │ │ 'required': True │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ } │ } }
The validation schema has two fields:
schemaVersion
: the version of the ODM the validation schema; and,schema
: which has the validation rules
The structure of the validation rules follows a Python library called cerberus that does all the validation heavy lifting. The PHES-ODM validation package integrates the ODM schema into cereberus and then uses cereberus methods to validate ODM data.
The dataset in the next code chunk is a very simple sites
table as a
CSV file. The dataset has
invalid data since it is missing the mandatory geoLat
column. You can
see in the above schema that geoLat
is mandatory by code
'geoLat': {'required': True}
.
invalid_odm_dataset = {
"sites": import_dataset("./datasets/invalid-sites-dataset.csv")
}
pprintDictList(invalid_odm_dataset["sites"], "Invalid ODM Dataset")
Invalid ODM Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ 2 │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Consequentially, we get an error when we try to validate the dataset with our constructed validation schema.
validate_data
function¶
The validate_data
function is used to validate ODM data. The function
requires two pieces of information: 1. a validation schema; 1. an ODM
dataset to validate.
The function returns a validation report for the data. The following code chunk validates our invalid ODM dataset and prints the report.
validation_result = validate_data(
validation_schema,
invalid_odm_dataset
)
pprint(validation_result)
ValidationReport( │ data_version='2.0.0', │ schema_version='2.0.0', │ package_version='0.5.0', │ table_info={'sites': {'columns': 1, 'rows': 2}}, │ errors=[ │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLong': '1'} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLong': '2'} │ │ } │ ], │ warnings=[] )
Understanding error messages¶
Take a look at the errors
field in the report. The report contains the
list of errors in our dataset. Here we have two errors, each error says
that we are missing the geoLat
column.
The error report provides metadata to trace which data row had invalid
data. The message
field is a human-readable description of the error.
Although most of the fields are self-explanatory, if further
clarification is needed, the errorType
field can be used to dig deeper
by finding the specification file for the validation rule. For example
the missing_mandatory_column
specification can be found in the
repo.
The validation report also consists of versioning metadata fields to
debug any errors in the ODM schema. These are the data_version
,
schema_version
, and package_version
fields.
Example: validating a valid data table¶
For completeness let’s validate a valid dataset in the next two code chunks.
# Import a valid dataset
valid_odm_data = {
"sites": import_dataset("./datasets/valid-sites-dataset.csv")
}
pprintDictList(valid_odm_data["sites"], "Valid Sites Table")
Valid Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ 1 │ │ 2 │ 2 │ └─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
# Validate the valid dataset
pprint(
validate_data(validation_schema, valid_odm_data)
)
ValidationReport( │ data_version='2.0.0', │ schema_version='2.0.0', │ package_version='0.5.0', │ table_info={'sites': {'columns': 2, 'rows': 2}}, │ errors=[], │ warnings=[] )
As we can see, the error list is empty.
generate_validation_schema
function¶
Generally, you don’t need to create a validation schema because there
are default ODM validation schemas for all ODM versions. However, you
can create your own schema with the generate_validation_schema
function. This function generates a validation schema from the ODM
dictionary. The next chunk has a chunk from the parts
sheet in the dictionary.
parts_sheet = import_dataset("./dictionary/parts-v2.csv")
pprintDictList(parts_sheet, 'Version 2 Parts Sheet')
Version 2 Parts Sheet ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ partID ┃ partType ┃ sites ┃ sitesRequired ┃ firstReleased ┃ lastUpdated ┃ status ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ │ sites │ table │ NA │ NA │ 1.0.0 │ 2.0.0 │ active │ │ geoLat │ attribute │ header │ mandatory │ 1.0.0 │ 2.0.0 │ active │ │ geoLong │ attribute │ header │ mandatory │ 1.0.0 │ 2.0.0 │ active │ └─────────────┴───────────────┴───────────┴────────────────────┴────────────────────┴──────────────────┴──────────┘
The above parts sheet describes:
A table called
sites
; and,two mandatory columns in the
sites
table calledgeoLat
andgeoLong
.
We can use the generate_validation_schema
function to generate a
validation schema from the above parts sheet. The function takes two
arguments:
the parts sheet as a list of Python dictionaries;
the ODM dictionary version we want for our generated schema.
The following code chunk generates a validation schema from the above parts sheet for version 2.0.0 datasets.
validation_schema_2 = generate_validation_schema(parts_sheet,
schema_version="2.0.0")
pprint(validation_schema_2)
{'schemaVersion': '2.0.0', 'schema': {}}
As we can see, we generated an identical validation schema as the one we
manually created but with one main difference, it includes a meta
field used to trace back to the row(s) in the parts sheet used to
generate the validation rule.
Validating our invalid dataset using the new validation schema returns
the same error report as above, except for a new validationRuleFields
,
which is just a copy of the meta field. We can see that by running the
next code.
pprint(validate_data(validation_schema_2, invalid_odm_dataset))
ValidationReport( │ data_version='2.0.0', │ schema_version='2.0.0', │ package_version='0.5.0', │ table_info={'sites': {'columns': 1, 'rows': 2}}, │ errors=[], │ warnings=[] )
Validating different ODM versions¶
We can also generate validation schemas for version 1.0.0 of the ODM dictionary. The next two code chunks import a demo version 1 parts sheet and creates a validation schema.
parts_sheet_v1 = import_dataset("./dictionary/parts-v1.csv")
pprintDictList(parts_sheet_v1, "Version 1 Parts Sheet")
Version 1 Parts Sheet ┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┓ ┃ partID ┃ partType ┃ sites ┃ sitesReq… ┃ version1… ┃ version1… ┃ version1V… ┃ firstRel… ┃ lastUpdat… ┃ status ┃ ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━┩ │ sites │ table │ NA │ NA │ tables │ Site │ NA │ 1.0.0 │ 2.0.0 │ active │ │ geoLat │ attribute │ header │ mandatory │ variables │ Site │ Latitude │ 1.0.0 │ 2.0.0 │ active │ │ geoLong │ attribute │ header │ mandatory │ variables │ Site │ Longitude │ 1.0.0 │ 2.0.0 │ active │ └─────────┴───────────┴────────┴───────────┴───────────┴───────────┴────────────┴───────────┴────────────┴────────┘
validation_schema_2_v1 = generate_validation_schema(parts_sheet_v1,
schema_version="1.0.0")
pprint(validation_schema_2_v1)
{'schemaVersion': '1.0.0', 'schema': {}}
The printed validation schema is once again identical to the previous one except, the table and column names have been replaced with their version 1 equivalents.
Finally, the next two code chunks import an invalid version 1 ODM dataset and validates it using our version 1 validation schema.
version_one_invalid_odm_data = {
"Site": import_dataset("./datasets/invalid-sites-dataset-v1.csv")
}
pprintDictList(version_one_invalid_odm_data["Site"], "Version 1 Sites Table")
Version 1 Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Longitude ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
pprint(validate_data(validation_schema_2_v1, version_one_invalid_odm_data))
ValidationReport( │ data_version='2.0.0', │ schema_version='1.0.0', │ package_version='0.5.0', │ table_info={'Site': {'columns': 1, 'rows': 1}}, │ errors=[], │ warnings=[] )
Final points¶
Just a final point, the validation-rules folder contains specifications for all the currently supported validation rules.
That’s the end! May all your data validation reports contain only empty fields!