# Getting to know the validation library This document describes and demonstrates the main functions of the validation library. The document is a [quarto](https://quarto.org/) notebook which you can execute yourself. The library has two main functions: 1. `validate_data`: to validate an ODM dataset using a validation schema. 2. `generate_validation_schema`: to generate a validation schema from the ODM dictionary. ## Setup We’ll walk through how to use these two functions, but first we will install the library dependencies by running the following command in the terminal. Make sure you are in the root of the library directory. `pip install -r requirements.txt` Next, the notebook uses the code to [rich](https://github.com/Textualize/rich) library to print and display tables. `rich` can be installed by running the following command in the terminal. `python -m pip install rich` Next, import the two main library functions. ``` python from odm_validation.validation import validate_data, generate_validation_schema ``` ## Example: validating a `sites` table with missing data Let’s start by validating an ODM dataset. We will use just the `sites` table that “contains information about a site; the location where an environmental sample was taken.” The table has a number of columns, but we will validate just the `geoLat` and `geoLong` columns. These columns are mandatory in the `sites` table. Validating an ODM dataset requires a validation schema, which contains the rules to validate. For the demo we created a [YAML file](./validation-schemas/sites-schema.yml) which has the validation rules for the sites table. The next code chunk will import the validation schema. ``` python validation_schema = import_schema("./validation-schemas/sites-schema.yml") pprint(validation_schema, expand_all=True) ```
{
'schemaVersion': '2.0.0',
'schema': {
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'geoLat': {
│   │   │   │   │   │   'required': True
│   │   │   │   │   },
│   │   │   │   │   'geoLong': {
│   │   │   │   │   │   'required': True
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   }
}
}
The validation schema has two fields: - `schemaVersion`: the version of the ODM the validation schema; and, - `schema`: which has the validation rules The structure of the validation rules follows a Python library called [cerberus](https://docs.python-cerberus.org/en/stable/schemas.html) that does all the validation heavy lifting. The PHES-ODM validation package integrates the ODM schema into cereberus and then uses cereberus methods to validate ODM data. The dataset in the next code chunk is a very simple `sites` table as a [CSV file](./datasets/invalid-sites-dataset.csv). The dataset has invalid data since it is missing the mandatory `geoLat` column. You can see in the above schema that `geoLat` is mandatory by code `'geoLat': {'required': True}`. ``` python invalid_odm_dataset = { "sites": import_dataset("./datasets/invalid-sites-dataset.csv") } pprintDictList(invalid_odm_dataset["sites"], "Invalid ODM Dataset") ```
                                                Invalid ODM Dataset                                                
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geoLong                                                                                                         ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                                                                               │
│ 2                                                                                                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Consequentially, we get an error when we try to validate the dataset with our constructed validation schema. ### `validate_data` function The `validate_data` function is used to validate ODM data. The function requires two pieces of information: 1. a validation schema; 1. an ODM dataset to validate. The function returns a validation report for the data. The following code chunk validates our invalid ODM dataset and prints the report. ``` python validation_result = validate_data( validation_schema, invalid_odm_dataset ) pprint(validation_result) ```
ValidationReport(
data_version='2.2.3',
schema_version='2.0.0',
package_version='1.0.0b1',
table_info={'sites': {'columns': 1, 'rows': 2}},
errors=[
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLong': '1'}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLong': '2'}
│   │   }
],
warnings=[]
)
### Understanding error messages Take a look at the `errors` field in the report. The report contains the list of errors in our dataset. Here we have two errors, each error says that we are missing the `geoLat` column. The error report provides metadata to trace which data row had invalid data. The `message` field is a human-readable description of the error. Although most of the fields are self-explanatory, if further clarification is needed, the `errorType` field can be used to dig deeper by finding the specification file for the validation rule. For example the `missing_mandatory_column` specification can be found in the [repo](../../validation-rules/missing_mandatory_column.md). The validation report also consists of versioning metadata fields to debug any errors in the ODM schema. These are the `data_version`, `schema_version`, and `package_version` fields. ### Example: validating a valid data table For completeness let’s validate a [valid dataset](./datasets/valid-sites-dataset.csv) in the next two code chunks. ``` python # Import a valid dataset valid_odm_data = { "sites": import_dataset("./datasets/valid-sites-dataset.csv") } pprintDictList(valid_odm_data["sites"], "Valid Sites Table") ```
                                                 Valid Sites Table                                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geoLat                                               geoLong                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                   │ 1                                                         │
│ 2                                                   │ 2                                                         │
└─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
``` python # Validate the valid dataset pprint( validate_data(validation_schema, valid_odm_data) ) ```
ValidationReport(
data_version='2.2.3',
schema_version='2.0.0',
package_version='1.0.0b1',
table_info={'sites': {'columns': 2, 'rows': 2}},
errors=[],
warnings=[]
)
As we can see, the error list is empty. ### `generate_validation_schema` function Generally, you don’t need to create a validation schema because there are default ODM validation schemas for all ODM versions. However, you can create your own schema with the `generate_validation_schema` function. This function generates a validation schema from the ODM dictionary. The next chunk has a chunk from the [parts sheet](./dictionary/parts-v2.csv) in the dictionary. ``` python parts_sheet = import_dataset("./dictionary/parts-v2.csv") pprintDictList(parts_sheet, 'Version 2 Parts Sheet') ```
                                               Version 2 Parts Sheet                                               
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ partID       partType        sites      sitesRequired       firstReleased       lastUpdated      status   ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ sites       │ tables         │ NA        │ NA                 │ 1.0.0              │ 2.0.0           │ active   │
│ geoLat      │ attributes     │ header    │ mandatory          │ 1.0.0              │ 2.0.0           │ active   │
│ geoLong     │ attributes     │ header    │ mandatory          │ 1.0.0              │ 2.0.0           │ active   │
└─────────────┴────────────────┴───────────┴────────────────────┴────────────────────┴─────────────────┴──────────┘
The above parts sheet describes: - A table called `sites`; and, - two mandatory columns in the `sites` table called `geoLat` and `geoLong`. We can use the `generate_validation_schema` function to generate a validation schema from the above parts sheet. The function takes two arguments: - the parts sheet as a list of Python dictionaries; - the ODM dictionary version we want for our generated schema. The following code chunk generates a validation schema from the above parts sheet for version 2.0.0 datasets. ``` python validation_schema_2 = generate_validation_schema(parts_sheet, schema_version="2.0.0") pprint(validation_schema_2) ```
{
'schemaVersion': '2.0.0',
'schema': {
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'geoLat': {
│   │   │   │   │   │   'required': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_mandatory_column',
│   │   │   │   │   │   │   │   'meta': [{'partID': 'geoLat', 'sites': 'header', 'sitesRequired': 'mandatory'}]
│   │   │   │   │   │   │   },
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_values_found',
│   │   │   │   │   │   │   │   'meta': [{'partID': 'geoLat', 'sites': 'header', 'sitesRequired': 'mandatory'}]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ],
│   │   │   │   │   │   'emptyTrimmed': False,
│   │   │   │   │   │   'forbidden': []
│   │   │   │   │   },
│   │   │   │   │   'geoLong': {
│   │   │   │   │   │   'required': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_mandatory_column',
│   │   │   │   │   │   │   │   'meta': [{'partID': 'geoLong', 'sites': 'header', 'sitesRequired': 'mandatory'}]
│   │   │   │   │   │   │   },
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_values_found',
│   │   │   │   │   │   │   │   'meta': [{'partID': 'geoLong', 'sites': 'header', 'sitesRequired': 'mandatory'}]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ],
│   │   │   │   │   │   'emptyTrimmed': False,
│   │   │   │   │   │   'forbidden': []
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [{'partID': 'sites', 'partType': 'tables'}]
│   │   │   }
│   │   }
}
}
As we can see, we generated an identical validation schema as the one we manually created but with one main difference, it includes a `meta` field used to trace back to the row(s) in the parts sheet used to generate the validation rule. Validating our invalid dataset using the new validation schema returns the same error report as above, except for a new `validationRuleFields`, which is just a copy of the meta field. We can see that by running the next code. ``` python pprint(validate_data(validation_schema_2, invalid_odm_dataset)) ```
ValidationReport(
data_version='2.2.3',
schema_version='2.0.0',
package_version='1.0.0b1',
table_info={'sites': {'columns': 1, 'rows': 2}},
errors=[
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [
│   │   │   │   {'partID': 'geoLat', 'sites': 'header', 'sitesRequired': 'mandatory'},
│   │   │   │   {'partID': 'geoLat', 'sites': 'header', 'sitesRequired': 'mandatory'}
│   │   │   ],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLong': '1'}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [
│   │   │   │   {'partID': 'geoLat', 'sites': 'header', 'sitesRequired': 'mandatory'},
│   │   │   │   {'partID': 'geoLat', 'sites': 'header', 'sitesRequired': 'mandatory'}
│   │   │   ],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column geoLat: Missing mandatory column geoLat',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLong': '2'}
│   │   }
],
warnings=[]
)
### Validating different ODM versions We can also generate validation schemas for version 1.0.0 of the ODM dictionary. The next two code chunks import a [demo version 1 parts sheet](./dictionary/parts-v1.csv) and creates a validation schema. ``` python parts_sheet_v1 = import_dataset("./dictionary/parts-v1.csv") pprintDictList(parts_sheet_v1, "Version 1 Parts Sheet") ```
                                               Version 1 Parts Sheet                                               
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃ partID   partType   sites   sitesReq…  version1…  version1T…  version1…  firstRele…  lastUpda…  status ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│ sites   │ tables    │ NA     │ NA        │ tables    │ Site       │ NA        │ 1.0.0      │ 2.0.0     │ active │
│ geoLat  │ attribut… │ header │ mandatory │ variables │ Site       │ Latitude  │ 1.0.0      │ 2.0.0     │ active │
│ geoLong │ attribut… │ header │ mandatory │ variables │ Site       │ Longitude │ 1.0.0      │ 2.0.0     │ active │
└─────────┴───────────┴────────┴───────────┴───────────┴────────────┴───────────┴────────────┴───────────┴────────┘
``` python validation_schema_2_v1 = generate_validation_schema(parts_sheet_v1, schema_version="1.0.0") pprint(validation_schema_2_v1) ```
{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Site': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'Latitude': {
│   │   │   │   │   │   'required': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_mandatory_column',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   │   │   │   │   │   'sites': 'header',
│   │   │   │   │   │   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'Latitude'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   },
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_values_found',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   │   │   │   │   │   'sites': 'header',
│   │   │   │   │   │   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'Latitude'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ],
│   │   │   │   │   │   'emptyTrimmed': False,
│   │   │   │   │   │   'forbidden': []
│   │   │   │   │   },
│   │   │   │   │   'Longitude': {
│   │   │   │   │   │   'required': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_mandatory_column',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   │   │   │   │   │   'sites': 'header',
│   │   │   │   │   │   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'Longitude'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   },
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_values_found',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   │   │   │   │   │   'sites': 'header',
│   │   │   │   │   │   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'Longitude'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ],
│   │   │   │   │   │   'emptyTrimmed': False,
│   │   │   │   │   │   'forbidden': []
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'sites',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Site'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}
The printed validation schema is once again identical to the previous one except, the table and column names have been replaced with their version 1 equivalents. Finally, the next two code chunks import an invalid [version 1 ODM dataset](./datasets/invalid-sites-dataset-v1.csv) and validates it using our version 1 validation schema. ``` python version_one_invalid_odm_data = { "Site": import_dataset("./datasets/invalid-sites-dataset-v1.csv") } pprintDictList(version_one_invalid_odm_data["Site"], "Version 1 Sites Table") ```
                                               Version 1 Sites Table                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Longitude                                                                                                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                                                                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
``` python pprint(validate_data(validation_schema_2_v1, version_one_invalid_odm_data)) ```
ValidationReport(
data_version='2.2.3',
schema_version='1.0.0',
package_version='1.0.0b1',
table_info={'Site': {'columns': 1, 'rows': 1}},
errors=[
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'Site',
│   │   │   'columnName': 'Latitude',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   'sites': 'header',
│   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   'version1Variable': 'Latitude'
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   'sites': 'header',
│   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   'version1Variable': 'Latitude'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'missing_mandatory_column rule violated in table Site, column Latitude: Missing mandatory column Latitude',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'Longitude': '1'}
│   │   }
],
warnings=[]
)
## Final points Just a final point, the [validation-rules](../../validation-rules/) folder contains specifications for all the currently supported validation rules. That’s the end! May all your data validation reports contain only empty fields!