Module functions¶
validate_data¶
Validates an ODM dataset.
Arguments¶
- schema: The rules to validate the data against. This is a dictionary that contains a cerberus schema object, as well as the odm-version it’s based on. The cerberus schema object should ideally be generated using the- generate_validation_schemafunction.- type: A Python dictionary with the following fields- schemaVersion: string that has a semver version
- schema: Cerberus schema- Example - { "schemaVersion": "1.2.3", "schema": { "addresses": { "type": "list", "schema": { "type": "dict", "schema": { "addressID": { "required" True, "meta": { "partID": "addressID", "addresses": "PK", "addressesRequired": "mandatory", } }, "addL2": { "meta": { "partID": "contactID", "contacts": "PK", "contactsRequired": "NA" } } } } } } } 
 
- data: The ODM data to be validated.- type: A Python dictionary whose keys are the names of the tables as contained in the ODM data dictionary and values is a list containing the table rows.- Example - { "addresses": [ { "addressID": "WastewaterSiteOttawa", "addL1": "123 Laurier Avenue", "addL2": "", "city": "Ottawa", "country": "Canada", "datasetID": "", "stateProvReg": "Ontario", "zipCode": "KE2 TYU" } ], "contacts": [ { "contactID": "OttawaWWContact", "organizationID": "WWOttawa", "email": "ww@ottawa.ca", "phone": "6137458999", "firstName": "John", "lastName": "Doe", "role": "Technician", "notes": "" } ] } 
 
- data_version: The ODM version of the- data.- type: string.
 
- rule_blacklist: A list of rule ids to explicitly disable.- type: A Python list of strings.- Example: - [rules.invalid_category.__name__, rules.invalid_type.__name__]- or simply - ['invalid_category', 'invalid_type']
 
Return¶
Returns a dictionary with the found errors and warnings.
All errors and warnings for each validation rule are documented in the specification for each validation rule.
- type: A Python dictionary consisting of the following fields - data_version: string consisting of the version of the ODM data
- schema_version: string consisting of the version of the validation schema used
- package_version: string consisting of the version of the validation package used
- table_info: A Python dictionary mapping all validated table(-ids) to their column/row counts.
- errors: A list of Python dictionaries describing each error. For more information refer to the files in the validation-rules folder
- warnings: A list of Python dictionaries describing each warning.
 
summarize_report¶
Summarizes the validation report.
Arguments¶
- report: A validation report returned from- validate_data.- type: A - ValidationReportobject.
 
- by: Specifies what to summarize by. Defaults to- table.- type: A Python enumeration with the values - (table, column, row).
 
Return¶
Returns a summarized version of report.
- type: A - SummarizedReportobject with the following fields and methods:- data_version: string consisting of the version of the ODM data
- schema_version: string consisting of the version of the validation schema used
- package_version: string consisting of the version of the validation package used
- overview: A Python dictionary with a general overview.
- errors: A Python dictionary with the error summaries.
- warnings: A python dictionary with the warning summaries.
 
Example¶
report = validate_data(...)
summary = summarize_report(report)
pprint(summary.overview)
See the summarize-report-function spec for more details.
generate_validation_schema¶
Generates the cerberus schema containing the validation rules to be used
with the validate_data function.
Arguments¶
- parts: The ODM data dictionary excel sheet ‘parts’.- type: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required.- Example - { "parts": [ { "partID": "addresses", "label": "Address table", "partType": "table", "addresses": "NA", "addressesRequired": "NA" }, { "partID": "addressID", "label": "Address ID", "partType": "attribute", "addresses": "pK", "addressesRequired": "mandatory" } ] } 
 
- sets: The ODM data dictionary excel sheet ‘sets’.- type: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required.- Example - { "sets": [ { "setID": "collectCat", "partID": "flowPr" }, { "setID": "collectCat", "partID": "comp8h" } ] } 
 
- schema_version: Optional version of the ODM dictionary the cerberus schema is for.- type: A string representing the version of the ODM to use.
 
- schema_additions: Optional argument which allows the user to update the cerberus schema with additional validations- type: A dictionary containing the updates. The shape is shown below,
 - { # The name of the table whose validation rules to update "<table_name>": { # The name of the column whose validation rules to update "<column_name>": { # Adds or updates the allowed rule for this column "allowed": string[] } } } 
Return¶
Return a dictionary that contains: 1. A valid cerberus schema object 2. The ODM dataset version the schema is for and 3. A list of warnings generated during the generation process.
Values from the ODM data dictionary will also be added to the meta
field for debugging purposes.
Example
{
    "schemaVersion": "1.2.3",
    "schema": {
        "addresses": {
            "type": "list",
            "schema": {
                "type": "dict",
                "schema": {
                    "addressID": {
                        "required" True,
                        "meta": [
                            {
                                "ruleId": "missing_mandatory_column",
                                "meta": [
                                    {
                                        "partID": "addressID",
                                        "addresses": "PK",
                                        "addressesRequired": "mandatory",
                                    }
                                ]
                            }
                        ]
                    },
                },
                "meta": {
                    "partID": "addresses",
                    "partType": "table"
                }
            }
        }
    },
    "warnings": []
}
Logic¶
Working with the version parameter¶
The cerberus schema for a part should be added only if it is active for
the provided version argument. To see whether a part is valid for a
version, the status, firstReleased, and lastUpdated fields are
used.
- The - statusfield can take one of two values,- activeor- depreciated.- activesays that a part is currently being actively used while- depreciatedsays the opposite.
- The - firstReleasedand- lastUpdatedfields has the version of the ODM when the part was first added and last changed respectively.
- The - changesfield is used to describe the changes made from one version to another.
The validation metadata for a part should only be used if it was active
for the version in the version argument.
Example of an odm_data_dictionary parameter is shown below,
{
    "parts": [
        {
            "partID": "addresses",
            "label": "Address table",
            "partType": "tables",
            "addresses": "NA",
            "addressesRequired": "NA",
            "status": "active",
            "changes": "added in version 2",
            "firstReleased": "2",
            "lastUpdated": "2"
        },
        {
            "partID": "comp3",
            "label": "Composite grab sample of 3",
            "partType": "categories",
            "addresses": "NA",
            "addressesRequired": "NA",
            "status": "depreciated",
            "changes": "Use grab with collection number (collectNum) = 3",
            "firstReleased": "1",
            "lastUpdated": "2"
        }
    ]
}
- The addresses part is currently active (status = ‘active’) and should only be included in version 2 since it was first released (firstReleased = ‘2’) then. 
- The - comp3part was depreciated in version 2 (status = ‘depreciated’ and lastUpdated = ‘2’) and should only be included in version 1 (firstReleased = ‘1’)
For the sets sheet, the versioning fields (firstReleased and
lastUpdated) define the dictionary version when a part was added to a
set. These values can be different from the versioning fields in the
parts sheet. For example, consider the following ODM snippet,
{
    "parts": [
        {
            "partID": "samples",
            "partType": "tables",
            "samples": "NA",
            "firstReleased": "1",
            "lastUpdated": "2"
        }
        {
            "partID": "collType",
            "partType": "attributes",
            "samples": "header",
            "mmaSet": "collectSet",
            "firstReleased": "1",
            "lastUpdated": "2"
        },
        {
            "partID": "comp8r",
            "partType": "measures",
            "samples": "header",
            "mmaSet": "NA",
            "firstReleased": "1",
            "lastUpdated": "2"
        }
    ],
    "sets": [
        {
            "setID": "collectSet",
            "partID": "comp8r",
            "firstReleased": "2",
            "lastUpdated": "2"
        }
    ]
}
It defines,
- A table called - samples
- A column in the - samplestable called- collType
- A part called - comp8rwhich is a measure and which is a category in the- collTypecolumn
However, notice that comp8r was only added as a category for the
collType column in version 2. This can be seen from its entry in the
sets sheet where it was firstReleased in version 2. For version 1,
although a part was defined for it, it was not a category for collType
only from version 2 onwards.
Version 2 of the dictionary renamed certain part pieces, for example,
the WWMeasures table was renamed to measures in version 2. To be
backcompatible with version 1, columns were added to the parts list to
document their version 1 equivalents. These columns are documented where
necessary in the spec.
Working with the schema_additions¶
Currently, the function only supports updating the following cerberus validation rules,
For example, for the following set of arguments to the function,
parts =
[ │ { │ │ 'partID': 'sites', │ │ 'partType': 'tables', │ │ 'sites': 'NA', │ │ 'sitesRequired': 'NA', │ │ 'version1Location': 'tables', │ │ 'version1Table': 'Site', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'siteID', │ │ 'partType': 'attributes', │ │ 'sites': 'pK', │ │ 'sitesRequired': 'mandatory', │ │ 'version1Location': 'variables', │ │ 'version1Table': 'Site', │ │ 'version1Variable': 'SiteID', │ │ 'status': 'active' │ } ]
version = “2.0.0”
schema_additions =
{ │ 'Site': { │ │ 'SiteID': { │ │ │ 'anyof': [ │ │ │ │ { │ │ │ │ │ 'allowed': [ │ │ │ │ │ │ 'Ottawa Site', │ │ │ │ │ │ 'Montreal Site' │ │ │ │ │ ], │ │ │ │ │ 'empty': True │ │ │ │ } │ │ │ ] │ │ } │ } }
The corresponding validation schema should be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Site': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'SiteID': { │ │ │ │ │ │ 'required': True, │ │ │ │ │ │ 'anyof': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'allowed': [ │ │ │ │ │ │ │ │ │ 'Ottawa Site', │ │ │ │ │ │ │ │ │ 'Montreal Site' │ │ │ │ │ │ │ │ ], │ │ │ │ │ │ │ │ 'empty': True │ │ │ │ │ │ │ } │ │ │ │ │ │ ], │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'missing_mandatory_column', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'siteID', │ │ │ │ │ │ │ │ │ │ 'sitesRequired': 'mandatory', │ │ │ │ │ │ │ │ │ │ 'sites': 'pK', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Site', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'SiteID' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'sites', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Site' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
Care should be taken to perform an update and not an overwrite of the
allowed field if the schema already contains an allowed field for
that column. For example for the arguments below,
parts =
[ │ { │ │ 'partID': 'samples', │ │ 'partType': 'tables', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'collection', │ │ 'partType': 'attributes', │ │ 'samples': 'header', │ │ 'dataType': 'categorical', │ │ 'mmaSet': 'collectCat', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'comp3h', │ │ 'partType': 'categories', │ │ 'samples': 'input', │ │ 'dataType': 'varchar', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'comp8h', │ │ 'partType': 'categories', │ │ 'samples': 'input', │ │ 'dataType': 'varchar', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'flowPr', │ │ 'partType': 'categories', │ │ 'samples': 'input', │ │ 'dataType': 'varchar', │ │ 'status': 'active' │ } ]
version = “2.0.0”
schema_additions =
{ │ 'samples': { │ │ 'collection': { │ │ │ 'anyof': [ │ │ │ │ { │ │ │ │ │ 'allowed': [ │ │ │ │ │ │ 'comp3', │ │ │ │ │ │ 'comp3dep' │ │ │ │ │ ] │ │ │ │ } │ │ │ ] │ │ } │ } }
The corresponding validation schema would be,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'samples': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'collection': { │ │ │ │ │ │ 'anyof': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'allowed': [ │ │ │ │ │ │ │ │ │ 'comp3h', │ │ │ │ │ │ │ │ │ 'comp8h', │ │ │ │ │ │ │ │ │ 'flowPr', │ │ │ │ │ │ │ │ │ 'comp3', │ │ │ │ │ │ │ │ │ 'comp3dep' │ │ │ │ │ │ │ │ ], │ │ │ │ │ │ │ │ 'empty': True │ │ │ │ │ │ │ } │ │ │ │ │ │ ], │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'invalid_category', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'collection', │ │ │ │ │ │ │ │ │ │ 'samples': 'header', │ │ │ │ │ │ │ │ │ │ 'dataType': 'categorical', │ │ │ │ │ │ │ │ │ │ 'mmaSet': 'collectCat' │ │ │ │ │ │ │ │ │ }, │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'comp3h', │ │ │ │ │ │ │ │ │ │ 'setID': 'collectCat' │ │ │ │ │ │ │ │ │ }, │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'comp8h', │ │ │ │ │ │ │ │ │ │ 'setID': 'collectCat' │ │ │ │ │ │ │ │ │ }, │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'flowPr', │ │ │ │ │ │ │ │ │ │ 'setID': 'collectCat' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'samples', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
