Module functions

validate_data

Validates an ODM dataset.

Arguments

  1. schema: The rules to validate the data against. This is a dictionary that contains a cerberus schema object, as well as the odm-version it’s based on. The cerberus schema object should ideally be generated using the generate_validation_schema function.

    type: A Python dictionary with the following fields

    • schemaVersion: string that has a semver version

    • schema: Cerberus schema

      Example

      {
          "schemaVersion": "1.2.3",
          "schema": {
              "addresses": {
                  "type": "list",
                  "schema": {
                      "type": "dict",
                      "schema": {
                          "addressID": {
                              "required" True,
                              "meta": {
                                  "partID": "addressID",
                                  "addresses": "PK",
                                  "addressesRequired": "mandatory",
                              }
                          },
                          "addL2": {
                              "meta": {
                                  "partID": "contactID",
                                  "contacts": "PK",
                                  "contactsRequired": "NA"
                              }
                          }
                      }
                  }
              }
          }
      }
      
  2. data: The ODM data to be validated.

    • type: A Python dictionary whose keys are the names of the tables as contained in the ODM data dictionary and values is a list containing the table rows.

      Example

      {
          "addresses": [
              {
                  "addressID": "WastewaterSiteOttawa",
                  "addL1": "123 Laurier Avenue",
                  "addL2": "",
                  "city": "Ottawa",
                  "country": "Canada",
                  "datasetID": "",
                  "stateProvReg": "Ontario",
                  "zipCode": "KE2 TYU"
              }
          ],
          "contacts": [
              {
                  "contactID": "OttawaWWContact",
                  "organizationID": "WWOttawa",
                  "email": "ww@ottawa.ca",
                  "phone": "6137458999",
                  "firstName": "John",
                  "lastName": "Doe",
                  "role": "Technician",
                  "notes": ""
              }
          ]
      }
      
  3. data_version: The ODM version of the data.

    • type: string.

  4. rule_blacklist: A list of rule ids to explicitly disable.

    • type: A Python list of strings.

      Example:

      [rules.invalid_category.__name__, rules.invalid_type.__name__]

      or simply

      ['invalid_category', 'invalid_type']

Return

Returns a dictionary with the found errors and warnings.

All errors and warnings for each validation rule are documented in the specification for each validation rule.

  • type: A Python dictionary consisting of the following fields

    • data_version: string consisting of the version of the ODM data

    • schema_version: string consisting of the version of the validation schema used

    • package_version: string consisting of the version of the validation package used

    • table_info: A Python dictionary mapping all validated table(-ids) to their column/row counts.

    • errors: A list of Python dictionaries describing each error. For more information refer to the files in the validation-rules folder

    • warnings: A list of Python dictionaries describing each warning.

summarize_report

Summarizes the validation report.

Arguments

  1. report: A validation report returned from validate_data.

    • type: A ValidationReport object.

  2. by: Specifies what to summarize by. Defaults to table.

    • type: A Python enumeration with the values (table, column, row).

Return

Returns a summarized version of report.

  • type: A SummarizedReport object with the following fields and methods:

    • data_version: string consisting of the version of the ODM data

    • schema_version: string consisting of the version of the validation schema used

    • package_version: string consisting of the version of the validation package used

    • overview: A Python dictionary with a general overview.

    • errors: A Python dictionary with the error summaries.

    • warnings: A python dictionary with the warning summaries.

Example

report = validate_data(...)
summary = summarize_report(report)
pprint(summary.overview)

See the summarize-report-function spec for more details.

generate_validation_schema

Generates the cerberus schema containing the validation rules to be used with the validate_data function.

Arguments

  1. parts: The ODM data dictionary excel sheet ‘parts’.

    • type: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required.

      Example

      {
          "parts": [
              {
                  "partID": "addresses",
                  "label": "Address table",
                  "partType": "table",
                  "addresses": "NA",
                  "addressesRequired": "NA"
              },
              {
                  "partID": "addressID",
                  "label": "Address ID",
                  "partType": "attribute",
                  "addresses": "pK",
                  "addressesRequired": "mandatory"
              }
          ]
      }
      
  2. sets: The ODM data dictionary excel sheet ‘sets’.

    • type: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required.

      Example

      {
          "sets": [
              {
                  "setID": "collectCat",
                  "partID": "flowPr"
              },
              {
      
                  "setID": "collectCat",
                  "partID": "comp8h"
              }
          ]
      }
      
  3. schema_version: Optional version of the ODM dictionary the cerberus schema is for.

    • type: A string representing the version of the ODM to use.

  4. schema_additions: Optional argument which allows the user to update the cerberus schema with additional validations

    • type: A dictionary containing the updates. The shape is shown below,

    {
        # The name of the table whose validation rules to update
        "<table_name>": {
            # The name of the column whose validation rules to update
            "<column_name>": {
                # Adds or updates the allowed rule for this column
                "allowed": string[]
            }
        }
    }
    

Return

Return a dictionary that contains: 1. A valid cerberus schema object 2. The ODM dataset version the schema is for and 3. A list of warnings generated during the generation process.

Values from the ODM data dictionary will also be added to the meta field for debugging purposes.

Example

{
    "schemaVersion": "1.2.3",
    "schema": {
        "addresses": {
            "type": "list",
            "schema": {
                "type": "dict",
                "schema": {
                    "addressID": {
                        "required" True,
                        "meta": [
                            {
                                "ruleId": "missing_mandatory_column",
                                "meta": [
                                    {
                                        "partID": "addressID",
                                        "addresses": "PK",
                                        "addressesRequired": "mandatory",
                                    }
                                ]
                            }
                        ]
                    },
                },
                "meta": {
                    "partID": "addresses",
                    "partType": "table"
                }
            }
        }
    },
    "warnings": []
}

Logic

Working with the version parameter

The cerberus schema for a part should be added only if it is active for the provided version argument. To see whether a part is valid for a version, the status, firstReleased, and lastUpdated fields are used.

  • The status field can take one of two values, active or depreciated. active says that a part is currently being actively used while depreciated says the opposite.

  • The firstReleased and lastUpdated fields has the version of the ODM when the part was first added and last changed respectively.

  • The changes field is used to describe the changes made from one version to another.

The validation metadata for a part should only be used if it was active for the version in the version argument.

Example of an odm_data_dictionary parameter is shown below,

{
    "parts": [
        {
            "partID": "addresses",
            "label": "Address table",
            "partType": "tables",
            "addresses": "NA",
            "addressesRequired": "NA",
            "status": "active",
            "changes": "added in version 2",
            "firstReleased": "2",
            "lastUpdated": "2"
        },
        {
            "partID": "comp3",
            "label": "Composite grab sample of 3",
            "partType": "categories",
            "addresses": "NA",
            "addressesRequired": "NA",
            "status": "depreciated",
            "changes": "Use grab with collection number (collectNum) = 3",
            "firstReleased": "1",
            "lastUpdated": "2"
        }
    ]
}
  • The addresses part is currently active (status = ‘active’) and should only be included in version 2 since it was first released (firstReleased = ‘2’) then.

  • The comp3 part was depreciated in version 2 (status = ‘depreciated’ and lastUpdated = ‘2’) and should only be included in version 1 (firstReleased = ‘1’)

For the sets sheet, the versioning fields (firstReleased and lastUpdated) define the dictionary version when a part was added to a set. These values can be different from the versioning fields in the parts sheet. For example, consider the following ODM snippet,

{
    "parts": [
        {
            "partID": "samples",
            "partType": "tables",
            "samples": "NA",
            "firstReleased": "1",
            "lastUpdated": "2"
        }
        {
            "partID": "collType",
            "partType": "attributes",
            "samples": "header",
            "mmaSet": "collectSet",
            "firstReleased": "1",
            "lastUpdated": "2"
        },
        {
            "partID": "comp8r",
            "partType": "measures",
            "samples": "header",
            "mmaSet": "NA",
            "firstReleased": "1",
            "lastUpdated": "2"
        }
    ],
    "sets": [
        {
            "setID": "collectSet",
            "partID": "comp8r",
            "firstReleased": "2",
            "lastUpdated": "2"
        }
    ]
}

It defines,

  • A table called samples

  • A column in the samples table called collType

  • A part called comp8r which is a measure and which is a category in the collType column

However, notice that comp8r was only added as a category for the collType column in version 2. This can be seen from its entry in the sets sheet where it was firstReleased in version 2. For version 1, although a part was defined for it, it was not a category for collType only from version 2 onwards.

Version 2 of the dictionary renamed certain part pieces, for example, the WWMeasures table was renamed to measures in version 2. To be backcompatible with version 1, columns were added to the parts list to document their version 1 equivalents. These columns are documented where necessary in the spec.

Working with the schema_additions

Currently, the function only supports updating the following cerberus validation rules,

For example, for the following set of arguments to the function,

parts =

[
{
│   │   'partID': 'sites',
│   │   'partType': 'tables',
│   │   'sites': 'NA',
│   │   'sitesRequired': 'NA',
│   │   'version1Location': 'tables',
│   │   'version1Table': 'Site',
│   │   'status': 'active'
},
{
│   │   'partID': 'siteID',
│   │   'partType': 'attributes',
│   │   'sites': 'pK',
│   │   'sitesRequired': 'mandatory',
│   │   'version1Location': 'variables',
│   │   'version1Table': 'Site',
│   │   'version1Variable': 'SiteID',
│   │   'status': 'active'
}
]

version = “2.0.0”

schema_additions =

{
'Site': {
│   │   'SiteID': {
│   │   │   'allowed': [
│   │   │   │   'Ottawa Site',
│   │   │   │   'Montreal Site'
│   │   │   ]
│   │   }
}
}

The corresponding validation schema should be,

{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Site': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'SiteID': {
│   │   │   │   │   │   'required': True,
│   │   │   │   │   │   'allowed': [
│   │   │   │   │   │   │   'Ottawa Site',
│   │   │   │   │   │   │   'Montreal Site'
│   │   │   │   │   │   ],
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_mandatory_column',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'siteID',
│   │   │   │   │   │   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   │   │   │   │   │   'sites': 'pK',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'SiteID'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'sites',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Site'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

Care should be taken to perform an update and not an overwrite of the allowed field if the schema already contains an allowed field for that column. For example for the arguments below,

parts =

[
{
│   │   'partID': 'samples',
│   │   'partType': 'tables',
│   │   'status': 'active'
},
{
│   │   'partID': 'collection',
│   │   'partType': 'attributes',
│   │   'samples': 'header',
│   │   'dataType': 'categorical',
│   │   'mmaSet': 'collectCat',
│   │   'status': 'active'
},
{
│   │   'partID': 'comp3h',
│   │   'partType': 'categories',
│   │   'samples': 'input',
│   │   'dataType': 'varchar',
│   │   'status': 'active'
},
{
│   │   'partID': 'comp8h',
│   │   'partType': 'categories',
│   │   'samples': 'input',
│   │   'dataType': 'varchar',
│   │   'status': 'active'
},
{
│   │   'partID': 'flowPr',
│   │   'partType': 'categories',
│   │   'samples': 'input',
│   │   'dataType': 'varchar',
│   │   'status': 'active'
}
]

version = “2.0.0”

schema_additions =

{
'samples': {
│   │   'collection': {
│   │   │   'allowed': [
│   │   │   │   'comp3',
│   │   │   │   'comp3dep'
│   │   │   ]
│   │   }
}
}

The corresponding validation schema would be,

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'samples': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'collection': {
│   │   │   │   │   │   'allowed': [
│   │   │   │   │   │   │   'comp3h',
│   │   │   │   │   │   │   'comp8h',
│   │   │   │   │   │   │   'flowPr',
│   │   │   │   │   │   │   'comp3',
│   │   │   │   │   │   │   'comp3dep'
│   │   │   │   │   │   ],
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'invalid_category',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'collection',
│   │   │   │   │   │   │   │   │   │   'samples': 'header',
│   │   │   │   │   │   │   │   │   │   'dataType': 'categorical',
│   │   │   │   │   │   │   │   │   │   'mmaSet': 'collectCat'
│   │   │   │   │   │   │   │   │   },
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'comp3h',
│   │   │   │   │   │   │   │   │   │   'setID': 'collectCat'
│   │   │   │   │   │   │   │   │   },
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'comp8h',
│   │   │   │   │   │   │   │   │   │   'setID': 'collectCat'
│   │   │   │   │   │   │   │   │   },
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'flowPr',
│   │   │   │   │   │   │   │   │   │   'setID': 'collectCat'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'samples',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}