Module functions¶
validate_data¶
Validates an ODM dataset.
Arguments¶
schema
: The rules to validate the data against. This is a dictionary that contains a cerberus schema object, as well as the odm-version it’s based on. The cerberus schema object should ideally be generated using thegenerate_validation_schema
function.type
: A Python dictionary with the following fieldsschemaVersion
: string that has a semver versionschema
: Cerberus schemaExample
{ "schemaVersion": "1.2.3", "schema": { "addresses": { "type": "list", "schema": { "type": "dict", "schema": { "addressID": { "required" True, "meta": { "partID": "addressID", "addresses": "PK", "addressesRequired": "mandatory", } }, "addL2": { "meta": { "partID": "contactID", "contacts": "PK", "contactsRequired": "NA" } } } } } } }
data
: The ODM data to be validated.type
: A Python dictionary whose keys are the names of the tables as contained in the ODM data dictionary and values is a list containing the table rows.Example
{ "addresses": [ { "addressID": "WastewaterSiteOttawa", "addL1": "123 Laurier Avenue", "addL2": "", "city": "Ottawa", "country": "Canada", "datasetID": "", "stateProvReg": "Ontario", "zipCode": "KE2 TYU" } ], "contacts": [ { "contactID": "OttawaWWContact", "organizationID": "WWOttawa", "email": "ww@ottawa.ca", "phone": "6137458999", "firstName": "John", "lastName": "Doe", "role": "Technician", "notes": "" } ] }
data_version
: The ODM version of thedata
.type
: string.
rule_blacklist
: A list of rule ids to explicitly disable.type
: A Python list of strings.Example:
[rules.invalid_category.__name__, rules.invalid_type.__name__]
or simply
['invalid_category', 'invalid_type']
Return¶
Returns a dictionary with the found errors and warnings.
All errors and warnings for each validation rule are documented in the specification for each validation rule.
type: A Python dictionary consisting of the following fields
data_version
: string consisting of the version of the ODM dataschema_version
: string consisting of the version of the validation schema usedpackage_version
: string consisting of the version of the validation package usedtable_info
: A Python dictionary mapping all validated table(-ids) to their column/row counts.errors
: A list of Python dictionaries describing each error. For more information refer to the files in the validation-rules folderwarnings
: A list of Python dictionaries describing each warning.
summarize_report¶
Summarizes the validation report.
Arguments¶
report
: A validation report returned fromvalidate_data
.type: A
ValidationReport
object.
by
: Specifies what to summarize by. Defaults totable
.type: A Python enumeration with the values
(table, column, row)
.
Return¶
Returns a summarized version of report
.
type: A
SummarizedReport
object with the following fields and methods:data_version
: string consisting of the version of the ODM dataschema_version
: string consisting of the version of the validation schema usedpackage_version
: string consisting of the version of the validation package usedoverview
: A Python dictionary with a general overview.errors
: A Python dictionary with the error summaries.warnings
: A python dictionary with the warning summaries.
Example¶
report = validate_data(...)
summary = summarize_report(report)
pprint(summary.overview)
See the summarize-report-function spec for more details.
generate_validation_schema¶
Generates the cerberus schema containing the validation rules to be used
with the validate_data
function.
Arguments¶
parts
: The ODM data dictionary excel sheet ‘parts’.type
: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required.Example
{ "parts": [ { "partID": "addresses", "label": "Address table", "partType": "table", "addresses": "NA", "addressesRequired": "NA" }, { "partID": "addressID", "label": "Address ID", "partType": "attribute", "addresses": "pK", "addressesRequired": "mandatory" } ] }
sets
: The ODM data dictionary excel sheet ‘sets’.type
: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required.Example
{ "sets": [ { "setID": "collectCat", "partID": "flowPr" }, { "setID": "collectCat", "partID": "comp8h" } ] }
schema_version
: Optional version of the ODM dictionary the cerberus schema is for.type
: A string representing the version of the ODM to use.
schema_additions
: Optional argument which allows the user to update the cerberus schema with additional validationstype
: A dictionary containing the updates. The shape is shown below,
{ # The name of the table whose validation rules to update "<table_name>": { # The name of the column whose validation rules to update "<column_name>": { # Adds or updates the allowed rule for this column "allowed": string[] } } }
Return¶
Return a dictionary that contains: 1. A valid cerberus schema object 2. The ODM dataset version the schema is for and 3. A list of warnings generated during the generation process.
Values from the ODM data dictionary will also be added to the meta
field for debugging purposes.
Example
{
"schemaVersion": "1.2.3",
"schema": {
"addresses": {
"type": "list",
"schema": {
"type": "dict",
"schema": {
"addressID": {
"required" True,
"meta": [
{
"ruleId": "missing_mandatory_column",
"meta": [
{
"partID": "addressID",
"addresses": "PK",
"addressesRequired": "mandatory",
}
]
}
]
},
},
"meta": {
"partID": "addresses",
"partType": "table"
}
}
}
},
"warnings": []
}
Logic¶
Working with the version
parameter¶
The cerberus schema for a part should be added only if it is active for
the provided version
argument. To see whether a part is valid for a
version, the status
, firstReleased
, and lastUpdated
fields are
used.
The
status
field can take one of two values,active
ordepreciated
.active
says that a part is currently being actively used whiledepreciated
says the opposite.The
firstReleased
andlastUpdated
fields has the version of the ODM when the part was first added and last changed respectively.The
changes
field is used to describe the changes made from one version to another.
The validation metadata for a part should only be used if it was active
for the version in the version
argument.
Example of an odm_data_dictionary
parameter is shown below,
{
"parts": [
{
"partID": "addresses",
"label": "Address table",
"partType": "tables",
"addresses": "NA",
"addressesRequired": "NA",
"status": "active",
"changes": "added in version 2",
"firstReleased": "2",
"lastUpdated": "2"
},
{
"partID": "comp3",
"label": "Composite grab sample of 3",
"partType": "categories",
"addresses": "NA",
"addressesRequired": "NA",
"status": "depreciated",
"changes": "Use grab with collection number (collectNum) = 3",
"firstReleased": "1",
"lastUpdated": "2"
}
]
}
The addresses part is currently active (status = ‘active’) and should only be included in version 2 since it was first released (firstReleased = ‘2’) then.
The
comp3
part was depreciated in version 2 (status = ‘depreciated’ and lastUpdated = ‘2’) and should only be included in version 1 (firstReleased = ‘1’)
For the sets sheet, the versioning fields (firstReleased
and
lastUpdated
) define the dictionary version when a part was added to a
set. These values can be different from the versioning fields in the
parts sheet. For example, consider the following ODM snippet,
{
"parts": [
{
"partID": "samples",
"partType": "tables",
"samples": "NA",
"firstReleased": "1",
"lastUpdated": "2"
}
{
"partID": "collType",
"partType": "attributes",
"samples": "header",
"mmaSet": "collectSet",
"firstReleased": "1",
"lastUpdated": "2"
},
{
"partID": "comp8r",
"partType": "measures",
"samples": "header",
"mmaSet": "NA",
"firstReleased": "1",
"lastUpdated": "2"
}
],
"sets": [
{
"setID": "collectSet",
"partID": "comp8r",
"firstReleased": "2",
"lastUpdated": "2"
}
]
}
It defines,
A table called
samples
A column in the
samples
table calledcollType
A part called
comp8r
which is a measure and which is a category in thecollType
column
However, notice that comp8r
was only added as a category for the
collType
column in version 2. This can be seen from its entry in the
sets sheet where it was firstReleased
in version 2. For version 1,
although a part was defined for it, it was not a category for collType
only from version 2 onwards.
Version 2 of the dictionary renamed certain part pieces, for example,
the WWMeasures
table was renamed to measures
in version 2. To be
backcompatible with version 1, columns were added to the parts list to
document their version 1 equivalents. These columns are documented where
necessary in the spec.
Working with the schema_additions¶
Currently, the function only supports updating the following cerberus validation rules,
For example, for the following set of arguments to the function,
parts =
[ │ { │ │ 'partID': 'sites', │ │ 'partType': 'tables', │ │ 'sites': 'NA', │ │ 'sitesRequired': 'NA', │ │ 'version1Location': 'tables', │ │ 'version1Table': 'Site', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'siteID', │ │ 'partType': 'attributes', │ │ 'sites': 'pK', │ │ 'sitesRequired': 'mandatory', │ │ 'version1Location': 'variables', │ │ 'version1Table': 'Site', │ │ 'version1Variable': 'SiteID', │ │ 'status': 'active' │ } ]
version = “2.0.0”
schema_additions =
{ │ 'Site': { │ │ 'SiteID': { │ │ │ 'allowed': [ │ │ │ │ 'Ottawa Site', │ │ │ │ 'Montreal Site' │ │ │ ] │ │ } │ } }
The corresponding validation schema should be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Site': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'SiteID': { │ │ │ │ │ │ 'required': True, │ │ │ │ │ │ 'allowed': [ │ │ │ │ │ │ │ 'Ottawa Site', │ │ │ │ │ │ │ 'Montreal Site' │ │ │ │ │ │ ], │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'missing_mandatory_column', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'siteID', │ │ │ │ │ │ │ │ │ │ 'sitesRequired': 'mandatory', │ │ │ │ │ │ │ │ │ │ 'sites': 'pK', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Site', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'SiteID' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'sites', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Site' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
Care should be taken to perform an update and not an overwrite of the
allowed
field if the schema already contains an allowed
field for
that column. For example for the arguments below,
parts =
[ │ { │ │ 'partID': 'samples', │ │ 'partType': 'tables', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'collection', │ │ 'partType': 'attributes', │ │ 'samples': 'header', │ │ 'dataType': 'categorical', │ │ 'mmaSet': 'collectCat', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'comp3h', │ │ 'partType': 'categories', │ │ 'samples': 'input', │ │ 'dataType': 'varchar', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'comp8h', │ │ 'partType': 'categories', │ │ 'samples': 'input', │ │ 'dataType': 'varchar', │ │ 'status': 'active' │ }, │ { │ │ 'partID': 'flowPr', │ │ 'partType': 'categories', │ │ 'samples': 'input', │ │ 'dataType': 'varchar', │ │ 'status': 'active' │ } ]
version = “2.0.0”
schema_additions =
{ │ 'samples': { │ │ 'collection': { │ │ │ 'allowed': [ │ │ │ │ 'comp3', │ │ │ │ 'comp3dep' │ │ │ ] │ │ } │ } }
The corresponding validation schema would be,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'samples': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'collection': { │ │ │ │ │ │ 'allowed': [ │ │ │ │ │ │ │ 'comp3h', │ │ │ │ │ │ │ 'comp8h', │ │ │ │ │ │ │ 'flowPr', │ │ │ │ │ │ │ 'comp3', │ │ │ │ │ │ │ 'comp3dep' │ │ │ │ │ │ ], │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'invalid_category', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'collection', │ │ │ │ │ │ │ │ │ │ 'samples': 'header', │ │ │ │ │ │ │ │ │ │ 'dataType': 'categorical', │ │ │ │ │ │ │ │ │ │ 'mmaSet': 'collectCat' │ │ │ │ │ │ │ │ │ }, │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'comp3h', │ │ │ │ │ │ │ │ │ │ 'setID': 'collectCat' │ │ │ │ │ │ │ │ │ }, │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'comp8h', │ │ │ │ │ │ │ │ │ │ 'setID': 'collectCat' │ │ │ │ │ │ │ │ │ }, │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'flowPr', │ │ │ │ │ │ │ │ │ │ 'setID': 'collectCat' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'samples', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }