# Module functions ## validate_data Validates an ODM dataset. ### Arguments 1. `schema`: The rules to validate the data against. This is a dictionary that contains a [cerberus](https://docs.python-cerberus.org/en/stable/) schema object, as well as the odm-version it’s based on. The cerberus schema object should ideally be generated using the `generate_validation_schema` function. `type`: A Python dictionary with the following fields - `schemaVersion`: string that has a semver version - `schema`: [Cerberus schema](https://docs.python-cerberus.org/en/stable/schemas.html) Example ``` python { "schemaVersion": "1.2.3", "schema": { "addresses": { "type": "list", "schema": { "type": "dict", "schema": { "addressID": { "required" True, "meta": { "partID": "addressID", "addresses": "PK", "addressesRequired": "mandatory", } }, "addL2": { "meta": { "partID": "contactID", "contacts": "PK", "contactsRequired": "NA" } } } } } } } ``` 2. `data`: The ODM data to be validated. - `type`: A Python dictionary whose keys are the names of the tables as contained in the ODM data dictionary and values is a list containing the table rows. Example ``` python { "addresses": [ { "addressID": "WastewaterSiteOttawa", "addL1": "123 Laurier Avenue", "addL2": "", "city": "Ottawa", "country": "Canada", "datasetID": "", "stateProvReg": "Ontario", "zipCode": "KE2 TYU" } ], "contacts": [ { "contactID": "OttawaWWContact", "organizationID": "WWOttawa", "email": "ww@ottawa.ca", "phone": "6137458999", "firstName": "John", "lastName": "Doe", "role": "Technician", "notes": "" } ] } ``` 3. `data_version`: The ODM version of the `data`. - `type`: string. 4. `rule_blacklist`: A list of rule ids to explicitly disable. - `type`: A Python list of strings. Example: `[rules.invalid_category.__name__, rules.invalid_type.__name__]` or simply `['invalid_category', 'invalid_type']` ### Return Returns a dictionary with the found errors and warnings. All errors and warnings for each validation rule are documented in the [specification for each validation rule](../validation-rules/). - type: A Python dictionary consisting of the following fields - `data_version`: string consisting of the version of the ODM data - `schema_version`: string consisting of the version of the validation schema used - `package_version`: string consisting of the version of the validation package used - `table_info`: A Python dictionary mapping all validated table(-ids) to their column/row counts. - `errors`: A list of Python dictionaries describing each error. For more information refer to the files in the [validation-rules](../validation-rules/) folder - `warnings`: A list of Python dictionaries describing each warning. ## summarize_report Summarizes the validation report. ### Arguments 1. `report`: A validation report returned from `validate_data`. - type: A `ValidationReport` object. 2. `by`: Specifies what to summarize by. Defaults to `table`. - type: A Python enumeration with the values `(table, column, row)`. ### Return Returns a summarized version of `report`. - type: A `SummarizedReport` object with the following fields and methods: - `data_version`: string consisting of the version of the ODM data - `schema_version`: string consisting of the version of the validation schema used - `package_version`: string consisting of the version of the validation package used - `overview`: A Python dictionary with a general overview. - `errors`: A Python dictionary with the error summaries. - `warnings`: A python dictionary with the warning summaries. ### Example ``` python report = validate_data(...) summary = summarize_report(report) pprint(summary.overview) ``` See the [summarize-report-function](summarize-report-function.md) spec for more details. ## generate_validation_schema Generates the cerberus schema containing the validation rules to be used with the `validate_data` function. ### Arguments 1. `parts`: The ODM data dictionary [excel sheet](https://github.com/Big-Life-Lab/PHES-ODM/tree/V2.0.0-rc.3/dictionary-tables) ‘parts’. - `type`: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required. Example ``` python { "parts": [ { "partID": "addresses", "label": "Address table", "partType": "table", "addresses": "NA", "addressesRequired": "NA" }, { "partID": "addressID", "label": "Address ID", "partType": "attribute", "addresses": "pK", "addressesRequired": "mandatory" } ] } ``` 2. `sets`: The ODM data dictionary [excel sheet](https://github.com/Big-Life-Lab/PHES-ODM/tree/V2.0.0-rc.3/dictionary-tables) ‘sets’. - `type`: A dictionary whose keys are the sheet names and values is a list containing the sheet rows. Currently the parts and sets sheet are required. Example ``` python { "sets": [ { "setID": "collectCat", "partID": "flowPr" }, { "setID": "collectCat", "partID": "comp8h" } ] } ``` 3. `schema_version`: Optional version of the ODM dictionary the cerberus schema is for. - `type`: A string representing the version of the ODM to use. 4. `schema_additions`: Optional argument which allows the user to update the cerberus schema with additional validations - `type`: A dictionary containing the updates. The shape is shown below, ``` python { # The name of the table whose validation rules to update "": { # The name of the column whose validation rules to update "": { # Adds or updates the allowed rule for this column "allowed": string[] } } } ``` ### Return Return a dictionary that contains: 1. A valid cerberus schema object 2. The ODM dataset version the schema is for and 3. A list of warnings generated during the generation process. Values from the ODM data dictionary will also be added to the `meta` field for debugging purposes. Example ``` python { "schemaVersion": "1.2.3", "schema": { "addresses": { "type": "list", "schema": { "type": "dict", "schema": { "addressID": { "required" True, "meta": [ { "ruleId": "missing_mandatory_column", "meta": [ { "partID": "addressID", "addresses": "PK", "addressesRequired": "mandatory", } ] } ] }, }, "meta": { "partID": "addresses", "partType": "table" } } } }, "warnings": [] } ``` ### Logic #### Working with the `version` parameter The cerberus schema for a part should be added only if it is active for the provided `version` argument. To see whether a part is valid for a version, the `status`, `firstReleased`, and `lastUpdated` fields are used. - The `status` field can take one of two values, `active` or `depreciated`. `active` says that a part is currently being actively used while `depreciated` says the opposite. - The `firstReleased` and `lastUpdated` fields has the version of the ODM when the part was first added and last changed respectively. - The `changes` field is used to describe the changes made from one version to another. The validation metadata for a part should only be used if it was active for the version in the `version` argument. Example of an `odm_data_dictionary` parameter is shown below, ``` python { "parts": [ { "partID": "addresses", "label": "Address table", "partType": "tables", "addresses": "NA", "addressesRequired": "NA", "status": "active", "changes": "added in version 2", "firstReleased": "2", "lastUpdated": "2" }, { "partID": "comp3", "label": "Composite grab sample of 3", "partType": "categories", "addresses": "NA", "addressesRequired": "NA", "status": "depreciated", "changes": "Use grab with collection number (collectNum) = 3", "firstReleased": "1", "lastUpdated": "2" } ] } ``` - The addresses part is currently active (status = ‘active’) and should only be included in version 2 since it was first released (firstReleased = ‘2’) then. - The `comp3` part was depreciated in version 2 (status = ‘depreciated’ and lastUpdated = ‘2’) and should only be included in version 1 (firstReleased = ‘1’) For the sets sheet, the versioning fields (`firstReleased` and `lastUpdated`) define the dictionary version when a part was added to a set. These values can be different from the versioning fields in the parts sheet. For example, consider the following ODM snippet, ``` python { "parts": [ { "partID": "samples", "partType": "tables", "samples": "NA", "firstReleased": "1", "lastUpdated": "2" } { "partID": "collType", "partType": "attributes", "samples": "header", "mmaSet": "collectSet", "firstReleased": "1", "lastUpdated": "2" }, { "partID": "comp8r", "partType": "measures", "samples": "header", "mmaSet": "NA", "firstReleased": "1", "lastUpdated": "2" } ], "sets": [ { "setID": "collectSet", "partID": "comp8r", "firstReleased": "2", "lastUpdated": "2" } ] } ``` It defines, - A table called `samples` - A column in the `samples` table called `collType` - A part called `comp8r` which is a measure and which is a category in the `collType` column However, notice that `comp8r` was only added as a category for the `collType` column in version 2. This can be seen from its entry in the sets sheet where it was `firstReleased` in version 2. For version 1, although a part was defined for it, it was not a category for `collType` only from version 2 onwards. Version 2 of the dictionary renamed certain part pieces, for example, the `WWMeasures` table was renamed to `measures` in version 2. To be backcompatible with version 1, columns were added to the parts list to document their version 1 equivalents. These columns are documented where necessary in the spec. #### Working with the schema_additions Currently, the function only supports updating the following cerberus validation rules, - [allowed](https://docs.python-cerberus.org/en/stable/validation-rules.html#allowed) For example, for the following set of arguments to the function, parts =
[
{
│   │   'partID': 'sites',
│   │   'partType': 'tables',
│   │   'sites': 'NA',
│   │   'sitesRequired': 'NA',
│   │   'version1Location': 'tables',
│   │   'version1Table': 'Site',
│   │   'status': 'active'
},
{
│   │   'partID': 'siteID',
│   │   'partType': 'attributes',
│   │   'sites': 'pK',
│   │   'sitesRequired': 'mandatory',
│   │   'version1Location': 'variables',
│   │   'version1Table': 'Site',
│   │   'version1Variable': 'SiteID',
│   │   'status': 'active'
}
]
version = “2.0.0” schema_additions =
{
'Site': {
│   │   'SiteID': {
│   │   │   'anyof': [
│   │   │   │   {
│   │   │   │   │   'allowed': [
│   │   │   │   │   │   'Ottawa Site',
│   │   │   │   │   │   'Montreal Site'
│   │   │   │   │   ],
│   │   │   │   │   'empty': True
│   │   │   │   }
│   │   │   ]
│   │   }
}
}
The corresponding validation schema should be,
{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Site': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'SiteID': {
│   │   │   │   │   │   'required': True,
│   │   │   │   │   │   'anyof': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'allowed': [
│   │   │   │   │   │   │   │   │   'Ottawa Site',
│   │   │   │   │   │   │   │   │   'Montreal Site'
│   │   │   │   │   │   │   │   ],
│   │   │   │   │   │   │   │   'empty': True
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ],
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'missing_mandatory_column',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'siteID',
│   │   │   │   │   │   │   │   │   │   'sitesRequired': 'mandatory',
│   │   │   │   │   │   │   │   │   │   'sites': 'pK',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'SiteID'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'sites',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Site'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}
Care should be taken to perform an update and not an overwrite of the `allowed` field if the schema already contains an `allowed` field for that column. For example for the arguments below, parts =
[
{
│   │   'partID': 'samples',
│   │   'partType': 'tables',
│   │   'status': 'active'
},
{
│   │   'partID': 'collection',
│   │   'partType': 'attributes',
│   │   'samples': 'header',
│   │   'dataType': 'categorical',
│   │   'mmaSet': 'collectCat',
│   │   'status': 'active'
},
{
│   │   'partID': 'comp3h',
│   │   'partType': 'categories',
│   │   'samples': 'input',
│   │   'dataType': 'varchar',
│   │   'status': 'active'
},
{
│   │   'partID': 'comp8h',
│   │   'partType': 'categories',
│   │   'samples': 'input',
│   │   'dataType': 'varchar',
│   │   'status': 'active'
},
{
│   │   'partID': 'flowPr',
│   │   'partType': 'categories',
│   │   'samples': 'input',
│   │   'dataType': 'varchar',
│   │   'status': 'active'
}
]
version = “2.0.0” schema_additions =
{
'samples': {
│   │   'collection': {
│   │   │   'anyof': [
│   │   │   │   {
│   │   │   │   │   'allowed': [
│   │   │   │   │   │   'comp3',
│   │   │   │   │   │   'comp3dep'
│   │   │   │   │   ]
│   │   │   │   }
│   │   │   ]
│   │   }
}
}
The corresponding validation schema would be,
{
'schemaVersion': '2.0.0',
'schema': {
│   │   'samples': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'collection': {
│   │   │   │   │   │   'anyof': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'allowed': [
│   │   │   │   │   │   │   │   │   'comp3h',
│   │   │   │   │   │   │   │   │   'comp8h',
│   │   │   │   │   │   │   │   │   'flowPr',
│   │   │   │   │   │   │   │   │   'comp3',
│   │   │   │   │   │   │   │   │   'comp3dep'
│   │   │   │   │   │   │   │   ],
│   │   │   │   │   │   │   │   'empty': True
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ],
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'invalid_category',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'collection',
│   │   │   │   │   │   │   │   │   │   'samples': 'header',
│   │   │   │   │   │   │   │   │   │   'dataType': 'categorical',
│   │   │   │   │   │   │   │   │   │   'mmaSet': 'collectCat'
│   │   │   │   │   │   │   │   │   },
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'comp3h',
│   │   │   │   │   │   │   │   │   │   'setID': 'collectCat'
│   │   │   │   │   │   │   │   │   },
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'comp8h',
│   │   │   │   │   │   │   │   │   │   'setID': 'collectCat'
│   │   │   │   │   │   │   │   │   },
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'flowPr',
│   │   │   │   │   │   │   │   │   │   'setID': 'collectCat'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'samples',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}