# Getting started The ODM-validation library provides an easy way of checking that your data is ODM-compliant by validating it against a rule-based schema. ## Installation The first step is to install the package. You can find instructions [here](../index-content.md#installation). ## Code setup ``` python # stdlib from os.path import join # for constructing file paths # odm-validation import odm_validation.odm as odm import odm_validation.schemas as schemas import odm_validation.utils as utils from odm_validation.validation import generate_validation_schema, validate_data # our asset directory ASSET_DIR = "../../assets/" ``` ## Validation To run the validation you will need your own dataset as a CSV file, and a validation schema specific to the version of the ODM you’re using. In this example we’ll validate a table of samples with the default schema. The `validate_data` can take multiple tables at once, so we include our sample table in a Python dictionary with the table name as the key. We are validating the `samples` table in this example. ``` python schema_name = f"schema-v{odm.VERSION_STR}.yml" schema_path = join(ASSET_DIR, join("validation-schemas", schema_name)) schema = schemas.import_schema(schema_path) samples = utils.import_dataset("samples.csv") data = {"samples": samples} report = validate_data(schema, data) print("number of errors: ", 0 if report.valid() else len(report.errors)) ``` number of errors: 4 Our sample dataset has been validated successfully with no errors. Now let’s see what happens when validating invalid data. We’ll use the same schema, but replace the previous dataset with an invalid one. ``` python samples = utils.import_dataset("samples-invalid.csv") data = {"samples": samples} report = validate_data(schema, data) ``` The validation report contains a list of errors from all instances of broken rules. Each error is a dictionary with information of what went wrong and where. The message-field has a human-readable summary. ``` python assert not report.valid() for e in report.errors: print("- " + e["message"]) ``` - missing_mandatory_column rule violated in table samples, column collDT: Missing mandatory column collDT - invalid_type rule violated in table samples, column collPer, row(s) 1: Value "asd" has type string which is incompatible with float - missing_mandatory_column rule violated in table samples, column collType: Missing mandatory column collType - missing_mandatory_column rule violated in table samples, column sampleID: Missing mandatory column sampleID The error message starts with the name of the rule that has been broken, followed by the table name, field name, and row number. The actual format of the message varies from rule to rule, but the general idea is the same. To find out the requirements for each rule, you can read the specifications in the [validation-rules](https://github.com/Big-Life-Lab/PHES-ODM-Validation/tree/6-jupyter-example/validation-rules) directory. ## Versioning ODM version information is an important part of the validation report. Consecutive validations by different people may give different results depending on their validation-package version, schema version, etc. This may be avoided by looking at the versions included in the report. If two reports from the same data have differing versions, then the errors reported may differ too. The package version is set automatically, but the schema and data versions should be specified. The latest version is used by default when not specified. ``` python print(report.package_version) print(report.schema_version) print(report.data_version) ``` 1.0.0b1 2.2.3 2.2.3 ## Schema generation You can generate your own validation schema from a CSV file of the parts table, using the `generate_validation_schema` function. The version parameter determines which version of the ODM the schema will be compatible with. ``` python version = odm.VERSION_STR parts_file = join(ASSET_DIR, f"dictionary/v{version}/parts.csv") parts = utils.import_dataset(parts_file) schema = generate_validation_schema(parts, schema_version=version) ``` The resulting schema can also be exported to a YAML file for future use. ``` python schemas.export_schema(schema, "my_schema.yml") ```