Standardized Validation Rule API

This document will go over the design for a standardized API to define validation rules in the library. It will also provide reasoning for some of the design decisions made.

Audience

The primary audience for this document are the developers of the validation library.

Context

Adding a new validation rule in the library cannot currently be done in an isolated manner. Programming the different pieces that goes into a new rule more often than not requires modifying multiple files in the library. In addition, the code in these files are not directly related to the rule itself.

For example, the code for the duplicate_entries_found rule is in multiple spots in the library. The schema generation is in one file, and the validation logic/error generation is in another. Worse, its not intuitive which parts of these files are immediately related to the rule.

There is a growing need to enable the users of the library to create their own rules. Users of the ODM come from a variety of backgrounds, disciplines, and professions. Trying to accomodate all the validation needs of such a diverse group of individuals is not possible, hence the need for a better way to implement rules.

Keeping the above in mind, the goal of of this document is to specify a standardized, intuitive, and isolated API for the creation of validation rules. This API should work not only for the core set of rules included with the library, but for any future rules that need to be implemented.

Design Approach

The design for this API was developed by:

  1. Going through all the rules currently implemented in the library as well as certain new rules that users have asked for;

  2. Identifying the common elements between the rules; and

  3. Developing an API to expose those elements to a user, in a way that works for all of the rules

The common elements or processes involved in each rule are explained below.

Rule processes

At a high level, implementing a new rule requires the following two processes:

  1. Generating a schema; and

  2. Validating data using the generated schema

There are other details that need to be thought off, for example, generating errors for the validation report, but they are all within the context of the above two processes.

Generating a schema

Generating a schema means returning the information needed by the rule to perform its validation during the validation process.

The rule needs to generate a schema for every table-column pair in the ODM dictionary. Certain table-column pairs may not be applicable to a rule, which the design needs to keep in mind.

The metadata needed to generate a schema will most probably come from an outside source. Currently, all this metadata comes from the ODM dictionary excel sheet.

For example, consider the greater_than_max_length rule. It needs to know what the max length is for a particular value in order to implement its validation logic. In addition, the rule only applies to string columns with a defined max length. Finally, all the information needed to generate this rule comes from the maxLength column of a part, locatedin the parts sheet in the ODM dictionary.

Take a look at the parts sheet below,

                                                    Parts Sheet                                                    
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ partID           partType              sites          dataType          maxValue          maxLength        ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ sites           │ tables               │ NA            │ NA               │ NA               │ NA               │
│ siteID          │ attributes           │ pK            │ varchar          │ NA               │ 6                │
│ geoLat          │ attributes           │ header        │ integer          │ 90               │ NA               │
│ geoLong         │ attributes           │ header        │ integer          │ 90               │ NA               │
└─────────────────┴──────────────────────┴───────────────┴──────────────────┴──────────────────┴──────────────────┘

It defines 4 parts:

  1. A sites table;

  2. A siteID string column, part of the sites table;

  3. A geoLat integer column, part of the sites table; and

  4. A geoLong integer column, part of the sites table

The greater_than_max_length rule would generate a schema only for the siteID column in the sites table. The schema would include its maxLength value of 6.

Validating data

Validating data means asking the rule if a value in a dataset is valid, according to the information returned in the schema generation process.

For example, for the dataset below,

                                                    Sites Table                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ siteID                                 geoLat                            geoLong                              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1234567                               │ 91                               │ 89                                   │
│ 2                                     │ 89                               │ 91                                   │
└───────────────────────────────────────┴──────────────────────────────────┴──────────────────────────────────────┘

the validation logic for the greater_than_max_length rule would be run for all values in the siteID column.

This step is optional if the desired validation can be done by cerberus

Specifications

These next few sections will go over the different elements of the API.

Keep in mind that the library uses cerberus as its validation engine. Some of its idiosyncrancies may leak into the design.

The ValidationRule class

This is an abstract class that every validation rule that can be used with the library will need to extend.

By using an abstract class it becomes clear what methods need to be implemented when programming a new rule. Run-time checks that show errors if all methods are not implemented can be shown to the user.

The ValidationRule class has the following methods:

  1. get_schema_keys

  2. gen_schema_values

  3. validate_data

  4. gen_report

  5. transformed

The first two methods are used in the schema generation process while the other three methods are used in the validation process.

In the following sections we will go over the listed methods in detail.

Every method by default takes the Python self parameter as its first argument. Its presence is assumed and will not be mentioned unless warranted.

get_schema_keys and gen_schema_values

These two functions are used in the schema generation process and are together used to construct the schema for a rule. The schema for a rule is an object that contains key-value pairs, where each key is the unique name for a piece of validation metadata, and the value contains the metadata.

The get_schema_keys method does not take any parameters and should return a list of strings that represents the keys.

The gen_schema_values method takes the following 4 parameters:

  1. The name of the table to generate the schema values for;

  2. The name of the column in the above table to generate the schema values for;

  3. An object that represents the ODM dictionary the schema is for. This can be used to generate the schema values for the provided table-column pair.

  4. An object which contains metadata about the current generation process. It should contain the following metadata:

    1. The version of the dictionary the schema is for

The method should return one of the following values:

  • None if the table-column pair is not applicable for the rule; or

  • A list of schema values, each of which matches up with the keys returned in the get_schema_keys method

For example, for the greater_than_max_length rule:

  1. The gen_schema_keys method would return the list ['maxlength']; and

  2. The gen_schema_values method would return the list [6] for the siteID column and None otherwise

Similiarly, if we were to implement the two methods for the greater_than_max_value rule with the parts sheet displayed above:

  1. The get_schema_keys method would return the list ['type', 'coerce', 'max']

  2. The gen_schema_values method would return the list ['integer', 'integer', 90]
    for the geoLat and geoLong columns. It would return None for the other columns.

Due to the seperation of the schema generation and validation processes, the schema generation process cannot be combined into one method. This is explained below.

validate_data

This method is optional. A rule can decide to implement it if nothing in cerberus meets its needs. It can also be used to override a cerberus rule.

If this method is implemented, then the get_schema_keys method should only return one key. This is for ease of implementation since there are currently no uses cases for custom validation with multiple schema keys.

The method takes the following 3 parameters:

  1. The value to be validated.

  2. The schema information to validate this value with. This is the schema value that the rule returned in their gen_schema_values method for the table-column pair that the value belongs to.

  3. An object that contains metadata about the value. It should include the following:

    1. The row that the value belongs to

    2. The row index

    3. The name of the column

    4. The name of the table

The method should return a boolean, True if the value passed validation, and False otherwise.

For example, with the greater_than_max_length rule and the sites table printed above, the validate_data method would be called twice for each row of the sites table with the parameters below. Each item in the list corresponds to each call of the validate_data method.

[
{
│   │   'value': '1234567',
│   │   'schema': 6,
│   │   'data_context': {
│   │   │   'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'},
│   │   │   'row_num': 0,
│   │   │   'column': 'siteID',
│   │   │   'table': 'sites'
│   │   }
},
{
│   │   'value': '1',
│   │   'schema': 6,
│   │   'data_context': {
│   │   │   'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'},
│   │   │   'row_num': 1,
│   │   │   'column': 'siteID',
│   │   │   'table': 'sites'
│   │   }
}
]

gen_report

This method is used by the rule to return a machine actionable object for every failed validation. Each of these objects will be included in the final validation report.

It takes the following parameters:

  1. The value that failed validation

  2. An object that contains metadata about the failed validation. The metadata includes:

    1. The name of the table the failed value is from

    2. The index of the row the failed value is from

    3. The name of the column the failed value is from

    4. The row object where the failed value is from

    5. The name of the schema key that failed validation

    6. The value of the schema key

The method can return None or a report object.

A rule can return None, for example, to avoid duplicate errors in the report. For example, with the greater_than_max_value rule, the gen_report method could return an object only for the max schema key.

For example, for the greater_than_max_value rule, this method would be called twice, once for the geoLat column value in row 1 and once for the geoLong value in row 2.

It would be called with the following parameters below. Each item in the list has the parameters for each call of the gen_report method.

[
{
│   │   'value': 91,
│   │   'error_context': {
│   │   │   'table': 'sites',
│   │   │   'row_num': 0,
│   │   │   'column': 'geoLat',
│   │   │   'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'},
│   │   │   'schema_key': 'min',
│   │   │   'schema_value': 90
│   │   }
},
{
│   │   'value': 91,
│   │   'error_context': {
│   │   │   'table': 'sites',
│   │   │   'row_num': 1,
│   │   │   'column': 'geoLong',
│   │   │   'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'},
│   │   │   'schema_key': 'min',
│   │   │   'schema_value': 90
│   │   }
}
]

transformed

This is an optional method used to inform the rule when a value has been transformed. Its useful only if the cerberus coerce rule is used.

It takes three parameters:

  1. The transformed value

  2. The original value

  3. An object that includes metadata about the transformed value. The metadata includes:

    1. The name of the table the value is from

    2. The index of the row the value is from

    3. The name of the column the value is from

    4. An object containing the row the value is from

The method can return None or a report object.

For example, for the same sites table and the greater_than_max_value rule, this method would be called four times, for the two values in the geoLat and geoLong column.

The parameters for each call are printed in the list below. Each list item corresponds to each call.

[
{
│   │   'trans_value': 91,
│   │   'orig_value': '91',
│   │   'trans_ctx': {
│   │   │   'table': 'sites',
│   │   │   'row_num': 0,
│   │   │   'column': 'geoLat',
│   │   │   'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'}
│   │   }
},
{
│   │   'trans_value': 89,
│   │   'orig_value': '89',
│   │   'trans_ctx': {
│   │   │   'table': 'sites',
│   │   │   'row_num': 0,
│   │   │   'column': 'geoLong',
│   │   │   'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'}
│   │   }
},
{
│   │   'trans_value': 89,
│   │   'orig_value': '89',
│   │   'trans_ctx': {
│   │   │   'table': 'sites',
│   │   │   'row_num': 1,
│   │   │   'column': 'geoLat',
│   │   │   'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'}
│   │   }
},
{
│   │   'trans_value': 91,
│   │   'orig_value': '91',
│   │   'trans_ctx': {
│   │   │   'table': 'sites',
│   │   │   'row_num': 1,
│   │   │   'column': 'geoLong',
│   │   │   'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'}
│   │   }
}
]

This method was added due to the less_than_min_value and greater_than_max_value rules, where we needed to show warnings to the user if a successful coercion took place.

Two methods for schema generation?

The functionality of get_schema_keys could be rolled into gen_schema_values, so that the latter method now returns the entire schema object, with keys and values.

This is not possible because of the way the validate_data and gen_report methods work. For both these methods the library needs to know the schema keys that map to each rule:

  1. If a rule implements its own validate_data method, the library needs to attach that method to cerberus, with the schema key associated with it.

  2. For the gen_report method, the library needs to know how to map each cerberus error object to the rule it belongs to. It uses the schema key for this.

If there was only one schema generation method, the library could not rely on it to always return a schema object, since most rules are not applicable to all table-column pairs.

With get_schema_keys, the library can always know what schema keys belong to a rule.

Working with self

Certain rules need to keep track of information between method calls.

For example, the duplicate_entries_found rule needs to keep track of all primaryKey-lastUpdated pairs it encounters during the validation process. It uses this information to check whether a new value is unique within a table, or if it has been encountered before.

The self parameter allows a rule to keep track of whatever state information it needs between calls of the different methods.

For example, for the duplicate_entries_found rule, it can keep track of each primaryKey-lastUpdated pair in self whenever validate_data is called. It can then refer to that field when validating a new pair to check if it is unique or not.

The Report object

Within the gen_report and transformed functions the validation rule can decide to return an object that represents an error/warning report.

Depending on whether the report is an error or a warning the errorType or warningType field should be set, with errorType being for the former and warningType being for the latter.

If the report is an error then it should be added to the errors list in the report and if it is warning it should be to the warnings list.

Finally, the validation rule is allowed to add any other metadata it deems necessary to the report object, for example, the offending value(s).