Standardized Validation Rule API¶
This document will go over the design for a standardized API to define validation rules in the library. It will also provide reasoning for some of the design decisions made.
Audience¶
The primary audience for this document are the developers of the validation library.
Context¶
Adding a new validation rule in the library cannot currently be done in an isolated manner. Programming the different pieces that goes into a new rule more often than not requires modifying multiple files in the library. In addition, the code in these files are not directly related to the rule itself.
For example, the code for the duplicate_entries_found rule is in
multiple spots in the library. The schema generation is in one file, and
the validation logic/error generation is in another. Worse, its not
intuitive which parts of these files are immediately related to the
rule.
There is a growing need to enable the users of the library to create their own rules. Users of the ODM come from a variety of backgrounds, disciplines, and professions. Trying to accomodate all the validation needs of such a diverse group of individuals is not possible, hence the need for a better way to implement rules.
Keeping the above in mind, the goal of of this document is to specify a standardized, intuitive, and isolated API for the creation of validation rules. This API should work not only for the core set of rules included with the library, but for any future rules that need to be implemented.
Design Approach¶
The design for this API was developed by:
Going through all the rules currently implemented in the library as well as certain new rules that users have asked for;
Identifying the common elements between the rules; and
Developing an API to expose those elements to a user, in a way that works for all of the rules
The common elements or processes involved in each rule are explained below.
Rule processes¶
At a high level, implementing a new rule requires the following two processes:
Generating a schema; and
Validating data using the generated schema
There are other details that need to be thought off, for example, generating errors for the validation report, but they are all within the context of the above two processes.
Generating a schema¶
Generating a schema means returning the information needed by the rule to perform its validation during the validation process.
The rule needs to generate a schema for every table-column pair in the ODM dictionary. Certain table-column pairs may not be applicable to a rule, which the design needs to keep in mind.
The metadata needed to generate a schema will most probably come from an outside source. Currently, all this metadata comes from the ODM dictionary excel sheet.
For example, consider the greater_than_max_length rule. It needs to
know what the max length is for a particular value in order to implement
its validation logic. In addition, the rule only applies to string
columns with a defined max length. Finally, all the information needed
to generate this rule comes from the maxLength column of a part,
locatedin the parts sheet in the ODM dictionary.
Take a look at the parts sheet below,
Parts Sheet ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ sites ┃ dataType ┃ maxValue ┃ maxLength ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ sites │ tables │ NA │ NA │ NA │ NA │ │ siteID │ attributes │ pK │ varchar │ NA │ 6 │ │ geoLat │ attributes │ header │ integer │ 90 │ NA │ │ geoLong │ attributes │ header │ integer │ 90 │ NA │ └─────────────────┴──────────────────────┴───────────────┴──────────────────┴──────────────────┴──────────────────┘
It defines 4 parts:
A
sitestable;A
siteIDstring column, part of thesitestable;A
geoLatinteger column, part of thesitestable; andA
geoLonginteger column, part of thesitestable
The greater_than_max_length rule would generate a schema only for the
siteID column in the sites table. The schema would include its
maxLength value of 6.
Validating data¶
Validating data means asking the rule if a value in a dataset is valid, according to the information returned in the schema generation process.
For example, for the dataset below,
Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ siteID ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1234567 │ 91 │ 89 │ │ 2 │ 89 │ 91 │ └───────────────────────────────────────┴──────────────────────────────────┴──────────────────────────────────────┘
the validation logic for the greater_than_max_length rule would be run
for all values in the siteID column.
This step is optional if the desired validation can be done by cerberus
Specifications¶
These next few sections will go over the different elements of the API.
Keep in mind that the library uses cerberus as its validation engine. Some of its idiosyncrancies may leak into the design.
The ValidationRule class¶
This is an abstract class that every validation rule that can be used with the library will need to extend.
By using an abstract class it becomes clear what methods need to be implemented when programming a new rule. Run-time checks that show errors if all methods are not implemented can be shown to the user.
The ValidationRule class has the following methods:
get_schema_keysgen_schema_valuesvalidate_datagen_reporttransformed
The first two methods are used in the schema generation process while the other three methods are used in the validation process.
In the following sections we will go over the listed methods in detail.
Every method by default takes the Python self parameter as its first
argument. Its presence is assumed and will not be mentioned unless
warranted.
get_schema_keys and gen_schema_values¶
These two functions are used in the schema generation process and are together used to construct the schema for a rule. The schema for a rule is an object that contains key-value pairs, where each key is the unique name for a piece of validation metadata, and the value contains the metadata.
The get_schema_keys method does not take any parameters and should
return a list of strings that represents the keys.
The gen_schema_values method takes the following 4 parameters:
The name of the table to generate the schema values for;
The name of the column in the above table to generate the schema values for;
An object that represents the ODM dictionary the schema is for. This can be used to generate the schema values for the provided table-column pair.
An object which contains metadata about the current generation process. It should contain the following metadata:
The version of the dictionary the schema is for
The method should return one of the following values:
Noneif the table-column pair is not applicable for the rule; orA list of schema values, each of which matches up with the keys returned in the
get_schema_keysmethod
For example, for the greater_than_max_length rule:
The
gen_schema_keysmethod would return the list['maxlength']; andThe
gen_schema_valuesmethod would return the list[6]for thesiteIDcolumn andNoneotherwise
Similiarly, if we were to implement the two methods for the
greater_than_max_value rule with the parts sheet displayed above:
The
get_schema_keysmethod would return the list['type', 'coerce', 'max']The
gen_schema_valuesmethod would return the list['integer', 'integer', 90]for thegeoLatandgeoLongcolumns. It would returnNonefor the other columns.
Due to the seperation of the schema generation and validation processes, the schema generation process cannot be combined into one method. This is explained below.
validate_data¶
This method is optional. A rule can decide to implement it if nothing in cerberus meets its needs. It can also be used to override a cerberus rule.
If this method is implemented, then the get_schema_keys method
should only return one key. This is for ease of implementation since
there are currently no uses cases for custom validation with multiple
schema keys.
The method takes the following 3 parameters:
The value to be validated.
The schema information to validate this value with. This is the schema value that the rule returned in their
gen_schema_valuesmethod for the table-column pair that the value belongs to.An object that contains metadata about the value. It should include the following:
The row that the value belongs to
The row index
The name of the column
The name of the table
The method should return a boolean, True if the value passed
validation, and False otherwise.
For example, with the greater_than_max_length rule and the sites
table printed above, the validate_data method would be called twice
for each row of the sites table with the parameters below. Each item
in the list corresponds to each call of the validate_data method.
[ │ { │ │ 'value': '1234567', │ │ 'schema': 6, │ │ 'data_context': { │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'}, │ │ │ 'row_num': 0, │ │ │ 'column': 'siteID', │ │ │ 'table': 'sites' │ │ } │ }, │ { │ │ 'value': '1', │ │ 'schema': 6, │ │ 'data_context': { │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'}, │ │ │ 'row_num': 1, │ │ │ 'column': 'siteID', │ │ │ 'table': 'sites' │ │ } │ } ]
gen_report¶
This method is used by the rule to return a machine actionable object for every failed validation. Each of these objects will be included in the final validation report.
It takes the following parameters:
The value that failed validation
An object that contains metadata about the failed validation. The metadata includes:
The name of the table the failed value is from
The index of the row the failed value is from
The name of the column the failed value is from
The row object where the failed value is from
The name of the schema key that failed validation
The value of the schema key
The method can return None or a report object.
A rule can return None, for example, to avoid duplicate errors in the
report. For example, with the greater_than_max_value rule, the
gen_report method could return an object only for the max schema
key.
For example, for the greater_than_max_value rule, this method would be
called twice, once for the geoLat column value in row 1 and once for
the geoLong value in row 2.
It would be called with the following parameters below. Each item in the
list has the parameters for each call of the gen_report method.
[ │ { │ │ 'value': 91, │ │ 'error_context': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 0, │ │ │ 'column': 'geoLat', │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'}, │ │ │ 'schema_key': 'min', │ │ │ 'schema_value': 90 │ │ } │ }, │ { │ │ 'value': 91, │ │ 'error_context': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 1, │ │ │ 'column': 'geoLong', │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'}, │ │ │ 'schema_key': 'min', │ │ │ 'schema_value': 90 │ │ } │ } ]
transformed¶
This is an optional method used to inform the rule when a value has been
transformed. Its useful only if the cerberus coerce rule is used.
It takes three parameters:
The transformed value
The original value
An object that includes metadata about the transformed value. The metadata includes:
The name of the table the value is from
The index of the row the value is from
The name of the column the value is from
An object containing the row the value is from
The method can return None or a report object.
For example, for the same sites table and the greater_than_max_value
rule, this method would be called four times, for the two values in the
geoLat and geoLong column.
The parameters for each call are printed in the list below. Each list item corresponds to each call.
[ │ { │ │ 'trans_value': 91, │ │ 'orig_value': '91', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 0, │ │ │ 'column': 'geoLat', │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'} │ │ } │ }, │ { │ │ 'trans_value': 89, │ │ 'orig_value': '89', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 0, │ │ │ 'column': 'geoLong', │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'} │ │ } │ }, │ { │ │ 'trans_value': 89, │ │ 'orig_value': '89', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 1, │ │ │ 'column': 'geoLat', │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'} │ │ } │ }, │ { │ │ 'trans_value': 91, │ │ 'orig_value': '91', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 1, │ │ │ 'column': 'geoLong', │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'} │ │ } │ } ]
This method was added due to the less_than_min_value and
greater_than_max_value rules, where we needed to show warnings to the
user if a successful coercion took place.
Two methods for schema generation?¶
The functionality of get_schema_keys could be rolled into
gen_schema_values, so that the latter method now returns the entire
schema object, with keys and values.
This is not possible because of the way the validate_data and
gen_report methods work. For both these methods the library needs to
know the schema keys that map to each rule:
If a rule implements its own
validate_datamethod, the library needs to attach that method to cerberus, with the schema key associated with it.For the
gen_reportmethod, the library needs to know how to map each cerberus error object to the rule it belongs to. It uses the schema key for this.
If there was only one schema generation method, the library could not rely on it to always return a schema object, since most rules are not applicable to all table-column pairs.
With get_schema_keys, the library can always know what schema keys
belong to a rule.
Working with self¶
Certain rules need to keep track of information between method calls.
For example, the duplicate_entries_found rule needs to keep track of
all primaryKey-lastUpdated pairs it encounters during the validation
process. It uses this information to check whether a new value is unique
within a table, or if it has been encountered before.
The self parameter allows a rule to keep track of whatever state
information it needs between calls of the different methods.
For example, for the duplicate_entries_found rule, it can keep track
of each primaryKey-lastUpdated pair in self whenever validate_data
is called. It can then refer to that field when validating a new pair to
check if it is unique or not.
The Report object¶
Within the gen_report and transformed functions the validation rule
can decide to return an object that represents an error/warning report.
Depending on whether the report is an error or a warning the errorType
or warningType field should be set, with errorType being for the
former and warningType being for the latter.
If the report is an error then it should be added to the errors list in the report and if it is warning it should be to the warnings list.
Finally, the validation rule is allowed to add any other metadata it deems necessary to the report object, for example, the offending value(s).