Standardized Validation Rule API¶
This document will go over the design for a standardized API to define validation rules in the library. It will also provide reasoning for some of the design decisions made.
Audience¶
The primary audience for this document are the developers of the validation library.
Context¶
Adding a new validation rule in the library cannot currently be done in an isolated manner. Programming the different pieces that goes into a new rule more often than not requires modifying multiple files in the library. In addition, the code in these files are not directly related to the rule itself.
For example, the code for the duplicate_entries_found
rule is in
multiple spots in the library. The schema generation is in one file, and
the validation logic/error generation is in another. Worse, its not
intuitive which parts of these files are immediately related to the
rule.
There is a growing need to enable the users of the library to create their own rules. Users of the ODM come from a variety of backgrounds, disciplines, and professions. Trying to accomodate all the validation needs of such a diverse group of individuals is not possible, hence the need for a better way to implement rules.
Keeping the above in mind, the goal of of this document is to specify a standardized, intuitive, and isolated API for the creation of validation rules. This API should work not only for the core set of rules included with the library, but for any future rules that need to be implemented.
Design Approach¶
The design for this API was developed by:
Going through all the rules currently implemented in the library as well as certain new rules that users have asked for;
Identifying the common elements between the rules; and
Developing an API to expose those elements to a user, in a way that works for all of the rules
The common elements or processes involved in each rule are explained below.
Rule processes¶
At a high level, implementing a new rule requires the following two processes:
Generating a schema; and
Validating data using the generated schema
There are other details that need to be thought off, for example, generating errors for the validation report, but they are all within the context of the above two processes.
Generating a schema¶
Generating a schema means returning the information needed by the rule to perform its validation during the validation process.
The rule needs to generate a schema for every table-column pair in the ODM dictionary. Certain table-column pairs may not be applicable to a rule, which the design needs to keep in mind.
The metadata needed to generate a schema will most probably come from an outside source. Currently, all this metadata comes from the ODM dictionary excel sheet.
For example, consider the greater_than_max_length
rule. It needs to
know what the max length is for a particular value in order to implement
its validation logic. In addition, the rule only applies to string
columns with a defined max length. Finally, all the information needed
to generate this rule comes from the maxLength
column of a part,
locatedin the parts
sheet in the ODM dictionary.
Take a look at the parts sheet below,
Parts Sheet ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ sites ┃ dataType ┃ maxValue ┃ maxLength ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ sites │ tables │ NA │ NA │ NA │ NA │ │ siteID │ attributes │ pK │ varchar │ NA │ 6 │ │ geoLat │ attributes │ header │ integer │ 90 │ NA │ │ geoLong │ attributes │ header │ integer │ 90 │ NA │ └─────────────────┴──────────────────────┴───────────────┴──────────────────┴──────────────────┴──────────────────┘
It defines 4 parts:
A
sites
table;A
siteID
string column, part of thesites
table;A
geoLat
integer column, part of thesites
table; andA
geoLong
integer column, part of thesites
table
The greater_than_max_length
rule would generate a schema only for the
siteID
column in the sites
table. The schema would include its
maxLength
value of 6.
Validating data¶
Validating data means asking the rule if a value in a dataset is valid, according to the information returned in the schema generation process.
For example, for the dataset below,
Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ siteID ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1234567 │ 91 │ 89 │ │ 2 │ 89 │ 91 │ └───────────────────────────────────────┴──────────────────────────────────┴──────────────────────────────────────┘
the validation logic for the greater_than_max_length
rule would be run
for all values in the siteID
column.
This step is optional if the desired validation can be done by cerberus
Specifications¶
These next few sections will go over the different elements of the API.
Keep in mind that the library uses cerberus as its validation engine. Some of its idiosyncrancies may leak into the design.
The ValidationRule
class¶
This is an abstract class that every validation rule that can be used with the library will need to extend.
By using an abstract class it becomes clear what methods need to be implemented when programming a new rule. Run-time checks that show errors if all methods are not implemented can be shown to the user.
The ValidationRule
class has the following methods:
get_schema_keys
gen_schema_values
validate_data
gen_report
transformed
The first two methods are used in the schema generation process while the other three methods are used in the validation process.
In the following sections we will go over the listed methods in detail.
Every method by default takes the Python self
parameter as its first
argument. Its presence is assumed and will not be mentioned unless
warranted.
get_schema_keys
and gen_schema_values
¶
These two functions are used in the schema generation process and are together used to construct the schema for a rule. The schema for a rule is an object that contains key-value pairs, where each key is the unique name for a piece of validation metadata, and the value contains the metadata.
The get_schema_keys
method does not take any parameters and should
return a list of strings that represents the keys.
The gen_schema_values
method takes the following 4 parameters:
The name of the table to generate the schema values for;
The name of the column in the above table to generate the schema values for;
An object that represents the ODM dictionary the schema is for. This can be used to generate the schema values for the provided table-column pair.
An object which contains metadata about the current generation process. It should contain the following metadata:
The version of the dictionary the schema is for
The method should return one of the following values:
None
if the table-column pair is not applicable for the rule; orA list of schema values, each of which matches up with the keys returned in the
get_schema_keys
method
For example, for the greater_than_max_length
rule:
The
gen_schema_keys
method would return the list['maxlength']
; andThe
gen_schema_values
method would return the list[6]
for thesiteID
column andNone
otherwise
Similiarly, if we were to implement the two methods for the
greater_than_max_value
rule with the parts sheet displayed above:
The
get_schema_keys
method would return the list['type', 'coerce', 'max']
The
gen_schema_values
method would return the list['integer', 'integer', 90]
for thegeoLat
andgeoLong
columns. It would returnNone
for the other columns.
Due to the seperation of the schema generation and validation processes, the schema generation process cannot be combined into one method. This is explained below.
validate_data
¶
This method is optional. A rule can decide to implement it if nothing in cerberus meets its needs. It can also be used to override a cerberus rule.
If this method is implemented, then the get_schema_keys
method
should only return one key. This is for ease of implementation since
there are currently no uses cases for custom validation with multiple
schema keys.
The method takes the following 3 parameters:
The value to be validated.
The schema information to validate this value with. This is the schema value that the rule returned in their
gen_schema_values
method for the table-column pair that the value belongs to.An object that contains metadata about the value. It should include the following:
The row that the value belongs to
The row index
The name of the column
The name of the table
The method should return a boolean, True
if the value passed
validation, and False
otherwise.
For example, with the greater_than_max_length
rule and the sites
table printed above, the validate_data
method would be called twice
for each row of the sites
table with the parameters below. Each item
in the list corresponds to each call of the validate_data
method.
[ │ { │ │ 'value': '1234567', │ │ 'schema': 6, │ │ 'data_context': { │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'}, │ │ │ 'row_num': 0, │ │ │ 'column': 'siteID', │ │ │ 'table': 'sites' │ │ } │ }, │ { │ │ 'value': '1', │ │ 'schema': 6, │ │ 'data_context': { │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'}, │ │ │ 'row_num': 1, │ │ │ 'column': 'siteID', │ │ │ 'table': 'sites' │ │ } │ } ]
gen_report
¶
This method is used by the rule to return a machine actionable object for every failed validation. Each of these objects will be included in the final validation report.
It takes the following parameters:
The value that failed validation
An object that contains metadata about the failed validation. The metadata includes:
The name of the table the failed value is from
The index of the row the failed value is from
The name of the column the failed value is from
The row object where the failed value is from
The name of the schema key that failed validation
The value of the schema key
The method can return None
or a report object
.
A rule can return None
, for example, to avoid duplicate errors in the
report. For example, with the greater_than_max_value
rule, the
gen_report
method could return an object only for the max
schema
key.
For example, for the greater_than_max_value
rule, this method would be
called twice, once for the geoLat
column value in row 1 and once for
the geoLong
value in row 2.
It would be called with the following parameters below. Each item in the
list has the parameters for each call of the gen_report
method.
[ │ { │ │ 'value': 91, │ │ 'error_context': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 0, │ │ │ 'column': 'geoLat', │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'}, │ │ │ 'schema_key': 'min', │ │ │ 'schema_value': 90 │ │ } │ }, │ { │ │ 'value': 91, │ │ 'error_context': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 1, │ │ │ 'column': 'geoLong', │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'}, │ │ │ 'schema_key': 'min', │ │ │ 'schema_value': 90 │ │ } │ } ]
transformed
¶
This is an optional method used to inform the rule when a value has been
transformed. Its useful only if the cerberus coerce
rule is used.
It takes three parameters:
The transformed value
The original value
An object that includes metadata about the transformed value. The metadata includes:
The name of the table the value is from
The index of the row the value is from
The name of the column the value is from
An object containing the row the value is from
The method can return None
or a report object
.
For example, for the same sites
table and the greater_than_max_value
rule, this method would be called four times, for the two values in the
geoLat
and geoLong
column.
The parameters for each call are printed in the list below. Each list item corresponds to each call.
[ │ { │ │ 'trans_value': 91, │ │ 'orig_value': '91', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 0, │ │ │ 'column': 'geoLat', │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'} │ │ } │ }, │ { │ │ 'trans_value': 89, │ │ 'orig_value': '89', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 0, │ │ │ 'column': 'geoLong', │ │ │ 'row': {'siteID': '1234567', 'geoLat': '91', 'geoLong': '89'} │ │ } │ }, │ { │ │ 'trans_value': 89, │ │ 'orig_value': '89', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 1, │ │ │ 'column': 'geoLat', │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'} │ │ } │ }, │ { │ │ 'trans_value': 91, │ │ 'orig_value': '91', │ │ 'trans_ctx': { │ │ │ 'table': 'sites', │ │ │ 'row_num': 1, │ │ │ 'column': 'geoLong', │ │ │ 'row': {'siteID': '1', 'geoLat': '89', 'geoLong': '91'} │ │ } │ } ]
This method was added due to the less_than_min_value
and
greater_than_max_value
rules, where we needed to show warnings to the
user if a successful coercion took place.
Two methods for schema generation?¶
The functionality of get_schema_keys
could be rolled into
gen_schema_values
, so that the latter method now returns the entire
schema object, with keys and values.
This is not possible because of the way the validate_data
and
gen_report
methods work. For both these methods the library needs to
know the schema keys that map to each rule:
If a rule implements its own
validate_data
method, the library needs to attach that method to cerberus, with the schema key associated with it.For the
gen_report
method, the library needs to know how to map each cerberus error object to the rule it belongs to. It uses the schema key for this.
If there was only one schema generation method, the library could not rely on it to always return a schema object, since most rules are not applicable to all table-column pairs.
With get_schema_keys
, the library can always know what schema keys
belong to a rule.
Working with self
¶
Certain rules need to keep track of information between method calls.
For example, the duplicate_entries_found
rule needs to keep track of
all primaryKey-lastUpdated pairs it encounters during the validation
process. It uses this information to check whether a new value is unique
within a table, or if it has been encountered before.
The self
parameter allows a rule to keep track of whatever state
information it needs between calls of the different methods.
For example, for the duplicate_entries_found
rule, it can keep track
of each primaryKey-lastUpdated pair in self
whenever validate_data
is called. It can then refer to that field when validating a new pair to
check if it is unique or not.
The Report object¶
Within the gen_report
and transformed
functions the validation rule
can decide to return an object that represents an error/warning report.
Depending on whether the report is an error or a warning the errorType
or warningType
field should be set, with errorType
being for the
former and warningType
being for the latter.
If the report is an error then it should be added to the errors list in the report and if it is warning it should be to the warnings list.
Finally, the validation rule is allowed to add any other metadata it deems necessary to the report object, for example, the offending value(s).