less_than_min_length¶
This rule ensures that the number of characters in a string type column
is not less than a defined minimum value. As an example, consider the
phone
column in the contacts
table whose minimum length value is 10.
The following dataset snippet would fail validation,
Invalid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ contactID ┃ phone ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ 123-456-1 │ └────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘
The phone
column in the first row has only 9 characters which is less
than the minimum length value of 10, whereas the following dataset
snippet would pass validation,
Valid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ contactID ┃ phone ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ 123-456-23 │ │ 2 │ 123-456-231 │ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
The phone
columns in both rows have a number of characters equal to 10
and 11 respectively which is greater than or equal to the minimum length
value of 10.
Error report¶
The error report should have the following fields,
errorType: less_than_min_length
tableName: The name of the table with the invalid value
columnName: The name of the column with the invalid value
rowNumber: The index of the row with the invalid value
row: The dictionary containing the invalid row
invalidValue: The invalid value
validationRuleFields: The ODM data dictionary rows used to generate this rule
message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> has length <invalid_length> which is less than the min length of <min_length>
The error report object for the example invalid row above is shown below,
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'less_than_min_length', │ │ │ 'tableName': 'contacts', │ │ │ 'columnName': 'phone', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'contactID': '1', │ │ │ │ 'phone': '123-456-1' │ │ │ }, │ │ │ 'invalidValue': '123-456-1', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'phone', │ │ │ │ │ 'minLength': '10', │ │ │ │ │ 'contacts': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'less_than_min_length rule violated in table contacts, column phone, row(s) 1: Value "123-456-1" (of length 9) is less than the min length of "10"' │ │ } │ ], │ 'warnings': [] }
Rule metadata¶
All the metadata for this rule is contained in the parts sheet in the ODM dictionary. The steps involved are:
Filter the rows to only include those whose data type is a string. Currently, these are
varchar
columns.Check whether the rows has a minimum length value. If it does then add the validation rule for that column.
The
minLength
column contains this metadataThe possible values are a number or
NA
If the value is
NA
then this column does not have a minimum length value
For example in the following ODM dictionary snippet,
Parts v2 ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ contacts ┃ dataType ┃ minLength ┃ status ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ │ contacts │ tables │ NA │ NA │ NA │ NA │ active │ │ phone │ attributes │ NA │ header │ varchar │ 10 │ active │ │ addresses │ tables │ NA │ NA │ NA │ NA │ active │ │ addL2 │ attributes │ header │ NA │ varchar │ NA │ active │ │ zero │ attributes │ header │ NA │ varchar │ 0 │ active │ │ negative │ attributes │ header │ NA │ varchar │ -1 │ active │ └─────────────────┴──────────────────┴────────────────┴───────────────┴───────────────┴────────────────┴──────────┘
The phone
in the contacts
table would have this validation rule but
the addL2
column in the addresses
table would not.
Cerberus Schema¶
We can use the minlength validation rule in the cerberus library to implement this.
The generated cerberus object for the example above is shown below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'contacts': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'phone': { │ │ │ │ │ │ 'minlength': 10, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'less_than_min_length', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'phone', │ │ │ │ │ │ │ │ │ │ 'minLength': '10', │ │ │ │ │ │ │ │ │ │ 'contacts': 'header' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'contacts', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
Parts with minLength
less or equal to zero will be ignored.
The metadata for this rule should have the row from the ODM parts sheet
for the column with its partID
and minLength
column values.
ODM Version 1¶
When generating the schema for version 1, we check whether the column has an equivalent part in version 1. If it does, then we add it to the cerberus schema. For example,
Parts v1 ┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ contacts ┃ dataType ┃ minLength ┃ version1… ┃ version1… ┃ version… ┃ status ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩ │ contacts │ tables │ NA │ NA │ NA │ NA │ tables │ Contact │ NA │ active │ │ phone │ attribut… │ NA │ header │ varchar │ 10 │ variables │ Contact │ phoneNu… │ active │ │ addresses │ tables │ NA │ NA │ NA │ NA │ tables │ Address │ NA │ active │ │ addL2 │ attribut… │ header │ NA │ varchar │ NA │ variables │ Address │ Address… │ active │ │ zero │ attribut… │ header │ NA │ varchar │ 0 │ variables │ Address │ Zero │ active │ │ negative │ attribut… │ header │ NA │ varchar │ -1 │ variables │ Address │ Neg │ active │ └───────────┴───────────┴───────────┴──────────┴──────────┴───────────┴───────────┴───────────┴──────────┴────────┘
The corresponding cerberus schema would be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Contact': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'phoneNumber': { │ │ │ │ │ │ 'minlength': 10, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'less_than_min_length', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'phone', │ │ │ │ │ │ │ │ │ │ 'minLength': '10', │ │ │ │ │ │ │ │ │ │ 'contacts': 'header', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Contact', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'phoneNumber' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'contacts', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Contact' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata should include the following columns,
The
partID
column valueThe
minLength
column valueThe
version1Location
column valueThe
version1Table
column valueThe
version1Variable
column value