greater_than_max_length¶
This rule ensures that the number of characters in a string type column
is not greater than a defined maximum value. As an example, consider the
phone
column in the contacts
table whose maximum length value is 10.
The following dataset snippet would fail validation,
Invalid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ contactID ┃ phone ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ 123-456-111 │ └────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘
The phone
column in the first row has 11 characters which is greater
than the maximum length value of 10, whereas the following dataset
snippet would pass validation,
Valid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ contactID ┃ phone ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ 123-456-23 │ │ 2 │ 123-456-2 │ └──────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘
The phone
columns in both rows have a number of characters equal to 10
and 9 respectively which is less than or equal to the maximum length
value of 10.
Error report¶
The error report should have the following fields,
errorType: greater_than_max_length
tableName: The name of the table with the invalid value
columnName: The name of the column with the invalid value
rowNumber: The index of the row with the invalid value
row: The dictionary containing the invalid row
invalidValue: The invalid value
validationRuleFields: The ODM data dictionary rows used to generate this rule
message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> has length <invalid_length> which is greater than the max length of <max_length>
The error report object for the example invalid row above is shown below,
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'greater_than_max_length', │ │ │ 'tableName': 'contacts', │ │ │ 'columnName': 'phone', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'contactID': '1', │ │ │ │ 'phone': '123-456-111' │ │ │ }, │ │ │ 'invalidValue': '123-456-111', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'phone', │ │ │ │ │ 'maxLength': '10', │ │ │ │ │ 'contacts': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'greater_than_max_length rule violated in table contacts, column phone, row(s) 1: Value "123-456-111" (of length 11) exceeds the max length of "10"' │ │ } │ ], │ 'warnings': [] }
Rule metadata¶
All the metadata for this rule is contained in the parts sheet in the ODM dictionary. The steps involved are:
Filter the rows to only include those whose data type is a string. Currently, these are
varchar
columns.Check whether the rows has a maximum length value. If it does then add the validation rule for that column.
The
maxLength
column contains this metadataThe possible values are a number or
NA
If the value is
NA
then this column does not have a maximum length value
For example in the following ODM dictionary snippet,
Parts v2 ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ contacts ┃ dataType ┃ maxLength ┃ status ┃ firstRelease ┃ lastUpdated ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ addresses │ tables │ NA │ NA │ NA │ NA │ active │ 1.0.0 │ 2.0.0 │ │ addL2 │ attributes │ header │ NA │ varchar │ NA │ active │ 1.0.0 │ 2.0.0 │ │ contacts │ tables │ NA │ NA │ NA │ NA │ active │ 1.0.0 │ 2.0.0 │ │ phone │ attributes │ NA │ header │ varchar │ 10 │ active │ 1.0.0 │ 2.0.0 │ └────────────┴─────────────┴────────────┴───────────┴───────────┴───────────┴────────┴──────────────┴─────────────┘
The phone
column in the contacts
table would have this validation
rule but the addL2
column in the addresses
table would not.
Cerberus Schema¶
We can use the maxlength validation rule in the cerberus library to implement this.
The generated cerberus object for the example above is shown below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'contacts': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'phone': { │ │ │ │ │ │ 'maxlength': 10, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'greater_than_max_length', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'phone', │ │ │ │ │ │ │ │ │ │ 'maxLength': '10', │ │ │ │ │ │ │ │ │ │ 'contacts': 'header' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'contacts', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata for this rule should have the row from the ODM parts sheet
for the column with its partID
and maxLength
column values.
ODM Version 1¶
When generating the schema for version 1, we check whether the column has an equivalent part in version 1. If it does, then we add it to the cerberus schema. For example,
Parts v1 ┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓ ┃ partID ┃ partT… ┃ addres… ┃ conta… ┃ dataTy… ┃ maxLe… ┃ versio… ┃ versi… ┃ versio… ┃ status ┃ first… ┃ lastUp… ┃ ┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩ │ addres… │ tables │ NA │ NA │ NA │ NA │ tables │ Addre… │ NA │ active │ 1.0.0 │ 2.0.0 │ │ addL2 │ attri… │ header │ NA │ varchar │ NA │ variab… │ Addre… │ AddLin… │ active │ 1.0.0 │ 2.0.0 │ │ contac… │ tables │ NA │ NA │ NA │ NA │ tables │ Conta… │ NA │ active │ 1.0.0 │ 2.0.0 │ │ phone │ attri… │ NA │ header │ varchar │ 10 │ variab… │ Conta… │ phoneN… │ active │ 1.0.0 │ 2.0.0 │ └─────────┴────────┴─────────┴────────┴─────────┴────────┴─────────┴────────┴─────────┴────────┴────────┴─────────┘
The corresponding cerberus schema would be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Contact': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'phoneNumber': { │ │ │ │ │ │ 'maxlength': 10, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'greater_than_max_length', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'phone', │ │ │ │ │ │ │ │ │ │ 'contacts': 'header', │ │ │ │ │ │ │ │ │ │ 'maxLength': '10', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Contact', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'phoneNumber' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'contacts', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Contact' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata should include the following columns,
The
partID
column valueThe
maxLength
column valueThe
version1Location
column valueThe
version1Table
column valueThe
version1Variable
column value