greater_than_max_length

This rule ensures that the number of characters in a string type column is not greater than a defined maximum value. As an example, consider the phone column in the contacts table whose maximum length value is 10. The following dataset snippet would fail validation,

                                                  Invalid Dataset                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ contactID                                           phone                                                      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                  │ 123-456-111                                                │
└────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘

The phone column in the first row has 11 characters which is greater than the maximum length value of 10, whereas the following dataset snippet would pass validation,

                                                   Valid Dataset                                                   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ contactID                                             phone                                                    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                    │ 123-456-23                                               │
│ 2                                                    │ 123-456-2                                                │
└──────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘

The phone columns in both rows have a number of characters equal to 10 and 9 respectively which is less than or equal to the maximum length value of 10.

Error report

The error report should have the following fields,

  • errorType: greater_than_max_length

  • tableName: The name of the table with the invalid value

  • columnName: The name of the column with the invalid value

  • rowNumber: The index of the row with the invalid value

  • row: The dictionary containing the invalid row

  • invalidValue: The invalid value

  • validationRuleFields: The ODM data dictionary rows used to generate this rule

  • message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> has length <invalid_length> which is greater than the max length of <max_length>

The error report object for the example invalid row above is shown below,

{
'errors': [
│   │   {
│   │   │   'errorType': 'greater_than_max_length',
│   │   │   'tableName': 'contacts',
│   │   │   'columnName': 'phone',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'contactID': '1',
│   │   │   │   'phone': '123-456-111'
│   │   │   },
│   │   │   'invalidValue': '123-456-111',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'phone',
│   │   │   │   │   'maxLength': '10',
│   │   │   │   │   'contacts': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'greater_than_max_length rule violated in table contacts, column phone, row(s) 1: Value "123-456-111" (of length 11) exceeds the max length of "10"'
│   │   }
],
'warnings': []
}

Rule metadata

All the metadata for this rule is contained in the parts sheet in the ODM dictionary. The steps involved are:

  1. Get the table names in the dictionary

  2. Get all the columns for each table

  3. Filter the rows to only include those whose data type is a string. Currently, these are varchar columns.

  4. Check whether the rows has a maximum length value. If it does then add the validation rule for that column.

    • The maxLength column contains this metadata

    • The possible values are a number or NA

      • If the value is NA then this column does not have a maximum length value

For example in the following ODM dictionary snippet,

                                                     Parts v2                                                      
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ partID      partType     addresses   contacts   dataType   maxLength  status  firstRelease  lastUpdated ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ addresses  │ tables      │ NA         │ NA        │ NA        │ NA        │ active │ 1.0.0        │ 2.0.0       │
│ addL2      │ attributes  │ header     │ NA        │ varchar   │ NA        │ active │ 1.0.0        │ 2.0.0       │
│ contacts   │ tables      │ NA         │ NA        │ NA        │ NA        │ active │ 1.0.0        │ 2.0.0       │
│ phone      │ attributes  │ NA         │ header    │ varchar   │ 10        │ active │ 1.0.0        │ 2.0.0       │
└────────────┴─────────────┴────────────┴───────────┴───────────┴───────────┴────────┴──────────────┴─────────────┘

The phone column in the contacts table would have this validation rule but the addL2 column in the addresses table would not.

Cerberus Schema

We can use the maxlength validation rule in the cerberus library to implement this.

The generated cerberus object for the example above is shown below,

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'contacts': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'phone': {
│   │   │   │   │   │   'maxlength': 10,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'greater_than_max_length',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'phone',
│   │   │   │   │   │   │   │   │   │   'maxLength': '10',
│   │   │   │   │   │   │   │   │   │   'contacts': 'header'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'contacts',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata for this rule should have the row from the ODM parts sheet for the column with its partID and maxLength column values.

ODM Version 1

When generating the schema for version 1, we check whether the column has an equivalent part in version 1. If it does, then we add it to the cerberus schema. For example,

                                                     Parts v1                                                      
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ partID   partT…  addres…  conta…  dataTy…  maxLe…  versio…  versi…  versio…  status  first…  lastUp… ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ addres… │ tables │ NA      │ NA     │ NA      │ NA     │ tables  │ Addre… │ NA      │ active │ 1.0.0  │ 2.0.0   │
│ addL2   │ attri… │ header  │ NA     │ varchar │ NA     │ variab… │ Addre… │ AddLin… │ active │ 1.0.0  │ 2.0.0   │
│ contac… │ tables │ NA      │ NA     │ NA      │ NA     │ tables  │ Conta… │ NA      │ active │ 1.0.0  │ 2.0.0   │
│ phone   │ attri… │ NA      │ header │ varchar │ 10     │ variab… │ Conta… │ phoneN… │ active │ 1.0.0  │ 2.0.0   │
└─────────┴────────┴─────────┴────────┴─────────┴────────┴─────────┴────────┴─────────┴────────┴────────┴─────────┘

The corresponding cerberus schema would be,

{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Contact': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'phoneNumber': {
│   │   │   │   │   │   'maxlength': 10,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'greater_than_max_length',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'phone',
│   │   │   │   │   │   │   │   │   │   'contacts': 'header',
│   │   │   │   │   │   │   │   │   │   'maxLength': '10',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Contact',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'phoneNumber'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'contacts',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Contact'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata should include the following columns,

  • The partID column value

  • The maxLength column value

  • The version1Location column value

  • The version1Table column value

  • The version1Variable column value