less_than_min_length

This rule ensures that the number of characters in a string type column is not less than a defined minimum value. As an example, consider the phone column in the contacts table whose minimum length value is 10. The following dataset snippet would fail validation,

                                                  Invalid Dataset                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ contactID                                               phone                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                      │ 123-456-1                                              │
└────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘

The phone column in the first row has only 9 characters which is less than the minimum length value of 10, whereas the following dataset snippet would pass validation,

                                                   Valid Dataset                                                   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ contactID                                           phone                                                      ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                                  │ 123-456-23                                                 │
│ 2                                                  │ 123-456-231                                                │
└────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────┘

The phone columns in both rows have a number of characters equal to 10 and 11 respectively which is greater than or equal to the minimum length value of 10.

Error report

The error report should have the following fields,

  • errorType: less_than_min_length

  • tableName: The name of the table with the invalid value

  • columnName: The name of the column with the invalid value

  • rowNumber: The index of the row with the invalid value

  • row: The dictionary containing the invalid row

  • invalidValue: The invalid value

  • validationRuleFields: The ODM data dictionary rows used to generate this rule

  • message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> has length <invalid_length> which is less than the min length of <min_length>

The error report object for the example invalid row above is shown below,

{
'errors': [
│   │   {
│   │   │   'errorType': 'less_than_min_length',
│   │   │   'tableName': 'contacts',
│   │   │   'columnName': 'phone',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'contactID': '1',
│   │   │   │   'phone': '123-456-1'
│   │   │   },
│   │   │   'invalidValue': '123-456-1',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'phone',
│   │   │   │   │   'minLength': '10',
│   │   │   │   │   'contacts': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'less_than_min_length rule violated in table contacts, column phone, row(s) 1: Value "123-456-1" (of length 9) is less than the min length of "10"'
│   │   }
],
'warnings': []
}

Rule metadata

All the metadata for this rule is contained in the parts sheet in the ODM dictionary. The steps involved are:

  1. Get the table names in the dictionary

  2. Get all the columns for each table

  3. Filter the rows to only include those whose data type is a string. Currently, these are varchar columns.

  4. Check whether the rows has a minimum length value. If it does then add the validation rule for that column.

    • The minLength column contains this metadata

    • The possible values are a number or NA

      • If the value is NA then this column does not have a minimum length value

For example in the following ODM dictionary snippet,

                                                     Parts v2                                                      
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ partID           partType          addresses       contacts       dataType       minLength       status   ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ contacts        │ tables           │ NA             │ NA            │ NA            │ NA             │ active   │
│ phone           │ attributes       │ NA             │ header        │ varchar       │ 10             │ active   │
│ addresses       │ tables           │ NA             │ NA            │ NA            │ NA             │ active   │
│ addL2           │ attributes       │ header         │ NA            │ varchar       │ NA             │ active   │
│ zero            │ attributes       │ header         │ NA            │ varchar       │ 0              │ active   │
│ negative        │ attributes       │ header         │ NA            │ varchar       │ -1             │ active   │
└─────────────────┴──────────────────┴────────────────┴───────────────┴───────────────┴────────────────┴──────────┘

The phone in the contacts table would have this validation rule but the addL2 column in the addresses table would not.

Cerberus Schema

We can use the minlength validation rule in the cerberus library to implement this.

The generated cerberus object for the example above is shown below,

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'contacts': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'phone': {
│   │   │   │   │   │   'minlength': 10,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'less_than_min_length',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'phone',
│   │   │   │   │   │   │   │   │   │   'minLength': '10',
│   │   │   │   │   │   │   │   │   │   'contacts': 'header'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'contacts',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

Parts with minLength less or equal to zero will be ignored.

The metadata for this rule should have the row from the ODM parts sheet for the column with its partID and minLength column values.

ODM Version 1

When generating the schema for version 1, we check whether the column has an equivalent part in version 1. If it does, then we add it to the cerberus schema. For example,

                                                     Parts v1                                                      
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
┃ partID     partType   addresses  contacts  dataType  minLength  version1…  version1…  version…  status ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
│ contacts  │ tables    │ NA        │ NA       │ NA       │ NA        │ tables    │ Contact   │ NA       │ active │
│ phone     │ attribut… │ NA        │ header   │ varchar  │ 10        │ variables │ Contact   │ phoneNu… │ active │
│ addresses │ tables    │ NA        │ NA       │ NA       │ NA        │ tables    │ Address   │ NA       │ active │
│ addL2     │ attribut… │ header    │ NA       │ varchar  │ NA        │ variables │ Address   │ Address… │ active │
│ zero      │ attribut… │ header    │ NA       │ varchar  │ 0         │ variables │ Address   │ Zero     │ active │
│ negative  │ attribut… │ header    │ NA       │ varchar  │ -1        │ variables │ Address   │ Neg      │ active │
└───────────┴───────────┴───────────┴──────────┴──────────┴───────────┴───────────┴───────────┴──────────┴────────┘

The corresponding cerberus schema would be,

{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Contact': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'phoneNumber': {
│   │   │   │   │   │   'minlength': 10,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'less_than_min_length',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'phone',
│   │   │   │   │   │   │   │   │   │   'minLength': '10',
│   │   │   │   │   │   │   │   │   │   'contacts': 'header',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Contact',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'phoneNumber'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'contacts',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Contact'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata should include the following columns,

  • The partID column value

  • The minLength column value

  • The version1Location column value

  • The version1Table column value

  • The version1Variable column value