greater_than_max_value

This rule ensure that columns values are not greater than a defined maximum value. Currently, this rule only supports columns whose data type is float or integer i.e. numbers. As an example, consider the geoLat column and geoLong column in the sites table. For the purpose of explaining this rule we’ll assume the geoLat column is an integer with a maximum value of 90 and the geoLong column is a float with a maximum value of 90.15. The following dataset snippet would fail validation,

# Using a yaml file here since in a CSV file every value is by default a string
pprint_yaml_file(asset("invalid-dataset-1.yml"))
[
{
│   │   'siteID': '1',
│   │   'geoLat': 91,
│   │   'geoLong': 90.2
}
]

whereas the following dataset snippet would pass validation,

# Using a yaml file here since in a CSV file every value is by default a string
pprint_yaml_file(asset("valid-dataset-1.yml"))
[
{
│   │   'siteID': '1',
│   │   'geoLat': 89,
│   │   'geoLong': 90.1
}
]

If the value to be validated is not a number but can be coerced into one, then a warning is given. Whether the coerced value is valid or not, a warning should reported informing the user of the mismatched type. For example the following dataset would fail validation with an error and a warning,

                                                  Invalid Dataset                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ siteID                              geoLat                              geoLong                               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                  │ 91                                 │ 90.2                                  │
└────────────────────────────────────┴────────────────────────────────────┴───────────────────────────────────────┘

and the following dataset should pass validation with a warning (since by default all CSV values are parsed as strings in Python),

                                                   Valid Dataset                                                   
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ siteID                              geoLat                              geoLong                               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                  │ 89                                 │ 90.1                                  │
└────────────────────────────────────┴────────────────────────────────────┴───────────────────────────────────────┘

If the value cannot be coerced into a number, then an error should be logged as in the following dataset where “a” and “b” cannot be coerced into a number

                                                  Invalid Dataset                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ siteID                              geoLat                              geoLong                               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                  │ a                                  │ b                                     │
└────────────────────────────────────┴────────────────────────────────────┴───────────────────────────────────────┘

The ODM dictionary is currently written so that the defined maximum values are inclusive i.e. only values greater than the maximum values are invalid. All other values should pass this rule. For example, for the geoLat column and geoLong column, a value of 90 and 90.15 respectively would be considered valid as in the dataset below,

# Using a yaml file here since in a CSV file every value is by default a string
pprint_yaml_file(asset("valid-dataset-3.yml"))
[
{
│   │   'siteID': '1',
│   │   'geoLat': 90,
│   │   'geoLong': 90.15
}
]

Error report

This rule can generate an error and/or a warning.

An error is generated in the following cases:

  1. The value is a number and is greater than the max value.

  2. The value is not a number, can be coerced into one, and is greater than the max value.

  3. The value is not a number and cannot be coerced into one.

The error report for the first two cases is shown below,

  • errorType: greater_than_max_value

  • tableName The name of the table with the invalid value

  • columnName: The name of the column with the invalid value

  • rowNumber: The index of the row with the invalid value

  • row: The dictionary containing the invalid row

  • invalidValue: The invalid value

  • validationRuleFields: The ODM data dictionary rows used to generate this rule

  • message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> is greater than the allowable maximum value of <max_value>

The error report for the third case is the identical except for the message field which is shown below,

  • message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> cannot be coerced into a number

A warning is generated when the value is a not a number but is coerced into one before validation. The warning report will have the following fields,

  • warningType: _coercion

  • coercionRules: The list of validation rules that required this coercion

  • tableName: The name of the table with the invalid value

  • columnName: The name of the column with the invalid value

  • rowNumber: The index of the row with the invalid value

  • row: The dictionary containing the invalid row

  • invalidValue: The invalid value

  • validationRuleFields: The ODM data dictionary rows used to generate this rule

  • message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> is a <invalid_type> and was coerced into a number

The error reports for each of the invalid datasets above are shown above,

Invalid dataset 1

{
'errors': [
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': 91,
│   │   │   │   'geoLong': 90.2
│   │   │   },
│   │   │   'invalidValue': 91,
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   'maxValue': '90',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLat, row(s) 1: Value "91" is greater than the max value of "90"'
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': 91,
│   │   │   │   'geoLong': 90.2
│   │   │   },
│   │   │   'invalidValue': 90.2,
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   'maxValue': '90.15',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 1: Value "90.2" is greater than the max value of "90.15"'
│   │   }
],
'warnings': []
}

Invalid dataset 2

{
'errors': [
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': 91,
│   │   │   │   'geoLong': 90.2
│   │   │   },
│   │   │   'invalidValue': 91,
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   'maxValue': '90',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLat, row(s) 1: Value "91" is greater than the max value of "90"'
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': 91,
│   │   │   │   'geoLong': 90.2
│   │   │   },
│   │   │   'invalidValue': 90.2,
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   'maxValue': '90.15',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 1: Value "90.2" is greater than the max value of "90.15"'
│   │   }
],
'warnings': [
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'coercionRules': [
│   │   │   │   'greater_than_max_value'
│   │   │   ],
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': '91',
│   │   │   │   'geoLong': '90.2'
│   │   │   },
│   │   │   'invalidValue': '91',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   'maxValue': '90',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "91" is a string and was coerced into a number'
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'coercionRules': [
│   │   │   │   'greater_than_max_value'
│   │   │   ],
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': '91',
│   │   │   │   'geoLong': '90.2'
│   │   │   },
│   │   │   'invalidValue': '90.2',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   'maxValue': '90.15',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "90.2" is a string and was coerced into a number'
│   │   }
]
}

Invalid dataset 3

{
'errors': [
│   │   {
│   │   │   'errorType': '_coercion',
│   │   │   'coercionRules': [
│   │   │   │   'greater_than_max_value'
│   │   │   ],
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': 'a',
│   │   │   │   'geoLong': 'b'
│   │   │   },
│   │   │   'invalidValue': 'a',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   'maxValue': '90',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': '_coercion rule violated in table sites, column geoLat, row(s) 1: Value "a" cannot be coerced into a number'
│   │   },
│   │   {
│   │   │   'errorType': '_coercion',
│   │   │   'coercionRules': [
│   │   │   │   'greater_than_max_value'
│   │   │   ],
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': '1',
│   │   │   │   'geoLat': 'a',
│   │   │   │   'geoLong': 'b'
│   │   │   },
│   │   │   'invalidValue': 'b',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   'maxValue': '90.15',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': '_coercion rule violated in table sites, column geoLong, row(s) 1: Value "b" cannot be coerced into a number'
│   │   }
],
'warnings': []
}

Finally, for the valid coercable dataset (valid dataset 2) above, the following warning should be generated

{
'errors': [
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'lastEdited',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'siteID': 1,
│   │   │   │   'lastEdited': '2024-12-30 00:00:00'
│   │   │   },
│   │   │   'invalidValue': '2024-12-30 00:00:00',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'lastEdited',
│   │   │   │   │   'maxValue': '2024-01-01',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column lastEdited, row(s) 1: Value "2024-12-30 00:00:00" is greater than the max value of "2024-01-01 00:00:00"'
│   │   }
],
'warnings': [
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'coercionRules': [
│   │   │   │   'greater_than_max_value'
│   │   │   ],
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'lastEdited',
│   │   │   'rowNumber': 1,
│   │   │   'row': {
│   │   │   │   'lastEdited': '2024-12-30',
│   │   │   │   'siteID': 1
│   │   │   },
│   │   │   'invalidValue': '2024-12-30',
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'lastEdited',
│   │   │   │   │   'maxValue': '2024-01-01',
│   │   │   │   │   'sites': 'header'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': '_coercion rule triggered in table sites, column lastEdited, row(s) 1: Value "2024-12-30" is a string and was coerced into a datetime'
│   │   }
]
}

Rule metadata

All the metadata for this rule is contained in the parts sheet in the data dictionary. The steps involved are:

  1. Get the table names in the dictionary

  2. Get all the columns for each table

  3. Filter the rows to include only those columns whose data type is float or integer

    • The dataType column in the parts sheet contains this metadata

    • It has a number of possible values but we’re looking for values of float or integer

  4. Check whether the filtered rows has a maximum value. If it does then add the validation rule for that column.

    • The maxValue column contains this metadata

    • The possible values are a number or NA.

      • If the value is NA then this column does not have a maximum value.

For example in the following ODM dictionary snippet,

                                                     Parts v2                                                      
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ partID      partType       sites     dataType     maxValue     status    firstReleased     lastUpdated   ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ sites      │ tables        │ NA       │ NA          │ NA          │ active   │ 1.0.0            │ 2.0.0         │
│ geoLat     │ attributes    │ header   │ integer     │ 90          │ active   │ 1.0.0            │ 2.0.0         │
│ geoLong    │ attributes    │ header   │ float       │ 90.15       │ active   │ 1.0.0            │ 2.0.0         │
│ geoEPSG    │ attributes    │ header   │ integer     │ NA          │ active   │ 1.0.0            │ 2.0.0         │
└────────────┴───────────────┴──────────┴─────────────┴─────────────┴──────────┴──────────────────┴───────────────┘

The geoLat part which is a column in the sites table has a max value of 90 and would have this rule set for it. On the other hand, the geoEPSG part which is also a column in the sites table would not have this rule due its max value being NA.

Cerberus Schema

We can use the maximum value validation rule in cerberus to implement this rule. Cerberus also needs to know the data type when performing this validation, so we need to set type and coerce as well.

The generated cerberus object for the example above is shown below,

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'geoLat': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'coerce': 'integer',
│   │   │   │   │   │   'max': 90,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'greater_than_max_value',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   │   │   │   │   │   'maxValue': '90',
│   │   │   │   │   │   │   │   │   │   'sites': 'header'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   },
│   │   │   │   │   'geoLong': {
│   │   │   │   │   │   'type': 'float',
│   │   │   │   │   │   'coerce': 'float',
│   │   │   │   │   │   'max': 90.15,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'greater_than_max_value',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   │   │   │   │   │   'maxValue': '90.15',
│   │   │   │   │   │   │   │   │   │   'sites': 'header'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'sites',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata for this rule should include the row from the ODM for the part with its partID and maxValue column values.

ODM Version 1

When generating the schema for version 1, we check whether the column has an equivalent part in version 1. If it does, then we add it to the cerberus schema. For example,

                                                     Parts v1                                                      
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ partID   partType  sites   dataType  maxValue  version…  version…  versio…  status  firstRe…  lastUp… ┃
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sites   │ tables   │ NA     │ NA       │ NA       │ tables   │ Site     │ NA      │ active │ 1.0.0    │ 2.0.0   │
│ geoLat  │ attribu… │ header │ integer  │ 90       │ variabl… │ Site     │ Latitu… │ active │ 1.0.0    │ 2.0.0   │
│ geoLong │ attribu… │ header │ float    │ 90.15    │ variabl… │ Site     │ Longit… │ active │ 1.0.0    │ 2.0.0   │
│ geoEPSG │ attribu… │ header │ integer  │ NA       │ variabl… │ Site     │ Latitu… │ active │ 1.0.0    │ 2.0.0   │
└─────────┴──────────┴────────┴──────────┴──────────┴──────────┴──────────┴─────────┴────────┴──────────┴─────────┘

The corresponding cerberus schema would be,

{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Site': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'Latitude': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'coerce': 'integer',
│   │   │   │   │   │   'max': 90,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'greater_than_max_value',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLat',
│   │   │   │   │   │   │   │   │   │   'sites': 'header',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'Latitude',
│   │   │   │   │   │   │   │   │   │   'maxValue': '90'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   },
│   │   │   │   │   'Longitude': {
│   │   │   │   │   │   'type': 'float',
│   │   │   │   │   │   'coerce': 'float',
│   │   │   │   │   │   'max': 90.15,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'greater_than_max_value',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'geoLong',
│   │   │   │   │   │   │   │   │   │   'sites': 'header',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Site',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'Longitude',
│   │   │   │   │   │   │   │   │   │   'maxValue': '90.15'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'sites',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Site'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata should include the following columns,

  • The partID column value

  • The version1Location column value

  • The version1Table column value

  • The version1Variable column value

  • The maxValue column value