less_than_min_value¶
This rule ensure that columns values are not less than a defined minimum
value. Currently, this rule only supports columns whose data type is
float or integer i.e. numbers. As an example, consider the geoLat
column and geoLong
column in the sites
table. For the purpose of
explaining this rule we’ll assume the geoLat
column is an integer
with a minimum value of -90 and the geoLong
column is a float
with a
minimum value of -90.15
. The following dataset snippet would fail
validation,
# Using a yaml file here since in a CSV file every value is by default a string
pprint_yaml_file(asset("invalid-dataset-1.yml"))
[ │ { │ │ 'siteID': 1, │ │ 'geoLat': -91, │ │ 'geoLong': -90.2 │ } ]
whereas the following dataset snippet would pass validation,
# Using a yaml file here since in a CSV file every value is by default a string
pprint_yaml_file(asset("valid-dataset-1.yml"))
[ │ { │ │ 'siteID': 1, │ │ 'geoLat': -89, │ │ 'geoLong': -90 │ } ]
If the value to be validated is not a number but can be coerced into one, then a warning is given. Whether the coerced value is valid or not, a warning should reported informing the user of the mismatched type. For example the following dataset would fail validation with an error and a warning,
Invalid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ siteID ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ -91 │ -90.2 │ └────────────────────────────────────┴────────────────────────────────────┴───────────────────────────────────────┘
and the following dataset should pass validation with a warning (since by default all CSV values are parsed as strings in Python),
Valid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ siteID ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ -89 │ -90 │ └────────────────────────────────────┴────────────────────────────────────┴───────────────────────────────────────┘
If the value cannot be coerced into a number, then an error should be logged as in the following dataset where “a” and “b” cannot be coerced into a number
Invalid Dataset ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ siteID ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ a │ b │ └────────────────────────────────────┴────────────────────────────────────┴───────────────────────────────────────┘
The ODM dictionary is currently written so that the defined minimum
values are inclusive i.e. only values less than the minimum values are
invalid. All other values should pass this rule. For example, for the
geoLat
column and geoLong
column, a value of -90 and -90.15
respectively would be considered valid as in the dataset below,
# Using a yaml file here since in a CSV file every value is by default a string
pprint_yaml_file(asset("valid-dataset-3.yml"))
[ │ { │ │ 'siteID': 1, │ │ 'geoLat': -90, │ │ 'geoLong': -90.15 │ } ]
Error report¶
This rule can generate an error and/or a warning.
An error is generated in the following cases:
The value is a number and is less than the min value.
The value is not a number, can be coerced into one, and is less than the min value.
The value is not a number and cannot be coerced into one.
The error report for the first two cases is shown below,
errorType: less_than_min_value
tableName The name of the table with the invalid value
columnName: The name of the column with the invalid value
rowNumber: The index of the row with the invalid value
row: The dictionary containing the invalid row
invalidValue: The invalid value
validationRuleFields: The ODM data dictionary rows used to generate this rule
message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> is less than the allowable minimum value of <min_value>
The error report for the third case is the identical except for the message field which is shown below,
message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> cannot be coerced into a number
A warning is generated when the value is a not a number but is coerced into one before validation. The warning report will have the following fields,
warningType: _coercion
coercionRules: The list of validation rules that required this coercion
tableName: The name of the table with the invalid value
columnName: The name of the column with the invalid value
rowNumber: The index of the row with the invalid value
row: The dictionary containing the invalid row
invalidValue: The invalid value
validationRuleFields: The ODM data dictionary rows used to generate this rule
message: Value <invalid_value> in row <row_index> in column <column_name> in table <table_name> is a <invalid_type> and was coerced into a number
The error reports for each of the invalid datasets above are shown above,
Invalid dataset 1
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': 1, │ │ │ │ 'geoLat': -91, │ │ │ │ 'geoLong': -90.2 │ │ │ }, │ │ │ 'invalidValue': -91, │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ 'minValue': '-90', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 1: Value "-91" is less than the min value of "-90"' │ │ }, │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': 1, │ │ │ │ 'geoLat': -91, │ │ │ │ 'geoLong': -90.2 │ │ │ }, │ │ │ 'invalidValue': -90.2, │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ 'minValue': '-90.15', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLong, row(s) 1: Value "-90.2" is less than the min value of "-90.15"' │ │ } │ ], │ 'warnings': [] }
Invalid dataset 2
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': -91, │ │ │ │ 'geoLong': -90.2 │ │ │ }, │ │ │ 'invalidValue': -91, │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ 'minValue': '-90', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 1: Value "-91" is less than the min value of "-90"' │ │ }, │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': -91, │ │ │ │ 'geoLong': -90.2 │ │ │ }, │ │ │ 'invalidValue': -90.2, │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ 'minValue': '-90.15', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLong, row(s) 1: Value "-90.2" is less than the min value of "-90.15"' │ │ } │ ], │ 'warnings': [ │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'coercionRules': [ │ │ │ │ 'less_than_min_value' │ │ │ ], │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': '-91', │ │ │ │ 'geoLong': '-90.2' │ │ │ }, │ │ │ 'invalidValue': '-91', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ 'minValue': '-90', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "-91" is a string and was coerced into a number' │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'coercionRules': [ │ │ │ │ 'less_than_min_value' │ │ │ ], │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': '-91', │ │ │ │ 'geoLong': '-90.2' │ │ │ }, │ │ │ 'invalidValue': '-90.2', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ 'minValue': '-90.15', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "-90.2" is a string and was coerced into a number' │ │ } │ ] }
Invalid dataset 3
{ │ 'errors': [ │ │ { │ │ │ 'errorType': '_coercion', │ │ │ 'coercionRules': [ │ │ │ │ 'less_than_min_value' │ │ │ ], │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': 'a', │ │ │ │ 'geoLong': 'b' │ │ │ }, │ │ │ 'invalidValue': 'a', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ 'minValue': '-90', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': '_coercion rule violated in table sites, column geoLat, row(s) 1: Value "a" cannot be coerced into a number' │ │ }, │ │ { │ │ │ 'errorType': '_coercion', │ │ │ 'coercionRules': [ │ │ │ │ 'less_than_min_value' │ │ │ ], │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': 'a', │ │ │ │ 'geoLong': 'b' │ │ │ }, │ │ │ 'invalidValue': 'b', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ 'minValue': '-90.15', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': '_coercion rule violated in table sites, column geoLong, row(s) 1: Value "b" cannot be coerced into a number' │ │ } │ ], │ 'warnings': [] }
Finally, for the valid coercable dataset (valid dataset 2) above, the following warning should be generated
{ │ 'errors': [], │ 'warnings': [ │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'coercionRules': [ │ │ │ │ 'less_than_min_value' │ │ │ ], │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': '-89', │ │ │ │ 'geoLong': '-90' │ │ │ }, │ │ │ 'invalidValue': '-89', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ 'minValue': '-90', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "-89" is a string and was coerced into a number' │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'coercionRules': [ │ │ │ │ 'less_than_min_value' │ │ │ ], │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'rowNumber': 1, │ │ │ 'row': { │ │ │ │ 'siteID': '1', │ │ │ │ 'geoLat': '-89', │ │ │ │ 'geoLong': '-90' │ │ │ }, │ │ │ 'invalidValue': '-90', │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ 'minValue': '-90.15', │ │ │ │ │ 'sites': 'header' │ │ │ │ } │ │ │ ], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "-90" is a string and was coerced into a number' │ │ } │ ] }
Rule metadata¶
All the metadata for this rule is contained in the parts sheet in the data dictionary. The steps involved are:
Filter the rows to include only those columns whose data type is float or integer
The
dataType
column in the parts sheet contains this metadataIt has a number of possible values but we’re looking for values of
float
orinteger
Check whether the filtered rows has a maximum value. If it does then add the validation rule for that column.
The
minValue
column contains this metadataThe possible values are a number or
NA
.If the value is
NA
then this column does not have a maximum value.
For example in the following ODM dictionary snippet,
Parts v2 ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ sites ┃ dataType ┃ minValue ┃ status ┃ firstReleased ┃ lastUpdated ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ sites │ tables │ NA │ NA │ NA │ active │ 1.0.0 │ 2.0.0 │ │ geoLat │ attributes │ header │ integer │ -90 │ active │ 1.0.0 │ 2.0.0 │ │ geoLong │ attributes │ header │ float │ -90.15 │ active │ 1.0.0 │ 2.0.0 │ │ geoEPSG │ attributes │ header │ integer │ NA │ active │ 1.0.0 │ 2.0.0 │ └────────────┴───────────────┴──────────┴─────────────┴─────────────┴──────────┴──────────────────┴───────────────┘
The geoLat
part which is a column in the sites
table has a
minValue
value of -91 and would have this rule set for it. On the
other hand, the geoEPSG
part which is also a column in the sites
table would not have this rule due its minValue
value being NA
.
Cerberus Schema¶
We can use the minimum value validation rule from cerberus to implement this rule. Cerberus also needs to know the data type when performing this validation, so we need to set type and coerce as well.
The generated cerberus object for the example above is shown below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'geoLat': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'coerce': 'integer', │ │ │ │ │ │ 'min': -90, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'less_than_min_value', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ │ │ │ │ │ 'minValue': '-90', │ │ │ │ │ │ │ │ │ │ 'sites': 'header' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ }, │ │ │ │ │ 'geoLong': { │ │ │ │ │ │ 'type': 'float', │ │ │ │ │ │ 'coerce': 'float', │ │ │ │ │ │ 'min': -90.15, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'less_than_min_value', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ │ │ │ │ │ 'minValue': '-90.15', │ │ │ │ │ │ │ │ │ │ 'sites': 'header' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'sites', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata for this rule should include the row from the ODM for the
part with its partID
and minValue
column values.
ODM Version 1¶
When generating the schema for version 1, we check whether the column has an equivalent part in version 1. If it does, then we add it to the cerberus schema. For example,
Parts v1 ┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓ ┃ partID ┃ partType ┃ sites ┃ dataType ┃ minValue ┃ version… ┃ version… ┃ versio… ┃ status ┃ firstRe… ┃ lastUp… ┃ ┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩ │ sites │ tables │ NA │ NA │ NA │ tables │ Site │ NA │ active │ 1.0.0 │ 2.0.0 │ │ geoLat │ attribu… │ header │ integer │ -90 │ variabl… │ Site │ Latitu… │ active │ 1.0.0 │ 2.0.0 │ │ geoLong │ attribu… │ header │ float │ -90.15 │ variabl… │ Site │ Longit… │ active │ 1.0.0 │ 2.0.0 │ │ geoEPSG │ attribu… │ header │ integer │ NA │ variabl… │ Site │ Latitu… │ active │ 1.0.0 │ 2.0.0 │ └─────────┴──────────┴────────┴──────────┴──────────┴──────────┴──────────┴─────────┴────────┴──────────┴─────────┘
The corresponding cerberus schema would be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Site': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'Latitude': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'coerce': 'integer', │ │ │ │ │ │ 'min': -90, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'less_than_min_value', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'geoLat', │ │ │ │ │ │ │ │ │ │ 'sites': 'header', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Site', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'Latitude', │ │ │ │ │ │ │ │ │ │ 'minValue': '-90' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ }, │ │ │ │ │ 'Longitude': { │ │ │ │ │ │ 'type': 'float', │ │ │ │ │ │ 'coerce': 'float', │ │ │ │ │ │ 'min': -90.15, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'less_than_min_value', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'geoLong', │ │ │ │ │ │ │ │ │ │ 'sites': 'header', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Site', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'Longitude', │ │ │ │ │ │ │ │ │ │ 'minValue': '-90.15' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'sites', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Site' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata should include the following columns,
The
partID
column valueThe
version1Location
column valueThe
version1Table
column valueThe
version1Variable
column valueThe
minValue
column value