duplicate_entries_found¶
This rule identifies when there are two identical entries with the same
primary key. The rational has been described
here.
In brief, the ODM can provide a papertrail of updates made to an entry
by allowing users to update entries by: 1. Not deleting the old entry 2.
Adding a new row for the updated entry and 3. Updating the lastUpdated
field for the updated entry.
For validation, this rule would be violated if two rows have the same
primary key values but non-unique lastUpdated
values. For example, the
following ODM data snippet would fail validation.
Invalid Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addId ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ │ 1 │ │ │ 3 │ │ └────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘
There are two rows (rows 1 and 2) with the same primary key value of 1
but also the same lastUpdated
value.
The following dataset would also fail validation,
Invalid Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addId ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ │ 1 │ 01/02/2023 │ │ 1 │ 01/02/2023 │ │ 3 │ │ │ 2 │ │ │ 2 │ │ └────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘
The following dataset would pass validation,
Valid Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addId ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ │ 1 │ 01/01/2023 │ │ 1 │ 01/02/2023 │ │ 3 │ │ └────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘
Error report¶
The error report will have the following fields
errorType: duplicate_entries_found
tableName: The name of the table with duplicate entries
columnName The name of the primary key column
invalidValue: The value of the primary key column
rowNumbers: The indexes of the duplicate entries
rows The entries in the table that failed this validation rule
validationRuleFields: The ODM data dictionary rule fields violated by this row
message: Duplicate entries found in rows <row_indexes> with primary key column <column_name> and primary key value <primary_key_value> in table <table_name>
The error report for the first invalid dataset is shown below,
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'duplicate_entries_found', │ │ │ 'tableName': 'addresses', │ │ │ 'columnName': 'addId', │ │ │ 'invalidValue': '1', │ │ │ 'rowNumbers': [ │ │ │ │ 1, │ │ │ │ 2 │ │ │ ], │ │ │ 'rows': [ │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ }, │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ } │ │ │ ], │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ 'addresses': 'pK' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 1,2: Duplicate entries found with primary key "1"' │ │ } │ ], │ 'warnings': [] }
In addition, a seperate error report object should be generated for each set of duplicate values found. For example the error report for the second invalid dataset above is shown below,
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'duplicate_entries_found', │ │ │ 'tableName': 'addresses', │ │ │ 'columnName': 'addId', │ │ │ 'invalidValue': '1', │ │ │ 'rowNumbers': [ │ │ │ │ 2, │ │ │ │ 3 │ │ │ ], │ │ │ 'rows': [ │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '01/02/2023' │ │ │ │ }, │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '01/02/2023' │ │ │ │ } │ │ │ ], │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ 'addresses': 'pK' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 2,3: Duplicate entries found with primary key "1"' │ │ }, │ │ { │ │ │ 'errorType': 'duplicate_entries_found', │ │ │ 'tableName': 'addresses', │ │ │ 'columnName': 'addId', │ │ │ 'invalidValue': '2', │ │ │ 'rowNumbers': [ │ │ │ │ 5, │ │ │ │ 6 │ │ │ ], │ │ │ 'rows': [ │ │ │ │ { │ │ │ │ │ 'addId': '2', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ }, │ │ │ │ { │ │ │ │ │ 'addId': '2', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ } │ │ │ ], │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ 'addresses': 'pK' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 5,6: Duplicate entries found with primary key "2"' │ │ } │ ], │ 'warnings': [] }
Two error report objects should be generated, one for rows 2 and 3 and, one for rows 5 and 6.
Rule metadata¶
All the metadata for this rule is contained in the parts sheet in the data dictionary. This rule should be added to the primary key column for all tables which can identified using these instructions.
For example,
Parts v2 ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ status ┃ firstReleased ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ addresses │ tables │ NA │ active │ 1.0.0 │ 2.0.0 │ │ addId │ attributes │ pK │ active │ 1.0.0 │ 2.0.0 │ │ addL1 │ attributes │ header │ active │ 1.0.0 │ 2.0.0 │ └─────────────────┴───────────────────┴─────────────────┴─────────────┴───────────────────────┴───────────────────┘
Here the addId
part can be identified as being the primary key for the
addresses table and hence should have this rule implemented.
Cerberus Schema¶
Cerberus currently does not have support for this rule. We will need to extend the cerberus validator to accomplish this.
The generated cerberus object for the example above is shown below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'addresses': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'addId': { │ │ │ │ │ │ 'unique': True, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'duplicate_entries_found', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ │ │ │ │ │ 'addresses': 'pK' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'addresses', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata for this rule should include the row from the ODM that defines the primary key column for the table.
ODM Version 1¶
When generating the schema for version 1, we should add this rule to any version 2 primary key columns which have a version 1 equivalent. For example, the parts snippet above with its version 1 columns is shown below
Parts v1 ┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ version1Lo… ┃ version1Ta… ┃ version1V… ┃ status ┃ firstRelea… ┃ lastUpdat… ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ addresses │ tables │ NA │ tables │ Address │ NA │ active │ 1.0.0 │ 2.0.0 │ │ addId │ attributes │ pK │ variables │ Address │ AddressId │ active │ 1.0.0 │ 2.0.0 │ │ addL1 │ attributes │ header │ variables │ Address │ AddL1 │ active │ 1.0.0 │ 2.0.0 │ └───────────┴────────────┴───────────┴─────────────┴─────────────┴────────────┴────────┴─────────────┴────────────┘
The corresponding cerberus schema would be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Address': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'AddressId': { │ │ │ │ │ │ 'unique': True, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'duplicate_entries_found', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ │ │ │ │ │ 'addresses': 'pK', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Address', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'AddressId' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'addresses', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Address' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }
The metadata should include the following columns,
The
partID
column valueThe
<table_name>
column valueThe
version1Location
column valueThe
version1Table
column valueThe
version1Variable
column value