duplicate_entries_found

This rule identifies when there are two identical entries with the same primary key. The rational has been described here. In brief, the ODM can provide a papertrail of updates made to an entry by allowing users to update entries by: 1. Not deleting the old entry 2. Adding a new row for the updated entry and 3. Updating the lastUpdated field for the updated entry.

For validation, this rule would be violated if two rows have the same primary key values but non-unique lastUpdated values. For example, the following ODM data snippet would fail validation.

                                              Invalid Addresses Table                                              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addId                                   lastUpdated                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                      │                                                                        │
│ 1                                      │                                                                        │
│ 3                                      │                                                                        │
└────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

There are two rows (rows 1 and 2) with the same primary key value of 1 but also the same lastUpdated value.

The following dataset would also fail validation,

                                              Invalid Addresses Table                                              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addId                                   lastUpdated                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                      │                                                                        │
│ 1                                      │ 01/02/2023                                                             │
│ 1                                      │ 01/02/2023                                                             │
│ 3                                      │                                                                        │
│ 2                                      │                                                                        │
│ 2                                      │                                                                        │
└────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

The following dataset would pass validation,

                                               Valid Addresses Table                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addId                                   lastUpdated                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                      │                                                                        │
│ 1                                      │ 01/01/2023                                                             │
│ 1                                      │ 01/02/2023                                                             │
│ 3                                      │                                                                        │
└────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

Error report

The error report will have the following fields

  • errorType: duplicate_entries_found

  • tableName: The name of the table with duplicate entries

  • columnName The name of the primary key column

  • invalidValue: The value of the primary key column

  • rowNumbers: The indexes of the duplicate entries

  • rows The entries in the table that failed this validation rule

  • validationRuleFields: The ODM data dictionary rule fields violated by this row

  • message: Duplicate entries found in rows <row_indexes> with primary key column <column_name> and primary key value <primary_key_value> in table <table_name>

The error report for the first invalid dataset is shown below,

{
'errors': [
│   │   {
│   │   │   'errorType': 'duplicate_entries_found',
│   │   │   'tableName': 'addresses',
│   │   │   'columnName': 'addId',
│   │   │   'invalidValue': '1',
│   │   │   'rowNumbers': [
│   │   │   │   1,
│   │   │   │   2
│   │   │   ],
│   │   │   'rows': [
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   }
│   │   │   ],
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'addId',
│   │   │   │   │   'addresses': 'pK'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 1,2: Duplicate entries found with primary key "1"'
│   │   }
],
'warnings': []
}

In addition, a seperate error report object should be generated for each set of duplicate values found. For example the error report for the second invalid dataset above is shown below,

{
'errors': [
│   │   {
│   │   │   'errorType': 'duplicate_entries_found',
│   │   │   'tableName': 'addresses',
│   │   │   'columnName': 'addId',
│   │   │   'invalidValue': '1',
│   │   │   'rowNumbers': [
│   │   │   │   2,
│   │   │   │   3
│   │   │   ],
│   │   │   'rows': [
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': '01/02/2023'
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': '01/02/2023'
│   │   │   │   }
│   │   │   ],
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'addId',
│   │   │   │   │   'addresses': 'pK'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 2,3: Duplicate entries found with primary key "1"'
│   │   },
│   │   {
│   │   │   'errorType': 'duplicate_entries_found',
│   │   │   'tableName': 'addresses',
│   │   │   'columnName': 'addId',
│   │   │   'invalidValue': '2',
│   │   │   'rowNumbers': [
│   │   │   │   5,
│   │   │   │   6
│   │   │   ],
│   │   │   'rows': [
│   │   │   │   {
│   │   │   │   │   'addId': '2',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'addId': '2',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   }
│   │   │   ],
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'addId',
│   │   │   │   │   'addresses': 'pK'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 5,6: Duplicate entries found with primary key "2"'
│   │   }
],
'warnings': []
}

Two error report objects should be generated, one for rows 2 and 3 and, one for rows 5 and 6.

Rule metadata

All the metadata for this rule is contained in the parts sheet in the data dictionary. This rule should be added to the primary key column for all tables which can identified using these instructions.

For example,

                                                     Parts v2                                                      
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ partID           partType           addresses        status       firstReleased          lastUpdated       ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ addresses       │ tables            │ NA              │ active      │ 1.0.0                 │ 2.0.0             │
│ addId           │ attributes        │ pK              │ active      │ 1.0.0                 │ 2.0.0             │
│ addL1           │ attributes        │ header          │ active      │ 1.0.0                 │ 2.0.0             │
└─────────────────┴───────────────────┴─────────────────┴─────────────┴───────────────────────┴───────────────────┘

Here the addId part can be identified as being the primary key for the addresses table and hence should have this rule implemented.

Cerberus Schema

Cerberus currently does not have support for this rule. We will need to extend the cerberus validator to accomplish this.

The generated cerberus object for the example above is shown below,

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'addresses': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'addId': {
│   │   │   │   │   │   'unique': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'duplicate_entries_found',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'addId',
│   │   │   │   │   │   │   │   │   │   'addresses': 'pK'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'addresses',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata for this rule should include the row from the ODM that defines the primary key column for the table.

ODM Version 1

When generating the schema for version 1, we should add this rule to any version 2 primary key columns which have a version 1 equivalent. For example, the parts snippet above with its version 1 columns is shown below

                                                     Parts v1                                                      
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ partID     partType    addresses  version1Lo…  version1Ta…  version1V…  status  firstRelea…  lastUpdat… ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ addresses │ tables     │ NA        │ tables      │ Address     │ NA         │ active │ 1.0.0       │ 2.0.0      │
│ addId     │ attributes │ pK        │ variables   │ Address     │ AddressId  │ active │ 1.0.0       │ 2.0.0      │
│ addL1     │ attributes │ header    │ variables   │ Address     │ AddL1      │ active │ 1.0.0       │ 2.0.0      │
└───────────┴────────────┴───────────┴─────────────┴─────────────┴────────────┴────────┴─────────────┴────────────┘

The corresponding cerberus schema would be,

{
'schemaVersion': '1.0.0',
'schema': {
│   │   'Address': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'AddressId': {
│   │   │   │   │   │   'unique': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'duplicate_entries_found',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'addId',
│   │   │   │   │   │   │   │   │   │   'addresses': 'pK',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Address',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'AddressId'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'addresses',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Address'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
}
}

The metadata should include the following columns,

  • The partID column value

  • The <table_name> column value

  • The version1Location column value

  • The version1Table column value

  • The version1Variable column value