duplicate_entries_found¶

This rule identifies when there are two identical entries with the same primary key. The rational has been described here. In brief, the ODM can provide a papertrail of updates made to an entry by allowing users to update entries by: 1. Not deleting the old entry 2. Adding a new row for the updated entry and 3. Updating the lastUpdated field for the updated entry.

For validation, this rule would be violated if two rows have the same primary key values but non-unique lastUpdated values. For example, the following ODM data snippet would fail validation.

                                              Invalid Addresses Table                                              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addId                                  ┃ lastUpdated                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                      │                                                                        │
│ 1                                      │                                                                        │
│ 3                                      │                                                                        │
└────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

There are two rows (rows 1 and 2) with the same primary key value of 1 but also the same lastUpdated value.

The following dataset would also fail validation,

                                              Invalid Addresses Table                                              
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addId                                  ┃ lastUpdated                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                      │                                                                        │
│ 1                                      │ 01/02/2023                                                             │
│ 1                                      │ 01/02/2023                                                             │
│ 3                                      │                                                                        │
│ 2                                      │                                                                        │
│ 2                                      │                                                                        │
└────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

The following dataset would pass validation,

                                               Valid Addresses Table                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addId                                  ┃ lastUpdated                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1                                      │                                                                        │
│ 1                                      │ 01/01/2023                                                             │
│ 1                                      │ 01/02/2023                                                             │
│ 3                                      │                                                                        │
└────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘

Error report¶

The error report will have the following fields

errorType: duplicate_entries_found
tableName: The name of the table with duplicate entries
columnName The name of the primary key column
invalidValue: The value of the primary key column
rowNumbers: The indexes of the duplicate entries
rows The entries in the table that failed this validation rule
validationRuleFields: The ODM data dictionary rule fields violated by this row
message: Duplicate entries found in rows <row_indexes> with primary key column <column_name> and primary key value <primary_key_value> in table <table_name>

The error report for the first invalid dataset is shown below,

{
│   'errors': [
│   │   {
│   │   │   'errorType': 'duplicate_entries_found',
│   │   │   'tableName': 'addresses',
│   │   │   'columnName': 'addId',
│   │   │   'invalidValue': '1',
│   │   │   'rowNumbers': [
│   │   │   │   1,
│   │   │   │   2
│   │   │   ],
│   │   │   'rows': [
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   }
│   │   │   ],
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'addId',
│   │   │   │   │   'addresses': 'pK'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 1,2: Duplicate entries found with primary key "1"'
│   │   }
│   ],
│   'warnings': []
}

In addition, a seperate error report object should be generated for each set of duplicate values found. For example the error report for the second invalid dataset above is shown below,

{
│   'errors': [
│   │   {
│   │   │   'errorType': 'duplicate_entries_found',
│   │   │   'tableName': 'addresses',
│   │   │   'columnName': 'addId',
│   │   │   'invalidValue': '1',
│   │   │   'rowNumbers': [
│   │   │   │   2,
│   │   │   │   3
│   │   │   ],
│   │   │   'rows': [
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': '01/02/2023'
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'addId': '1',
│   │   │   │   │   'lastUpdated': '01/02/2023'
│   │   │   │   }
│   │   │   ],
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'addId',
│   │   │   │   │   'addresses': 'pK'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 2,3: Duplicate entries found with primary key "1"'
│   │   },
│   │   {
│   │   │   'errorType': 'duplicate_entries_found',
│   │   │   'tableName': 'addresses',
│   │   │   'columnName': 'addId',
│   │   │   'invalidValue': '2',
│   │   │   'rowNumbers': [
│   │   │   │   5,
│   │   │   │   6
│   │   │   ],
│   │   │   'rows': [
│   │   │   │   {
│   │   │   │   │   'addId': '2',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   },
│   │   │   │   {
│   │   │   │   │   'addId': '2',
│   │   │   │   │   'lastUpdated': ''
│   │   │   │   }
│   │   │   ],
│   │   │   'validationRuleFields': [
│   │   │   │   {
│   │   │   │   │   'partID': 'addId',
│   │   │   │   │   'addresses': 'pK'
│   │   │   │   }
│   │   │   ],
│   │   │   'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 5,6: Duplicate entries found with primary key "2"'
│   │   }
│   ],
│   'warnings': []
}

Two error report objects should be generated, one for rows 2 and 3 and, one for rows 5 and 6.

Rule metadata¶

All the metadata for this rule is contained in the parts sheet in the data dictionary. This rule should be added to the primary key column for all tables which can identified using these instructions.

For example,

                                                     Parts v2                                                      
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ partID          ┃ partType          ┃ addresses       ┃ status      ┃ firstReleased         ┃ lastUpdated       ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ addresses       │ tables            │ NA              │ active      │ 1.0.0                 │ 2.0.0             │
│ addId           │ attributes        │ pK              │ active      │ 1.0.0                 │ 2.0.0             │
│ addL1           │ attributes        │ header          │ active      │ 1.0.0                 │ 2.0.0             │
└─────────────────┴───────────────────┴─────────────────┴─────────────┴───────────────────────┴───────────────────┘

Here the addId part can be identified as being the primary key for the addresses table and hence should have this rule implemented.

Cerberus Schema¶

Cerberus currently does not have support for this rule. We will need to extend the cerberus validator to accomplish this.

The generated cerberus object for the example above is shown below,

{
│   'schemaVersion': '2.0.0',
│   'schema': {
│   │   'addresses': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'addId': {
│   │   │   │   │   │   'unique': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'duplicate_entries_found',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'addId',
│   │   │   │   │   │   │   │   │   │   'addresses': 'pK'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'addresses',
│   │   │   │   │   │   'partType': 'tables'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
│   }
}

The metadata for this rule should include the row from the ODM that defines the primary key column for the table.

ODM Version 1¶

When generating the schema for version 1, we should add this rule to any version 2 primary key columns which have a version 1 equivalent. For example, the parts snippet above with its version 1 columns is shown below

                                                     Parts v1                                                      
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ partID    ┃ partType   ┃ addresses ┃ version1Lo… ┃ version1Ta… ┃ version1V… ┃ status ┃ firstRelea… ┃ lastUpdat… ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ addresses │ tables     │ NA        │ tables      │ Address     │ NA         │ active │ 1.0.0       │ 2.0.0      │
│ addId     │ attributes │ pK        │ variables   │ Address     │ AddressId  │ active │ 1.0.0       │ 2.0.0      │
│ addL1     │ attributes │ header    │ variables   │ Address     │ AddL1      │ active │ 1.0.0       │ 2.0.0      │
└───────────┴────────────┴───────────┴─────────────┴─────────────┴────────────┴────────┴─────────────┴────────────┘

The corresponding cerberus schema would be,

{
│   'schemaVersion': '1.0.0',
│   'schema': {
│   │   'Address': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'AddressId': {
│   │   │   │   │   │   'unique': True,
│   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   'ruleID': 'duplicate_entries_found',
│   │   │   │   │   │   │   │   'meta': [
│   │   │   │   │   │   │   │   │   {
│   │   │   │   │   │   │   │   │   │   'partID': 'addId',
│   │   │   │   │   │   │   │   │   │   'addresses': 'pK',
│   │   │   │   │   │   │   │   │   │   'version1Location': 'variables',
│   │   │   │   │   │   │   │   │   │   'version1Table': 'Address',
│   │   │   │   │   │   │   │   │   │   'version1Variable': 'AddressId'
│   │   │   │   │   │   │   │   │   }
│   │   │   │   │   │   │   │   ]
│   │   │   │   │   │   │   }
│   │   │   │   │   │   ]
│   │   │   │   │   }
│   │   │   │   },
│   │   │   │   'meta': [
│   │   │   │   │   {
│   │   │   │   │   │   'partID': 'addresses',
│   │   │   │   │   │   'partType': 'tables',
│   │   │   │   │   │   'version1Location': 'tables',
│   │   │   │   │   │   'version1Table': 'Address'
│   │   │   │   │   }
│   │   │   │   ]
│   │   │   }
│   │   }
│   }
}

The metadata should include the following columns,

The partID column value
The <table_name> column value
The version1Location column value
The version1Table column value
The version1Variable column value

duplicate_entries_found¶

Error report¶

Rule metadata¶

Cerberus Schema¶

ODM Version 1¶

PHES-ODM Validation

Navigation

Related Topics