# duplicate_entries_found This rule identifies when there are two identical entries with the same primary key. The rational has been described [here](https://odm.discourse.group/t/duplicate-entries-and-lastedited-field/55). In brief, the ODM can provide a papertrail of updates made to an entry by allowing users to update entries by: 1. Not deleting the old entry 2. Adding a new row for the updated entry and 3. Updating the `lastUpdated` field for the updated entry. For validation, this rule would be violated if two rows have the same primary key values but non-unique `lastUpdated` values. For example, the following ODM data snippet would fail validation.
Invalid Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addId ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ │ 1 │ │ │ 3 │ │ └────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘There are two rows (rows 1 and 2) with the same primary key value of `1` but also the same `lastUpdated` value. The following dataset would also fail validation,
Invalid Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addId ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ │ 1 │ 01/02/2023 │ │ 1 │ 01/02/2023 │ │ 3 │ │ │ 2 │ │ │ 2 │ │ └────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘The following dataset would pass validation,
Valid Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addId ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ │ │ 1 │ 01/01/2023 │ │ 1 │ 01/02/2023 │ │ 3 │ │ └────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────┘## Error report The error report will have the following fields - **errorType**: duplicate_entries_found - **tableName**: The name of the table with duplicate entries - **columnName** The name of the primary key column - **invalidValue**: The value of the primary key column - **rowNumbers**: The indexes of the duplicate entries - **rows** The entries in the table that failed this validation rule - **validationRuleFields**: The ODM data dictionary rule fields violated by this row - **message**: Duplicate entries found in rows \
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'duplicate_entries_found', │ │ │ 'tableName': 'addresses', │ │ │ 'columnName': 'addId', │ │ │ 'invalidValue': '1', │ │ │ 'rowNumbers': [ │ │ │ │ 1, │ │ │ │ 2 │ │ │ ], │ │ │ 'rows': [ │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ }, │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ } │ │ │ ], │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ 'addresses': 'pK' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 1,2: Duplicate entries found with primary key "1"' │ │ } │ ], │ 'warnings': [] }In addition, a seperate error report object should be generated for each set of duplicate values found. For example the error report for the second invalid dataset above is shown below,
{ │ 'errors': [ │ │ { │ │ │ 'errorType': 'duplicate_entries_found', │ │ │ 'tableName': 'addresses', │ │ │ 'columnName': 'addId', │ │ │ 'invalidValue': '1', │ │ │ 'rowNumbers': [ │ │ │ │ 2, │ │ │ │ 3 │ │ │ ], │ │ │ 'rows': [ │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '01/02/2023' │ │ │ │ }, │ │ │ │ { │ │ │ │ │ 'addId': '1', │ │ │ │ │ 'lastUpdated': '01/02/2023' │ │ │ │ } │ │ │ ], │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ 'addresses': 'pK' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 2,3: Duplicate entries found with primary key "1"' │ │ }, │ │ { │ │ │ 'errorType': 'duplicate_entries_found', │ │ │ 'tableName': 'addresses', │ │ │ 'columnName': 'addId', │ │ │ 'invalidValue': '2', │ │ │ 'rowNumbers': [ │ │ │ │ 5, │ │ │ │ 6 │ │ │ ], │ │ │ 'rows': [ │ │ │ │ { │ │ │ │ │ 'addId': '2', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ }, │ │ │ │ { │ │ │ │ │ 'addId': '2', │ │ │ │ │ 'lastUpdated': '' │ │ │ │ } │ │ │ ], │ │ │ 'validationRuleFields': [ │ │ │ │ { │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ 'addresses': 'pK' │ │ │ │ } │ │ │ ], │ │ │ 'message': 'duplicate_entries_found rule violated in table addresses, column addId, row(s) 5,6: Duplicate entries found with primary key "2"' │ │ } │ ], │ 'warnings': [] }Two error report objects should be generated, one for rows 2 and 3 and, one for rows 5 and 6. ## Rule metadata All the metadata for this rule is contained in the parts sheet in the data dictionary. This rule should be added to the primary key column for all tables which can identified using these [instructions](../specs/odm-how-tos.md#how-to-get-the-columns-names-for-a-table). For example,
Parts v2 ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ status ┃ firstReleased ┃ lastUpdated ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ addresses │ tables │ NA │ active │ 1.0.0 │ 2.0.0 │ │ addId │ attributes │ pK │ active │ 1.0.0 │ 2.0.0 │ │ addL1 │ attributes │ header │ active │ 1.0.0 │ 2.0.0 │ └─────────────────┴───────────────────┴─────────────────┴─────────────┴───────────────────────┴───────────────────┘Here the `addId` part can be identified as being the primary key for the addresses table and hence should have this rule implemented. ## Cerberus Schema Cerberus currently does not have support for this rule. We will need to [extend the cerberus validator](https://docs.python-cerberus.org/en/stable/customize.html) to accomplish this. The generated cerberus object for the example above is shown below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'addresses': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'addId': { │ │ │ │ │ │ 'unique': True, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'duplicate_entries_found', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ │ │ │ │ │ 'addresses': 'pK' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'addresses', │ │ │ │ │ │ 'partType': 'tables' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }The metadata for this rule should include the row from the ODM that defines the primary key column for the table. ## ODM Version 1 When generating the schema for version 1, we should add this rule to any version 2 primary key columns which have a version 1 equivalent. For example, the parts snippet above with its version 1 columns is shown below
Parts v1 ┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ partID ┃ partType ┃ addresses ┃ version1Lo… ┃ version1Ta… ┃ version1V… ┃ status ┃ firstRelea… ┃ lastUpdat… ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ addresses │ tables │ NA │ tables │ Address │ NA │ active │ 1.0.0 │ 2.0.0 │ │ addId │ attributes │ pK │ variables │ Address │ AddressId │ active │ 1.0.0 │ 2.0.0 │ │ addL1 │ attributes │ header │ variables │ Address │ AddL1 │ active │ 1.0.0 │ 2.0.0 │ └───────────┴────────────┴───────────┴─────────────┴─────────────┴────────────┴────────┴─────────────┴────────────┘The corresponding cerberus schema would be,
{ │ 'schemaVersion': '1.0.0', │ 'schema': { │ │ 'Address': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'AddressId': { │ │ │ │ │ │ 'unique': True, │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ 'ruleID': 'duplicate_entries_found', │ │ │ │ │ │ │ │ 'meta': [ │ │ │ │ │ │ │ │ │ { │ │ │ │ │ │ │ │ │ │ 'partID': 'addId', │ │ │ │ │ │ │ │ │ │ 'addresses': 'pK', │ │ │ │ │ │ │ │ │ │ 'version1Location': 'variables', │ │ │ │ │ │ │ │ │ │ 'version1Table': 'Address', │ │ │ │ │ │ │ │ │ │ 'version1Variable': 'AddressId' │ │ │ │ │ │ │ │ │ } │ │ │ │ │ │ │ │ ] │ │ │ │ │ │ │ } │ │ │ │ │ │ ] │ │ │ │ │ } │ │ │ │ }, │ │ │ │ 'meta': [ │ │ │ │ │ { │ │ │ │ │ │ 'partID': 'addresses', │ │ │ │ │ │ 'partType': 'tables', │ │ │ │ │ │ 'version1Location': 'tables', │ │ │ │ │ │ 'version1Table': 'Address' │ │ │ │ │ } │ │ │ │ ] │ │ │ } │ │ } │ } }The metadata should include the following columns, - The `partID` column value - The `