Summary Report

This document provides the specifications for programmatically summarizing a validation report generated with the validation.validate_data function.

Context

Summarizing the validation report is desireable when the dataset is large and there are a lot of validation errors/warnings. A summary report enables a user to have a bird’s eye view of the validation errors in a dataset, making it easy to identify where in the errors are localized, if they are.

The next sections go over the assets we will be working with in this specification.

The validation schema

The validation schema we will be working with is displayed below,

{
'schemaVersion': '2.0.0',
'schema': {
│   │   'addresses': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'addID': {
│   │   │   │   │   │   'type': 'string',
│   │   │   │   │   │   'required': True
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   },
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'siteID': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'required': True
│   │   │   │   │   },
│   │   │   │   │   'geoLat': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'coerce': 'integer',
│   │   │   │   │   │   'min': -90,
│   │   │   │   │   │   'max': 90
│   │   │   │   │   },
│   │   │   │   │   'geoLong': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'coerce': 'integer',
│   │   │   │   │   │   'min': -90,
│   │   │   │   │   │   'max': 90
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   }
}
}

It defines the following entities:

  • An addresses table with a string column called addID which is mandatory; and

  • A sites table with:

    • A string column called siteID which is mandatory

    • Two integer columns called geoLat and geoLong, both with a min and max value of -90 and 90 respectively.

The dataset

The addresses and sites tables we will be validating are shown below,

                                                  Addresses Table                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addID                                                                                                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Calgary                                                                                                         │
│ Ottawa                                                                                                          │
│ Vancouver                                                                                                       │
│ Hamilton                                                                                                        │
│ Toronto                                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                    Sites Table                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geoLat                                               geoLong                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ -91                                                 │ 92                                                        │
│ -92                                                 │ 93                                                        │
│ -93                                                 │ 94                                                        │
└─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

The validation report

Finally, the full validation report for our dataset is printed below,

ValidationReport(
data_version='2.0.0',
schema_version='2.0.0',
package_version='0.5.0',
table_info={'addresses': {'columns': 1, 'rows': 5}, 'sites': {'columns': 2, 'rows': 3}},
errors=[
│   │   {
│   │   │   'errorType': 'less_than_min_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 1: Value "-91" is less than the min value of "-90"',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': -91, 'geoLong': 92},
│   │   │   'invalidValue': -91
│   │   },
│   │   {
│   │   │   'errorType': 'less_than_min_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 2: Value "-92" is less than the min value of "-90"',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': -92, 'geoLong': 93},
│   │   │   'invalidValue': -92
│   │   },
│   │   {
│   │   │   'errorType': 'less_than_min_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 3: Value "-93" is less than the min value of "-90"',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': -93, 'geoLong': 94},
│   │   │   'invalidValue': -93
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 1: Value "92" is greater than the max value of "90"',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': -91, 'geoLong': 92},
│   │   │   'invalidValue': 92
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 2: Value "93" is greater than the max value of "90"',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': -92, 'geoLong': 93},
│   │   │   'invalidValue': 93
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 3: Value "94" is greater than the max value of "90"',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': -93, 'geoLong': 94},
│   │   │   'invalidValue': 94
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': -91, 'geoLong': 92}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': -92, 'geoLong': 93}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': -93, 'geoLong': 94}
│   │   }
],
warnings=[
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "-91" is a string and was coerced into a number',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': '-91', 'geoLong': '92'},
│   │   │   'invalidValue': '-91',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "92" is a string and was coerced into a number',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': '-91', 'geoLong': '92'},
│   │   │   'invalidValue': '92',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 2: Value "-92" is a string and was coerced into a number',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': '-92', 'geoLong': '93'},
│   │   │   'invalidValue': '-92',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 2: Value "93" is a string and was coerced into a number',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': '-92', 'geoLong': '93'},
│   │   │   'invalidValue': '93',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 3: Value "-93" is a string and was coerced into a number',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': '-93', 'geoLong': '94'},
│   │   │   'invalidValue': '-93',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 3: Value "94" is a string and was coerced into a number',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': '-93', 'geoLong': '94'},
│   │   │   'invalidValue': '94',
│   │   │   'coercionRules': []
│   │   }
]
)

As you can see, the report is overwhelming, even for this relatively small dataset. Its hard to see whether the errors are localized to one table and or column or if they’re spread out. The summary report should help with this.

Specifications

Summarizing by table

The user should be able to generate a summary for each table.

The summary should include an overall table summary that includes:

  • The number rows in the table and the number of rows with errors

  • The number of columns in the table and the number of columns with errors

For example, for the dataset above,

  • 5 rows found in the addresses table

    • 0 rows with errors

    • 0 rows with warnings

  • 1 column found in the addresses table

    • 0 columns with errors

    • 0 columns with warnings

  • 3 rows found in the sites table

    • 3 rows with errors

    • 3 rows with warnings

  • 2 columns found in the sites table

    • 2 column with errors

    • 2 columns with warnings

    • 1 missing mandatory column

The summary should provide the following information about the errors and warnings for each table:

  • The total number of errors in each table

  • The total number of errors of each type in each table

  • The total number of warnings in each table

  • The total number of warnings of each type in each table

For example, for the dataset above, the summary report should inform the user that there are:

  • 0 errors in the addresses table

  • 9 errors in the sites table

    • 3 less_than_min_value errors

    • 3 greater_than_max_value errors

    • 3 missing_mandatory_column errors

  • 6 warnings in the sites table

    • 6 _coercion warnings

Summarizing by row

The user should be able to generate a summary for each row in the table. This summary is always generated within the context of a table. For example, for the dataset above, the summary report should inform the user that there are:

  • 0 errors in the addresses table

  • 9 errors in the sites table

    • 3 errors in row 1

      • 1 less_than_min_value error

      • 1 greater_than_max_value error

      • 1 missing_mandatory_column error

    • 3 errors in row 2

      • 1 less_than_min_value error

      • 1 greater_than_max_value error

      • 1 missing_mandatory_column error

    • 3 errors in row 3

      • 1 less_than_min_value error

      • 1 greater_than_max_value error

      • 1 missing_mandatory_column error

  • 6 warnings in the sites table

    • 2 warnings in row 1

      • 2 _coercion warnings

    • 2 warnings in row 2

      • 2 _coercion warnings

    • 2 warnings in row 3

      • 2 _coercion warnings

Summarizing by column

The user should be able to generate a summary for each column in the table. This summary is always generated within a context of a table. For example, for the dataset above, the summary report should inform the user that there are:

  • 0 errors in the addresses table

  • 9 errors in the sites table

    • 3 errors in the geoLat column

      • 3 less_than_min_value errors

    • 3 errors in the geoLong column

      • 3 greater_than_max_value errors

    • 3 errors in the column siteID

      • 3 missing_mandatory_column errors

  • 6 warnings in the sites table

    • 3 warnings in the geoLat column

      • 3 _coercion warnings

    • 3 warnings in the geoLong column

      • 3 _coercion warnings

Default summarization

By default, the summary report should be generated by table.

Printing a summary

The user should be able to print the report in a markdown format. The summary should include both the errors and warnings. Depending on how the user decided to create the summary report, there are 3 versions that can be printed:

  1. Summary by table;

  2. Summary by table and row; and

  3. Summary by table and column