Summary Report¶
This document provides the specifications for programmatically
summarizing a validation report generated with the
validation.validate_data function.
Context¶
Summarizing the validation report is desireable when the dataset is large and there are a lot of validation errors/warnings. A summary report enables a user to have a bird’s eye view of the validation errors in a dataset, making it easy to identify where in the errors are localized, if they are.
The next sections go over the assets we will be working with in this specification.
The validation schema¶
The validation schema we will be working with is displayed below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'addresses': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'addID': { │ │ │ │ │ │ 'type': 'string', │ │ │ │ │ │ 'required': True │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ }, │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'siteID': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'required': True │ │ │ │ │ }, │ │ │ │ │ 'geoLat': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'coerce': 'integer', │ │ │ │ │ │ 'min': -90, │ │ │ │ │ │ 'max': 90 │ │ │ │ │ }, │ │ │ │ │ 'geoLong': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'coerce': 'integer', │ │ │ │ │ │ 'min': -90, │ │ │ │ │ │ 'max': 90 │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ } │ } }
It defines the following entities:
An
addressestable with a string column calledaddIDwhich is mandatory; andA
sitestable with:A string column called
siteIDwhich is mandatoryTwo integer columns called
geoLatandgeoLong, both with a min and max value of -90 and 90 respectively.
The dataset¶
The addresses and sites tables we will be validating are shown
below,
Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addID ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Calgary │ │ Ottawa │ │ Vancouver │ │ Hamilton │ │ Toronto │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ -91 │ 92 │ │ -92 │ 93 │ │ -93 │ 94 │ └─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
The validation report¶
Finally, the full validation report for our dataset is printed below,
ValidationReport( │ data_version='2.2.3', │ schema_version='2.0.0', │ package_version='1.0.0b1', │ table_info={'addresses': {'columns': 1, 'rows': 5}, 'sites': {'columns': 2, 'rows': 3}}, │ errors=[ │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 1: Value "-91" is less than the min value of "-90"', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': -91, 'geoLong': 92}, │ │ │ 'invalidValue': -91 │ │ }, │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 2: Value "-92" is less than the min value of "-90"', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': -92, 'geoLong': 93}, │ │ │ 'invalidValue': -92 │ │ }, │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 3: Value "-93" is less than the min value of "-90"', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': -93, 'geoLong': 94}, │ │ │ 'invalidValue': -93 │ │ }, │ │ { │ │ │ 'errorType': 'greater_than_max_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 1: Value "92" is greater than the max value of "90"', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': -91, 'geoLong': 92}, │ │ │ 'invalidValue': 92 │ │ }, │ │ { │ │ │ 'errorType': 'greater_than_max_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 2: Value "93" is greater than the max value of "90"', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': -92, 'geoLong': 93}, │ │ │ 'invalidValue': 93 │ │ }, │ │ { │ │ │ 'errorType': 'greater_than_max_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 3: Value "94" is greater than the max value of "90"', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': -93, 'geoLong': 94}, │ │ │ 'invalidValue': 94 │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': -91, 'geoLong': 92} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': -92, 'geoLong': 93} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': -93, 'geoLong': 94} │ │ } │ ], │ warnings=[ │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "-91" is a string and was coerced into a number', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': '-91', 'geoLong': '92'}, │ │ │ 'invalidValue': '-91', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "92" is a string and was coerced into a number', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': '-91', 'geoLong': '92'}, │ │ │ 'invalidValue': '92', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 2: Value "-92" is a string and was coerced into a number', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': '-92', 'geoLong': '93'}, │ │ │ 'invalidValue': '-92', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 2: Value "93" is a string and was coerced into a number', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': '-92', 'geoLong': '93'}, │ │ │ 'invalidValue': '93', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 3: Value "-93" is a string and was coerced into a number', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': '-93', 'geoLong': '94'}, │ │ │ 'invalidValue': '-93', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 3: Value "94" is a string and was coerced into a number', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': '-93', 'geoLong': '94'}, │ │ │ 'invalidValue': '94', │ │ │ 'coercionRules': [] │ │ } │ ] )
As you can see, the report is overwhelming, even for this relatively small dataset. Its hard to see whether the errors are localized to one table and or column or if they’re spread out. The summary report should help with this.
Specifications¶
Summarizing by table¶
The user should be able to generate a summary for each table.
The summary should include an overall table summary that includes:
The number rows in the table and the number of rows with errors
The number of columns in the table and the number of columns with errors
For example, for the dataset above,
5 rows found in the
addressestable0 rows with errors
0 rows with warnings
1 column found in the
addressestable0 columns with errors
0 columns with warnings
3 rows found in the
sitestable3 rows with errors
3 rows with warnings
2 columns found in the
sitestable2 column with errors
2 columns with warnings
1 missing mandatory column
The summary should provide the following information about the errors and warnings for each table:
The total number of errors in each table
The total number of errors of each type in each table
The total number of warnings in each table
The total number of warnings of each type in each table
For example, for the dataset above, the summary report should inform the user that there are:
0 errors in the
addressestable9 errors in the
sitestable3
less_than_min_valueerrors3
greater_than_max_valueerrors3
missing_mandatory_columnerrors
6 warnings in the
sitestable6
_coercionwarnings
Summarizing by row¶
The user should be able to generate a summary for each row in the table. This summary is always generated within the context of a table. For example, for the dataset above, the summary report should inform the user that there are:
0 errors in the
addressestable9 errors in the
sitestable3 errors in row 1
1
less_than_min_valueerror1
greater_than_max_valueerror1
missing_mandatory_columnerror
3 errors in row 2
1
less_than_min_valueerror1
greater_than_max_valueerror1
missing_mandatory_columnerror
3 errors in row 3
1
less_than_min_valueerror1
greater_than_max_valueerror1
missing_mandatory_columnerror
6 warnings in the
sitestable2 warnings in row 1
2
_coercionwarnings
2 warnings in row 2
2
_coercionwarnings
2 warnings in row 3
2
_coercionwarnings
Summarizing by column¶
The user should be able to generate a summary for each column in the table. This summary is always generated within a context of a table. For example, for the dataset above, the summary report should inform the user that there are:
0 errors in the
addressestable9 errors in the
sitestable3 errors in the
geoLatcolumn3
less_than_min_valueerrors
3 errors in the
geoLongcolumn3
greater_than_max_valueerrors
3 errors in the column
siteID3
missing_mandatory_columnerrors
6 warnings in the
sitestable3 warnings in the
geoLatcolumn3
_coercionwarnings
3 warnings in the
geoLongcolumn3
_coercionwarnings
Default summarization¶
By default, the summary report should be generated by table.
Printing a summary¶
The user should be able to print the report in a markdown format. The summary should include both the errors and warnings. Depending on how the user decided to create the summary report, there are 3 versions that can be printed:
Summary by table;
Summary by table and row; and
Summary by table and column