Summary Report¶
This document provides the specifications for programmatically
summarizing a validation report generated with the
validation.validate_data
function.
Context¶
Summarizing the validation report is desireable when the dataset is large and there are a lot of validation errors/warnings. A summary report enables a user to have a bird’s eye view of the validation errors in a dataset, making it easy to identify where in the errors are localized, if they are.
The next sections go over the assets we will be working with in this specification.
The validation schema¶
The validation schema we will be working with is displayed below,
{ │ 'schemaVersion': '2.0.0', │ 'schema': { │ │ 'addresses': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'addID': { │ │ │ │ │ │ 'type': 'string', │ │ │ │ │ │ 'required': True │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ }, │ │ 'sites': { │ │ │ 'type': 'list', │ │ │ 'schema': { │ │ │ │ 'type': 'dict', │ │ │ │ 'schema': { │ │ │ │ │ 'siteID': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'required': True │ │ │ │ │ }, │ │ │ │ │ 'geoLat': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'coerce': 'integer', │ │ │ │ │ │ 'min': -90, │ │ │ │ │ │ 'max': 90 │ │ │ │ │ }, │ │ │ │ │ 'geoLong': { │ │ │ │ │ │ 'type': 'integer', │ │ │ │ │ │ 'coerce': 'integer', │ │ │ │ │ │ 'min': -90, │ │ │ │ │ │ 'max': 90 │ │ │ │ │ } │ │ │ │ } │ │ │ } │ │ } │ } }
It defines the following entities:
An
addresses
table with a string column calledaddID
which is mandatory; andA
sites
table with:A string column called
siteID
which is mandatoryTwo integer columns called
geoLat
andgeoLong
, both with a min and max value of -90 and 90 respectively.
The dataset¶
The addresses
and sites
tables we will be validating are shown
below,
Addresses Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ addID ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Calgary │ │ Ottawa │ │ Vancouver │ │ Hamilton │ │ Toronto │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Sites Table ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ geoLat ┃ geoLong ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ -91 │ 92 │ │ -92 │ 93 │ │ -93 │ 94 │ └─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
The validation report¶
Finally, the full validation report for our dataset is printed below,
ValidationReport( │ data_version='2.0.0', │ schema_version='2.0.0', │ package_version='0.5.0', │ table_info={'addresses': {'columns': 1, 'rows': 5}, 'sites': {'columns': 2, 'rows': 3}}, │ errors=[ │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 1: Value "-91" is less than the min value of "-90"', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': -91, 'geoLong': 92}, │ │ │ 'invalidValue': -91 │ │ }, │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 2: Value "-92" is less than the min value of "-90"', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': -92, 'geoLong': 93}, │ │ │ 'invalidValue': -92 │ │ }, │ │ { │ │ │ 'errorType': 'less_than_min_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 3: Value "-93" is less than the min value of "-90"', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': -93, 'geoLong': 94}, │ │ │ 'invalidValue': -93 │ │ }, │ │ { │ │ │ 'errorType': 'greater_than_max_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 1: Value "92" is greater than the max value of "90"', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': -91, 'geoLong': 92}, │ │ │ 'invalidValue': 92 │ │ }, │ │ { │ │ │ 'errorType': 'greater_than_max_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 2: Value "93" is greater than the max value of "90"', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': -92, 'geoLong': 93}, │ │ │ 'invalidValue': 93 │ │ }, │ │ { │ │ │ 'errorType': 'greater_than_max_value', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 3: Value "94" is greater than the max value of "90"', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': -93, 'geoLong': 94}, │ │ │ 'invalidValue': 94 │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': -91, 'geoLong': 92} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': -92, 'geoLong': 93} │ │ }, │ │ { │ │ │ 'errorType': 'missing_mandatory_column', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'siteID', │ │ │ 'validationRuleFields': [], │ │ │ 'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': -93, 'geoLong': 94} │ │ } │ ], │ warnings=[ │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "-91" is a string and was coerced into a number', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': '-91', 'geoLong': '92'}, │ │ │ 'invalidValue': '-91', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "92" is a string and was coerced into a number', │ │ │ 'rowNumber': 1, │ │ │ 'row': {'geoLat': '-91', 'geoLong': '92'}, │ │ │ 'invalidValue': '92', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 2: Value "-92" is a string and was coerced into a number', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': '-92', 'geoLong': '93'}, │ │ │ 'invalidValue': '-92', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 2: Value "93" is a string and was coerced into a number', │ │ │ 'rowNumber': 2, │ │ │ 'row': {'geoLat': '-92', 'geoLong': '93'}, │ │ │ 'invalidValue': '93', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLat', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLat, row(s) 3: Value "-93" is a string and was coerced into a number', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': '-93', 'geoLong': '94'}, │ │ │ 'invalidValue': '-93', │ │ │ 'coercionRules': [] │ │ }, │ │ { │ │ │ 'warningType': '_coercion', │ │ │ 'tableName': 'sites', │ │ │ 'columnName': 'geoLong', │ │ │ 'validationRuleFields': [], │ │ │ 'message': '_coercion rule triggered in table sites, column geoLong, row(s) 3: Value "94" is a string and was coerced into a number', │ │ │ 'rowNumber': 3, │ │ │ 'row': {'geoLat': '-93', 'geoLong': '94'}, │ │ │ 'invalidValue': '94', │ │ │ 'coercionRules': [] │ │ } │ ] )
As you can see, the report is overwhelming, even for this relatively small dataset. Its hard to see whether the errors are localized to one table and or column or if they’re spread out. The summary report should help with this.
Specifications¶
Summarizing by table¶
The user should be able to generate a summary for each table.
The summary should include an overall table summary that includes:
The number rows in the table and the number of rows with errors
The number of columns in the table and the number of columns with errors
For example, for the dataset above,
5 rows found in the
addresses
table0 rows with errors
0 rows with warnings
1 column found in the
addresses
table0 columns with errors
0 columns with warnings
3 rows found in the
sites
table3 rows with errors
3 rows with warnings
2 columns found in the
sites
table2 column with errors
2 columns with warnings
1 missing mandatory column
The summary should provide the following information about the errors and warnings for each table:
The total number of errors in each table
The total number of errors of each type in each table
The total number of warnings in each table
The total number of warnings of each type in each table
For example, for the dataset above, the summary report should inform the user that there are:
0 errors in the
addresses
table9 errors in the
sites
table3
less_than_min_value
errors3
greater_than_max_value
errors3
missing_mandatory_column
errors
6 warnings in the
sites
table6
_coercion
warnings
Summarizing by row¶
The user should be able to generate a summary for each row in the table. This summary is always generated within the context of a table. For example, for the dataset above, the summary report should inform the user that there are:
0 errors in the
addresses
table9 errors in the
sites
table3 errors in row 1
1
less_than_min_value
error1
greater_than_max_value
error1
missing_mandatory_column
error
3 errors in row 2
1
less_than_min_value
error1
greater_than_max_value
error1
missing_mandatory_column
error
3 errors in row 3
1
less_than_min_value
error1
greater_than_max_value
error1
missing_mandatory_column
error
6 warnings in the
sites
table2 warnings in row 1
2
_coercion
warnings
2 warnings in row 2
2
_coercion
warnings
2 warnings in row 3
2
_coercion
warnings
Summarizing by column¶
The user should be able to generate a summary for each column in the table. This summary is always generated within a context of a table. For example, for the dataset above, the summary report should inform the user that there are:
0 errors in the
addresses
table9 errors in the
sites
table3 errors in the
geoLat
column3
less_than_min_value
errors
3 errors in the
geoLong
column3
greater_than_max_value
errors
3 errors in the column
siteID
3
missing_mandatory_column
errors
6 warnings in the
sites
table3 warnings in the
geoLat
column3
_coercion
warnings
3 warnings in the
geoLong
column3
_coercion
warnings
Default summarization¶
By default, the summary report should be generated by table.
Printing a summary¶
The user should be able to print the report in a markdown format. The summary should include both the errors and warnings. Depending on how the user decided to create the summary report, there are 3 versions that can be printed:
Summary by table;
Summary by table and row; and
Summary by table and column