# Summary Report This document provides the specifications for programmatically summarizing a validation report generated with the `validation.validate_data` function. ## Context Summarizing the validation report is desireable when the dataset is large and there are a lot of validation errors/warnings. A summary report enables a user to have a bird’s eye view of the validation errors in a dataset, making it easy to identify where in the errors are localized, if they are. The next sections go over the assets we will be working with in this specification. ### The validation schema The validation schema we will be working with is displayed below,
{
'schemaVersion': '2.0.0',
'schema': {
│   │   'addresses': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'addID': {
│   │   │   │   │   │   'type': 'string',
│   │   │   │   │   │   'required': True
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   },
│   │   'sites': {
│   │   │   'type': 'list',
│   │   │   'schema': {
│   │   │   │   'type': 'dict',
│   │   │   │   'schema': {
│   │   │   │   │   'siteID': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'required': True
│   │   │   │   │   },
│   │   │   │   │   'geoLat': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'coerce': 'integer',
│   │   │   │   │   │   'min': -90,
│   │   │   │   │   │   'max': 90
│   │   │   │   │   },
│   │   │   │   │   'geoLong': {
│   │   │   │   │   │   'type': 'integer',
│   │   │   │   │   │   'coerce': 'integer',
│   │   │   │   │   │   'min': -90,
│   │   │   │   │   │   'max': 90
│   │   │   │   │   }
│   │   │   │   }
│   │   │   }
│   │   }
}
}
It defines the following entities: - An `addresses` table with a **string** column called `addID` which is **mandatory**; and - A `sites` table with: - A **string** column called `siteID` which is mandatory - Two **integer** columns called `geoLat` and `geoLong`, both with a **min** and **max** value of **-90** and **90** respectively. ### The dataset The `addresses` and `sites` tables we will be validating are shown below,
                                                  Addresses Table                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addID                                                                                                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Calgary                                                                                                         │
│ Ottawa                                                                                                          │
│ Vancouver                                                                                                       │
│ Hamilton                                                                                                        │
│ Toronto                                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                    Sites Table                                                    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geoLat                                               geoLong                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ -91                                                 │ 92                                                        │
│ -92                                                 │ 93                                                        │
│ -93                                                 │ 94                                                        │
└─────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘
### The validation report Finally, the full validation report for our dataset is printed below,
ValidationReport(
data_version='2.2.3',
schema_version='2.0.0',
package_version='1.0.0b1',
table_info={'addresses': {'columns': 1, 'rows': 5}, 'sites': {'columns': 2, 'rows': 3}},
errors=[
│   │   {
│   │   │   'errorType': 'less_than_min_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 1: Value "-91" is less than the min value of "-90"',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': -91, 'geoLong': 92},
│   │   │   'invalidValue': -91
│   │   },
│   │   {
│   │   │   'errorType': 'less_than_min_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 2: Value "-92" is less than the min value of "-90"',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': -92, 'geoLong': 93},
│   │   │   'invalidValue': -92
│   │   },
│   │   {
│   │   │   'errorType': 'less_than_min_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'less_than_min_value rule violated in table sites, column geoLat, row(s) 3: Value "-93" is less than the min value of "-90"',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': -93, 'geoLong': 94},
│   │   │   'invalidValue': -93
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 1: Value "92" is greater than the max value of "90"',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': -91, 'geoLong': 92},
│   │   │   'invalidValue': 92
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 2: Value "93" is greater than the max value of "90"',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': -92, 'geoLong': 93},
│   │   │   'invalidValue': 93
│   │   },
│   │   {
│   │   │   'errorType': 'greater_than_max_value',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'greater_than_max_value rule violated in table sites, column geoLong, row(s) 3: Value "94" is greater than the max value of "90"',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': -93, 'geoLong': 94},
│   │   │   'invalidValue': 94
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': -91, 'geoLong': 92}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': -92, 'geoLong': 93}
│   │   },
│   │   {
│   │   │   'errorType': 'missing_mandatory_column',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'siteID',
│   │   │   'validationRuleFields': [],
│   │   │   'message': 'missing_mandatory_column rule violated in table sites, column siteID: Missing mandatory column siteID',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': -93, 'geoLong': 94}
│   │   }
],
warnings=[
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 1: Value "-91" is a string and was coerced into a number',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': '-91', 'geoLong': '92'},
│   │   │   'invalidValue': '-91',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 1: Value "92" is a string and was coerced into a number',
│   │   │   'rowNumber': 1,
│   │   │   'row': {'geoLat': '-91', 'geoLong': '92'},
│   │   │   'invalidValue': '92',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 2: Value "-92" is a string and was coerced into a number',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': '-92', 'geoLong': '93'},
│   │   │   'invalidValue': '-92',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 2: Value "93" is a string and was coerced into a number',
│   │   │   'rowNumber': 2,
│   │   │   'row': {'geoLat': '-92', 'geoLong': '93'},
│   │   │   'invalidValue': '93',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLat',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLat, row(s) 3: Value "-93" is a string and was coerced into a number',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': '-93', 'geoLong': '94'},
│   │   │   'invalidValue': '-93',
│   │   │   'coercionRules': []
│   │   },
│   │   {
│   │   │   'warningType': '_coercion',
│   │   │   'tableName': 'sites',
│   │   │   'columnName': 'geoLong',
│   │   │   'validationRuleFields': [],
│   │   │   'message': '_coercion rule triggered in table sites, column geoLong, row(s) 3: Value "94" is a string and was coerced into a number',
│   │   │   'rowNumber': 3,
│   │   │   'row': {'geoLat': '-93', 'geoLong': '94'},
│   │   │   'invalidValue': '94',
│   │   │   'coercionRules': []
│   │   }
]
)
As you can see, the report is overwhelming, even for this relatively small dataset. Its hard to see whether the errors are localized to one table and or column or if they’re spread out. The summary report should help with this. ## Specifications ### Summarizing by table The user should be able to generate a summary for each table. The summary should include an overall table summary that includes: - The number rows in the table and the number of rows with errors - The number of columns in the table and the number of columns with errors For example, for the dataset above, - **5** rows found in the `addresses` table - **0** rows with errors - **0** rows with warnings - **1** column found in the `addresses` table - **0** columns with errors - **0** columns with warnings - **3** rows found in the `sites` table - **3** rows with errors - **3** rows with warnings - **2** columns found in the `sites` table - **2** column with errors - **2** columns with warnings - **1** missing mandatory column The summary should provide the following information about the errors and warnings for each table: - The total number of errors in each table - The total number of errors of each type in each table - The total number of warnings in each table - The total number of warnings of each type in each table For example, for the dataset above, the summary report should inform the user that there are: - **0** errors in the `addresses` table - **9** errors in the `sites` table - **3** `less_than_min_value` errors - **3** `greater_than_max_value` errors - **3** `missing_mandatory_column` errors - **6** warnings in the `sites` table - **6** `_coercion` warnings ### Summarizing by row The user should be able to generate a summary for each row in the table. This summary is always generated within the context of a table. For example, for the dataset above, the summary report should inform the user that there are: - **0** errors in the `addresses` table - **9** errors in the `sites` table - **3** errors in row 1 - **1** `less_than_min_value` error - **1** `greater_than_max_value` error - **1** `missing_mandatory_column` error - **3** errors in row 2 - **1** `less_than_min_value` error - **1** `greater_than_max_value` error - **1** `missing_mandatory_column` error - **3** errors in row 3 - **1** `less_than_min_value` error - **1** `greater_than_max_value` error - **1** `missing_mandatory_column` error - **6** warnings in the `sites` table - **2** warnings in row 1 - **2** `_coercion` warnings - **2** warnings in row 2 - **2** `_coercion` warnings - **2** warnings in row 3 - **2** `_coercion` warnings ## Summarizing by column The user should be able to generate a summary for each column in the table. This summary is always generated within a context of a table. For example, for the dataset above, the summary report should inform the user that there are: - **0** errors in the `addresses` table - **9** errors in the `sites` table - **3** errors in the `geoLat` column - **3** `less_than_min_value` errors - **3** errors in the `geoLong` column - **3** `greater_than_max_value` errors - **3** errors in the column `siteID` - **3** `missing_mandatory_column` errors - **6** warnings in the `sites` table - **3** warnings in the `geoLat` column - **3** `_coercion` warnings - **3** warnings in the `geoLong` column - **3** `_coercion` warnings ## Default summarization By default, the summary report should be generated by table. ## Printing a summary The user should be able to print the report in a markdown format. The summary should include both the errors and warnings. Depending on how the user decided to create the summary report, there are 3 versions that can be printed: 1. Summary by table; 2. Summary by table and row; and 3. Summary by table and column