summarize_report¶
This document specifies the implementation the summarize_report
function, based on the summary report
specification.
The summarize_report
function is used to transform the report
outputted from validate_data
into a summarized report. It can
summarize by multiple categories.
Function signature¶
def summarize_report(report: ValidationReport,
by: Set[SummaryKey] = {SummaryKey.table}
) -> SummarizedReport
Arguments¶
report
: a validation report returned fromvalidate_data
by
: a set of keys to summarize by. The full set consists of the keys (table, column, row). An error/warning summarization will be performed for each group/key specified. Defaults totable
.
SummarizedReport object¶
This is the summarized report object returned from the function.
class SummarizedReport:
data_version: str
schema_version: str
package_version: str
overview: dict
errors: dict
warnings: dict
Fields¶
The following fields are shared with ValidationReport
(and documented
here):
data_version
schema_version
package_version
The remaining fields are unique to the summarized report:
Overview data-structure¶
The overview holds the most basic information about the validation:
number of columns and rows per table
number of errors/warnings per rule
It should be separated from the error-summary to simplify implementation and because the summary report should have this info regardless of which keys are summarized by.
The table information is taken from the validation report’s table_info
field.
overview = {
'tables': {
'addresses': {
'columns': 2,
'rows': 6,
'rules': {
'x': 4,
'y': 2,
}
}
},
'errors': {
'x': 4,
'y': 2,
},
'warnings': {
}
}
ErrorSummary data-structure¶
use a flat table approach, where each table has a list of ‘summary objects’ with the appropriate fields for the selected group key,
class SummaryEntry:
ruleId: RuleId
count: int
key: SummaryKey
"""The group/summary-key this entry is summarized by."""
value: str
"""Id of the entity in group `key` that this entry is derived from. This
corresponds to the table-id when grouping by `table`, the column-id when
grouping by `column`, etc."""
E = SummaryEntry
ErrorSummary = Dict[TableId, List[E]]
errors: ErrorSummary = {
'addresses': [
E(ruleId='x', count=4, key='table', value='addresses'),
E(ruleId='y', count=2, key='table', value='addresses'),
E(ruleId='x', count=1, key='column', value='addId'),
E(ruleId='y', count=2, key='column', value='addId'),
E(ruleId='x', count=3, key='column', value='addL1'),
E(ruleId='x', count=2, key='row', value='1'),
E(ruleId='y', count=1, key='row', value='1'),
E(ruleId='x', count=2, key='row', value='2'),
E(ruleId='y', count=1, key='row', value='2'),
E(ruleId='_all', count=6, key='table', value='addresses'),
E(ruleId='_all', count=3, key='column', value='addId'),
E(ruleId='_all', count=3, key='column', value='addL1'),
E(ruleId='_all', count=3, key='row', value='1'),
E(ruleId='_all', count=3, key='row', value='2'),
]
}
This enables the user to use their own select/filter functions on the data as it has a flat table form.
The main reasons for choosing this structure is:
Users can easily transform the data to fit their needs, as its a flat table.
Multiple groupings can be included in the same dataset (table, column, etc.).
It’s less complex than its corresponding tree-representation.
See Appendix 1 for an alternative tree-based implementation.
Example usage¶
report = validate_data(...)
summary = summarize_report(report, by=column)
See summarize-tool for more examples of how the summary generation may be used in practice.
Appendix 1¶
This is an alternative data-structure for representing the error summary.
Count = int
ErrorSummary = Dict[TableId, Dict[SummaryKey, Dict[str, Count]]]
errors: ErrorSummary = {
'addresses': {
table: {'x': 4, 'y': 2, '_all': 6},
column: {
'addId': {'x': 1, 'y': 2, '_all': 3},
'addL1': {'x': 3, '_all': 3},
},
row: {
'1': {'x': 2, 'y': 1, '_all': 3},
'2': {'x': 2, 'y': 1, '_all': 3},
}
}
}
The main advantage of this data-structure is that it can be iterated and used directly without performing additional transforms; however, the inner structure of the table-key is not consistent with the other keys.