summarize_report¶

This document specifies the implementation the summarize_report function, based on the summary report specification.

The summarize_report function is used to transform the report outputted from validate_data into a summarized report. It can summarize by multiple categories.

Function signature¶

def summarize_report(report: ValidationReport,
                     by: Set[SummaryKey] = {SummaryKey.table}
                     ) -> SummarizedReport

Arguments¶

report: a validation report returned from validate_data
by: a set of keys to summarize by. The full set consists of the keys (table, column, row). An error/warning summarization will be performed for each group/key specified. Defaults to table.

SummarizedReport object¶

This is the summarized report object returned from the function.

class SummarizedReport:
    data_version: str
    schema_version: str
    package_version: str
    overview: dict
    errors: dict
    warnings: dict

Fields¶

The following fields are shared with ValidationReport (and documented here):

data_version
schema_version
package_version

The remaining fields are unique to the summarized report:

overview contains a high level summary of the summarized data. It is specified below.
errors and warnings contains summarizations grouped by the keys specified when calling the function. It is specified below.

Overview data-structure¶

The overview holds the most basic information about the validation:

number of columns and rows per table
number of errors/warnings per rule

It should be separated from the error-summary to simplify implementation and because the summary report should have this info regardless of which keys are summarized by.

The table information is taken from the validation report’s table_info field.

overview = {
    'tables': {
        'addresses': {
            'columns': 2,
            'rows': 6,
            'rules': {
                'x': 4,
                'y': 2,
            }
        }
    },
    'errors': {
        'x': 4,
        'y': 2,
    },
    'warnings': {
    }
}

ErrorSummary data-structure¶

use a flat table approach, where each table has a list of ‘summary objects’ with the appropriate fields for the selected group key,

class SummaryEntry:
    ruleId: RuleId
    count: int

    key: SummaryKey
    """The group/summary-key this entry is summarized by."""

    value: str
    """Id of the entity in group `key` that this entry is derived from. This
    corresponds to the table-id when grouping by `table`, the column-id when
    grouping by `column`, etc."""

E = SummaryEntry

ErrorSummary = Dict[TableId, List[E]]

errors: ErrorSummary = {
    'addresses': [
        E(ruleId='x', count=4, key='table',  value='addresses'),
        E(ruleId='y', count=2, key='table',  value='addresses'),
        E(ruleId='x', count=1, key='column', value='addId'),
        E(ruleId='y', count=2, key='column', value='addId'),
        E(ruleId='x', count=3, key='column', value='addL1'),
        E(ruleId='x', count=2, key='row',    value='1'),
        E(ruleId='y', count=1, key='row',    value='1'),
        E(ruleId='x', count=2, key='row',    value='2'),
        E(ruleId='y', count=1, key='row',    value='2'),

        E(ruleId='_all', count=6, key='table',  value='addresses'),
        E(ruleId='_all', count=3, key='column', value='addId'),
        E(ruleId='_all', count=3, key='column', value='addL1'),
        E(ruleId='_all', count=3, key='row',    value='1'),
        E(ruleId='_all', count=3, key='row',    value='2'),
    ]
}

This enables the user to use their own select/filter functions on the data as it has a flat table form.

The main reasons for choosing this structure is:

Users can easily transform the data to fit their needs, as its a flat table.
Multiple groupings can be included in the same dataset (table, column, etc.).
It’s less complex than its corresponding tree-representation.

See Appendix 1 for an alternative tree-based implementation.

Example usage¶

report = validate_data(...)
summary = summarize_report(report, by=column)

See summarize-tool for more examples of how the summary generation may be used in practice.

Appendix 1¶

This is an alternative data-structure for representing the error summary.

Count = int
ErrorSummary = Dict[TableId, Dict[SummaryKey, Dict[str, Count]]]

errors: ErrorSummary = {
    'addresses': {
        table: {'x': 4, 'y': 2, '_all': 6},
        column: {
            'addId': {'x': 1, 'y': 2, '_all': 3},
            'addL1': {'x': 3, '_all': 3},
        },
        row: {
            '1': {'x': 2, 'y': 1, '_all': 3},
            '2': {'x': 2, 'y': 1, '_all': 3},
        }
    }
}

The main advantage of this data-structure is that it can be iterated and used directly without performing additional transforms; however, the inner structure of the table-key is not consistent with the other keys.