# summarize_report

This document specifies the implementation the `summarize_report`
function, based on the [summary report](summary-report.md)
specification.

The `summarize_report` function is used to transform the report
outputted from `validate_data` into a summarized report. It can
summarize by multiple categories.

## Function signature

    def summarize_report(report: ValidationReport,
                         by: Set[SummaryKey] = {SummaryKey.table}
                         ) -> SummarizedReport

### Arguments

- `report`: a validation report returned from `validate_data`
- `by`: a set of keys to summarize by. The full set consists of the keys
  (table, column, row). An error/warning summarization will be performed
  for each group/key specified. Defaults to `table`.

## SummarizedReport object

This is the summarized report object returned from the function.

    class SummarizedReport:
        data_version: str
        schema_version: str
        package_version: str
        overview: dict
        errors: dict
        warnings: dict

### Fields

The following fields are shared with `ValidationReport` (and documented
[here](module-functions.md#validate-data)):

- data_version
- schema_version
- package_version

The remaining fields are unique to the summarized report:

- `overview` contains a high level summary of the summarized data. It is
  specified [below](#overview-data-structure).

- `errors` and `warnings` contains summarizations grouped by the keys
  specified when calling the function. It is specified
  [below](#errorsummary-data-structure).

## Overview data-structure

The overview holds the most basic information about the validation:

- number of columns and rows per table
- number of errors/warnings per rule

It should be separated from the error-summary to simplify implementation
and because the summary report should have this info regardless of which
keys are summarized by.

The table information is taken from the validation report’s `table_info`
field.

    overview = {
        'tables': {
            'addresses': {
                'columns': 2,
                'rows': 6,
                'rules': {
                    'x': 4,
                    'y': 2,
                }
            }
        },
        'errors': {
            'x': 4,
            'y': 2,
        },
        'warnings': {
        }
    }

## ErrorSummary data-structure

use a flat table approach, where each table has a list of ‘summary
objects’ with the appropriate fields for the selected group key,

    class SummaryEntry:
        ruleId: RuleId
        count: int

        key: SummaryKey
        """The group/summary-key this entry is summarized by."""

        value: str
        """Id of the entity in group `key` that this entry is derived from. This
        corresponds to the table-id when grouping by `table`, the column-id when
        grouping by `column`, etc."""

    E = SummaryEntry

    ErrorSummary = Dict[TableId, List[E]]

    errors: ErrorSummary = {
        'addresses': [
            E(ruleId='x', count=4, key='table',  value='addresses'),
            E(ruleId='y', count=2, key='table',  value='addresses'),
            E(ruleId='x', count=1, key='column', value='addId'),
            E(ruleId='y', count=2, key='column', value='addId'),
            E(ruleId='x', count=3, key='column', value='addL1'),
            E(ruleId='x', count=2, key='row',    value='1'),
            E(ruleId='y', count=1, key='row',    value='1'),
            E(ruleId='x', count=2, key='row',    value='2'),
            E(ruleId='y', count=1, key='row',    value='2'),

            E(ruleId='_all', count=6, key='table',  value='addresses'),
            E(ruleId='_all', count=3, key='column', value='addId'),
            E(ruleId='_all', count=3, key='column', value='addL1'),
            E(ruleId='_all', count=3, key='row',    value='1'),
            E(ruleId='_all', count=3, key='row',    value='2'),
        ]
    }

This enables the user to use their own select/filter functions on the
data as it has a flat table form.

The main reasons for choosing this structure is:

- Users can easily transform the data to fit their needs, as its a flat
  table.
- Multiple groupings can be included in the same dataset (table, column,
  etc.).
- It’s less complex than its corresponding tree-representation.

See [Appendix 1](#Appendix-1) for an alternative tree-based
implementation.

## Example usage

    report = validate_data(...)
    summary = summarize_report(report, by=column)

See [summarize-tool](summarize-tool.md) for more examples of how the
summary generation may be used in practice.

## Appendix 1

This is an alternative data-structure for representing the error
summary.

    Count = int
    ErrorSummary = Dict[TableId, Dict[SummaryKey, Dict[str, Count]]]

    errors: ErrorSummary = {
        'addresses': {
            table: {'x': 4, 'y': 2, '_all': 6},
            column: {
                'addId': {'x': 1, 'y': 2, '_all': 3},
                'addL1': {'x': 3, '_all': 3},
            },
            row: {
                '1': {'x': 2, 'y': 1, '_all': 3},
                '2': {'x': 2, 'y': 1, '_all': 3},
            }
        }
    }

The main advantage of this data-structure is that it can be iterated and
used directly without performing additional transforms; however, the
inner structure of the table-key is not consistent with the other keys.