ckanext-summarystats

GitHub

This CKAN extension allows plugins to create summary statistics for a dataset that meets certain criteria. These summary stats are then uploaded to the dataset.

Requirements

This plugin is compatible with CKAN 2.9 or later.

Installation

pip install -e "git+https://github.com/RTIInternational/ckanext-summarystats.git#egg=ckanext-summarystats"

Usage

This extension is not standalone but meant to be extended by your own CKAN plugin using the two provided interfaces.

Example summarystats usage in a plugin

from ckanext.summarystats.interfaces import ISummaryStats

class MyPlugin(plugins.SingletonPlugin):
    plugins.implements(ISummaryStats)

    def is_eligible_for_summarystats(self, dataset):
        """
        Returns a boolean to determine if summary stats should be
        calculated for the given dataset
        """
        # Some criteria
        if dataset.get("data_type") == "math":
            return True

    def calculate_summarystats(self, dataset):
        """
        Calculates summary statistics for a given dataset and
        returns a pandas data frame
        """
        # Get resource to generate stats dataframe
        stats_df = None
        resource_filepath = None
        for resource in dataset.get("resources"):
            if resource.get("resource_type") == "math":
                resource_filepath = get_resource_file_path()
                resource_dataframe = pd.read_csv(resource_filepath)
                # Do some transform to create a new dataframe
                stats_df = resource_dataframe

        return stats_df

When a dataset’s resource is created or updated, summarystats will call is_eligible_for_summarystats to see if it should calculate_summarystats.

Handling Errors and Schema

If the user’s data is not correctly formatted for calculating summary statistics, raise SumstatsCalcError(error_message). Any error encountered when generating summary stats will be saved to the dataset on the summarystats_error field. While processing, the summarystats_processing field is set to True. These fields must be added to your dataset schema if you want them available on the dataset.

What sort of summary stats might be calculated?

A simple example could be a dataset containing tabular data resources where each row is a person’s favorite food. Using this plugin, you could implement a is_eligible_for_summarystats function that checks if the dataset does indeed contain such data, then implement a calculate_summarystats function to summarize the data to determine the top 10 favorite foods in the dataset.