ckanext-summarystats

GitHub

This CKAN extension allows plugins to create summary statistics for a dataset that meets certain criteria. These summary stats are then uploaded to the dataset.

Requirements

This plugin is compatible with CKAN 2.9 or later.

Installation

pip install -e "git+https://github.com/RTIInternational/ckanext-summarystats.git#egg=ckanext-summarystats"

Usage

This extension is not standalone but meant to be extended by your own CKAN plugin using the two provided interfaces.

Example summarystats usage in a plugin

from ckanext.summarystats.interfaces import ISummaryStats

class MyPlugin(plugins.SingletonPlugin):
    plugins.implements(ISummaryStats)

    def is_eligible_for_summarystats(self, dataset):
        """
        Returns a boolean to determine if summary stats should be
        calculated for the given dataset
        """
        # Some criteria
        if dataset.get("data_type") == "math":
            return True

    def calculate_summarystats(self, dataset):
        """
        Calculates summary statistics for a given dataset and
        returns a pandas data frame
        """
        # Get resource to generate stats dataframe
        stats_df = None
        resource_filepath = None
        for resource in dataset.get("resources"):
            if resource.get("resource_type") == "math":
                resource_filepath = get_resource_file_path()
                resource_dataframe = pd.read_csv(resource_filepath)
                # Do some transform to create a new dataframe
                stats_df = resource_dataframe

        return stats_df

When a dataset’s resource is created or updated, summarystats will call is_eligible_for_summarystats to see if it should calculate_summarystats.

What sort of summary stats might be calculated?

A simple example could be a dataset containing tabular data resources where each row is a person’s favorite food. Using this plugin, you could implement a is_eligible_for_summarystats function that checks if the dataset does indeed contain such data, then implement a calculate_summarystats function to summarize the data to determine the top 10 favorite foods in the dataset.

Handling errors

If the user’s data is not correctly formatted for calculating summary statistics, raise SumstatsCalcError(error_message) and the error message will be saved to the dataset on the summary_stats_error field.