PrePAN

Sign in to PrePAN

Statistics::Descriptive::LogScale Approximate descriptive statistics class, saving memory at cost of fixed relative error.

Good

Synopsis

    use strict;
    use warnings;

    use Statistics::Descriptive::LogScale;
    my $stat = Statistics::Descriptive::LogScale->new ();

    # read numbers from stdin. Print a brief summary.

    while(<>) {
        chomp;
        $stat->add_data($_);
    };

    # This can also be done in O(1) memory, precisely
    printf "Mean: %f +- %f\n", $stat->mean, $stat->standard_deviation;

    # This requires either storing actual data, or approximating
    printf "25%%/50%%/75%%: %f, %f, %f\n", $stat->percentile(25), $stat->median, $stat->percentile(75);

Description

This module provides Statistics::Descriptive::Full-compatible interface, however it doesn't keep all the data. Instead, data is divided into logarithmic bins and only bin counts are stored. Thus rough statistical distribution's properties can be calculated without using lots of memory.

It was initially targeted at performance analysis. Knowing that 99% of requests finished in 0.1s +- 0.01s looks more useful than just having an average request time 0.1s +- 1s (standard deviation) which is what I observed not once trying to "analyze" our web applications' performance.

However, a broader usage may exist, e.g. some long-running application may want to keep track of various useful numbers without leaking.

Ideally, this module could also become a short way of saying "I'm not sure why I need statistics, but it's nice to have, and simple." For those who know why they need statistics, there's R.

Comments

I have written similar code to do the mean & deviation, so I know how that part works, although I'm less familiar with the percentile. Would it be possible to compute the percentile without approximation or storing a lot of data if you passed the percentiles to the constructor, i.e.

Statistics::Descriptive::LogScale->new (25, 50, 75);
Thanks for your comment.

While searching for prior art, I have discovered a few papers on memory-efficient algorithms for percentiles/median that required a small fraction of sample size to run. I admit I didn't manage to read through any of them, let alone try to implement. At second glance they seem to use approximation, too.

However, when precise values are needed, it would probably be more feasible to move the sample to a separate machine and process it there with R.

Please sign up to post a review.