PrePAN

Sign in to PrePAN

Set::SegmentTree Immutable segment trees in perl

Good

Synopsis

  use Set::SegmentTree;
  my $builder = Set::SegmentTree::Builder->new(@segment_list);
  $builder->insert([ segment_name, start, end ], [ ... ]);
  my $newtree = $builder->build();
  my @segments = $newtree->find($value);
  $newtree->serialize( $filename );

  my $savedtree = Set::SegmentTree->from_file( $filename );
  my @segments = $savedtree->find($value);

Description

wat? Segment Tree

A Segment tree is an immutable tree structure used to efficiently resolve a value to the set of segments which encompass it.

Why?

You have a large set of value intervals (like time segments!) and need to match them against a single value (like a time) efficiently.

This solution is suitable for problems where the set of intervals is known in advance of the queries, and the tree needs to be loaded and queried efficiently many orders of magnitude more often than the set of intervals is updated.

Data structure:

A segment is like this: [ Segment Label, Start Value , End Value ]

Start Value and End Values Must be numeric.

Start Value Must be less than End Value

Segment Label Must occur exactly once

The speed of Set::SegmentTree depends on not being concerned with additional segment relevant data, so it is expected one would use the label as an index into whatever persistence retains additional information about the segment.

Use walkthrough

my @segments = (['A',1,5],['B',2,3],['C',3,8],['D',10,15]);

This defines four intervals which both do and don't overlap. - A - 1 to 5 - B - 2 to 3 - C - 3 to 8 - D - 10 to 15

Doing a find within the resulting tree.

my $tree = Set::SegmentTree::Builder->new(@segments)->build

Would make these tests pass

is_deeply [$tree->find(0)], [];
is_deeply [$tree->find(1)], [qw/A/];
is_deeply [$tree->find(2)], [qw/A B/];
is_deeply [$tree->find(3)], [qw/A B C/];
is_deeply [$tree->find(4)], [qw/A C/];
is_deeply [$tree->find(6)], [qw/C/];
is_deeply [$tree->find(9)], [];
is_deeply [$tree->find(12)], [qw/D/];

And although this structure is relatively expensive to build, it can be saved efficiently,

my $builder = Set::SegmentTree::Builder->new(@segments);
$builder->to_file('filename');

and then loaded and queried extremely quickly, making this. pass in only milliseconds.

my $tree = Set::SegmentTree->from_file('filename');
is_deeply [$tree->find(3)], [qw/A B C/];

This structure is useful in the use case where...

1) value segment intersection is important 1) performance of loading and lookup is critical, but building is not

The Segment Tree data structure allows you to resolve any single value to the list of segments which encompass it in O(log(n)+nk).

Comments

How does this differ from the existing Set::IntervalTree?
SEGMENT INTERVAL
STORAGE O(n log n) O(n)
CREATION O(n log n) O(n log n)
QUERY O(log n + k) O(log n + m)
MODIFY immutable O(log n)
SERIALIZE FlatBuffers unknown

I think that about summarizes the algorithmic/capability factors.

The biggest difference operationally would be that I designed Set::SegmentTree to be quick-loaded from disk (memory mapped) and queried with no parsing/building activities required. That is, the tree build cost is one-time, and subsequent read-and-query operations are more or less O(log n + k).

perhaps a little explanation of 'k' and 'm' in the Query times above would be in order, as it occurs to me to be rather obtuse.

k refers to 'number of intervals that are represented in this segment' - so O(log n) to find the node and +k to provide the list of segments that are represented there.

m refers to 'number of intervals that are leftward of the found interval' - as I understand it the query operation uses O(log n) time to find a start point then walks left m times, filtering all encountered interval nodes by whether or not their high value is lower than the query value. So any interval that started in the time before the query point must be compared to see if that interval is matched. (right/left swappable to your standard of course)
"label" sounds more like an "id".
I suppose you are right in that it cannot repeat. Do you think that makes it more clear?

Please sign up to post a review.