PrePAN

Sign in to PrePAN

Data::Enumerable::Lazy A lazy enumerator/generator optimised for non-flat collections

Good

Synopsis

# The library implements a lazy enumerator/generator pattern and is optimised
# for non-flat collections (resolves and flattens sub-, subsub-,... collections automatically).

# A basic lazy generator producing even numbers in a given range:

my ($from, $to) = (0, 10);
my $current = $from;
my $tream = Data::Enumerable::Lazy->new({
    on_has_next => sub { $current <= $to          },
    on_next     => sub { shift->yield($current++) },
})->grep(sub { shift % 2 == 0 });

$tream->to_list(); # generates: [0, 2, 4, 6, 8, 10]

Description

This library is yet another implementation of lazy generator + enumerable pattern for Perl5. First of all, it is not a parallel calculation framework. For the parallelism problems, Generator::Object would be the way to go.

This library provides building blocks for lazy data manipulation. One of the key features of this library is that it abstracts the end users away from a multi-level nested sub-collection enumeration whereas a classical enumerator normally operates within a flat collection context. Think of it this way: this library provides a built-in functionality to resolve enumerable steps in micro-batches, and these micro-batches might be enumerables as well and so on and so forth. But it definitely does not force the micro-batched way.

A quick example: let's say there is a task: read multiple plain text files word-by-word. This task contains several nesting enumerable loops:

-> file
  -> line
    -> word (we assume there is no word wrap, but it is not a problem for our model)

Implemented in imperative style, the program would look like:

foreach my $file (@files) {
  foreach my $line ($file->read_lines) {
    foreach my $word ($line->split_words) {
      # do something
    }
  }
}

Let's say there is one more level of complexity: multi-partition setup:

foreach my $partition (@partitions) {
  foreach my $file ($partition->ls_files) {
    ...
  }
}

We have to implement the loops over and over again, exposing the internal knowledge about the partitions, files, lines etc. But we can do it better. Let's examine a lazy enumerable approach:

my $word_enum = Data::Enumerable::Lazy::from_list(@partitions)
  -> continue({ on_has_next => sub { my ($self, $partition) = @_; $self->yield($partition->ls_files) } })
  -> continue({ on_has_next => sub { my ($self, $file)      = @_; $self->yield($file->read_lines   ) } })
  -> continue({ on_has_next => sub { my ($self, $line)      = @_; $self->yield($line->split_words  ) } });

while ($word_enum->has_next) {
  my $word = $word_enum->next;
  # do something
}

The benefit is that the end user might have zero knowledge about partitions, files, lines, etc. Adding a new nesting loop for multi-computer fetch? Easy.

The end user is focused on consuming the words, not intermediate stages. A wrapper library might hide the implementation details and return the enumerable back to the end user.

Enumerables are single-pass calculation units. What it means: an enumerable is stateful, once it reached the end of the sequence, it will not rewind to the beginning.

Enumerables have internal buffer: another enumerable which preserves the pre-fetched collection. The fig. above illustrates the buffering algorithm.

[enumerable.has_next] -> [_buffer.has_next] -> yes -> return true
                                              -> no -> result = [enumerable.on_has_next] -> return result

[enumerable.next] -> [_buffer.has_next] -> yes -> return [_buffer.next]
                                          -> no -> result = [enumerable.next] -> [enumerable.set_buffer(result)] -> return _buffer.next

EXAMPLES

A basic range

This example implements a range generator from $from until $to. In order to generate this range we define 2 callbacks: on_has_next() and on_next(). The first one is used as point of truth whether the sequence has any more non-visited elements, and the 2nd one is to return the next element in the sequence and the one that changes the state of the internal sequence iterator.

sub basic_range {
  my ($from, $to) = @_;
  my $current = $from;
  Data::Enumerable::Lazy->new({
    on_has_next => sub {
      return $current  sub {
      my ($self) = @_;
      return $self->yield($current++);
    },
  });
}

on_has_next() makes sure the current value does not exceed $to value, and on_next() yields the next value of the sequence. Note the yield method. An enumerable developer is expected to use this method in order to return the next step value. This method does some internal bookkeeping and smart caching.

Usage:

# We initialize a new range generator from 0 to 10 including.
my $range = basic_range(0, 10);
# We check if the sequence has elements in it's tail.
while ($range->has_next) {
  # In this very line the state of $range is being changed
  say $range->next;
}
is $range->has_next, 0, '$range has been iterated completely'
is $range->next, undef, 'A fully iterated sequence returns undef on next()'

Prime numbers

Prime numbers is an infinite sequence of natural numbers. This example implements a very basic prime number generator.

my $prime_num_stream = Data::Enumerable::Lazy->new({
  # This is an infinite sequence
  on_has_next => sub { 1 },
  on_next => sub {
    my $self = shift;
    # We save the result of the previous step
    my $next = $self->{_prev_} // 1;
    LOOKUP: while (1) {
      $next++;
      # Check all numbers from 2 to sqrt(N)
      foreach (2..floor(sqrt($next))) {
        ($next % $_ == 0) and next LOOKUP;
      }
      last LOOKUP;
    }
    # Save the result in order to use it in the next step
    $self->{_prev_} = $next;
    # Return the result
    $self->yield($next);
  },
});

What's remarkable regarding this specific example is that one can not simply call to_list() in order to get all elements of the sequence. The enumerable will throw an exception claiming it's an infinitive sequence. Therefore, we should use next() in order to get elements one by one or use another handy method take() which returns first N results.

Nested enumerables

In this example we will output a numbers of a multiplication table 10x10. What's interesting in this example is that there are 2 sequences: primary and secondary. Primary on_next() returns secondary sequence, which generates the result of multiplication of 2 numbers.

# A new stream based on a range from 1 to 10
my $mult_table = Data::Enumerable::Lazy->from_list(1..10)->continue({
  on_has_next => sub {
    my ($self, $i) = @_;
    # The primary stream returns another sequence, based on range
    $self->yield(Data::Enumerable::Lazy->from_list(1..10)->continue({
      on_next => sub {
        # $_[0] is a substream self
        # $_[1] is a next substream sequence element
        $_[0]->yield( $_[1] * $i )
      },
    }));
  },
});

Another feature which is demonstrated here is the batched result generation. Let's iterate the sequence step by step and see what happens inside.

$mult_table->has_next;         # returns true based on the primary range, _buffer is
                               # empty
$mult_table->next;             # returns 1, the secondary sequence is now stored as
                               # the primary enumerable buffer and 1 is being served
                               # from this buffer
$mult_table->has_next;         # returns true, resolved by the state of the buffer
$mult_table->next;             # returns 2, moves buffer iterator forward, the
                               # primary sequence on_next() is _not_ being called
                               # this time
$mult_table->next for (3..10); # The last iteration completes the buffer
                               # iteration cycle
$mult_table->has_next;         # returns true, but now it calls the primary
                               # on_has_next()
$mult_table->next;             # returns 2 as the first element in the next
                               # secondary sequence (which is 1 again) multiplied by
                               # the 2nd element of the primary sequence (which is 2)
$mult_table->to_list;          # Generates the tail of the sesquence:
                               # [4, 6, ..., 80, 90, 100]
$mult_table->has_next;         # returns false as the buffer is empty now and the
                               # primary sequence on_has_next() says there is nothing
                               # more to iterate over.

OPTIONS

on_next($self, $element) :: CodeRef -> Data::Enumerable::Lazy | Any

on_next is a code ref, a callback which is being called every time the generator is in demand for a new bit of data. Enumerable buffers up the result of the previous calculation and if there are no more elements left in the buffer, on_next() would be called.

$element is defined when the current collection is a contuniation of another enumerable. I.e.:

my $enum = Data::Enumerable::Lazy->from_list(1, 2, 3);
my $enum2 = $enum->continue({
    on_next => sub {
        my ($self, $i) = @_;
        $self->yield($i * $i)
    }
});
$enum2->to_list; # generates 1, 4, 9

In this case $i would be defined and it comes from the original enumerable.

The function is supposed to return an enumerable, in this case it would be kept as the buffer object. If this function method returns any other value, it would be wrapped in a Data::Enumerable::Lazy->singular(). There is a way to prevent an enumerable from wrapping your return value in an enum and keeping it in a raw state by providing _no_wrap=1.

on_has_next($self) :: CodeRef -> Bool

on_has_next is a code ref, a callback to be called whenever the enumerable is about to resolve has_next() method call. Similar to on_next() call, this one is also triggered whenever an enumerable runs out of buffered elements. The function should return boolean.

A method that returns 1 all the time is the way to initialise an infinite enumerable (see infinity()). If it returns 0 no matter what, it would be an empty enumerable (see empty()). Normally you stay somewhere in the middle and implement some state check logic in there.

on_reset($self) :: CodeRef -> void

This is a callback to be called in order to reset the state of the enumerable. This callback should be defined in the same scope as the enumerable itself. The library provides nothing magical but a callback and a handle to call it, so the state cleanup is completely on the developer's side.

is_finite :: Bool

A boolean flag indicating whether an enumerable is finite or not. By default, enumerables are treated as infinite, which means some functions will throw an exception, like: to_list() or resolve().

Make sure to not mark an enumerable as finite and to call finite-size defined methods, in this case it will create an infinite loop on the resolution.

INSTANCE METHODS

next()

Function next() is the primary interface for accessing elements of an enumerable. It will do some internal checks and if there is no elements to be served from an intermediate buffer, it will resolve the next step by calling on_next() callback. Enumerables are composable: one enumerable might be based on another enumeration. E.g.: a sequence of natural number squares is based on the sequence of natural numbers themselves. In other words, a sequence is defined as a tuple of another sequence and a function which would be lazily applied to every element of this sequence.

next() accepts 0 or more arguments, which would be passed to on_next() callback.

next() is expected to do the heavy-lifting job in opposite to has_next(), which is supposed to be cheap and fast. This statement flips upside down whenever grep() is applied to a stream. See grep() for more details.

has_next()

has_next() is the primary entry point to get an information about the state of an enumerable. If the method returned false, there are no more elements to be consumed. I.e. the sequence has been iterated completely. Normally it means the end of an iteration cycle.

Enumerables use internal buffers in order to support batched on_next() resolutions. If there are some elements left in the buffer, on_next() won't call on_has_next() callback immediately. If the buffer has been iterated completely, on_has_next() would be called.

on_next() should be fast on resolving the state of an enumerable as it's going to be used for a condition state check.

reset()

This method is a generic entry point for a enum reset. In fact, it is basically a wrapper around user-defined on_reset().

to_list()

This function transforms a lazy enumerable to a list. Only finite enumerables can be transformed to a list, so the method checks if an enumerable is created with is_finite=1 flag. An exception would be thrown otherwise.

map($callback)

Creates a new enumerable by applying a user-defined function to the original enumerable. Works the same way as perl map {} function but it's lazy.

reduce($acc, $callback)

Resolves the enumerable and returns the resulting state of the accumulator $acc provided as the 1st argument. $callback should always return the new state of $acc.

reduce() is defined for finite enumerables only.

grep($callback, $max_lookahead)

grep() is a function which returns a new enumerable by applying a user-defined filter function.

grep() might be applied to both finite and infinite enumerables. In case of an infinitive enumerable there is an additional argument specifying max number of lookahead steps. If an element satisfying the condition could not be found in max_lookahead steps, an enumerable is considered to be completely iterated and has_next() will return false.

grep() returns a new enumerable with quite special properties: has_next() will perform a look ahead and call the original enumerable next() method in order to find an element for which the user-defined function will return true. next(), on the other side, returns the value that was pre-fetched by has_next().

resolve()

Resolves an enumerable completely. Applicable for finite enumerables only. The method returns nothing.

take($N_elements)

Resolves first $N_elements and returns the resulting list. If there are fewer than N elements in the enumerable, the entire enumerable would be returned as a list.

take_while($callback)

This function takes elements until it meets the first one that does not satisfy the conditional callback. The callback receives only 1 argument: an element. It should return true if the element should be taken. Once it returned false, the stream is over.

continue($ext = %{ on_next => sub {}, ... })

Creates a new enumerable by extending the existing one. on_next is the only mandatory argument. on_has_next might be overridden if some custom logic comes into play.

is_finite is inherited from the parent enumerable by default. All additional attributes would be transparently passed to the constructor.

yield($result)

This method is supposed to be called from on_next callback only. This is the only valid way for an Enumerable to return the next result. Effectively, it ensures the returned result conforms to the required interface and is wrapped in a lazy wrapper if needed.

CLASS METHODS

empty()

Returns an empty enumerable. Effectively it means an equivalent of an empty array. has_next() will return false and next() will return undef. Useful whenever a on_next() step wants to return an empty result set.

singular($val)

Returns an enumerable with a single element $val. Actively used as an internal data container.

from_list(@list)

Returns a new enumerable instantiated from a list. The easiest way to initialise an enumerable. In fact, all elements are already resolved so this method sets is_finite=1 by default.

cycle()

Creates an infinitive enumerable by cycling the original list. E.g. if the original list is [1, 2, 3], cycle() will generate an infinitive sequence like: 1, 2, 3, 1, 2, 3, 1, ...

infinity()

Returns a new infinite enumerable. has_next() always returns true whereas next() returns undef all the time. Useful as an extension basis for infinite sequences.

merge($tream1 [, $tream2 [, $tream3 [, ...]]])

This function merges one or more streams together by fan-outing next() method call among the non-empty streams. Returns a new enumerable instance, which: * Has next elements as far as at least one of the streams does. * Returns next element py picking it one-by-one from the streams. * Is finite if and only if all the streams are finite. If one of the streams is over, it would be taken into account and next() will continue choosing from non-empty ones.

AUTHOR

Oleg S

SEE ALSO

Comments

Please sign up to post a review.