PrePAN

Sign in to PrePAN

PrePAN provides a place
to discuss your modules.

CLOSE

Requests for Reviews Feed

Bifcode Bifcode serialization format

STATUS

This module and related encoding format are still under development. Do not use it anywhere near production. Input is welcome.

DESCRIPTION

Bifcode implements the bifcode serialisation format, a mixed binary/text encoding with support for the following data types:

  • Primitive:
    • Undefined(null)
    • Booleans(true/false)
    • Integer numbers
    • Floating point numbers
    • UTF8 strings
    • Binary strings
  • Structured:
    • Arrays(lists)
    • Hashes(dictionaries)

The encoding is simple to construct and relatively easy to parse. There is no need to escape special characters in strings. It is not considered human readable, but as it is mostly text it can usually be visually debugged.

Bifcode can only be constructed canonically; i.e. there is only one possible encoding per data structure. This property makes it suitable for comparing structures (using cryptographic hashes) across networks.

In terms of size the encoding is similar to minified JSON. In terms of speed this module compares well with other pure Perl encoding modules with the same features.

MOTIVATION & GOALS

Bifcode was created for a project because none of currently available serialization formats (Bencode, JSON, MsgPack, Sereal, YAML, etc) met the requirements of:

  • Support for undef
  • Support for UTF8 strings
  • Support for binary data
  • Trivial to construct on the fly from within SQLite triggers
  • Universally-recognized canonical form for hashing

There no lofty goals or intentions to promote this outside of my specific case. Use it or not, as you please, based on your own requirements. Constructive discussion is welcome.

SPECIFICATION

The encoding is defined as follows:

BIFCODE_UNDEF

A null or undefined value correspond to '~'.

BIFCODE_TRUE and BIFCODE_FALSE

Boolean values are represented by '1' and '0'.

BIFCODE_UTF8

A UTF8 string is 'U' followed by the octet length of the decoded string as a base ten number followed by a colon and the decoded string. For example "\x{df}" corresponds to "U2:\x{c3}\x{9f}".

BIFCODE_BYTES

Opaque data is 'B' followed by the octet length of the data as a base ten number followed by a colon and then the data itself. For example a three-byte blob 'xyz' corresponds to 'B3:xyz'.

BIFCODE_INTEGER

Integers are represented by an 'I' followed by the number in base 10 followed by a ','. For example 'I3,' corresponds to 3 and 'I-3,' corresponds to -3. Integers have no size limitation. 'I-0,' is invalid. All encodings with a leading zero, such as 'I03,', are invalid, other than 'I0,', which of course corresponds to 0.

BIFCODE_FLOAT

Floats are represented by an 'F' followed by a decimal number in base 10 followed by a 'e' followed by an exponent followed by a ','. For example 'F3.0e-1,' corresponds to 0.3 and 'F-0.1e0,' corresponds to -0.1. Floats have no size limitation. 'F-0.0,' is invalid. All encodings with an extraneous leading zero, such as 'F03.0e0,', are invalid.

BIFCODE_LIST

Lists are encoded as a '[' followed by their elements (also bifcode encoded) followed by a ']'. For example '[U4:spamU4:eggs]' corresponds to ['spam', 'eggs'].

BIFCODE_DICT

Dictionaries are encoded as a '{' followed by a list of alternating keys and their corresponding values followed by a '}'. For example, '{U3:cowU3:mooU4:spamU4:eggs}' corresponds to {'cow': 'moo', 'spam': 'eggs'} and '{U4:spam[U1:aU1:b]}' corresponds to {'spam': ['a', 'b']}. Keys must be BIFCODE_UTF8 or BIFCODE_BYTES and appear in sorted order (sorted as raw strings, not alphanumerics).

INTERFACE

encode_bifcode( $datastructure )

Takes a single argument which may be a scalar, or may be a reference to either a scalar, an array or a hash. Arrays and hashes may in turn contain values of these same types. Returns a byte string.

The mapping from Perl to bifcode is as follows:

  • 'undef' maps directly to BIFCODE_UNDEF.
  • The global package variables $Bifcode::TRUE and $Bifcode::FALSE encode to BIFCODE_TRUE and BIFCODE_FALSE.
  • Plain scalars that look like canonically represented integers will be serialised as BIFCODE_INTEGER. Otherwise they are treated as BIFCODE_UTF8.
  • SCALAR references become BIFCODE_BYTES.
  • ARRAY references become BIFCODE_LIST.
  • HASH references become BIFCODE_DICT.

You can force scalars to be encoded a particular way by passing a reference to them blessed as Bifcode::BYTES, Bifcode::INTEGER or Bifcode::UTF8. The force_bifcode function below can help with creating such references.

This subroutine croaks on unhandled data types.

decode_bifcode( $string [, $max_depth ] )

Takes a byte string and returns the corresponding deserialised data structure.

If you pass an integer for the second option, it will croak when attempting to parse dictionaries nested deeper than this level, to prevent DoS attacks using maliciously crafted input.

bifcode types are mapped back to Perl in the reverse way to the encode_bifcode function, with the exception that any scalars which were "forced" to a particular type (using blessed references) will decode as unblessed scalars.

Croaks on malformed data.

force_bifcode( $scalar, $type )

Returns a reference to $scalar blessed as Bifcode::$TYPE. The value of $type is not checked, but the encode_bifcode function will only accept the resulting reference where $type is one of 'bytes', 'integer', or 'utf8'.

DIAGNOSTICS

  • trailing garbage at %s

    Your data does not end after the first encode_bifcode-serialised item.

    You may also get this error if a malformed item follows.

  • garbage at %s

    Your data is malformed.

  • unexpected end of data at %s

    Your data is truncated.

  • unexpected end of string data starting at %s

    Your data includes a string declared to be longer than the available data.

  • malformed string length at %s

    Your data contained a string with negative length or a length with leading zeroes.

  • malformed integer data at %s

    Your data contained something that was supposed to be an integer but didn't make sense.

  • dict key not in sort order at %s

    Your data violates the encode_bifcode format constaint that dict keys must appear in lexical sort order.

  • duplicate dict key at %s

    Your data violates the encode_bifcode format constaint that all dict keys must be unique.

  • dict key is not a string at %s

    Your data violates the encode_bifcode format constaint that all dict keys be strings.

  • dict key is missing value at %s

    Your data contains a dictionary with an odd number of elements.

  • nesting depth exceeded at %s

    Your data contains dicts or lists that are nested deeper than the $max_depth passed to decode_bifcode().

  • unhandled data type

    You are trying to serialise a data structure that consists of data types other than

    • scalars
    • references to arrays
    • references to hashes
    • references to scalars

    The format does not support this.

BUGS AND LIMITATIONS

Strings and numbers are practically indistinguishable in Perl, so encode_bifcode() has to resort to a heuristic to decide how to serialise a scalar. This cannot be fixed.

AUTHOR

Mark Lawrence , heavily based on Bencode by Aristotle Pagaltzis

COPYRIGHT AND LICENSE

This software is copyright (c):

  • 2015 by Aristotle Pagaltzis
  • 2017 by Mark Lawrence.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

mlawren@github 3 comments

WebService::Dataworld PROPOSAL: A wrapper around the data.world APIs

data.world has JSON APIs for both query and maintenance of large public and non-public datasets. There are some fiddly bits to the API; first thing I noticed is that querying and maintenance happen at different URLs. Additionally, some of the returns are not yet consistent; sometimes you'll get a message and other data, sometimes just a message; there's no "truthy on success" code that would be useful to "try" blocks, etc. I propose a simple wrapper around the complexities, an API object that just takes care of those details for the user, validates inputs, and deals with returns in a way to make them more consistent.

Thoughts?

GeekRuthie@twitter 5 comments

CrawlerCommons::RobotRulesParser Perl implementation of Google crawler-commons RobotRulesParser

This module is a fairly close reproduction of the Crawler-Commons SimpleRobotRulesParser http://crawler-commons.github.io/crawlercommons/0.7/crawlercommons/robots/SimpleRobotRulesParser.html

From BaseRobotsParser javadoc:

Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent. Note that multiple agent names
may be provided as comma-separated values; the order of these shouldn't
matter, as the file is parsed in order, and each agent name found in the
file will be compared to every agent name found in robotNames.
Also note that names are lower-cased before comparison, and that any
robot name you pass shouldn't contain commas or spaces; if the name has
spaces, it will be split into multiple names, each of which will be
compared against agent names in the robots.txt file. An agent name is
considered a match if it's a prefix match on the provided robot name. For
example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match
"crawlerbot" as the agent name, because of splitting on spaces,
lower-casing, and the prefix match rule.

The method failedFetch is not implemented.

akrobinson74@github 1 comment

Unoconv Use LibreOffice to convert file formats

Note: using my name in the module name was a stop-gap while I find a better name. Suggestions???

   Curley::Unoconv is a Perl extension that allows you to use LibreOffice (or OpenOffice.org) to convert from any spreadsheet format LibreOffice will accept to any format LibreOffice will. You can then do further processing with, e.g. Text::CSV_XS The function only does the conversion if necessary, which is determined by comparing the two files' mtimes.

   It uses Dag Wieers' unoconv, available on most Linux distributions. http://dag.wieers.com/home-made/unoconv/

   Note that the conversion can fail (and your program will die). A few notes:

   ·   The version number unoconv expects must agree with the version of LibreOffice installed. Install unoconv from the same repository you got LibreOffice from. E.g. if you install LibreOffice from debian backports, then install unoconv from debian backports.

   ·   Unoconv may not like it if the running instance of LibreOffice (if any) has a copy of the source file loaded. Save and close it.

   ·   If LibreOffice does not support the given format the whole kazoo will die messily.

EXPORT unoconv ()

SEE ALSO http://dag.wieers.com/home-made/unoconv/

   man 1 unoconv

DEFICIENCIES · There is not a lot of error checking. For example, we do not check to see if we can write to the target file.

   ·   There is no provision for passing other parameters to the unoconv program. Maybe make that an optional parameter?

   ·   How can we gracefully tell if LibreOffice supports the file format the caller wants?

   ·   We have no control over how well our back end converts. Other programs and Perl modules may do a better job for given pairs of source and destination formats.

   ·   OpenOffice::UNO might provide a more direct and flexible interface, and eliminate the need for python.

charlescurley@github 1 comment

Set::SegmentTree Immutable segment trees in perl

wat? Segment Tree

A Segment tree is an immutable tree structure used to efficiently resolve a value to the set of segments which encompass it.

Why?

You have a large set of value intervals (like time segments!) and need to match them against a single value (like a time) efficiently.

This solution is suitable for problems where the set of intervals is known in advance of the queries, and the tree needs to be loaded and queried efficiently many orders of magnitude more often than the set of intervals is updated.

Data structure:

A segment is like this: [ Segment Label, Start Value , End Value ]

Start Value and End Values Must be numeric.

Start Value Must be less than End Value

Segment Label Must occur exactly once

The speed of Set::SegmentTree depends on not being concerned with additional segment relevant data, so it is expected one would use the label as an index into whatever persistence retains additional information about the segment.

Use walkthrough

my @segments = (['A',1,5],['B',2,3],['C',3,8],['D',10,15]);

This defines four intervals which both do and don't overlap. - A - 1 to 5 - B - 2 to 3 - C - 3 to 8 - D - 10 to 15

Doing a find within the resulting tree.

my $tree = Set::SegmentTree::Builder->new(@segments)->build

Would make these tests pass

is_deeply [$tree->find(0)], [];
is_deeply [$tree->find(1)], [qw/A/];
is_deeply [$tree->find(2)], [qw/A B/];
is_deeply [$tree->find(3)], [qw/A B C/];
is_deeply [$tree->find(4)], [qw/A C/];
is_deeply [$tree->find(6)], [qw/C/];
is_deeply [$tree->find(9)], [];
is_deeply [$tree->find(12)], [qw/D/];

And although this structure is relatively expensive to build, it can be saved efficiently,

my $builder = Set::SegmentTree::Builder->new(@segments);
$builder->to_file('filename');

and then loaded and queried extremely quickly, making this. pass in only milliseconds.

my $tree = Set::SegmentTree->from_file('filename');
is_deeply [$tree->find(3)], [qw/A B C/];

This structure is useful in the use case where...

1) value segment intersection is important 1) performance of loading and lookup is critical, but building is not

The Segment Tree data structure allows you to resolve any single value to the list of segments which encompass it in O(log(n)+nk).

DavidIAm@github 5 comments

Net::ZooIt High level recipes for Apache ZooKeeper

DESCRIPTION

Net::ZooIt provides high level recipes for working with ZooKeeper in Perl, like locks, leader election or queues.

Net::ZooKeeper Handles

Net::ZooIt methods always take a Net::ZooKeeper handle object as a parameter and delegate their creation to the user. Rationale: enterprises often have customised ways to create those handles, Net::ZooIt aims to be instantly usable without such customisation.

Automatic Cleanup

Net::ZooIt constructors return a Net::ZooIt object, which automatically clean up their znodes when they go out of scope at the end of the enclosing block. If you want to clean up earlier, call

  $zooit_obj->DESTROY;

Implication: if you call Net::ZooIt constructors in void context, the created object goes out of scope immediately, and your znodes are deleted. Net::ZooIt logs a ZOOIT_ERR message in this case.

Error Handling

Net::ZooIt constructors return nothing in case of errors during creation.

Once you hold a lock or other resource, you're not notified of connection loss errors. If you need to take special action, check your Net::ZooKeeper handle.

If you give up Net::ZooIt resources during connection loss, your znodes cannot be cleaned up immediately, they will enter a garbage collection queue and Net::ZooIt will clean them up once connection is resumed.

Logging

Net::ZooIt logs to STDERR. Log messages are prefixed with Zulu military time, PID and the level of the current message: ZOOIT_DIE ZOOIT_ERR ZOOIT_WARN ZOOIT_INFO ZOOIT_DEBUG.

If Net::ZooIt throws an exception, it prints a ZOOIT_DIE level message before dying. This allows seeing the original error message even if an eval {} block swallows it.

To capture Net::ZooIt log messages to a file instead of STDERR, redirect STDERR to a new file handle in the normal Perl manner:

open(OLDERR, '>&', fileno(STDERR)) or die("unable to dup STDERR: $!");
open(STDERR, '>', $log_file) or die("unable to redirect STDERR: $!");

subogero@github 3 comments

List::Flat Another module to flatten an arrayref or list that may contain arrayrefs

I was unhappy with the several modules on CPAN I could find that do the relatively simple task of flattening a deep structure of array references into a single flat list, so I wrote another one.

Or rather, another two: one that handles circular references (flat) and one that doesn't (flatx). I suspect flatx is a terrible name but I couldn't think of one that wasn't something like "flat_unsafe" and that didn't encourage its use indiscriminately.

Thoughts welcome. Thanks.

aaronpriven@github 5 comments

English::Control Like English.pm, but with ${^OFS} instead of $OFS

So I had this idea that English.pm would be better if, instead of storing its variables in each package (potentially clobbering other variables if somebody has forgotten that, for example, $LIST_SEPARATOR or $WARNING or $NR are special), it would be better if it stored its variables as control-character variables, like ${^LIST_SEPARATOR} or ${^NR}. These are reserved to perl, normally, and are forced to be in package "main" so they only need to be set up once (not imported for each module).

Anyway, so I wrote this module to make that happen. (I say "wrote", but mostly what I did, other than typing {^ and } a lot, was delete a bunch of stuff.)

Thoughts?

aaronpriven@github 4 comments

Carton::Include A module to automatically include the nearest local/ dir

When I want to execute script that uses a cpanfile & Carton, I have to type "carton exec ./script.pl" in order to execute it (in order to include the nearest local/ directory in @INC). By creating Carton::Include, and using this module inside my scripts, I can have them automatically search for the nearest local/ directory up the tree and "use lib" that, so I can then execute my script by typing only ./script.pl

akarelas@github 1 comment

Catalyst::Authentication::Credential::JWT authentication to a Catalyst app via JSON Web Token

This authentication credential checker tries to read a JSON Web Token (JWT) from the current request, verifies its signature and looks up the user in the configured authentication store.

It provides support for authentication/authorization via JWT to Catalyst.

gerhardj@github 0 comments