PrePAN

Sign in to PrePAN

Bifcode Bifcode serialization format

Good

Synopsis

use Bifcode qw( encode_bifcode decode_bifcode );

my $bifcode = encode_bifcode {
    bools   => [ $Bifcode::FALSE, $Bifcode::TRUE, ],
    bytes   => \pack( 's<',       255 ),
    integer => 25,
    float   => 1.0 / 300000000.0,
    undef   => undef,
    utf8    => "\x{df}",
};

my $decoded = decode_bifcode $bifcode;

Description

STATUS

This module and related encoding format are still under development. Do not use it anywhere near production. Input is welcome.

DESCRIPTION

Bifcode implements the bifcode serialisation format, a mixed binary/text encoding with support for the following data types:

  • Primitive:
    • Undefined(null)
    • Booleans(true/false)
    • Integer numbers
    • Floating point numbers
    • UTF8 strings
    • Binary strings
  • Structured:
    • Arrays(lists)
    • Hashes(dictionaries)

The encoding is simple to construct and relatively easy to parse. There is no need to escape special characters in strings. It is not considered human readable, but as it is mostly text it can usually be visually debugged.

Bifcode can only be constructed canonically; i.e. there is only one possible encoding per data structure. This property makes it suitable for comparing structures (using cryptographic hashes) across networks.

In terms of size the encoding is similar to minified JSON. In terms of speed this module compares well with other pure Perl encoding modules with the same features.

MOTIVATION & GOALS

Bifcode was created for a project because none of currently available serialization formats (Bencode, JSON, MsgPack, Sereal, YAML, etc) met the requirements of:

  • Support for undef
  • Support for UTF8 strings
  • Support for binary data
  • Trivial to construct on the fly from within SQLite triggers
  • Universally-recognized canonical form for hashing

There no lofty goals or intentions to promote this outside of my specific case. Use it or not, as you please, based on your own requirements. Constructive discussion is welcome.

SPECIFICATION

The encoding is defined as follows:

BIFCODE_UNDEF

A null or undefined value correspond to '~'.

BIFCODE_TRUE and BIFCODE_FALSE

Boolean values are represented by '1' and '0'.

BIFCODE_UTF8

A UTF8 string is 'U' followed by the octet length of the decoded string as a base ten number followed by a colon and the decoded string. For example "\x{df}" corresponds to "U2:\x{c3}\x{9f}".

BIFCODE_BYTES

Opaque data is 'B' followed by the octet length of the data as a base ten number followed by a colon and then the data itself. For example a three-byte blob 'xyz' corresponds to 'B3:xyz'.

BIFCODE_INTEGER

Integers are represented by an 'I' followed by the number in base 10 followed by a ','. For example 'I3,' corresponds to 3 and 'I-3,' corresponds to -3. Integers have no size limitation. 'I-0,' is invalid. All encodings with a leading zero, such as 'I03,', are invalid, other than 'I0,', which of course corresponds to 0.

BIFCODE_FLOAT

Floats are represented by an 'F' followed by a decimal number in base 10 followed by a 'e' followed by an exponent followed by a ','. For example 'F3.0e-1,' corresponds to 0.3 and 'F-0.1e0,' corresponds to -0.1. Floats have no size limitation. 'F-0.0,' is invalid. All encodings with an extraneous leading zero, such as 'F03.0e0,', are invalid.

BIFCODE_LIST

Lists are encoded as a '[' followed by their elements (also bifcode encoded) followed by a ']'. For example '[U4:spamU4:eggs]' corresponds to ['spam', 'eggs'].

BIFCODE_DICT

Dictionaries are encoded as a '{' followed by a list of alternating keys and their corresponding values followed by a '}'. For example, '{U3:cowU3:mooU4:spamU4:eggs}' corresponds to {'cow': 'moo', 'spam': 'eggs'} and '{U4:spam[U1:aU1:b]}' corresponds to {'spam': ['a', 'b']}. Keys must be BIFCODE_UTF8 or BIFCODE_BYTES and appear in sorted order (sorted as raw strings, not alphanumerics).

INTERFACE

encode_bifcode( $datastructure )

Takes a single argument which may be a scalar, or may be a reference to either a scalar, an array or a hash. Arrays and hashes may in turn contain values of these same types. Returns a byte string.

The mapping from Perl to bifcode is as follows:

  • 'undef' maps directly to BIFCODE_UNDEF.
  • The global package variables $Bifcode::TRUE and $Bifcode::FALSE encode to BIFCODE_TRUE and BIFCODE_FALSE.
  • Plain scalars that look like canonically represented integers will be serialised as BIFCODE_INTEGER. Otherwise they are treated as BIFCODE_UTF8.
  • SCALAR references become BIFCODE_BYTES.
  • ARRAY references become BIFCODE_LIST.
  • HASH references become BIFCODE_DICT.

You can force scalars to be encoded a particular way by passing a reference to them blessed as Bifcode::BYTES, Bifcode::INTEGER or Bifcode::UTF8. The force_bifcode function below can help with creating such references.

This subroutine croaks on unhandled data types.

decode_bifcode( $string [, $max_depth ] )

Takes a byte string and returns the corresponding deserialised data structure.

If you pass an integer for the second option, it will croak when attempting to parse dictionaries nested deeper than this level, to prevent DoS attacks using maliciously crafted input.

bifcode types are mapped back to Perl in the reverse way to the encode_bifcode function, with the exception that any scalars which were "forced" to a particular type (using blessed references) will decode as unblessed scalars.

Croaks on malformed data.

force_bifcode( $scalar, $type )

Returns a reference to $scalar blessed as Bifcode::$TYPE. The value of $type is not checked, but the encode_bifcode function will only accept the resulting reference where $type is one of 'bytes', 'integer', or 'utf8'.

DIAGNOSTICS

  • trailing garbage at %s

    Your data does not end after the first encode_bifcode-serialised item.

    You may also get this error if a malformed item follows.

  • garbage at %s

    Your data is malformed.

  • unexpected end of data at %s

    Your data is truncated.

  • unexpected end of string data starting at %s

    Your data includes a string declared to be longer than the available data.

  • malformed string length at %s

    Your data contained a string with negative length or a length with leading zeroes.

  • malformed integer data at %s

    Your data contained something that was supposed to be an integer but didn't make sense.

  • dict key not in sort order at %s

    Your data violates the encode_bifcode format constaint that dict keys must appear in lexical sort order.

  • duplicate dict key at %s

    Your data violates the encode_bifcode format constaint that all dict keys must be unique.

  • dict key is not a string at %s

    Your data violates the encode_bifcode format constaint that all dict keys be strings.

  • dict key is missing value at %s

    Your data contains a dictionary with an odd number of elements.

  • nesting depth exceeded at %s

    Your data contains dicts or lists that are nested deeper than the $max_depth passed to decode_bifcode().

  • unhandled data type

    You are trying to serialise a data structure that consists of data types other than

    • scalars
    • references to arrays
    • references to hashes
    • references to scalars

    The format does not support this.

BUGS AND LIMITATIONS

Strings and numbers are practically indistinguishable in Perl, so encode_bifcode() has to resort to a heuristic to decide how to serialise a scalar. This cannot be fixed.

AUTHOR

Mark Lawrence , heavily based on Bencode by Aristotle Pagaltzis

COPYRIGHT AND LICENSE

This software is copyright (c):

  • 2015 by Aristotle Pagaltzis
  • 2017 by Mark Lawrence.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

Comments

I like it!
I don’t want to keep you you from uploading it if that’s what you want to do… but… is there a point? The reason I wrote the Bencode module was that I had files in the Bencode format, created by other software, and I wanted to read and create them myself. This new format might be better than Bencode, but it doesn’t support anything that JSON doesn’t, and no other software in the world will read or write it. What’s the use?
I have a specific project, with a set of requirements that cannot be satisfied by all of the encoding formats I've looked at, JSON included. I have updated the prepan entry with newer documentation (and a hopefully better name) to explain that.

Please sign up to post a review.