PrePAN

Sign in to PrePAN

PrePAN provides a place
to discuss your modules.

CLOSE

Requests for Reviews Feed

USB::TMC Perl interface to USBTMC Test&Measurement backend

Based on USB::LibUSB.

Does not yet support the additional usb488_subclass.

amba@github 0 comments

Parallel::Regex::PCRE Apply regexes to buffer via pthread pool

The proposed module would provide a convenient interface to a pool of pthreads for parallelized regular expression matching.

Using the class:

After initialization, every time the pool was given an input string via "match", each of the workers in the pool would try to apply disjoint subsets of regular expressions to the string. The caller would block until all of the regular expressions had been applied, and receive a count of total matches as a return value.

Any matches could then be accessed through an iterator method "next_match", which would return a numeric regex id (corresponding to the position of the regex in "regex_list") and any text captured by it.

Use cases

The suitability of this module to a given application depends entirely on whether the overhead of setting up the pool and traversing matches exceeds the performance benefit of applying regular expressions to the input in parallel.

The first expected use-case for this module is an email spam filter, which might apply thousands or tens of thousands of regular expressions to each email document, few of which are expected to match at all.

This use-case, where the number of match() calls is huge, each input is large, the number of regular expressions is large, and the number of matches is small, is on the optimal end of the cost/benefit spectrum. Small numbers of regular expressions being applied to small strings would be on the other end, and I expect would be better served just using ordinary perl regular expressions serially. I look forward to testing these expectations against real data.

Thoughts on implementation:

The guts of the module would exist mostly as XS. The initial implementation would use PCRE simply to keep C development time short. Other implementations with the same interface might be provided as Parallel::Regex::* if PCRE proves unsatisfactory.

new() would initialize:

  • A pool of pthread worker threads,

  • A set of compiled PCRE regular expressions,

  • A mapping of those workers to those compiled regexes,

  • A pipe,

  • Work mutexes on which worker threads are blocked/unblocked,

  • An input pointer,

  • An array of output pointers, one per worker,

  • An active worker counter,

  • A shared state mutex for protecting the worker counter and output counters

.

The match method would simply:

  • Reset the iterator state,

  • Set the active worker counter to the pool size,

  • Update the input pointer to point at the buffer,

  • Unlock the worker mutexes,

  • Read a match count from the pipe (blocking until available),

  • Return match count to caller

.

The worker threads would remain blocked on their work mutex until woken by the match method, then:

  • Apply each of their regexes to the input buffer (accessed via input pointer),

  • Write match data to their own local memory buffer (resizing the buffer and copying forward as needed) as matches are found,

  • When done with all regexes, acquire the shared state mutex,

  • Update its output pointer to point at its local memory buffer,

  • Update the total hit count,

  • Decrement the active worker counter,

  • Release the shared state mutex,

  • If active worker counter is zero (it is last thread to finish), write total hit count to pipe,

  • Lock its own worker mutex to put itself to sleep until the next time match method is called.

.

The iterator is straightforward, needs only maintain two integers for state:

  • An index into the output pointer array,

  • An index into the next record in the corresponding worker's local output buffer.

.

The workers' local output buffer would contain a series of variable-length records, starting with the regex id (or 0xFFFF for the sentry record) and a count of matches, and for each match a count of groups (which can be zero) followed by length-prefixed group text strings.

Anticipated follow-up work:

I'm not going to spend any time on premature optimization until I must, but I expect some regexes to take more time to run than others, and unless this is reflected in the distribution of regexes to worker threads, there will be one worker thread (the slowest) limiting the performance of the entire system. Since the caller does not unblock until after all of the threads are done, the slowest thread determines time blocked.

A couple of solutions occur to me:

  • I could push the burden of figuring it out onto the user, and have them provide weights with their regular expressions. That would be simplest for me, but more complex for the user.

  • Alternatively I could provide a way to switch the pool between "training" and "working" modes, having them measure the time each regex takes to complete during "training" mode and rewriting the regex -> worker mapping appropriately on the transition to "working" mode. Thus the user could train the pool with a few rounds of input data (at the cost of some performance), let it reshuffle the mapping, and then process the rest of the data at full speed. That would be simplest for the user, but more complex for me.

The distinction might not matter that much, as I expect to be the only user, at least for a while.

.

Also, I expect to write a Parallel::Regex::Reference module which is actually a serial implementation written in pure-perl. Its purpose would be twofold:

  • To compare against the performance of Parallel::Regex::PCRE, so that use of the module can be justified (or not!),

  • To provide a fallback solution, so that if anyone encounters problems with Parallel::Regex::PCRE they can simply switch to the other module without otherwise changing their code.

ttkciar@github 1 comment

Web::HackerNews Scrape the HTML of Hackernews

Given a Hacker News page, scrape the HTML to extract the contents. For example, get the title and the "hide" URL, etc., so that one can automatically match the titles against a regular expression then "hide" stories about Elon Musk, James Damore, react.js, Google memos, or other tedious things and people.

This is an HTML scraper and not related to WebService::HackerNews by Neil Bower. Note that Hacker News uses tables and "center" tags for layout, with no particular logical subdivision.

benkasminbullock@github 2 comments

XS::Check Check XS for errors, something like Perl Critic for XS

Something like Perl critic for XS.

benkasminbullock@github 3 comments

USB::LibUSB Perl interface to the libusb-1.0 API

This module provides a Perl interface to the libusb-1.0 API. It provides access to most basic libusb functionality including read-out of device descriptors and synchronous device I/O. The objective is to provide the full portability of libusb-1.0 (Linux, Windows, BSDs, OSX,...).

The module has a two-tier design:

  • USB::LibUSB::XS

Raw XS interface, stay as close at possible to the libusb API. Not intended to be used directly.

  • USB::LibUSB

Based on USB::LibUSB::XS, adds convenient error handling and additional high-level functionality (e.g. device discovery with vid, pid and serial number). Easy to build more functionality without knowing about XS.

Update: Now on CPAN

amba@github 4 comments

Crypt::File::Valet Convenient encrypted I/O

I am the author of File::Valet (https://metacpan.org/pod/File::Valet) and have need for a module with a similarly convenient way to perform I/O on encrypted files, so I'm writing one. The synopsis shows what I'd like it to look like, but it's not set in stone. The method and function names are chosen to be similar to those of File::Valet, but with "x" used instead of "f" to denote "encrypted" vs "file".

The module would use a caller-provided digest instance, hashed password string, and salt string to encrypt/decrypt the contents of files via a CTR cipher, with random padding before and after the file content, and a convenient way to harden predictable plaintext (via mix method).

When I came to PrePAN, the question I had in mind was "should this be named File::Crypt::Valet or Crypt::File::Valet?" but general comments and suggestions about the proposed module would be welcome as well.

ttkciar@github 2 comments

Bifcode Bifcode serialization format

STATUS

This module and related encoding format are still under development. Do not use it anywhere near production. Input is welcome.

DESCRIPTION

Bifcode implements the bifcode serialisation format, a mixed binary/text encoding with support for the following data types:

  • Primitive:
    • Undefined(null)
    • Booleans(true/false)
    • Integer numbers
    • Floating point numbers
    • UTF8 strings
    • Binary strings
  • Structured:
    • Arrays(lists)
    • Hashes(dictionaries)

The encoding is simple to construct and relatively easy to parse. There is no need to escape special characters in strings. It is not considered human readable, but as it is mostly text it can usually be visually debugged.

Bifcode can only be constructed canonically; i.e. there is only one possible encoding per data structure. This property makes it suitable for comparing structures (using cryptographic hashes) across networks.

In terms of size the encoding is similar to minified JSON. In terms of speed this module compares well with other pure Perl encoding modules with the same features.

MOTIVATION & GOALS

Bifcode was created for a project because none of currently available serialization formats (Bencode, JSON, MsgPack, Sereal, YAML, etc) met the requirements of:

  • Support for undef
  • Support for UTF8 strings
  • Support for binary data
  • Trivial to construct on the fly from within SQLite triggers
  • Universally-recognized canonical form for hashing

There no lofty goals or intentions to promote this outside of my specific case. Use it or not, as you please, based on your own requirements. Constructive discussion is welcome.

SPECIFICATION

The encoding is defined as follows:

BIFCODE_UNDEF

A null or undefined value correspond to '~'.

BIFCODE_TRUE and BIFCODE_FALSE

Boolean values are represented by '1' and '0'.

BIFCODE_UTF8

A UTF8 string is 'U' followed by the octet length of the decoded string as a base ten number followed by a colon and the decoded string. For example "\x{df}" corresponds to "U2:\x{c3}\x{9f}".

BIFCODE_BYTES

Opaque data is 'B' followed by the octet length of the data as a base ten number followed by a colon and then the data itself. For example a three-byte blob 'xyz' corresponds to 'B3:xyz'.

BIFCODE_INTEGER

Integers are represented by an 'I' followed by the number in base 10 followed by a ','. For example 'I3,' corresponds to 3 and 'I-3,' corresponds to -3. Integers have no size limitation. 'I-0,' is invalid. All encodings with a leading zero, such as 'I03,', are invalid, other than 'I0,', which of course corresponds to 0.

BIFCODE_FLOAT

Floats are represented by an 'F' followed by a decimal number in base 10 followed by a 'e' followed by an exponent followed by a ','. For example 'F3.0e-1,' corresponds to 0.3 and 'F-0.1e0,' corresponds to -0.1. Floats have no size limitation. 'F-0.0,' is invalid. All encodings with an extraneous leading zero, such as 'F03.0e0,', are invalid.

BIFCODE_LIST

Lists are encoded as a '[' followed by their elements (also bifcode encoded) followed by a ']'. For example '[U4:spamU4:eggs]' corresponds to ['spam', 'eggs'].

BIFCODE_DICT

Dictionaries are encoded as a '{' followed by a list of alternating keys and their corresponding values followed by a '}'. For example, '{U3:cowU3:mooU4:spamU4:eggs}' corresponds to {'cow': 'moo', 'spam': 'eggs'} and '{U4:spam[U1:aU1:b]}' corresponds to {'spam': ['a', 'b']}. Keys must be BIFCODE_UTF8 or BIFCODE_BYTES and appear in sorted order (sorted as raw strings, not alphanumerics).

INTERFACE

encode_bifcode( $datastructure )

Takes a single argument which may be a scalar, or may be a reference to either a scalar, an array or a hash. Arrays and hashes may in turn contain values of these same types. Returns a byte string.

The mapping from Perl to bifcode is as follows:

  • 'undef' maps directly to BIFCODE_UNDEF.
  • The global package variables $Bifcode::TRUE and $Bifcode::FALSE encode to BIFCODE_TRUE and BIFCODE_FALSE.
  • Plain scalars that look like canonically represented integers will be serialised as BIFCODE_INTEGER. Otherwise they are treated as BIFCODE_UTF8.
  • SCALAR references become BIFCODE_BYTES.
  • ARRAY references become BIFCODE_LIST.
  • HASH references become BIFCODE_DICT.

You can force scalars to be encoded a particular way by passing a reference to them blessed as Bifcode::BYTES, Bifcode::INTEGER or Bifcode::UTF8. The force_bifcode function below can help with creating such references.

This subroutine croaks on unhandled data types.

decode_bifcode( $string [, $max_depth ] )

Takes a byte string and returns the corresponding deserialised data structure.

If you pass an integer for the second option, it will croak when attempting to parse dictionaries nested deeper than this level, to prevent DoS attacks using maliciously crafted input.

bifcode types are mapped back to Perl in the reverse way to the encode_bifcode function, with the exception that any scalars which were "forced" to a particular type (using blessed references) will decode as unblessed scalars.

Croaks on malformed data.

force_bifcode( $scalar, $type )

Returns a reference to $scalar blessed as Bifcode::$TYPE. The value of $type is not checked, but the encode_bifcode function will only accept the resulting reference where $type is one of 'bytes', 'integer', or 'utf8'.

DIAGNOSTICS

  • trailing garbage at %s

    Your data does not end after the first encode_bifcode-serialised item.

    You may also get this error if a malformed item follows.

  • garbage at %s

    Your data is malformed.

  • unexpected end of data at %s

    Your data is truncated.

  • unexpected end of string data starting at %s

    Your data includes a string declared to be longer than the available data.

  • malformed string length at %s

    Your data contained a string with negative length or a length with leading zeroes.

  • malformed integer data at %s

    Your data contained something that was supposed to be an integer but didn't make sense.

  • dict key not in sort order at %s

    Your data violates the encode_bifcode format constaint that dict keys must appear in lexical sort order.

  • duplicate dict key at %s

    Your data violates the encode_bifcode format constaint that all dict keys must be unique.

  • dict key is not a string at %s

    Your data violates the encode_bifcode format constaint that all dict keys be strings.

  • dict key is missing value at %s

    Your data contains a dictionary with an odd number of elements.

  • nesting depth exceeded at %s

    Your data contains dicts or lists that are nested deeper than the $max_depth passed to decode_bifcode().

  • unhandled data type

    You are trying to serialise a data structure that consists of data types other than

    • scalars
    • references to arrays
    • references to hashes
    • references to scalars

    The format does not support this.

BUGS AND LIMITATIONS

Strings and numbers are practically indistinguishable in Perl, so encode_bifcode() has to resort to a heuristic to decide how to serialise a scalar. This cannot be fixed.

AUTHOR

Mark Lawrence , heavily based on Bencode by Aristotle Pagaltzis

COPYRIGHT AND LICENSE

This software is copyright (c):

  • 2015 by Aristotle Pagaltzis
  • 2017 by Mark Lawrence.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

mlawren@github 3 comments

WebService::Dataworld PROPOSAL: A wrapper around the data.world APIs

data.world has JSON APIs for both query and maintenance of large public and non-public datasets. There are some fiddly bits to the API; first thing I noticed is that querying and maintenance happen at different URLs. Additionally, some of the returns are not yet consistent; sometimes you'll get a message and other data, sometimes just a message; there's no "truthy on success" code that would be useful to "try" blocks, etc. I propose a simple wrapper around the complexities, an API object that just takes care of those details for the user, validates inputs, and deals with returns in a way to make them more consistent.

Thoughts?

GeekRuthie@twitter 5 comments

CrawlerCommons::RobotRulesParser Perl implementation of Google crawler-commons RobotRulesParser

This module is a fairly close reproduction of the Crawler-Commons SimpleRobotRulesParser http://crawler-commons.github.io/crawlercommons/0.7/crawlercommons/robots/SimpleRobotRulesParser.html

From BaseRobotsParser javadoc:

Parse the robots.txt file in content, and return rules appropriate
for processing paths by userAgent. Note that multiple agent names
may be provided as comma-separated values; the order of these shouldn't
matter, as the file is parsed in order, and each agent name found in the
file will be compared to every agent name found in robotNames.
Also note that names are lower-cased before comparison, and that any
robot name you pass shouldn't contain commas or spaces; if the name has
spaces, it will be split into multiple names, each of which will be
compared against agent names in the robots.txt file. An agent name is
considered a match if it's a prefix match on the provided robot name. For
example, if you pass in "Mozilla Crawlerbot-super 1.0", this would match
"crawlerbot" as the agent name, because of splitting on spaces,
lower-casing, and the prefix match rule.

The method failedFetch is not implemented.

akrobinson74@github 1 comment

Unoconv Use LibreOffice to convert file formats

Note: using my name in the module name was a stop-gap while I find a better name. Suggestions???

   Curley::Unoconv is a Perl extension that allows you to use LibreOffice (or OpenOffice.org) to convert from any spreadsheet format LibreOffice will accept to any format LibreOffice will. You can then do further processing with, e.g. Text::CSV_XS The function only does the conversion if necessary, which is determined by comparing the two files' mtimes.

   It uses Dag Wieers' unoconv, available on most Linux distributions. http://dag.wieers.com/home-made/unoconv/

   Note that the conversion can fail (and your program will die). A few notes:

   ·   The version number unoconv expects must agree with the version of LibreOffice installed. Install unoconv from the same repository you got LibreOffice from. E.g. if you install LibreOffice from debian backports, then install unoconv from debian backports.

   ·   Unoconv may not like it if the running instance of LibreOffice (if any) has a copy of the source file loaded. Save and close it.

   ·   If LibreOffice does not support the given format the whole kazoo will die messily.

EXPORT unoconv ()

SEE ALSO http://dag.wieers.com/home-made/unoconv/

   man 1 unoconv

DEFICIENCIES · There is not a lot of error checking. For example, we do not check to see if we can write to the target file.

   ·   There is no provision for passing other parameters to the unoconv program. Maybe make that an optional parameter?

   ·   How can we gracefully tell if LibreOffice supports the file format the caller wants?

   ·   We have no control over how well our back end converts. Other programs and Perl modules may do a better job for given pairs of source and destination formats.

   ·   OpenOffice::UNO might provide a more direct and flexible interface, and eliminate the need for python.

charlescurley@github 1 comment