Apache Arrow https://en.wikipedia.org/wiki/Draft:Apache_Arrow already has C bindings via GLib https://github.com/apache/arrow/tree/master/c_glib and Perl already has GLib bindings via https://metacpan.org/pod/Glib.
Given this, does it make sense to implement Arrow bindings for perl as its own module? Or are the existing GLib bindings sufficient for any practical task?
If I were to implement Apache::Arrow[::*] it would likely be little more than a convenience wrapper around the GLib interface.
The proposed module would provide a convenient interface to a pool of pthreads for parallelized regular expression matching.
After initialization, every time the pool was given an input string via "match", each of the workers in the pool would try to apply disjoint subsets of regular expressions to the string. The caller would block until all of the regular expressions had been applied, and receive a count of total matches as a return value.
Any matches could then be accessed through an iterator method "next_match", which would return a numeric regex id (corresponding to the position of the regex in "regex_list") and any text captured by it.
The suitability of this module to a given application depends entirely on whether the overhead of setting up the pool and traversing matches exceeds the performance benefit of applying regular expressions to the input in parallel.
The first expected use-case for this module is an email spam filter, which might apply thousands or tens of thousands of regular expressions to each email document, few of which are expected to match at all.
This use-case, where the number of match() calls is huge, each input is large, the number of regular expressions is large, and the number of matches is small, is on the optimal end of the cost/benefit spectrum. Small numbers of regular expressions being applied to small strings would be on the other end, and I expect would be better served just using ordinary perl regular expressions serially. I look forward to testing these expectations against real data.
The guts of the module would exist mostly as XS. The initial implementation would use PCRE simply to keep C development time short. Other implementations with the same interface might be provided as Parallel::Regex::* if PCRE proves unsatisfactory.
new() would initialize:
A pool of pthread worker threads,
A set of compiled PCRE regular expressions,
A mapping of those workers to those compiled regexes,
Work mutexes on which worker threads are blocked/unblocked,
An input pointer,
An array of output pointers, one per worker,
An active worker counter,
A shared state mutex for protecting the worker counter and output counters
The match method would simply:
Reset the iterator state,
Set the active worker counter to the pool size,
Update the input pointer to point at the buffer,
Unlock the worker mutexes,
Read a match count from the pipe (blocking until available),
Return match count to caller
The worker threads would remain blocked on their work mutex until woken by the match method, then:
Apply each of their regexes to the input buffer (accessed via input pointer),
Write match data to their own local memory buffer (resizing the buffer and copying forward as needed) as matches are found,
When done with all regexes, acquire the shared state mutex,
Update its output pointer to point at its local memory buffer,
Update the total hit count,
Decrement the active worker counter,
Release the shared state mutex,
If active worker counter is zero (it is last thread to finish), write total hit count to pipe,
Lock its own worker mutex to put itself to sleep until the next time match method is called.
The iterator is straightforward, needs only maintain two integers for state:
An index into the output pointer array,
An index into the next record in the corresponding worker's local output buffer.
The workers' local output buffer would contain a series of variable-length records, starting with the regex id (or 0xFFFF for the sentry record) and a count of matches, and for each match a count of groups (which can be zero) followed by length-prefixed group text strings.
I'm not going to spend any time on premature optimization until I must, but I expect some regexes to take more time to run than others, and unless this is reflected in the distribution of regexes to worker threads, there will be one worker thread (the slowest) limiting the performance of the entire system. Since the caller does not unblock until after all of the threads are done, the slowest thread determines time blocked.
A couple of solutions occur to me:
I could push the burden of figuring it out onto the user, and have them provide weights with their regular expressions. That would be simplest for me, but more complex for the user.
Alternatively I could provide a way to switch the pool between "training" and "working" modes, having them measure the time each regex takes to complete during "training" mode and rewriting the regex -> worker mapping appropriately on the transition to "working" mode. Thus the user could train the pool with a few rounds of input data (at the cost of some performance), let it reshuffle the mapping, and then process the rest of the data at full speed. That would be simplest for the user, but more complex for me.
The distinction might not matter that much, as I expect to be the only user, at least for a while.
Also, I expect to write a Parallel::Regex::Reference module which is actually a serial implementation written in pure-perl. Its purpose would be twofold:
To compare against the performance of Parallel::Regex::PCRE, so that use of the module can be justified (or not!),
To provide a fallback solution, so that if anyone encounters problems with Parallel::Regex::PCRE they can simply switch to the other module without otherwise changing their code.
I am the author of File::Valet (https://metacpan.org/pod/File::Valet) and have need for a module with a similarly convenient way to perform I/O on encrypted files, so I'm writing one. The synopsis shows what I'd like it to look like, but it's not set in stone. The method and function names are chosen to be similar to those of File::Valet, but with "x" used instead of "f" to denote "encrypted" vs "file".
The module would use a caller-provided digest instance, hashed password string, and salt string to encrypt/decrypt the contents of files via a CTR cipher, with random padding before and after the file content, and a convenient way to harden predictable plaintext (via mix method).
When I came to PrePAN, the question I had in mind was "should this be named File::Crypt::Valet or Crypt::File::Valet?" but general comments and suggestions about the proposed module would be welcome as well.