PrePAN

Sign in to PrePAN

Parallel::Regex::PCRE Apply regexes to buffer via pthread pool

Author
ttkciar@github
Date
URL
Status
In Review
Good

Synopsis

use Parallel::Regex::PCRE;

# A spamfilter might have thousands of regular expressions, but this example has three:
my @regex_list = ('^subject:.+lose weight', '^from:.+weight loss', '(authou?ri[sz]e payment)');

# Initialize the pool.  This need only be done once for the life of the process:
my $re_pool = Parallel::Regex::PCRE->new(
    threads => 3,
    regex_list => \@regex_list,
    modifiers => $PCRE_MULTILINE | $PCRE_CASELESS | $PCRE_UNGREEDY
);

# Let the worker threads find matches in a string (also resets the iterator):
my $n_matches = $re_pool->match($buffer);

# Iterate through hits, which may or may not have capture groups:
while (my $match = $re_pool->next_match) {
    print "got a hit from ", $match->{regex_id}, "\n";
    print "capture groups: ", join(", ", @{$match->{captures}}), "\n";
}

Description

The proposed module would provide a convenient interface to a pool of pthreads for parallelized regular expression matching.

Using the class:

After initialization, every time the pool was given an input string via "match", each of the workers in the pool would try to apply disjoint subsets of regular expressions to the string. The caller would block until all of the regular expressions had been applied, and receive a count of total matches as a return value.

Any matches could then be accessed through an iterator method "next_match", which would return a numeric regex id (corresponding to the position of the regex in "regex_list") and any text captured by it.

Use cases

The suitability of this module to a given application depends entirely on whether the overhead of setting up the pool and traversing matches exceeds the performance benefit of applying regular expressions to the input in parallel.

The first expected use-case for this module is an email spam filter, which might apply thousands or tens of thousands of regular expressions to each email document, few of which are expected to match at all.

This use-case, where the number of match() calls is huge, each input is large, the number of regular expressions is large, and the number of matches is small, is on the optimal end of the cost/benefit spectrum. Small numbers of regular expressions being applied to small strings would be on the other end, and I expect would be better served just using ordinary perl regular expressions serially. I look forward to testing these expectations against real data.

Thoughts on implementation:

The guts of the module would exist mostly as XS. The initial implementation would use PCRE simply to keep C development time short. Other implementations with the same interface might be provided as Parallel::Regex::* if PCRE proves unsatisfactory.

new() would initialize:

  • A pool of pthread worker threads,

  • A set of compiled PCRE regular expressions,

  • A mapping of those workers to those compiled regexes,

  • A pipe,

  • Work mutexes on which worker threads are blocked/unblocked,

  • An input pointer,

  • An array of output pointers, one per worker,

  • An active worker counter,

  • A shared state mutex for protecting the worker counter and output counters

.

The match method would simply:

  • Reset the iterator state,

  • Set the active worker counter to the pool size,

  • Update the input pointer to point at the buffer,

  • Unlock the worker mutexes,

  • Read a match count from the pipe (blocking until available),

  • Return match count to caller

.

The worker threads would remain blocked on their work mutex until woken by the match method, then:

  • Apply each of their regexes to the input buffer (accessed via input pointer),

  • Write match data to their own local memory buffer (resizing the buffer and copying forward as needed) as matches are found,

  • When done with all regexes, acquire the shared state mutex,

  • Update its output pointer to point at its local memory buffer,

  • Update the total hit count,

  • Decrement the active worker counter,

  • Release the shared state mutex,

  • If active worker counter is zero (it is last thread to finish), write total hit count to pipe,

  • Lock its own worker mutex to put itself to sleep until the next time match method is called.

.

The iterator is straightforward, needs only maintain two integers for state:

  • An index into the output pointer array,

  • An index into the next record in the corresponding worker's local output buffer.

.

The workers' local output buffer would contain a series of variable-length records, starting with the regex id (or 0xFFFF for the sentry record) and a count of matches, and for each match a count of groups (which can be zero) followed by length-prefixed group text strings.

Anticipated follow-up work:

I'm not going to spend any time on premature optimization until I must, but I expect some regexes to take more time to run than others, and unless this is reflected in the distribution of regexes to worker threads, there will be one worker thread (the slowest) limiting the performance of the entire system. Since the caller does not unblock until after all of the threads are done, the slowest thread determines time blocked.

A couple of solutions occur to me:

  • I could push the burden of figuring it out onto the user, and have them provide weights with their regular expressions. That would be simplest for me, but more complex for the user.

  • Alternatively I could provide a way to switch the pool between "training" and "working" modes, having them measure the time each regex takes to complete during "training" mode and rewriting the regex -> worker mapping appropriately on the transition to "working" mode. Thus the user could train the pool with a few rounds of input data (at the cost of some performance), let it reshuffle the mapping, and then process the rest of the data at full speed. That would be simplest for the user, but more complex for me.

The distinction might not matter that much, as I expect to be the only user, at least for a while.

.

Also, I expect to write a Parallel::Regex::Reference module which is actually a serial implementation written in pure-perl. Its purpose would be twofold:

  • To compare against the performance of Parallel::Regex::PCRE, so that use of the module can be justified (or not!),

  • To provide a fallback solution, so that if anyone encounters problems with Parallel::Regex::PCRE they can simply switch to the other module without otherwise changing their code.

Comments

Noting feedback from freenode-#perl:

* PCRE2 offers better perl compatibility than PCRE; might use that instead.

* "support for pcre's named captures is probably a must for sanity" -- simcop2387

After this is working, could be improved by extending input pointer to an array of input pointers, so that data could be added to the queue while workers are still processing. Would need to refactor the communication back to the caller, though.

Please sign up to post a review.