Sign in to PrePAN



GitHub: okharch PAUSE ID: OKHARCH

User's Modules

DBIx::Brev framework for laconic DBI powered by DBIx::Connection, Config::General, SQL::SplitStatement

DBIx::Brev provides framework for using DBI in more convenient & laconic way:

1) Establish connection using db aliases, setup in config file provided by Config::General (apache style)

2) Keep & reestablish connection automatically using DBIx::Connector facilities.

3) Using default connection paradigm to make DBI code laconic.

4) Switch easily & instantly between databases using cached connections

5) You can switch context by db_use($db_alias) or you can execute each of sql_xxx/inserts subroutines by specifiing explicit db alias as a first parameter without switching default database connection.

6) sql_exec executes multiple statements in single transaction using SQL::SplitStatement::split_with_placeholders to extract statements.

To make it work put to ~/dbi.conf or /etc/dbi.conf database aliases and switch easily between databases. It's suitable for one liners where less code is best approach. Also it will be good for everyone who likes laconic code.

okharch@github 2 comments

Parallel::parallel_map Do map in parallel

You know from the school that nothing is more simple than 2 x 2 = 4

You know from the university that it is the simplest operation for computer as well (left shift).

This module tries outperform and speed up even simplest calculations using all the power of your server CPU cores and familiar map conception.

time perl -Ilib -MParallel::parallel_map -e'parallel_map {$*2} 1..20000_000'

real 0m6.379s user 0m9.592s sys 0m2.026s

time perl -Ilib -MParallel::parallel_map -e'map {$*2} 1..20000_000'

real 0m3.120s user 0m2.901s sys 0m0.217s

Provided more cores (I have 4) and more memory to test something like time perl -Ilib -MParallel::parallel_map -e'parallel_map {$*2} 1..100000_000'

It would definitely outperform, especially latest Perl 5.16 etc. :)

Here is how it works:

1) finds out how many cpu cores you have,

2) splits map work by the number of cores,

3) do it in parallel and

4) after job is done by each thread it merges the results into array if it was called int list context. Otherwise it only calculates values and does not collect the results. That can be used as a parallel for loop.

Sorry, slightly more then 1-2-3.

Interprocess communication is done using plain old temporary files. so it should work everywhere where fork is implemented.

Although I have a benchmark that makes Perl crazy when it tries to make garbage collection under Win32 despite some small tests work perfectly. So there is no still heaven on Windows ;) .

Although it's not required, please install Sereal and File::Slurp. That way IPC is done much faster than using Storable capabilities.

okharch@github 0 comments

Parallel::DataPipe makes it simple to parallelize code which process data

If you have some long running script processing data item by item (having on input some data and having on output some processed data i.e. aggregation, webcrawling,etc) here is good news for you:

You can speed it up 4-20 times with minimal efforts from you. Modern computer (even modern smartphones ;) ) have multiple CPU cores: 2,4,8, even 24! And huge amount of memory: memory is cheap now. So they are ready for parallel data processing. With this script there is an easy and flexible way to use that power.

Well, it is not the first method on parallelizm in Perl. You could write an efficient crawler using single core and framework like Coro::LWP or AnyEvent::HTTP::LWP. Also you can elegantly use all your cpu cores for parallel processing using Parallel::Loop. So what are the benefits of this module?

1) because it uses input_iterator it does not have to know all input data before starting parallel processing

2) because it uses merge_data processed data is ready for using in main thread immediately.

1) and 2) remove requirements for memory which is needed to store data items before and after parallel work. and allows parallelize work on collecting, processing and using processed data.

If you don't want to overload your database with multiple simultaneous queries you make queries only within input_iterator and then process_data and then flush it with merge_data. On the other hand you usually win if make queries in process_data and do a lot of data processors. This guarantees full load of your cpu capabilities. It's not surprise, that DB servers usually serves N queries simultaneously faster then N queries one by one. Make tests and you will know.

To (re)write your script for using all processing power of your server you have to find out:

1) the method to obtain source/input data. I call it input iterator. It can be either array with some identifiers/urls or reference to subroutine which returns next portion of data or undef if there is nor more data to process.

2) how to process data i.e. method which receives input item and produce output item. I call it process_data subroutine. The good news is that item which is processed and then returned can be any scalar value in perl, including references to array and hashes. It can be everything that Storable can freeze and then thaw.

3) how to use processed data. I call it merge_data. In the example above it just prints an item, but you could do buffered inserts to database, send email, etc.

Take into account that 1) and 3) is executed in main script thread. While all 2) work is done in parallel forked threads. So for 1) and 3) it's better not to do things that block execution and remains hungry dogs 2) without meat to eat. So (still) this approach will benefit if you know that bottleneck in you script is CPU on processing step. Of course it's not the case for some web crawling tasks unless you do some heavy calculations

P.S. Please help me to improve documentation for this module. I understand my English is far from perfect and so probably is clarity of explanations. Thank you!

okharch@github 2 comments