PrePAN

Sign in to PrePAN

Lingua::StarDict::Writer A module that allows to create a StarDict dictionary

Good

Synopsis

    use Lingua::StarDict::Writer;

    my $stardict_writer = new StarDict::Writer (name=>'My Cool Dictionary', date=>"2020-12-31");

    $stardict_writer->add_entry_part('42', {type=> "t", body => "ˈfɔɹti tuː"} );
    $stardict_writer->add_entry_part('42', {type=> "m", body => "Answer to the Ultimate Question of Life, the Universe, and Everything"});

    $stardict_writer->add_entry_part('Perl', {type=> "t", body => "pɛʁl"} );
    $stardict_writer->add_entry_part('Perl', {type=> "h", body => "The <b>best</b> programming language ever"});

    $stardict_writer->write;

Description

StarDict is a popular dictionary format, supported by many dictionary and book reading programs.

StarDict entry may consist of several parts of various text or media types.

This module allows to create a new StarDict dictionary with entries consisting of parts of arbitrary types.

Comments

xq (IRC)
I see bareword filehandles, indirect new and use of utf8::
xq (IRC)
I alo see some very strange str* reimplementation code which I don't understand the use for
Nataraj: can you comment on why functions sub g_ascii_strcasecmp and sub strcmp must exist?
xq (IRC)
Nataraj: the fact that you "exactly" reimplement an algorighm does not help figuring out if it makes sense or not
exactly is quoted, because the original in in C, and perl strings are not C strings

Nataraj: because it looks to me like you are sorting decoded unicode characters (in other words, perl code points) using some sort of arbitrary rules

xq: Are you here?

xq: I have renewed my knowledge about that Lingua::StarDict::Writer sorting isse

xq: Story is the following: when you create a StarDict dictionary, you should write an index file, with list of entries titles.

xq: StarDict software will use this index for entry lookup, so it should be sored with same compare algorithm, Stardict will use for lookup, or it will not work right.

xq: The compare algorithm StarDict uses is following: it treats string as a set of bytes (ignoring the fact it can be utf8 inside), if it has uppercase latin letters, they are converted to lowercase, then strings are compared byte to byte.

xq: Do not ask me why they do it that way. I do not know. But I should sort
entries of the index in exactly the same way.

xq: i.e. If you have entry list like this

$VAR1 = [
'aaa01',
'aaa02',
'ZZZ00',
'aaa03',
'aaa04',
'aaa05',
'aaa06',
'aaa07'
];

xq: they need to be sorted this way

$VAR1 = [
'aaa01',
'aaa02',
'aaa03',
'aaa04',
'aaa05',
'aaa06',
'aaa07',
'ZZZ00'
];

xq: So if you know the better way to do it, please tell me about it, and please explain why it is better than mine

xq: I'd be happy to know more about perl encoding issues ;-)
xq (IRC)
Nataraj: does this output match what you have described? https://paste.debian.net/hidden/4d1290bb/

#!/usr/bin/env perl

use 5.026;
use strict;
use warnings;

use utf8;

use Encode::Simple qw[encode_utf8];

my @inputs = qw[
aaa01
잔디
aaa02
aaa03

aaa04
ZZZ00
aaa05
aaa06
aaa07

];

my @processed_inputs = map { encode_utf8 lc } @inputs;

my @sorted_processed_inputs = sort @processed_inputs;

say for @sorted_processed_inputs;

#########

OUTPUT:

aaa01
aaa02
aaa03
aaa04
aaa05
aaa06
aaa07
zzz00


잔디
Nataraj
xq: I do not know. If unicode multibyte char can have 0x41..0x5A chars inside, then I guess this function will give wrong result
xq: no they can't... But will not this lc spoil results making lowercase where it is not expected...

Nataraj
xq: the solution you've suggested proved to be wrong. It do lowercase to non-latin characters too (for example Cyrillic), but it should not. I've changed your code to
{
use utf8;

use Encode::Simple qw[encode_utf8];

my @res = sort {encode_utf8('lc', $a) cmp encode_utf8('lc', $b)} @l;
print Dumper \@res;
}

Nataraj
And it give the following result
$VAR1 = [
'aaa01',
'aaa02',
'aaa03',
'aaa04',
'ZZZ00',
'aaa05',
'aaa06',
'aaa07',
'аааа',
'АААА',
'ЯЯЯЯ',
'яяяя'
];

Nataraj
Though, expected result is
$VAR1 = [
'aaa01',
'aaa02',
'aaa03',
'aaa04',
'aaa05',
'aaa06',
'aaa07',
'ZZZ00',
'АААА',
'ЯЯЯЯ',
'аааа',
'яяяя'
];
xq (IRC)
Nataraj: make sure to always clearly understand if you are working with bytes or characters
Nataraj: if your module is going to expose an API that takes strings, make sure to clearly document if you expect bytes or characters as input (it is characters in your case I think)

Grinnz (IRC)
xq: that means you will be working with the internally stored string, which may or may not be the actual value of the string
so it's a stupid idea
xq (IRC)
I don't like it, but it's literally what perldoc suggests
Grinnz (IRC)
it's not a suggestion, it's saying what happens *if* it's there
xq (IRC)
the alternative would be to manually do something like tr/A-Z/a-z/r

xq (IRC)
Grinnz: what would you suggest?
Grinnz (IRC)
it's tricky because you're trying to apply a character operation to a byte string
the tr/// is probably the cleanest

xq (IRC)
Nataraj: see above, it is probably beneficial to use tr/// instead of lc() + use bytes
Nataraj
xq: yep, I am following the discussion
Grinnz (IRC)
alternatively, if you wanted to use bytes for this, you'd need to apply it only to the two lc() calls, and not the cmp
that's why it's broken
for lc itself it happens to not break because then it only operates on nonvariant ascii codepoints
but every other string operation will break
(also i wouldn't be surprised to see "use bytes" get deprecated at some point)
xq

additionally, there is still a couple ways to simplify your code and/or avoid reinventing wheels of varied complexity
instead of writing your own constructor, use Moo
instead of using various methods of path, directory tree and filehandle manipulations, depend on Path::Tiny
for time formatting you can use Time::Piece (core module)
sweval: use Moo; has a_hashref => (is => "rw", default => sub {+{}}); my $obj = main->new; $obj->a_hashref->{key} = 'value'; [keys $obj->a_hashref->%*]
perlbot (IRC)
xq: ["key"]
xq (IRC)
anything particularly wrong here?
I don't see what you gain by handles_via compared to this, can you explain it?
here you should use path::tiny's helpful open functions instead https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/master/lib/Lingua/StarDict/Writer.pm#L197
sweval: use Moo; has arrayref => (is => "rw", default => sub { [] }); my $obj = main->new; push $obj->arrayref->@*, 'stuff'; say $obj->arrayref->[0]
xq (IRC)
Nataraj: you have some unnecessary includes, for example https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer.pm#L10 and https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer/Entry.pm#L7 also in the Makefile
Nataraj: you should probably be using 'exists' to check instead of this https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer.pm#L173
Nataraj: this block looks like it has no use, you can probably do just my @ordered_keys = sort ... here https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer.pm#L197
Nataraj: if you want to refer to me, 'xq from freenode #perl' is fine

xq (IRC)
Nataraj: another thing is, you are targeting perl 5.6 but your dependency Unicode::UTF8 requires at least 5.8 https://metacpan.org/source/CHANSEN/Unicode-UTF8-0.62/META.yml#L28

Please sign up to post a review.