Synopsis
use Lingua::StarDict::Writer;
my $stardict_writer = new StarDict::Writer (name=>'My Cool Dictionary', date=>"2020-12-31");
$stardict_writer->add_entry_part('42', {type=> "t", body => "ˈfɔɹti tuː"} );
$stardict_writer->add_entry_part('42', {type=> "m", body => "Answer to the Ultimate Question of Life, the Universe, and Everything"});
$stardict_writer->add_entry_part('Perl', {type=> "t", body => "pɛʁl"} );
$stardict_writer->add_entry_part('Perl', {type=> "h", body => "The <b>best</b> programming language ever"});
$stardict_writer->write;
Description
StarDict is a popular dictionary format, supported by many dictionary and book reading programs.
StarDict entry may consist of several parts of various text or media types.
This module allows to create a new StarDict dictionary with entries consisting of parts of arbitrary types.
Comments
I see bareword filehandles, indirect new and use of utf8::
I alo see some very strange str* reimplementation code which I don't understand the use for
Nataraj: can you comment on why functions sub g_ascii_strcasecmp and sub strcmp must exist?
Nataraj: the fact that you "exactly" reimplement an algorighm does not help figuring out if it makes sense or not
exactly is quoted, because the original in in C, and perl strings are not C strings
Nataraj: because it looks to me like you are sorting decoded unicode characters (in other words, perl code points) using some sort of arbitrary rules
xq: Are you here?
xq: I have renewed my knowledge about that Lingua::StarDict::Writer sorting isse
xq: Story is the following: when you create a StarDict dictionary, you should write an index file, with list of entries titles.
xq: StarDict software will use this index for entry lookup, so it should be sored with same compare algorithm, Stardict will use for lookup, or it will not work right.
xq: The compare algorithm StarDict uses is following: it treats string as a set of bytes (ignoring the fact it can be utf8 inside), if it has uppercase latin letters, they are converted to lowercase, then strings are compared byte to byte.
xq: Do not ask me why they do it that way. I do not know. But I should sort
entries of the index in exactly the same way.
xq: i.e. If you have entry list like this
$VAR1 = [
'aaa01',
'aaa02',
'ZZZ00',
'aaa03',
'aaa04',
'aaa05',
'aaa06',
'aaa07'
];
xq: they need to be sorted this way
$VAR1 = [
'aaa01',
'aaa02',
'aaa03',
'aaa04',
'aaa05',
'aaa06',
'aaa07',
'ZZZ00'
];
xq: So if you know the better way to do it, please tell me about it, and please explain why it is better than mine
xq: I'd be happy to know more about perl encoding issues ;-)
Nataraj: does this output match what you have described? https://paste.debian.net/hidden/4d1290bb/
#!/usr/bin/env perl
use 5.026;
use strict;
use warnings;
use utf8;
use Encode::Simple qw[encode_utf8];
my @inputs = qw[
aaa01
잔디
aaa02
aaa03
草
aaa04
ZZZ00
aaa05
aaa06
aaa07
茶
];
my @processed_inputs = map { encode_utf8 lc } @inputs;
my @sorted_processed_inputs = sort @processed_inputs;
say for @sorted_processed_inputs;
#########
OUTPUT:
aaa01
aaa02
aaa03
aaa04
aaa05
aaa06
aaa07
zzz00
茶
草
잔디
xq: I do not know. If unicode multibyte char can have 0x41..0x5A chars inside, then I guess this function will give wrong result
xq: no they can't... But will not this lc spoil results making lowercase where it is not expected...
Nataraj
xq: the solution you've suggested proved to be wrong. It do lowercase to non-latin characters too (for example Cyrillic), but it should not. I've changed your code to
{
use utf8;
use Encode::Simple qw[encode_utf8];
my @res = sort {encode_utf8('lc', $a) cmp encode_utf8('lc', $b)} @l;
print Dumper \@res;
}
Nataraj
And it give the following result
$VAR1 = [
'aaa01',
'aaa02',
'aaa03',
'aaa04',
'ZZZ00',
'aaa05',
'aaa06',
'aaa07',
'аааа',
'АААА',
'ЯЯЯЯ',
'яяяя'
];
Nataraj
Though, expected result is
$VAR1 = [
'aaa01',
'aaa02',
'aaa03',
'aaa04',
'aaa05',
'aaa06',
'aaa07',
'ZZZ00',
'АААА',
'ЯЯЯЯ',
'аааа',
'яяяя'
];
Nataraj: make sure to always clearly understand if you are working with bytes or characters
Nataraj: if your module is going to expose an API that takes strings, make sure to clearly document if you expect bytes or characters as input (it is characters in your case I think)
Grinnz (IRC)
xq: that means you will be working with the internally stored string, which may or may not be the actual value of the string
so it's a stupid idea
xq (IRC)
I don't like it, but it's literally what perldoc suggests
Grinnz (IRC)
it's not a suggestion, it's saying what happens *if* it's there
xq (IRC)
the alternative would be to manually do something like tr/A-Z/a-z/r
xq (IRC)
Grinnz: what would you suggest?
Grinnz (IRC)
it's tricky because you're trying to apply a character operation to a byte string
the tr/// is probably the cleanest
xq (IRC)
Nataraj: see above, it is probably beneficial to use tr/// instead of lc() + use bytes
Nataraj
xq: yep, I am following the discussion
Grinnz (IRC)
alternatively, if you wanted to use bytes for this, you'd need to apply it only to the two lc() calls, and not the cmp
that's why it's broken
for lc itself it happens to not break because then it only operates on nonvariant ascii codepoints
but every other string operation will break
(also i wouldn't be surprised to see "use bytes" get deprecated at some point)
additionally, there is still a couple ways to simplify your code and/or avoid reinventing wheels of varied complexity
instead of writing your own constructor, use Moo
instead of using various methods of path, directory tree and filehandle manipulations, depend on Path::Tiny
for time formatting you can use Time::Piece (core module)
perlbot (IRC)
xq: ["key"]
xq (IRC)
anything particularly wrong here?
I don't see what you gain by handles_via compared to this, can you explain it?
Nataraj: you have some unnecessary includes, for example https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer.pm#L10 and https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer/Entry.pm#L7 also in the Makefile
Nataraj: you should probably be using 'exists' to check instead of this https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer.pm#L173
Nataraj: this block looks like it has no use, you can probably do just my @ordered_keys = sort ... here https://gitlab.com/dhyannataraj/lingua-stardict-writer-perl/-/blob/151cc873b919f6a4f3b181cbc213e8d25940421c/lib/Lingua/StarDict/Writer.pm#L197
Nataraj: if you want to refer to me, 'xq from freenode #perl' is fine
xq (IRC)
Nataraj: another thing is, you are targeting perl 5.6 but your dependency Unicode::UTF8 requires at least 5.8 https://metacpan.org/source/CHANSEN/Unicode-UTF8-0.62/META.yml#L28
Please sign up to post a review.