PrePAN

Sign in to PrePAN

Geo::Postcodes::JP Parse and search functions for Japanese postcodes

Good

Synopsis

# Get the files from the PO website
use Geo::Postcodes::JP::Update /update_files/;

# Parse the files
use Geo::Postcodes::JP::Process /read_ken_all_csv read_jigyosyo_csv/;

# Searching things
use Geo::Postcodes::JP qw/postcode_to_address address_to_postcode/;

Description

Geo::Postcodes::JP::Update

Check the postcode download site for updates to the files.

Download the updated postcodes.

Geo::Postcodes::JP::Process

Parse the KEN_ALL.CSV and jigyosyo.csv files; eliminate the problems from the files. Change the "sonota" entries and "ika ni nai baai" entries somehow. Turn the multiline entries into single entries. Handle the duplicate cases. Handle the cases of business postcodes which don't correspond to entries in the main file (have a different address which is not in the address file). If possible convert the inconsistent "(kabu)" and "(kabushikigaisha)" entries. Convert the half-width katakana entries into standard kana (use MeCab module?).

Put the postcode data into a SQLite database or other handy format.

Geo::Postcodes::JP

Offer searches of the postcode data.


As is well-known, the Japanese postal code data files as supplied by the post office have some problems. A major goal of the module is to take the "bad" file of the post office and turn it into a "clean" file of some kind.

The output file might be .CSV or SQlite, or maybe both kinds of file.

Comments

FYI, I've used binary search on CSV (via persistent indexes) in a similar module for Brazilian postcodes: https://metacpan.org/source/SYP/Geo-CEP-0.4/lib/Geo/CEP.pm
@creaktive: Thanks for this link to the source code. The Japanese postal code data files as supplied by the post office have some big problems inherent in them, so one of the big goals of the module is to take the bad file of the post office and turn it into a clean file of some kind. I could turn it into CSV but I thought SQlite would be preferable. Maybe there should be an option for both kinds. Also it looks like the Brazilian file is much smaller than the Japanese one? There are about 200,000 lines in the KEN_ALL.CSV file and the business office file has about 20,000 lines.

As far as namespacing goes, I think Geo::Japan:: is a good namespace.
Does this type of postcode module already exist on CPAN for other countries?

I had a quick look and have already found Geo::Postcode for the UK, Geo::Postcodes::NO for Norway and Geo::Postcodes::DK for Denmark etc. Perhaps following the API (or even better inheriting from Geo::Postcodes as at least one of those already does) would be a good starting point.

Do the postcodes change very often? I would possibly recommend separating out the downloading and parsing of the source from the query interface. You could perhaps provide some extra methods for the download option, or perhaps even suggest to the Geo::Postcodes author a new abstract method for db_retrieval() which you could implement in a child class of Geo::Postcodes::JP?
In fact, I see that http://search.cpan.org/~arne/Geo-Postcodes-0.32/lib/Geo/Postcodes/Update.pm is already something along those lines.
@mlawren: Thank you for your comment. I already knew about Geo::Postcode but hadn't noticed Geo::Postcodes. I'm glad to find out about it. I had a look at Geo::Postcodes. I think if there is an existing namespace I will use it, so maybe "Geo::Postcodes::JA" for the module. I want to use the same sort of API if possible. Subclassing from this might not lead to a perfect result. For example, there are about 200,000 postcodes for Japan so the Geo::Postcodes methods like "get_postcodes" might cause some problems, and the Japanese geography (prefecture etc.) doesn't fit onto that module. Your notion about breaking the module into a parsing submodule and a user module is definitely a good idea. The postcode file is updated quite a lot, not so much because of changes of the postcodes but because of changes of administration regions.
To continue the existing namespace conversation....

Your choice of "JA" for abbreviating Japan is something new. The ISO 3166-1 standard 2-letter abbreviation for the nation of Japan is "JP" (see http://en.wikipedia.org/wiki/ISO_3166-1), and no doubt you already know of the .jp top-level domain in the DNS. Worth staying consistent I would think.

You might also be interested in ISO codes for the names of the principal subdivisions: http://en.wikipedia.org/wiki/ISO_3166-2:JP. Your point about the large number of postcodes is a good one. Perhaps you come up with an per-region interface which could be retrofitted onto Geo::Postcodes...
Actually JA is not something new, it is used on cpan in things like the Lingua::JA:: namespace, and many others. However I imagine that the people who chose this didn't really know about the country codes. I think it's better to use JP. I'm not that interested in compatibility with Geo::Postcodes beyond the namespace, since there are only Norway and Denmark at the moment, and the module has not been updated since 2006. I doubt whether a lot of people are really using that format.
I've just found another module:

http://search.cpan.org/perldoc?Number::ZipCode::JP

This seems to serve as a validator using the same file of data.
@benkasminbullock I'm not sure how the Japanese file is organized, but the Brazilian one is range-based, that is, each city has it's own block of postcodes. Suburbs do not share such consistency level here (no consistency at all sometimes) *AND* the precise data is not public :(
@creaktive: the postcodes are a modification of an older three digit format to a seven digit format. The big problem is parsing the file of data, which is not 100% in machine-readable format (for example, one entry is on several lines of the file, so it's necessary to unify the lines into one entry).
Nobody has commented recently so I have put a skeleton on the web as demonstrated above.
I found another repo Perl module with a similar sort of purpose:

https://github.com/yappo/p5-Geography-AddressExtract-Japan

Seems like this was developed into a commercial application:

http://yapcasia.org/2012/talk/show/767bd582-abb3-11e1-85bd-57a46aeab6a4
anonymouse
Anonymous
Hi is there anybody knows the coding of the search function as use in http://www.geopostcodes.com/?? where can i find the example of the source code?

Please sign up to post a review.