Given a Hacker News page, scrape the HTML to extract the contents. For example, get the title and the "hide" URL, etc., so that one can automatically match the titles against a regular expression then "hide" stories about Elon Musk, James Damore, react.js, Google memos, or other tedious things and people.
This is an HTML scraper and not related to WebService::HackerNews by Neil Bower. Note that Hacker News uses tables and "center" tags for layout, with no particular logical subdivision.
C compilers typically don't warn about daft mistakes like putting function arguments in the wrong order, or using the wrong size of a variable, or things like not checking the return value of malloc, etc. This module would check for typical errors in C programs like switch fallthroughs, use of equals instead of == in an if statement, insist on using braces with if statements, bad if statement indentation, like
if (x) printf ("reached"); printf ("reached even if x is not true, despite this indentation");
At the moment I have a script which does something like the above, wondering whether it would be worth working up into a module.
I've been developing a C program which indexes files by making trigrams of the contents of files. It's working reasonably well now and I'm thinking of extending it to a Perl version which could be used to index files, database entries and other things.
Get world-wide web pages (make HTTP requests) with caching to local file system, correct if-modified and correctly handling compression of requests.
There are a lot of web scrapers on CPAN including Scrappy, Web::Scraper, and some others. However, these don't do what I want, which is to get a web page only if necessary, use a local cache if possible, always handle gzip requests. This would be built on top of LWP::UserAgent and friends, with the option to use another user agent module if necessary.
Unlike other modules, this would not handle HTML parsing but just get the page from the web.
I'm hoping I don't have to write this module but will suddenly find that a solution already exists on CPAN which I've somehow overlooked.