new module: HTML::TableExtract
I'd like to register a new module name, please: Name DSLI Description Info - - HTML:: TableExtract adpf Flexible HTML table extraction MSISK This is a subclass of HTML::Parser, and does just what it says. Perhaps the most powerful feature is that you can specify tables of interest using a list of headers you expect to see in the table. Using this method, the module will return vertical slices of the table, ordered in the same order as you specified with the headers, even though in the actual table the columns might be in a different order. In this way you can extract information based on what the document is communicating rather than some particular HTML layout. You can also extract tables based on depth and count information, or just extract all tables. I've included the documentation below. If you would like to experiment with the module, you can find it in one of the following locations: http://www.cpan.org/authors/id/M/MS/MSISK/ http://www.mojotoad.com/sisk/projects/HTML-TableExtract/ Thanks, Matt Sisk [EMAIL PROTECTED] NAME HTML::TableExtract - Perl extension for extracting the text contained in tables within an HTML document. SYNOPSIS # Using column header information. Assume an HTML document # with a table which has "Date", "Price", and "Cost" # somewhere in a row. The columns beneath those headings are # what you are interested in. use HTML::TableExtract; $te = new HTML::TableExtract( headers => [qw(Date Price Cost)] ); $te->parse($html_string); # rows() assumes the first table found in the document if no # table is provided. Since automap is enabled by default, # each row is returned in the same column order as we # specified for our headers. Otherwise, we would have to rely # on $te->column_order to figure out the column in which each # header was found. foreach $row ($te->rows) { print join(',', @$_),"\n"; } # Using depth and count information. In this example, our # tables must be within two other tables, plus be the third # table at that depth within those tables. In other words, # wherever there exists a table within a table that contains # a cell with at least three tables in sequence, we grab # the third table. Depth and count both begin with 0. $te = new HTML::TableExtract( depth => 2, count => 2 ); $te->parse($html_string); foreach ($te->tables) { print "Table found at ", join(',', $te->table_coords($_)), ":\n"; foreach ($te->rows($_)) { print " ", join(',', @$_), "\n"; } } DESCRIPTION HTML::TableExtract is a subclass of HTML::Parser that serves to extract the textual information from tables of interest contained within an HTML document. The textual information for each table is stored in an array of arrays that represent the rows and cells of that table. There are three ways to specify which tables you would like to extract from a document: *Headers*, *Depth*, and *Count*. *Headers*, the most flexible and adaptive of the techniques, involves specifying text in an array that you expect to appear above the data in the tables of interest. Once all headers have been located in a row of that table, all further cells beneath the columns that matched your headers are extracted. All other columns are ignored: think of it as vertical slices through a table. In addition, HTML::TableExtract automatically rearranges each row in the same order as the headers you provided. If you would like to disable this, set *automap* to 0 during object creation, and instead rely on the column_map() method to find out the order in which the headers were found. *Depth* and *Count* are more specific ways to specify tables that have more dependencies on the HTML document layout. *Depth* represents how deeply a table resides in other tables. The depth of a top-level table in the document is 0. A table within a top- level table has a depth of 1, and so on. *Count* represents which table at a particular depth you are interested in, starting with 0. Each of the *Headers*, *Depth*, and *Count* specifications are cumulative in their effect on the overall extraction. For instance, if you specify only a *Depth*, then you get all tables at that depth (note that these could very well reside in separate higher-level tables throughout the document). If you specify only a *Count*, then the tables at that *Count* from all depths are returned. If you only specify *Headers*, then you get all tables in the document matching those header characteristics. If you have specified multiple characteristics, then each characteristic has veto power over whether a p
new bundle: Finance-QuoteHist
I'd like to register *another* set of modules as well, please. These modules, based on LWP::UserAgent, allow you to fetch historical stock quotes from the web. I understand that some of the derived module names end up going into the realm of "three names", but in my estimation this will be better as more site-specific instances get added. Please let me know if this is a mondo faux pas or just plain ugly. They all live beneath the Finance namespace currently: Name DSLI Description Info - - Finance:: HistQuote bdpO Historical stock quotes from multiple sites MSISK Finance:: HistQuote:: GenericbdpO Historical stock quote base classMSISK Finance:: HistQuote:: MotleyFool bdpO Historical stock quotes from the Motley Fool MSISK Finance:: HistQuote:: FinancialWeb bdpO Historical stock quotes from FinancialWebMSISK The idea, here, is that site-specific instances all derive from the Generic base class. One of the properties of these classes is that you can specify a "lineup" of other site-specific classes to try in the event the first class fails in its attempt to retrieve quotes. Finance::HistQuote, the top-level class, is merely an aggregator that defaults to a particular lineup automatically, but otherwise behaves as though it were an instance of the first site-specific class in the lineup. I really don't know what to do about the long module names in the site-specific cases, unless it involves moving them out of the Finance category, which would be a shame since that is where they seem to belong, along with the Finance::FoolQuote and others of that ilk. At any rate, if you would like to see more details, then the documentation and distribution is available here: http://www.mojotoad.com/sisk/projects/Finance-QuoteHist/ Thanks for your time, Matt Sisk [EMAIL PROTECTED]
Re: new bundle: Finance-QuoteHist
Matt Sisk writes: > I'd like to register *another* set of modules as well, please. These > modules, based on LWP::UserAgent, allow you to fetch historical stock > quotes from the web. I understand that some of the derived module names > end up going into the realm of "three names", but in my estimation this > will be better as more site-specific instances get added. Please let me > know if this is a mondo faux pas or just plain ugly. They all live > beneath the Finance namespace currently: > > Name DSLI Description Info > - - > Finance:: > HistQuote bdpO Historical stock quotes from multiple sites MSISK > > Finance:: > HistQuote:: > GenericbdpO Historical stock quote base classMSISK > > Finance:: > HistQuote:: > MotleyFool bdpO Historical stock quotes from the Motley Fool MSISK > > Finance:: > HistQuote:: > FinancialWeb bdpO Historical stock quotes from FinancialWebMSISK I don't much like the idea of encoding the names of communication endpoints to class names. Net::FTP::GatekeeperDecCom, HTTP::SlashdotOrg, what next? The knowledge of how to contact/parse the "session" of a service is *data* (however complex, it's still data), not code. And a class name is (usually) much closer to data, in my mind. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: new bundle: Finance-QuoteHist
Hello again, Jarkko! Jarkko Hietaniemi wrote: > I don't much like the idea of encoding the names of communication > endpoints to class names. Net::FTP::GatekeeperDecCom, HTTP::SlashdotOrg, > what next? The knowledge of how to contact/parse the "session" of > a service is *data* (however complex, it's still data), not code. > And a class name is (usually) much closer to data, in my mind. Well, I can't say I'm entirely comfortable with it myself, and I'm open to suggestions. To me, the three issues are this: 1) Each site represents it's data in a particular way -- yes, this is still just data, as you point out, but there needs to be a practical way to represent the "bag of tricks" necessary for that particular data source in a consistent, expandable way (by other people besides just me). 2) The "lineup" of the data sources (sites) that have been implemented should be arbitrary and configurable on a per-application basis. By "lineup" I refer to the failover redundancy list of these specific classes that each class supports. If the first class fails, it trys the next data source, and so on down the line. 3) It would be nice if user-contributed site-specific expansions were available to the rest of us, even without my having to include it in the main bundle. If I were distributing an application, rather than modules, my first instinct would be to distribute some sort of configuration data file that described the characteristics of each site, although this does not adequately address point #3. At least in this way, however, users could add sites without having to add classes. I suppose I took my queue from Finance::YahooQuote, which provides stock quotes specifically from Yahoo. Does anyone have any thoughts on how to best encapsulate the three points I mention above without using source-specific classes? I'm all for it. I might add, the primary function of the site-specific classes is to override the urls() method of the generic class -- so whatever solutions are proposed should be able to adequately (and safely) produce code for that method. Thanks, Matt Sisk [EMAIL PROTECTED]
File Replication
I have written a module I use for maintaining replica's of files and file structures in different file systems. I'd like to submit this to CPAN. Under the module naming convention it would look like this I think. Name DSLI Description Info - - File::Repl cDpO File and File structure replication utility DROBERTS The utility has been written for Win32 - but has no dependencies on that architecture. I 'd appreciate anu guidance you can give. This memo was sent a few days ago - it is being resent because of a spamming incident (when some outgoing and incoming mails were lost). If you replied to my first memo please will you re-send. Thanks Dave
Module to contribute - Text::split_csv
Name: Doug Munsinger E-mail: [EMAIL PROTECTED] Preferred ID: MUNSINGER Description of what planned to contribute: A module that splits report .csv-format files vertically into smaller files. Takes one required and two optional arguments. Required: filename argument. Optional: number of columns to place in each smaller file (default is 100), number of first cells for each row to retain in each file (default is 1). Offers single subroutine that can process args and call three additional subs, or these can be accessed directly. Worked out to handle large installation system reporting files which were no longer workable at full size in Excel or html table format. Module description: Name DSLIDescription Info ---- - Text:: split_csv Rdpfdivides .csv format reports in vertical sections MUNSINGER Doug Munsinger Operating Systems Engineer Fidelity Investments Systems Company 400 Puritan Way, Mailcode M2F Marlborough, MA 01752 [EMAIL PROTECTED] 508-787-7389 pager 800-759- #1135069
User update for PEASE
(This Mail was generated by the server http://p11.speedlink.de/pause/authenquery;ACTION=edit_cred automatically) Record update in the PAUSE users database: userid: [PEASE] fullname: [Mark Pease] email: [[EMAIL PROTECTED]] was [[EMAIL PROTECTED]] homepage: [] cpan_mail_alias: [secr] was [publ] Data were entered by PEASE (Mark Pease). Please check if they are correct. Thanks, The Pause
Why doesn't my module get included?
Hi! I wrote the following beginning of december, got successfully registered as PFEIFFER, but nothing else happened, despite an unanswered mail asking about the progress. What's up? Now that it's reached a good level with version 0.5, I think it's time to have iPerl be listed in CPAN. You need not keep the source, since I have my own download area, which I'll be keeping up-to-date. Inverse Perl means instead of having long strings in short Perl code, rather having a big string (the document), and dispersing the Perl code in it. What I submit are essentially three files, a module that does almost everything Text::iPerl, a commandline-interface iperl (just a getopts frontend), and a cgi-frontend web-iperl. I do not see this in the File:: or IO:: trees, because it is hardly concerned with where the processed document comes from. Nor in the Filter:: tree, though in the wide Unix sense of the word it is a filter. Nor in the Parse:: tree because more than only a parser it defines a Perl based language that resides in arbitrary text documents. name - Daniel Pfeiffer email address - [EMAIL PROTECTED] homepage - http://beam.to/iPerl/ your preferred user-ID on CPAN - PFEIFFER description in module list format - Text::iPerl adpf Bring text-docs to life via embedded Perl -- Bring text-docs to life! Erwecke Textdokumente zum Leben! http://beam.to/iPerl/ Vivigu tekstodokumentojn!
Where for art thou, Andreas Koenig?
I haven't seen any activity from Andreas on this list since November 3, 1999. Has anyone checked his house to make sure he's okay? ;-) Any chance of him emerging from busy mode any time soon? -- Matt Sisk [EMAIL PROTECTED]
User update for MRJC
(This Mail was generated by the server http://p11.speedlink.de/pause/authenquery;ACTION=edit_cred automatically) Record update in the PAUSE users database: userid: [MRJC] fullname: [Martin R.J. Cleaver] was [Martin RJ Cleaver] email: [[EMAIL PROTECTED]] homepage: [http://www.mrjc.com/] was [http://www.hkstar.com/~mrjc] cpan_mail_alias: [publ] Data were entered by MRJC (Martin RJ Cleaver). Please check if they are correct. Thanks, The Pause
User update for CHAMAS
(This Mail was generated by the server https://pause.kbx.de/pause/authenquery;ACTION=edit_cred automatically) Record update in the PAUSE users database: userid: [CHAMAS] fullname: [Joshua Chamas] email: [[EMAIL PROTECTED]] homepage: [http://www.chamas.com] was [] cpan_mail_alias: [publ] Data were entered by CHAMAS (Joshua Chamas). Please check if they are correct. Thanks, The Pause
User update for NPESKETT
(This Mail was generated by the server https://pause.kbx.de/pause/authenquery;ACTION=edit_cred automatically) Record update in the PAUSE users database: userid: [NPESKETT] fullname: [Nick Peskett] email: [[EMAIL PROTECTED]] was [[EMAIL PROTECTED]] homepage: [] cpan_mail_alias: [publ] Data were entered by NPESKETT (Nick Peskett). Please check if they are correct. Thanks, The Pause
User update for RFOLEY
(This Mail was generated by the server https://pause.kbx.de/pause/authenquery;ACTION=edit_cred automatically) Record update in the PAUSE users database: userid: [RFOLEY] fullname: [Richard Foley] email: [[EMAIL PROTECTED]] was [[EMAIL PROTECTED]] homepage: [] cpan_mail_alias: [publ] Data were entered by ANDK (Andreas J. König). Please check if they are correct. Thanks, The Pause