new module: HTML::TableExtract

2000-02-03 Thread Matt Sisk

I'd like to register a new module name, please:

Name   DSLI  Description  Info
-     -
HTML::
TableExtract   adpf  Flexible HTML table extraction   MSISK


This is a subclass of HTML::Parser, and does just what it says. Perhaps
the most powerful feature is that you can specify tables of interest
using a list of headers you expect to see in the table. Using this
method, the module will return vertical slices of the table, ordered in
the same order as you specified with the headers, even though in the
actual table the columns might be in a different order. In this way you
can extract information based on what the document is communicating
rather than some particular HTML layout.

You can also extract tables based on depth and count information, or
just extract all tables.

I've included the documentation below. If you would like to experiment
with the module, you can find it in one of the following locations:

   http://www.cpan.org/authors/id/M/MS/MSISK/
   http://www.mojotoad.com/sisk/projects/HTML-TableExtract/

Thanks,
Matt Sisk
[EMAIL PROTECTED]




NAME
HTML::TableExtract - Perl extension for extracting the text
contained in tables within an HTML document.

SYNOPSIS
 # Using column header information.  Assume an HTML document
 # with a table which has "Date", "Price", and "Cost"
 # somewhere in a  row. The columns beneath those headings are
 # what you are interested in.

 use HTML::TableExtract;
 $te = new HTML::TableExtract( headers => [qw(Date Price Cost)] );
 $te->parse($html_string);

 # rows() assumes the first table found in the document if no
 # table is provided. Since automap is enabled by default,
 # each row is returned in the same column order as we
 # specified for our headers. Otherwise, we would have to rely
 # on $te->column_order to figure out the column in which each
 # header was found.

 foreach $row ($te->rows) {
print join(',', @$_),"\n";
 }

 # Using depth and count information.  In this example, our
 # tables must be within two other tables, plus be the third
 # table at that depth within those tables.  In other words,
 # wherever there exists a table within a table that contains
 # a cell with at least three tables in sequence, we grab
 # the third table. Depth and count both begin with 0.

 $te = new HTML::TableExtract( depth => 2, count => 2 );
 $te->parse($html_string);
 foreach ($te->tables) {
print "Table found at ", join(',', $te->table_coords($_)),
":\n";
foreach ($te->rows($_)) {
   print "   ", join(',', @$_), "\n";
}
 }

DESCRIPTION
HTML::TableExtract is a subclass of HTML::Parser that serves to
extract the textual information from tables of interest
contained within an HTML document. The textual information for
each table is stored in an array of arrays that represent the
rows and cells of that table.

There are three ways to specify which tables you would like to
extract from a document: *Headers*, *Depth*, and *Count*.

*Headers*, the most flexible and adaptive of the techniques,
involves specifying text in an array that you expect to appear
above the data in the tables of interest. Once all headers have
been located in a row of that table, all further cells beneath
the columns that matched your headers are extracted. All other
columns are ignored: think of it as vertical slices through a
table. In addition, HTML::TableExtract automatically rearranges
each row in the same order as the headers you provided. If you
would like to disable this, set *automap* to 0 during object
creation, and instead rely on the column_map() method to find
out the order in which the headers were found.

*Depth* and *Count* are more specific ways to specify tables
that have more dependencies on the HTML document layout. *Depth*
represents how deeply a table resides in other tables. The depth
of a top-level table in the document is 0. A table within a top-
level table has a depth of 1, and so on. *Count* represents
which table at a particular depth you are interested in,
starting with 0.

Each of the *Headers*, *Depth*, and *Count* specifications are
cumulative in their effect on the overall extraction. For
instance, if you specify only a *Depth*, then you get all tables
at that depth (note that these could very well reside in
separate higher-level tables throughout the document). If you
specify only a *Count*, then the tables at that *Count* from all
depths are returned. If you only specify *Headers*, then you get
all tables in the document matching those header
characteristics. If you have specified multiple characteristics,
then each characteristic has veto power over whether a
p

new bundle: Finance-QuoteHist

2000-02-03 Thread Matt Sisk

I'd like to register *another* set of modules as well, please. These
modules, based on LWP::UserAgent, allow you to fetch historical stock
quotes from the web. I understand that some of the derived module names
end up going into the realm of "three names", but in my estimation this
will be better as more site-specific instances get added. Please let me
know if this is a mondo faux pas or just plain ugly. They all live
beneath the Finance namespace currently:

Name   DSLI  Description  Info
-     -
Finance::
HistQuote  bdpO  Historical stock quotes from multiple sites  MSISK

Finance::
HistQuote::
GenericbdpO  Historical stock quote base classMSISK

Finance::
HistQuote::
MotleyFool bdpO  Historical stock quotes from the Motley Fool MSISK

Finance::
HistQuote::
FinancialWeb   bdpO  Historical stock quotes from FinancialWebMSISK


The idea, here, is that site-specific instances all derive from the
Generic base class. One of the properties of these classes is that you
can specify a "lineup" of other site-specific classes to try in the
event the first class fails in its attempt to retrieve quotes.

Finance::HistQuote, the top-level class, is merely an aggregator that
defaults to a particular lineup automatically, but otherwise behaves as
though it were an instance of the first site-specific class in the
lineup.

I really don't know what to do about the long module names in the
site-specific cases, unless it involves moving them out of the Finance
category, which would be a shame since that is where they seem to
belong, along with the Finance::FoolQuote and others of that ilk.

At any rate, if you would like to see more details, then the
documentation and distribution is available here:

   http://www.mojotoad.com/sisk/projects/Finance-QuoteHist/


Thanks for your time,
Matt Sisk
[EMAIL PROTECTED]



Re: new bundle: Finance-QuoteHist

2000-02-03 Thread Jarkko Hietaniemi


Matt Sisk writes:
 > I'd like to register *another* set of modules as well, please. These
 > modules, based on LWP::UserAgent, allow you to fetch historical stock
 > quotes from the web. I understand that some of the derived module names
 > end up going into the realm of "three names", but in my estimation this
 > will be better as more site-specific instances get added. Please let me
 > know if this is a mondo faux pas or just plain ugly. They all live
 > beneath the Finance namespace currently:
 > 
 > Name   DSLI  Description  Info
 > -     -
 > Finance::
 > HistQuote  bdpO  Historical stock quotes from multiple sites  MSISK
 > 
 > Finance::
 > HistQuote::
 > GenericbdpO  Historical stock quote base classMSISK
 > 
 > Finance::
 > HistQuote::
 > MotleyFool bdpO  Historical stock quotes from the Motley Fool MSISK
 > 
 > Finance::
 > HistQuote::
 > FinancialWeb   bdpO  Historical stock quotes from FinancialWebMSISK

I don't much like the idea of encoding the names of communication
endpoints to class names.  Net::FTP::GatekeeperDecCom, HTTP::SlashdotOrg,
what next?  The knowledge of how to contact/parse the "session" of
a service is *data* (however complex, it's still data), not code.
And a class name is (usually) much closer to data, in my mind.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: new bundle: Finance-QuoteHist

2000-02-03 Thread Matt Sisk

Hello again, Jarkko!

Jarkko Hietaniemi wrote:
> I don't much like the idea of encoding the names of communication
> endpoints to class names.  Net::FTP::GatekeeperDecCom, HTTP::SlashdotOrg,
> what next?  The knowledge of how to contact/parse the "session" of
> a service is *data* (however complex, it's still data), not code.
> And a class name is (usually) much closer to data, in my mind.

Well, I can't say I'm entirely comfortable with it myself, and I'm open
to suggestions.

To me, the three issues are this:

  1) Each site represents it's data in a particular way -- yes, this is
still just data, as you point out, but there needs to be a practical way
to represent the "bag of tricks" necessary for that particular data
source in a consistent, expandable way (by other people besides just
me).

  2) The "lineup" of the data sources (sites) that have been implemented
should be arbitrary and configurable on a per-application basis.  By
"lineup" I refer to the failover redundancy list of these specific
classes that each class supports. If the first class fails, it trys the
next data source, and so on down the line.

  3) It would be nice if user-contributed site-specific expansions were
available to the rest of us, even without my having to include it in the
main bundle.

If I were distributing an application, rather than modules, my first
instinct would be to distribute some sort of configuration data file
that described the characteristics of each site, although this does not
adequately address point #3. At least in this way, however, users could
add sites without having to add classes.

I suppose I took my queue from Finance::YahooQuote, which provides stock
quotes specifically from Yahoo.

Does anyone have any thoughts on how to best encapsulate the three
points I mention above without using source-specific classes? I'm all
for it. I might add, the primary function of the site-specific classes
is to override the urls() method of the generic class -- so whatever
solutions are proposed should be able to adequately (and safely) produce
code for that method.

Thanks,
Matt Sisk
[EMAIL PROTECTED]



File Replication

2000-02-03 Thread Dave A Roberts



I have written a module I use for maintaining replica's of files and file
structures in different file systems. I'd like to submit this to CPAN.  Under
the module naming convention it would look like this I think.


Name   DSLI  Description  Info
-     -
File::Repl cDpO  File and File structure replication utility  DROBERTS


The utility has been written for Win32 - but has no dependencies on that
architecture.  I 'd appreciate anu guidance you can give.

This memo was sent a few days ago - it is being resent because of a spamming
incident (when some outgoing and incoming mails were lost).  If you replied to
my first memo please will you re-send.

Thanks


Dave






Module to contribute - Text::split_csv

2000-02-03 Thread Munsinger, Doug

Name: Doug Munsinger
E-mail: [EMAIL PROTECTED]
Preferred ID: MUNSINGER
Description of what planned to contribute: 
A module that splits report .csv-format files vertically into
smaller files.  Takes one required and two optional arguments.  Required:
filename argument.  Optional: number of columns to place in each smaller
file (default is 100), number of first cells for each row to retain in each
file (default is 1).  Offers single subroutine that can process args and
call three additional subs, or these can be accessed directly.  Worked out
to handle large installation system reporting files which were no longer
workable at full size in Excel or html table format. 
Module description:

Name  DSLIDescription
Info
----
-
Text::
split_csv Rdpfdivides .csv format reports in vertical sections
MUNSINGER


Doug Munsinger
Operating Systems Engineer
Fidelity Investments Systems Company
400 Puritan Way, Mailcode M2F
Marlborough, MA 01752
[EMAIL PROTECTED]
508-787-7389
pager 800-759- #1135069




User update for PEASE

2000-02-03 Thread Perl Authors Upload Server

(This Mail was generated by the server
  http://p11.speedlink.de/pause/authenquery;ACTION=edit_cred
automatically)

Record update in the PAUSE users database:

 userid: [PEASE]
   fullname: [Mark Pease]
  email: [[EMAIL PROTECTED]] was [[EMAIL PROTECTED]]
   homepage: []
cpan_mail_alias: [secr] was [publ]


Data were entered by PEASE (Mark Pease).
Please check if they are correct.

Thanks,
The Pause



Why doesn't my module get included?

2000-02-03 Thread Daniel Pfeiffer
Hi!

I wrote the following beginning of december, got successfully registered as PFEIFFER, but nothing else happened, despite an unanswered mail asking about the progress.  What's up?

Now that it's reached a good level with version 0.5, I think it's time to have iPerl be listed in CPAN.  You need not keep the source, since I have my own download area, which I'll be keeping up-to-date.

Inverse Perl means instead of having long strings in short Perl code, rather having a big string (the document), and dispersing the Perl code in it.

What I submit are essentially three files, a module that does almost everything Text::iPerl, a commandline-interface iperl (just a getopts frontend), and a cgi-frontend web-iperl.

I do not see this in the File:: or IO:: trees, because it is hardly concerned with where the processed document comes from.  Nor in the Filter:: tree, though in the wide Unix sense of the word it is a filter.  Nor in the Parse:: tree because more than only a parser it defines a Perl based language that resides in arbitrary text documents.


name
- Daniel Pfeiffer

email address
- [EMAIL PROTECTED]

homepage
- http://beam.to/iPerl/

your preferred user-ID on CPAN
- PFEIFFER

description in module list format
- Text::iPerl	adpf	Bring text-docs to life via embedded Perl

-- 
Bring text-docs to life!  Erwecke Textdokumente zum Leben!
   http://beam.to/iPerl/
Vivigu tekstodokumentojn!


Where for art thou, Andreas Koenig?

2000-02-03 Thread Matt Sisk

I haven't seen any activity from Andreas on this list since November 3,
1999.

Has anyone checked his house to make sure he's okay?

;-)

Any chance of him emerging from busy mode any time soon?

-- 
Matt Sisk
[EMAIL PROTECTED]



User update for MRJC

2000-02-03 Thread Perl Authors Upload Server

(This Mail was generated by the server
  http://p11.speedlink.de/pause/authenquery;ACTION=edit_cred
automatically)

Record update in the PAUSE users database:

 userid: [MRJC]
   fullname: [Martin R.J. Cleaver] was [Martin RJ Cleaver]
  email: [[EMAIL PROTECTED]]
   homepage: [http://www.mrjc.com/] was [http://www.hkstar.com/~mrjc]
cpan_mail_alias: [publ]


Data were entered by MRJC (Martin RJ Cleaver).
Please check if they are correct.

Thanks,
The Pause



User update for CHAMAS

2000-02-03 Thread Perl Authors Upload Server

(This Mail was generated by the server
  https://pause.kbx.de/pause/authenquery;ACTION=edit_cred
automatically)

Record update in the PAUSE users database:

 userid: [CHAMAS]
   fullname: [Joshua Chamas]
  email: [[EMAIL PROTECTED]]
   homepage: [http://www.chamas.com] was []
cpan_mail_alias: [publ]


Data were entered by CHAMAS (Joshua Chamas).
Please check if they are correct.

Thanks,
The Pause



User update for NPESKETT

2000-02-03 Thread Perl Authors Upload Server

(This Mail was generated by the server
  https://pause.kbx.de/pause/authenquery;ACTION=edit_cred
automatically)

Record update in the PAUSE users database:

 userid: [NPESKETT]
   fullname: [Nick Peskett]
  email: [[EMAIL PROTECTED]] was [[EMAIL PROTECTED]]
   homepage: []
cpan_mail_alias: [publ]


Data were entered by NPESKETT (Nick Peskett).
Please check if they are correct.

Thanks,
The Pause



User update for RFOLEY

2000-02-03 Thread Perl Authors Upload Server

(This Mail was generated by the server
  https://pause.kbx.de/pause/authenquery;ACTION=edit_cred
automatically)

Record update in the PAUSE users database:

 userid: [RFOLEY]
   fullname: [Richard Foley]
  email: [[EMAIL PROTECTED]] was [[EMAIL PROTECTED]]
   homepage: []
cpan_mail_alias: [publ]


Data were entered by ANDK (Andreas J. König).
Please check if they are correct.

Thanks,
The Pause