Hello, we have written three modules in a distribution that we have called Text-Document, from the name of the main package (the are two are called Text::DocumentCollection and Text::Bloom).
Thay have been subjected to review in comp.lang.perl.modules (with little response, so we assume nobody is hurt...). The aim of the distribution is dealing with text documents from the perspective of information retrieval, so we think they belong to the Text:: namespace. One of us is already a PAUSE author (ASPINELLI), but up to now he just maintained an already-existing package, so he took the perhaps misguided step of uploading the distribution (Text-Document.1.04.tar.gz) to PAUSE without registering the name. Now we ask for the registration of Text::Document, and of course we are ready to delete the registration from PAUSE and upload it with a revised name, if necessary. Below you find the README and the pods for the modules. Thanks in advance Andrea Spinelli, [EMAIL PROTECTED] Walter Vannini, [EMAIL PROTECTED] ----------README Text::Document is a collection of modules which allow to operate on text documents from the perspective of Information Retrieval. Text::Document scans documents, extracts terms, compares pairs of documents using the Jaccard and Cosine similarity measures. Text::Bloom allows to compute Bloom filters which compactly store information about term presence in documents, thereby allowing for efficient storage of document 'signatures'. Text::DocumentCollection is a collection of documents, allowing for persistency and for such calculations as the Inverse Document Frequency (IDF). Version 1.01 of the package Text::Document is Copyright (C) 2001 Andrea Spinelli and Walter Vannini All documents in this package can be used with the same limitations as Perl itself. Anyway, we are eager to know about your experiences with this thing, at [EMAIL PROTECTED] and/or [EMAIL PROTECTED] ------Document.pod =head1 NAME Text::Document - a text document subject to statistical analysis =head1 SYNOPSIS my $t = Text::Document->new(); $t->AddContent( 'foo bar baz' ); $t->AddContent( 'foo barbaz; ' ); my @freqList = $t->KeywordFrequency(); my $u = Text::Document->new(); ... my $sj = $t->JaccardSimilarity( $u ); my $sc = $t->CosineSimilarity( $u ); =head1 DESCRIPTION C<Text::Document> allows to perform simple Information-Retrieval-oriented statistics on pure-text documents. Text can be added in chunks, so that the document may be incrementally built, for instance by a class like C<HTML::Parser>. A simple algorithm splits the text into terms; the algorithm may be redefined by subclassing and redefining C<ScanV>. The C<KeywordFrequency> function computes term frequency over the whole document. =head1 FORESEEN REUSE The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a C<V> suffix. Redefining other methods will require greater attention. =head1 CLASS METHODS =head2 new The creator method. No arguments. my $d = Text::Document->new(); =head2 NewFromString Take a string written by C<WriteToString> (see below) and create a new C<Text::Document> with the same contents; call C<die> whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable. my $b = Text::Document::NewFromString( $str ); The return value is a blessed reference; put in another way, this is an alternative contructor. The string should have been written by C<WriteToString>; you may of course tweak the string contents, but at this point you're entirely on you own. =head1 INSTANCE METHODS =head2 AddContent Used as $d->AddContent( 'foo bar baz foo9' ); $d->AddContent( 'mary had a little lamb' ); Successive calls accumulate content; there is currently no way of resetting the content to zero. =head2 Terms Returns a list of all distinct terms in the document, in no particular order. =head2 Occurrences Returns the number of occurrences of a given term. $d->AddContent( 'foo baz bar foo foo'); my $n = $d->Occurrences( 'foo' ); # now $n is 3 =head2 ScanV Scan a string and return a list of terms. Called internally as: my @terms = $self->ScanV( $text ); =head2 KeywordFrequency Returns a reference list of pairs I<[term,frequency]>, sorted by ascending frequency. my $listRef = $d->KeywordFrequency(); foreach my $pair (@{$listRef}){ my ($term,$frequency) = @{$pair}; ... } Terms in the document are sampled and their frequencies of occurrency are sorted in ascending order; finally, the list is returned to the user. =head2 WriteToString Convert the document (actually, some parameters and the term counters) into a string which can be saved and later restored with C<NewFromString>. my $str = $d->WriteToString(); The string begins with a header which encodes the originating package, its version, the parameters of the current instance. Whenever possible, C<Compress::Zlib> is used in order to compress the bit vector in the most efficient way. On systems without C<Compress::Zlib>, the bit string is saved uncompressed. =head2 JaccardSimilarity Compute the Jaccard measure of document similarity, which is defined as follows: given two documents I<D> and I<E>, let I<Ds> and I<Es> be the set of terms occurring in I<D> and I<E>, respectively. Define I<S> as the intersection of I<Ds> and I<Es>, and I<T> as their union. Then the Jaccerd similarity is the the number of elements of I<S> divided by the number of elements of I<T>. It is called as follows: my $sim = $d->JaccardSimilarity( $e ); If neither document has any terms the result is undef (a rare evenience). Otherwise the similarity is a real number between 0.0 (no terms in common) and 1.0 (all terms in common). =head2 CosineSimilarity Compute the cosine similarity between two documents I<D> and I<E>. Let I<Ds> and I<Es> be the set of terms occurring in I<D> and I<E>, respectively. Define I<T> as the union of I<Ds> and I<Es>, and let I<ti> be the I<i>-th element of I<T>. Then the term vectors of I<D> and I<E> are Dv = (nD(t1), nD(t2), ..., nD(tN)) Ev = (nE(t1), nE(t2), ..., nE(tN)) where nD(ti) is the number of occurrences of term ti in I<D>, and nE(ti) the same for I<E>. Now we are at last ready to define the cosine similarity I<CS>: CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev)) Here (... , ...) is the scalar product and Norm is the Euclidean norm (square root of the sum of squares). C<CosineSImilarity> is called as $sim = $d->CosineSimilarity( $e ); It is C<undef> if either I<D> or I<E> have no occurrence of any term. Otherwise, it is a number between 0.0 and 1.0. Since term occurrences are always non-negative, the cosine is obviously always non-negative. =head1 AUTHORS [EMAIL PROTECTED] (Andrea Spinelli) [EMAIL PROTECTED] (Walter Vannini) =head1 HISTORY 2001-11-02 - initial revision =head DISCARDED CHOICES We did not use C<Storable>, because we wanted to fine-tune compression and version compatibility. However, this choice may be easily reversed redefining WriteToString and NewFromString. ---------Bloom.pod =head1 NAME Text::Bloom - Evaluate Bloom signature of a set of terms =head1 SYNOPSIS my $b = Text::Bloom->new(); $b->Compute( qw( foo bar baz ) ); my $sig = $b->WriteToString(); $b->WriteToFile( 'afile.sig' ); my $b2 = Text::Bloom::NewFromFile( 'afile.sig' ); my $b3 = Text::Bloom->new(); $b3->Compute( qw( foo bar barbaz ) ); my $sim = $b->Similarity( $b2 ); my $b4 = Text::Bloom::NewFromString( $sig ); =head1 DESCRIPTION C<Text::Bloom> applies the Bloom filtering technique to the statistical analysis of documents. The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..I<p-1>, where I<p> is a prime, currently set to the greatest prime less than 2^32. Each quantized value is mapped to I<d> integers in the range 0..I<size-1>, where I<size> is an integer less than I<p>, currently 2^17, using a family of hash functions, computed by the C<HashV> function. Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0. Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains I<n> distinct terms, in the resulting bit vector at most I<n * d> bits are set to 1. The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a I<signature>. Moreover, it does not depend on a pre-set dictionary of terms. The signature may be used for: =over 4 =item * testing whether a given set of terms is present in the document, =item * computing which fraction of terms are common to two documents. =back The bit representation may be written to and read from a file. C<Text::Bloom> prepends a header to the bit stream proper; moreover, whenever the package C<Compress::Zlib> is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents. The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression C</[0-9a-z]+/>. There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method C<QuantizeV>. =head1 FORESEEN REUSE The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a C<V> suffix. Redefining other methods will require greater attention. =head1 CLASS METHODS =head2 new The constructor. No arguments are required. $b = Text::Bloom->new(); =head2 NewFromString Take a string written by C<WriteToString> (see below) and create a new C<Text::Bloom> with the same contents; call C<die> whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable. my $b = Text::Bloom::NewFromString( $str ); The return value is a blessed reference; put in another way, this is an alternative contructor. The string should have been written by C<WriteToString>; you may of course tweak the string contents, but at this point you're entirely on you own. =head2 NewFromFile Utility function that reads a binary file and performs a C<NewFromString> on its content; see its counterpart, C<WriteToFile>. my $b2 = Text::Document::NewFromFile( 'foo.sig' ); =head1 INSTANCE METHODS =head2 Size Set and get the size of the filter, in bits. The default size is currently 128K. print 'size is ' . $b->Size() . "\n"; $b->Size( 65536 ); The C<Size> method must be called before the C<Compute> method in order to have effect. =head2 Compute Compute the Bloom signature from the given set of words and store it internally. $b->Compute( qw( foo bar baz foobar bazbaz ) ); Makes use of the C<QuantizeV> method. =head2 QuantizeV Convert a term into an integer; must return an integer in the range 0 .. C<$Text::Bloom::p-1>. It is called as my $hash = $b->QuantizeV( $term ); The current version is designed for strings matching C</[a-z0-9]+/>. Other characters do not cause errors, but degrade the hash function performance. This function is a likely candidate for redefinition. =head2 HashV Convert an integer to a (smaller) integer, according to one of a class of similar functions. It is internally called as: my $index = $b->HashV( $order, $value ); The C<$value> must belong to the interval 0..C<$Text::Bloom::p-1>, while the index must lie in 0..I<size-1>. C<$order> is a small integer from 0 to I<d-1>. The default implementation is index = m[order] * value + q[order] (mod size) the values of I<m> and I<q> are taken from the array C<@Text::Bloom::hashParam>; the form of the function is taken from [2]. =head2 WriteToString Convert the Bloom signature into a string which can be saved and later restored with C<NewFromString>. C<Compute> must have been called previously. my $str = $b->WriteToString(); The string begins with a header which encodes the originating package, its version, the parameters of the current instance. Whenever possible, C<Compress::Zlib> is used in order to compress the bit vector in the most efficient way. On systems without C<Compress::Zlib>, the bit string is saved uncompressed. =head2 WriteToFile These convenience functions just call their String counterparts and read/write the file specified in the argument. $b->WriteToFile( 'foo.sig' ); =head1 AUTHORS [EMAIL PROTECTED] (Andrea Spinelli) [EMAIL PROTECTED] (Walter Vannini) =head1 BIBLIOGRAPHY =over 4 =item [1] Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors", I<Communications of the ACM>, B<13>, 7, July 1970, pages 422-426. (available electronically from ACM Digital Library). =item [2] M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel Free-Text Searching", I<Communications of the ACM>, B<32>, 10, October 1989, pages 1237- 1239. (available electronically from ACM Digital Library). =back =head1 BUGS On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence. =head1 HISTORY 2001-11-02 - initial revision --------------DocumentCollection.pod =head1 NAME Text::DocumentCollection - a collection of documents =head1 SYNOPSIS =head1 DESCRIPTION =head1 CLASS METHODS =head2 new The constructor; arguments must be passed as maps from keys to values. The key C<file> is mandatory. my $c = Text::DocumentCollection->new( file => 'coll.db' ); Documents from the collection are saved as in the specified file, which is currently handled by a C<DB_File> hash. =head1 INSTANCE METHODS =head2 Add Add a document to the collection, tagging it with a unique key. $c->Add( $key, $doc ); C<Add> C<die>s if the key is already present. To change an existing key, use C<Delete> and then C<Add>. =head2 Delete Discard a document from the collection. =head2 NewFromDB Loads the collection from the given DB file: my $c = Text::DocumentCollection->NewFromDB( file => 'coll.db' ); The file must be either empty or created by a former invocation of C<new> or C<NewFromDB>, followed by any number of C<Add> and/or C<Delete>. Currently, all documents in the collection are revived (by calling C<NewFromString>). This poses performance problems for huge collections; a caching mechanism would be an option in this case. =head2 IDF Inverse Term frequency of a given term. The definition we used is, given a term I<t>, a set of documents I<DOC> and the binary relationship I<has-term>: IDF(t) = log2( #DOC / #{ d in DOC | d has-term t } ) The logarithm is in base 2, since this is related to an information measurement, and # is the cardinality operator. =head2 EnumerateV Enumerates all the document in the collection. Called as: my @result = $c->EnumerateV( \&Callback, 'the rock' ); The function C<Callback> will be called on each element of the collection as: my @l = CallBack( $c, $key, $doc, $rock ); where C<$rock> is the second argument to C<Callback>. Since C<$c> is the first argument, the callback may be an instance method of C<Text::DocumentCollection>. The final result is obtained by concatenating all the partial results (C<@l> in the example above). If you do not want a result, simply return the empty list (). There is no particular order of enumeration, so there is no particular order in which results are concatenated. =head1 AUTHORS [EMAIL PROTECTED] [EMAIL PROTECTED] -- Andrea Spinelli - Software Architect e-mail: [EMAIL PROTECTED] phone: +39-035-636029 fax: +39-035-638129