Re: Use of Lucene to store data from RSS feeds

Erick Erickson Fri, 15 Oct 2010 11:16:57 -0700

How many documents Lucene creates and when is entirely up to you. Your code
calls
IndexWriter.addDocument after all.


You can add multiple values to a #field# in a document if you want, just
call Document.add()
repeatedly...

HTH
Erick

On Fri, Oct 15, 2010 at 10:35 AM, Martin O'Shea <app...@dsl.pipex.com>wrote:

> @Pulkit Singhal: Thanks for the reply. Just to clarify my post yesterday,
> I'm not sure if each row in the database table would form a document or not
> because I do not know if Lucene works in this manner. In my case, each row
> of the table represents a single polling of an RSS feed to retrieve any new
> postings over a given number of hours. If Lucene allows a document to have
> separate time-based entries, then I am happy to use it for indexing. But if
> a separate document is needed per row of the table, then I'm uncertain. I
> always do have the option of using Lucene for in-memory indexing of postings
> to calculate the keyword frequencies. This I know how to do.
>
> The individual columns of my table represent the only two elements of each
> RSS item that I'm interested in retrieving text from, i.e. the title and
> description.
>
> -----Original Message-----
> From: Pulkit Singhal [mailto:pulkitsing...@gmail.com]
> Sent: 15 Oct 2010 13 36
> To: java-user@lucene.apache.org
> Subject: Re: Use of Lucene to store data from RSS feeds
>
> When you ask:
> a) will each feed would form a Lucene document, or
> b) will each database row would form a lucene document
> I'm inclined to say that really depends on what type of aggregation
> tool or logic you are using.
>
> I don't know if "Tika" does it but if there is a tool out there that
> can be pointed to a feed and tweaked to spit out documents with each
> field having the settings that you want then you can go with that
> approach. But if you are already parsing the feed and storing the raw
> data into a database table then there is no reason that you can't
> leverage that. From a database row perspective you have already done a
> good deal of work to collect the data and breaking it down into chunks
> that Lucene can happily index as separate fields in a document.
>
> By the way I think there are tools that read from the database
> directly too but I won't try to make things too complicated.
>
> The way I see it, if you were to use the row at this moment and index
> the 4 columns as fields ... plus you could set the feed body to be
> ANALYZED (why don't I see the feed body in your database table?) ...
> then lucene range queries on the date/time field could possibly return
> some results. I am not sure how to get keyword frequencies but if the
> analyzed tokens that lucene is keeping in its index sort of represent
> the keywords that you are talking about then i do know that lucene
> keeps some sort of inverted index per token in terms of how many
> occurrences of it are there ... may be someone else on the list can
> comment on how to extract that info in a query.
>
> Sounds doable.
>
> On Thu, Oct 14, 2010 at 10:17 AM,  <app...@dsl.pipex.com> wrote:
> > Hello
> >
> > I would like to store data retrieved hourly from RSS feeds in a database
> or in Lucene so that the text can be easily
> > indexed for word frequencies.
> >
> > I need to get the text from the title and description elements of RSS
> items.
> >
> > Ideally, for each hourly retrieval from a given feed, I would add a row
> to a table in a dataset made up of the
> > following columns:
> >
> > feed_url, title_element_text, description_element_text, polling_date_time
> >
> > From this, I can look up any element in a feed and calculate keyword
> frequencies based upon the length of time required.
> >
> > This can be done as a database table and hashmaps used to calculate word
> frequencies. But can I do this in Lucene to
> > this degree of granularity at all? If so, would each feed form a Lucene
> document or would each 'row' from the
> > database table form one?
> >
> > Can anyone advise?
> >
> > Thanks
> >
> > Martin O'Shea.
> > --
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Use of Lucene to store data from RSS feeds

Reply via email to