How many documents Lucene creates and when is entirely up to you. Your code calls IndexWriter.addDocument after all.
You can add multiple values to a #field# in a document if you want, just call Document.add() repeatedly... HTH Erick On Fri, Oct 15, 2010 at 10:35 AM, Martin O'Shea <app...@dsl.pipex.com>wrote: > @Pulkit Singhal: Thanks for the reply. Just to clarify my post yesterday, > I'm not sure if each row in the database table would form a document or not > because I do not know if Lucene works in this manner. In my case, each row > of the table represents a single polling of an RSS feed to retrieve any new > postings over a given number of hours. If Lucene allows a document to have > separate time-based entries, then I am happy to use it for indexing. But if > a separate document is needed per row of the table, then I'm uncertain. I > always do have the option of using Lucene for in-memory indexing of postings > to calculate the keyword frequencies. This I know how to do. > > The individual columns of my table represent the only two elements of each > RSS item that I'm interested in retrieving text from, i.e. the title and > description. > > -----Original Message----- > From: Pulkit Singhal [mailto:pulkitsing...@gmail.com] > Sent: 15 Oct 2010 13 36 > To: java-user@lucene.apache.org > Subject: Re: Use of Lucene to store data from RSS feeds > > When you ask: > a) will each feed would form a Lucene document, or > b) will each database row would form a lucene document > I'm inclined to say that really depends on what type of aggregation > tool or logic you are using. > > I don't know if "Tika" does it but if there is a tool out there that > can be pointed to a feed and tweaked to spit out documents with each > field having the settings that you want then you can go with that > approach. But if you are already parsing the feed and storing the raw > data into a database table then there is no reason that you can't > leverage that. From a database row perspective you have already done a > good deal of work to collect the data and breaking it down into chunks > that Lucene can happily index as separate fields in a document. > > By the way I think there are tools that read from the database > directly too but I won't try to make things too complicated. > > The way I see it, if you were to use the row at this moment and index > the 4 columns as fields ... plus you could set the feed body to be > ANALYZED (why don't I see the feed body in your database table?) ... > then lucene range queries on the date/time field could possibly return > some results. I am not sure how to get keyword frequencies but if the > analyzed tokens that lucene is keeping in its index sort of represent > the keywords that you are talking about then i do know that lucene > keeps some sort of inverted index per token in terms of how many > occurrences of it are there ... may be someone else on the list can > comment on how to extract that info in a query. > > Sounds doable. > > On Thu, Oct 14, 2010 at 10:17 AM, <app...@dsl.pipex.com> wrote: > > Hello > > > > I would like to store data retrieved hourly from RSS feeds in a database > or in Lucene so that the text can be easily > > indexed for word frequencies. > > > > I need to get the text from the title and description elements of RSS > items. > > > > Ideally, for each hourly retrieval from a given feed, I would add a row > to a table in a dataset made up of the > > following columns: > > > > feed_url, title_element_text, description_element_text, polling_date_time > > > > From this, I can look up any element in a feed and calculate keyword > frequencies based upon the length of time required. > > > > This can be done as a database table and hashmaps used to calculate word > frequencies. But can I do this in Lucene to > > this degree of granularity at all? If so, would each feed form a Lucene > document or would each 'row' from the > > database table form one? > > > > Can anyone advise? > > > > Thanks > > > > Martin O'Shea. > > -- > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >