SV: SV: Sort problematics

Marcus Falck Thu, 18 May 2006 15:23:49 -0700

Hi Gunther.
 
We thought in the terms of an index containing the search profiles and search 
that index using the documents as a query. But we couldn't really figure it 
out. We have an alert service up and running today using Veritys implementation 
of alerts. So we looked at the Verity documentation and realised that they 
didn't handle the alert using an inverted index. So we implemented our new 
alert service in the same way the verity service works today. 
Which seems to work nice, but if you have any concrete solution on how to 
achive an inverted index storing pretty complex queries you are more then 
welcome to share it.
 
-
 
What I want to accomplish is an central index for alot of large backend systems 
containing a lot of articles. For example news polled from web, newspapers 
delivered in electronic form to us and 3:d part document databases.
So what we have done is to implement a search engine using Lucene as the core. 
This engine is scalable both in terms of range and round-robin/range. Fetcher 
applications fetches documents from different storages and transforms those 
documents into a more common format and then distributes them to all 
searchmachines matching that range.
The range clustering is built using date range. Since we are going to buy 
document databases from other companies we can't guarantee that all data will 
be added in terms of date order. 
The volymes of data we are talking about are around 500 Million news articles.
 
The enduser, and alot of our internal processes for value adding services, are 
then defining a search query for things they want to monitor. In the endusers 
case this is called "agent". When the user logs in to the system and clicks on 
its agent the user will get the matching articles presented to him/her in DATE 
order (newest first). The date order is critical. The relevance is not 
important since we have value added services such as quality control of the 
hits.
 
So the last thing to do in order to get a fully functional prof of concept up 
is to fix the date order presentation. And since it's alot of data and the 
IndexSearcher will be recreated pretty often we will need to change the lucene 
scoring/ranking. And I can't understand why this should be so hard? But I don't 
have any clue of what the best practises for doing so are.
 
/
Regards
Marcus

________________________________

Från: Günther Starnberger [mailto:[EMAIL PROTECTED]
Skickat: to 2006-05-18 23:22
Till: java-user@lucene.apache.org
Ämne: Re: SV: Sort problematics

On Thu, May 18, 2006 at 10:53:23PM +0200, Marcus Falck wrote:

Hello,

> The term scorer will give higher score on documents containing both
> terms. This is a problem (in our application) since in this case want
> the same score on documents as long as they contain 1 of the terms
> (since we are dealing with newsletter observation for companies they
> want to get the hits ordered by date to get the complete overview).  I
> tested to rewrite the TermScorer to give me the same score with
> success. So my question is.

What exactly do you want to achieve with your application?

You speak of "immediate alerts". I understand this as: Your users
specify keywords or queries and when you receive a new document which
matches a query you alert the user.

Is this what you want to do? If so I don't think that Lucene is useful
for this kind of realtime queries. Instead of using an inverted index
it would make more sense to use a normal index which contains the
terms you search for. If you receive a new document make a lookup on
each term of the document using the index. It _might_ be possible to
do this with Lucene by storing the search-terms as documents and using
the documents which you receive as queries, but i guess this it isn't
that trivial.

If you need a combination of traditional search and real-time alerts a
hybrid solution may make sense. But using Lucene for real-time search
isn't a good idea (at least IMO).

bye,
/gst

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

SV: SV: Sort problematics

Reply via email to