Le Samedi 13 Janvier 2007 16:48, Melange a écrit :
> Nicolas Lalevée-2 wrote:
> > Le Samedi 13 Janvier 2007 10:49, Melange a écrit :
> >> Hello, I'd like to index a web forum (phpBB) with Lucene. I wonder how
> >> to best map the forum document model (topics and their messages) to the
> >> Lucene
> >> document model.
> >>
> >> Usually, some forum member creates a new topic with its first message
> >> text,
> >> then other members add reply messages to that topic. Messages are
> >> sometimes
> >> updated, but most of the time topics grow incrementally. There's no
> >> limit for the number of replies, thousands is nothing unusual.
> >>
> >> Currently, I see two options for my Lucene data model: A single document
> >> type or two document types (one for the topics and one for the
> >> messages). When using only a single document type, things are fairly
> >> clear but there would obviously be a lot of unneccessary index
> >> modifications (their would be one field with all messages concatenated).
> >> To reduce the amount of index
> >> updates, the separation of topics and messages seems to be the right
> >> thing
> >> to do.
> >>
> >> So I'd like to use two document types for my document model, but I do
> >> not understand how I could bring these two together when searching. I
> >> don't want to list all messages but I want the messages grouped by
> >> topic, how can
> >> I go about that?
> >>
> >> The topic documents could be boosted, but perhaps that's not even
> >> necessary
> >> because of their relativly short length (compared to message documents).
> >
> > Hi Melange,
> >
> > The two document types design will be only usefull if you want to search
> > for
> > topics and search for messages. Here you want to search for messages
> > grouped
> > by topic. So you should have one kind of document : message documents. In
> > this message docment, you will refer the topic's id, so you will be able
> > to
> > group by topic. To group by topic some search results, you might be
> > interested by Solr's [1] faceted search [2].
> >
> > cheers,
> > Nicolas
> >
> > [1] http://incubator.apache.org/solr/
> > [2] http://wiki.apache.org/solr/SimpleFacetParameters
>
> Thank you Nicolas, good idea with the message documents, I'll do that
> instead.
>
> Sorry, I couldn't really find anything at the Solr links you provided
> regarding the grouping of search results (hits). Will I have to load all
> the hits into RAM in order to perform the grouping myself or is there a way
> to have Lucene do that for me? Or how is this to be done, roughly?

In fact I should have given more details...  ;)

Solr is a search server based on Lucene. As you are intended to use phpbb, you 
could probably use also solr. Then, you will develop some php I think to 
request the solr engine. And you will be able to use the facet search of 
Solr.

I don't know what are your exact constraint about integrating a search engine 
in your forum, so I may be wrong. If you want to stay in Java, I recommand 
you to see how Solr does the facet search. I have get of lot of good idea in 
there. I just don't remember where I found the facet stuff in solr. The basic 
idea was to do facet search with a set of filters, each filter representing a 
category. Then, when agregating the results of a query in a hitcollector, you 
just dispatch the doc in a queue associated with a filter.

But rethinking about it, for a forum, the updates are done quite often, and in 
the design I presented to you, you would have as many filter as the number of 
thread. And this is a scalability issue.

But maybe you want a simpler "group by" feature. Your search return a bunch of 
documents (Hits for instance) representing your messages. You select the 20 
best matching ones and only then your group them by thread id.

Nicolas, trying to give not too bad advise... :p

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to