RE: Specialized XML handling in Lucene

Steven A Rowe Tue, 11 Mar 2008 10:11:55 -0700

Hi Eran,

On 03/11/2008 at 12:26 PM, Eran Sevi wrote:
> If I query this index structure and get results from several
> xml docs, is there a better way to group results by doc id, 
> other then iterating on all results, get original document
> and check the value of xml_doc_id field?


Perhaps a Sort would do the grouping you want?:

<http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Sort.html>

Check out Lucene's TestSort.java for usage hints:

<http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_3_1/src/test/org/apache/lucene/search/TestSort.java?view=markup>

Chapter 5 in Erik Hatcher's and Otis Gospodnetic's excellent book "Lucene in 
Action" covers sorting:

<http://www.manning.com/hatcher2/>

Steve

> On Tue, Mar 11, 2008 at 5:48 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:
> > Hi Eran, see my comments below inline:
> > 
> > On 03/11/2008 at 9:23 AM, Eran Sevi wrote:
> > > I would like to ask for suggestions of the best design for
> > > the following scenario:
> > > 
> > > I have a very large number of XML files (around 1M).
> > > Each file contains several sections. Each section contains
> > > many elements (about 1000-5000).
> > > Each element has a value and some attributes describing the
> > > value (like
> > > metadata), for example:
> > > 
> > > <Section1>
> > >     <Element1  id="0"  type="A"  meta1="val11"
> > >                meta2="val21">value1</Element1> <Element1  id="1" 
> > >                type="B"  meta1="val12" meta2="val21">value2</Element1>
> > > ...
> > > </Section1>
> > > <Section2>
> > >     <Element2 id="0"  type="D"  meta1="val11"
> > >               meta3="val31">value3</Element2> <Element2 id="1" 
> > >               type="B"  meta1="val13" meta3="val34">value1</Element2>
> > > ...
> > > <Section2>
> > > ...
> > > 
> > > As you can see, each attribute can have any value, and
> > > attribute names can be the same in different sections.
> > > 
> > > I would like to index the XML in such a way so I can perform
> > > queries like:
> > > 
> > > element1=value1 AND type=A AND meta2=val21
> > > 
> > > and also more complicated queries that include positions
> > > between elements, and even range queries on attribute values.
> > > 
> > > Indexing each element as a different document might not be
> > > possible because of the large number of documents it might
> > > create (more then 5 billion docs), and might also make it
> > > difficult to parse results - I still want to know how
> > > many original XML documents contains the searched terms.
> > 
> > 5 billion docs is within the range that Lucene can handle. I think you
> > should try doc = element and see how well it works.
> > 
> > In order to know which original documents your hits come from, add an
> > "xml_doc_id" field, and collect the hits' xml_doc_id values in a set,
> > then take the set's cardinality.
> > 
> > > Indexing each attribute as a different field is also
> > > difficult because I then need the positional information
> > > of the found terms and check that they were all found in
> > > the same place (and thus "belong" to the same element).
> > 
> > You could use an XPath(-ish, depending on requirements) field that
> > represents the element location, e.g.:
> > 
> > <Section1>
> >  <Element1 id="0" type="A" meta1="val11"
> >  meta2="val21">value1</Element1> <Element1 id="1" type="B"
> >  meta1="val12" meta2="val21">value2</Element1> ...
> > </Section1>
> > 
> > ==>
> > 
> > Lucene Document field-name:value
> > 
> >  doc #1
> >       xml_doc_id:1
> >            xpath:/Section1/Element1[1]
> >               id:0
> >             type:A
> >            meta1:val11
> >            meta2:val21
> >            value:value1
> > 
> >  doc #2
> >       xml_doc_id:1
> >            xpath:/Section1/Element1[2]
> >               id:1
> >             type:B
> >            meta1:val12
> >            meta2:val21
> >            value:value2
> > 
> > Hope it helps,
> > Steve
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED] For
> > additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Specialized XML handling in Lucene

Reply via email to