Re: Specialized XML handling in Lucene

Eran Sevi Tue, 11 Mar 2008 09:27:22 -0700

Thanks Steve for the quick reply,

Another question regarding this solution:


If I query this index structure and get results from several xml docs, is
there a better way to group results by doc id, other then iterating on all
results, get original document and check the value of xml_doc_id field?

Thanks in advance.

On Tue, Mar 11, 2008 at 5:48 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:

> Hi Eran, see my comments below inline:
>
> On 03/11/2008 at 9:23 AM, Eran Sevi wrote:
> > I would like to ask for suggestions of the best design for
> > the following scenario:
> >
> > I have a very large number of XML files (around 1M).
> > Each file contains several sections. Each section contains
> > many elements (about 1000-5000).
> > Each element has a value and some attributes describing the
> > value (like
> > metadata), for example:
> >
> > <Section1>
> >     <Element1  id="0"  type="A"  meta1="val11"
> >                meta2="val21">value1</Element1>
> >     <Element1  id="1"  type="B"  meta1="val12"
> >                meta2="val21">value2</Element1>
> > ...
> > </Section1>
> > <Section2>
> >     <Element2 id="0"  type="D"  meta1="val11"
> >               meta3="val31">value3</Element2>
> >     <Element2 id="1"  type="B"  meta1="val13"
> >               meta3="val34">value1</Element2>
> > ...
> > <Section2>
> > ...
> >
> > As you can see, each attribute can have any value, and
> > attribute names can be the same in different sections.
> >
> > I would like to index the XML in such a way so I can perform
> > queries like:
> >
> > element1=value1 AND type=A AND meta2=val21
> >
> > and also more complicated queries that include positions
> > between elements, and even range queries on attribute values.
> >
> > Indexing each element as a different document might not be
> > possible because of the large number of documents it might
> > create (more then 5 billion docs), and might also make it
> > difficult to parse results - I still want to know how
> > many original XML documents contains the searched terms.
>
> 5 billion docs is within the range that Lucene can handle.  I think you
> should try doc = element and see how well it works.
>
> In order to know which original documents your hits come from, add an
> "xml_doc_id" field, and collect the hits' xml_doc_id values in a set, then
> take the set's cardinality.
>
> > Indexing each attribute as a different field is also
> > difficult because I then need the positional information
> > of the found terms and check that they were all found in
> > the same place (and thus "belong" to the same element).
>
> You could use an XPath(-ish, depending on requirements) field that
> represents the element location, e.g.:
>
> <Section1>
>  <Element1 id="0" type="A" meta1="val11" meta2="val21">value1</Element1>
>  <Element1 id="1" type="B" meta1="val12" meta2="val21">value2</Element1>
>  ...
> </Section1>
>
> ==>
>
> Lucene Document field-name:value
>
>  doc #1
>       xml_doc_id:1
>            xpath:/Section1/Element1[1]
>               id:0
>             type:A
>            meta1:val11
>            meta2:val21
>            value:value1
>
>  doc #2
>       xml_doc_id:1
>            xpath:/Section1/Element1[2]
>               id:1
>             type:B
>            meta1:val12
>            meta2:val21
>            value:value2
>
> Hope it helps,
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Specialized XML handling in Lucene

Reply via email to