Thanks Steve for the quick reply, Another question regarding this solution:
If I query this index structure and get results from several xml docs, is there a better way to group results by doc id, other then iterating on all results, get original document and check the value of xml_doc_id field? Thanks in advance. On Tue, Mar 11, 2008 at 5:48 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote: > Hi Eran, see my comments below inline: > > On 03/11/2008 at 9:23 AM, Eran Sevi wrote: > > I would like to ask for suggestions of the best design for > > the following scenario: > > > > I have a very large number of XML files (around 1M). > > Each file contains several sections. Each section contains > > many elements (about 1000-5000). > > Each element has a value and some attributes describing the > > value (like > > metadata), for example: > > > > <Section1> > > <Element1 id="0" type="A" meta1="val11" > > meta2="val21">value1</Element1> > > <Element1 id="1" type="B" meta1="val12" > > meta2="val21">value2</Element1> > > ... > > </Section1> > > <Section2> > > <Element2 id="0" type="D" meta1="val11" > > meta3="val31">value3</Element2> > > <Element2 id="1" type="B" meta1="val13" > > meta3="val34">value1</Element2> > > ... > > <Section2> > > ... > > > > As you can see, each attribute can have any value, and > > attribute names can be the same in different sections. > > > > I would like to index the XML in such a way so I can perform > > queries like: > > > > element1=value1 AND type=A AND meta2=val21 > > > > and also more complicated queries that include positions > > between elements, and even range queries on attribute values. > > > > Indexing each element as a different document might not be > > possible because of the large number of documents it might > > create (more then 5 billion docs), and might also make it > > difficult to parse results - I still want to know how > > many original XML documents contains the searched terms. > > 5 billion docs is within the range that Lucene can handle. I think you > should try doc = element and see how well it works. > > In order to know which original documents your hits come from, add an > "xml_doc_id" field, and collect the hits' xml_doc_id values in a set, then > take the set's cardinality. > > > Indexing each attribute as a different field is also > > difficult because I then need the positional information > > of the found terms and check that they were all found in > > the same place (and thus "belong" to the same element). > > You could use an XPath(-ish, depending on requirements) field that > represents the element location, e.g.: > > <Section1> > <Element1 id="0" type="A" meta1="val11" meta2="val21">value1</Element1> > <Element1 id="1" type="B" meta1="val12" meta2="val21">value2</Element1> > ... > </Section1> > > ==> > > Lucene Document field-name:value > > doc #1 > xml_doc_id:1 > xpath:/Section1/Element1[1] > id:0 > type:A > meta1:val11 > meta2:val21 > value:value1 > > doc #2 > xml_doc_id:1 > xpath:/Section1/Element1[2] > id:1 > type:B > meta1:val12 > meta2:val21 > value:value2 > > Hope it helps, > Steve > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >