Hi Eran, see my comments below inline: On 03/11/2008 at 9:23 AM, Eran Sevi wrote: > I would like to ask for suggestions of the best design for > the following scenario: > > I have a very large number of XML files (around 1M). > Each file contains several sections. Each section contains > many elements (about 1000-5000). > Each element has a value and some attributes describing the > value (like > metadata), for example: > > <Section1> > <Element1 id="0" type="A" meta1="val11" > meta2="val21">value1</Element1> > <Element1 id="1" type="B" meta1="val12" > meta2="val21">value2</Element1> > ... > </Section1> > <Section2> > <Element2 id="0" type="D" meta1="val11" > meta3="val31">value3</Element2> > <Element2 id="1" type="B" meta1="val13" > meta3="val34">value1</Element2> > ... > <Section2> > ... > > As you can see, each attribute can have any value, and > attribute names can be the same in different sections. > > I would like to index the XML in such a way so I can perform > queries like: > > element1=value1 AND type=A AND meta2=val21 > > and also more complicated queries that include positions > between elements, and even range queries on attribute values. > > Indexing each element as a different document might not be > possible because of the large number of documents it might > create (more then 5 billion docs), and might also make it > difficult to parse results - I still want to know how > many original XML documents contains the searched terms.
5 billion docs is within the range that Lucene can handle. I think you should try doc = element and see how well it works. In order to know which original documents your hits come from, add an "xml_doc_id" field, and collect the hits' xml_doc_id values in a set, then take the set's cardinality. > Indexing each attribute as a different field is also > difficult because I then need the positional information > of the found terms and check that they were all found in > the same place (and thus "belong" to the same element). You could use an XPath(-ish, depending on requirements) field that represents the element location, e.g.: <Section1> <Element1 id="0" type="A" meta1="val11" meta2="val21">value1</Element1> <Element1 id="1" type="B" meta1="val12" meta2="val21">value2</Element1> ... </Section1> ==> Lucene Document field-name:value doc #1 xml_doc_id:1 xpath:/Section1/Element1[1] id:0 type:A meta1:val11 meta2:val21 value:value1 doc #2 xml_doc_id:1 xpath:/Section1/Element1[2] id:1 type:B meta1:val12 meta2:val21 value:value2 Hope it helps, Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]