Hi Eran, On 03/11/2008 at 12:26 PM, Eran Sevi wrote: > If I query this index structure and get results from several > xml docs, is there a better way to group results by doc id, > other then iterating on all results, get original document > and check the value of xml_doc_id field?
Perhaps a Sort would do the grouping you want?: <http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Sort.html> Check out Lucene's TestSort.java for usage hints: <http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_3_1/src/test/org/apache/lucene/search/TestSort.java?view=markup> Chapter 5 in Erik Hatcher's and Otis Gospodnetic's excellent book "Lucene in Action" covers sorting: <http://www.manning.com/hatcher2/> Steve > On Tue, Mar 11, 2008 at 5:48 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote: > > Hi Eran, see my comments below inline: > > > > On 03/11/2008 at 9:23 AM, Eran Sevi wrote: > > > I would like to ask for suggestions of the best design for > > > the following scenario: > > > > > > I have a very large number of XML files (around 1M). > > > Each file contains several sections. Each section contains > > > many elements (about 1000-5000). > > > Each element has a value and some attributes describing the > > > value (like > > > metadata), for example: > > > > > > <Section1> > > > <Element1 id="0" type="A" meta1="val11" > > > meta2="val21">value1</Element1> <Element1 id="1" > > > type="B" meta1="val12" meta2="val21">value2</Element1> > > > ... > > > </Section1> > > > <Section2> > > > <Element2 id="0" type="D" meta1="val11" > > > meta3="val31">value3</Element2> <Element2 id="1" > > > type="B" meta1="val13" meta3="val34">value1</Element2> > > > ... > > > <Section2> > > > ... > > > > > > As you can see, each attribute can have any value, and > > > attribute names can be the same in different sections. > > > > > > I would like to index the XML in such a way so I can perform > > > queries like: > > > > > > element1=value1 AND type=A AND meta2=val21 > > > > > > and also more complicated queries that include positions > > > between elements, and even range queries on attribute values. > > > > > > Indexing each element as a different document might not be > > > possible because of the large number of documents it might > > > create (more then 5 billion docs), and might also make it > > > difficult to parse results - I still want to know how > > > many original XML documents contains the searched terms. > > > > 5 billion docs is within the range that Lucene can handle. I think you > > should try doc = element and see how well it works. > > > > In order to know which original documents your hits come from, add an > > "xml_doc_id" field, and collect the hits' xml_doc_id values in a set, > > then take the set's cardinality. > > > > > Indexing each attribute as a different field is also > > > difficult because I then need the positional information > > > of the found terms and check that they were all found in > > > the same place (and thus "belong" to the same element). > > > > You could use an XPath(-ish, depending on requirements) field that > > represents the element location, e.g.: > > > > <Section1> > > <Element1 id="0" type="A" meta1="val11" > > meta2="val21">value1</Element1> <Element1 id="1" type="B" > > meta1="val12" meta2="val21">value2</Element1> ... > > </Section1> > > > > ==> > > > > Lucene Document field-name:value > > > > doc #1 > > xml_doc_id:1 > > xpath:/Section1/Element1[1] > > id:0 > > type:A > > meta1:val11 > > meta2:val21 > > value:value1 > > > > doc #2 > > xml_doc_id:1 > > xpath:/Section1/Element1[2] > > id:1 > > type:B > > meta1:val12 > > meta2:val21 > > value:value2 > > > > Hope it helps, > > Steve > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] For > > additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]