Re: Group by in Lucene ?

Marcus Herou Wed, 28 Jan 2009 03:45:35 -0800

Oh bytw, faceting is easy it's the distinct part I think is hard.

Example Lucene Facet:
http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html


On Wed, Jan 28, 2009 at 12:43 PM, Marcus Herou
<marcus.he...@tailsweep.com>wrote:

> Hi.
>
> This is way too slow I think since what you are explaining is something I
> already tested. However I might be using the HitCollector badly.
>
> Please prove me wrong. Supplying some code which I tested this with.
> It stores a hash of the value of the term in a TIntHashSet and just
> calculates the size of that set.
> This one takes approx 3 sec on about 0.5M rows = way too slow.
>
>
> main test class:
> public class GroupingTest
> {
>     protected static final Log log =
> LogFactory.getLog(GroupingTest.class.getName());
>     static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
>     public static void main(String[] args)
>     {
>         Utils.initLogger();
>         String[] fields =
> {"uid","ip","date","siteId","visits","countryCode"};
>         try
>         {
>             IndexFactory fact = new IndexFactory();
>             String d = "/tmp/csvtest";
>             fact.initDir(d);
>             IndexReader reader = fact.getReader(d);
>             IndexSearcher searcher = fact.getSearcher(d, reader);
>             QueryParser parser = new MultiFieldQueryParser(fields,
> fact.getAnalyzer());
>             Query q = parser.parse("date:20090125");
>
>
>             GroupingHitCollector coll = new GroupingHitCollector();
>             coll.setDistinct(true);
>             coll.setGroupField("uid");
>             coll.setIndexReader(reader);
>             long start = System.currentTimeMillis();
>             searcher.search(q, coll);
>             long stop = System.currentTimeMillis();
>             System.out.println("Time: " + (stop-start) + ", distinct
> count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
>         }
>         catch (Exception e)
>         {
>             log.error(e.toString(), e);
>         }
>     }
> }
>
>
> public class GroupingHitCollector  extends HitCollector
> {
>     protected IndexReader indexReader;
>     protected String groupField;
>     protected boolean distinct;
>     //protected TLongHashSet set;
>     protected TIntHashSet set;
>     protected int distinctSize;
>
>     int count = 0;
>     int sum = 0;
>
>     public GroupingHitCollector()
>     {
>         set = new TIntHashSet();
>     }
>
>     public String getGroupField()
>     {
>         return groupField;
>     }
>
>     public void setGroupField(String groupField)
>     {
>         this.groupField = groupField;
>     }
>
>     public IndexReader getIndexReader()
>     {
>         return indexReader;
>     }
>
>     public void setIndexReader(IndexReader indexReader)
>     {
>         this.indexReader = indexReader;
>     }
>
>     public boolean isDistinct()
>     {
>         return distinct;
>     }
>
>     public void setDistinct(boolean distinct)
>     {
>         this.distinct = distinct;
>     }
>
>     public void collect(int doc, float score)
>     {
>         if(distinct)
>         {
>             try
>             {
>                 Document document = this.indexReader.document(doc);
>                 if(document != null)
>                 {
>                     String s = document.get(groupField);
>                     if(s != null)
>                     {
>                         set.add(s.hashCode());
>                         //set.add(Crc64.generate(s));
>                     }
>                 }
>             }
>             catch (IOException e)
>             {
>                 e.printStackTrace();
>             }
>         }
>         count++;
>         sum += doc;  // use it to avoid any possibility of being optimized
> away
>     }
>
>     public int getCount() { return count; }
>     public int getSum() { return sum; }
>
>     public int getDistinctCount()
>     {
>         distinctSize = set.size();
>         return distinctSize;
>
>     }
> }
>
>
> On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina...@gmx.de> wrote:
>
>>
>> By the way: if you only need to count documents (count groups)
>> HitCollector
>> is a good choice. If you only count you don't need to sort anything.
>>
>>
>> ninaS wrote:
>> >
>> > Hello,
>> >
>> > yes I tried HitCollector but I am not satisfied with it because you can
>> > not use sorting with HitCollector unless you implement a way to use
>> > TopFieldTocCollector. I did not manage to do that in a performant way.
>> >
>> > It is easier to first do a normal search und "group by" afterwards:
>> >
>> > Iterate through the result documents and take one of each group. Each
>> > document has a groupingKey. I remember which groupingKey is already used
>> > and don't take another document of this group into the result list.
>> >
>> > Regards,
>> > Nina
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Group by in Lucene ?

Reply via email to