Hi.

This is way too slow I think since what you are explaining is something I
already tested. However I might be using the HitCollector badly.

Please prove me wrong. Supplying some code which I tested this with.
It stores a hash of the value of the term in a TIntHashSet and just
calculates the size of that set.
This one takes approx 3 sec on about 0.5M rows = way too slow.


main test class:
public class GroupingTest
{
    protected static final Log log =
LogFactory.getLog(GroupingTest.class.getName());
    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    public static void main(String[] args)
    {
        Utils.initLogger();
        String[] fields =
{"uid","ip","date","siteId","visits","countryCode"};
        try
        {
            IndexFactory fact = new IndexFactory();
            String d = "/tmp/csvtest";
            fact.initDir(d);
            IndexReader reader = fact.getReader(d);
            IndexSearcher searcher = fact.getSearcher(d, reader);
            QueryParser parser = new MultiFieldQueryParser(fields,
fact.getAnalyzer());
            Query q = parser.parse("date:20090125");


            GroupingHitCollector coll = new GroupingHitCollector();
            coll.setDistinct(true);
            coll.setGroupField("uid");
            coll.setIndexReader(reader);
            long start = System.currentTimeMillis();
            searcher.search(q, coll);
            long stop = System.currentTimeMillis();
            System.out.println("Time: " + (stop-start) + ", distinct
count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
        }
        catch (Exception e)
        {
            log.error(e.toString(), e);
        }
    }
}


public class GroupingHitCollector  extends HitCollector
{
    protected IndexReader indexReader;
    protected String groupField;
    protected boolean distinct;
    //protected TLongHashSet set;
    protected TIntHashSet set;
    protected int distinctSize;

    int count = 0;
    int sum = 0;

    public GroupingHitCollector()
    {
        set = new TIntHashSet();
    }

    public String getGroupField()
    {
        return groupField;
    }

    public void setGroupField(String groupField)
    {
        this.groupField = groupField;
    }

    public IndexReader getIndexReader()
    {
        return indexReader;
    }

    public void setIndexReader(IndexReader indexReader)
    {
        this.indexReader = indexReader;
    }

    public boolean isDistinct()
    {
        return distinct;
    }

    public void setDistinct(boolean distinct)
    {
        this.distinct = distinct;
    }

    public void collect(int doc, float score)
    {
        if(distinct)
        {
            try
            {
                Document document = this.indexReader.document(doc);
                if(document != null)
                {
                    String s = document.get(groupField);
                    if(s != null)
                    {
                        set.add(s.hashCode());
                        //set.add(Crc64.generate(s));
                    }
                }
            }
            catch (IOException e)
            {
                e.printStackTrace();
            }
        }
        count++;
        sum += doc;  // use it to avoid any possibility of being optimized
away
    }

    public int getCount() { return count; }
    public int getSum() { return sum; }

    public int getDistinctCount()
    {
        distinctSize = set.size();
        return distinctSize;
    }
}


On Wed, Jan 28, 2009 at 10:51 AM, ninaS <nina...@gmx.de> wrote:

>
> By the way: if you only need to count documents (count groups) HitCollector
> is a good choice. If you only count you don't need to sort anything.
>
>
> ninaS wrote:
> >
> > Hello,
> >
> > yes I tried HitCollector but I am not satisfied with it because you can
> > not use sorting with HitCollector unless you implement a way to use
> > TopFieldTocCollector. I did not manage to do that in a performant way.
> >
> > It is easier to first do a normal search und "group by" afterwards:
> >
> > Iterate through the result documents and take one of each group. Each
> > document has a groupingKey. I remember which groupingKey is already used
> > and don't take another document of this group into the result list.
> >
> > Regards,
> > Nina
> >
>
> --
> View this message in context:
> http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Reply via email to