Hi Michael,
Ah! I think we may have hit a regression bug here. We have identified the problem, the fix is rather simple and we were already in the process of getting a performance enhancement out in a day or two. Would it be useful to you if push in the bug fix as a part of that release. Alternately, we can provide a patch with the fix and you could take a snapshot (which is generally a bad idea :). Simple workaround is to avoid adding empty docsets to the OrDocSet which is clearly suboptimal but works. Anmol Michael Mastroianni wrote: > > Thanks for the response (and the library, of course :)). I figured out > the order thing by looking at your tests (I should have done that > first). It might be a good idea to have a ctor that takes a sorted array > of ints, since it looks like in situations where you are, for instance, > loading a docset from a hitcollector, you have to store things that way > anyway. > > I have another question about the boolean docsets. If I have an > AndDocIdSet with a bunch of OrDocIdSets inside it, and any of those > contain an empty basic DocSet, the iterator on the AndDocIdSet will blow > up on calls to next(). I'm not sure whether this is by design or a bug, > but it might be a good thing to put in the javadoc. I can reproduce this > behavior with the following junit test. Is this something you are aware > of? > > regards, > Michael > > > > public void testPartialEmptyAnd() throws IOException > { > try > { > DocSet ds1 = new P4DDocIdSet(); > DocSet ds2 = new P4DDocIdSet(); > ds2.addDoc(42); > ds2.addDoc(43); > ds2.addDoc(44); > ArrayList<DocIdSet> docs = new > ArrayList<DocIdSet>(); > docs.add(ds1); > docs.add(ds2); > OrDocIdSet orlist1 = new OrDocIdSet(docs); > DocSet ds3 = new P4DDocIdSet(); > DocSet ds4 = new P4DDocIdSet(); > ds4.addDoc(42); > ds4.addDoc(43); > ds4.addDoc(44); > ArrayList<DocIdSet> docs2 = new > ArrayList<DocIdSet>(); > docs2.add(ds3); > docs2.add(ds4); > OrDocIdSet orlist2 = new OrDocIdSet(docs2); > ArrayList<DocIdSet> docs3 = new > ArrayList<DocIdSet>(); > docs3.add(orlist1); > docs3.add(orlist2); > AndDocIdSet andlist = new AndDocIdSet(docs3); > > DocIdSetIterator iter = andlist.iterator(); > @SuppressWarnings("unused") > int docId = -1; > while(iter.next()) > { > docId = iter.doc(); > } > } > catch(Exception e) > { > System.out.println(e.getMessage()); > return; > } > assertTrue(false); > } > > -----Original Message----- > From: molz [mailto:anmol.bha...@gmail.com] > Sent: Tuesday, April 28, 2009 9:00 PM > To: java-user@lucene.apache.org > Subject: RE: kamikaze > > > Hi Micheal, > > Thanks for trying out Kamikaze for starters. So I guess there are a few > issues here > > 1. getDocSetInstance(int min, max, count,DocSetFactory.FOCUS) assumes > that > count < max. I guess thats an API check we should add anyways to improve > usability. That is not to say that it will not work if count > max but > we > have not done the due diligence on that one. > > 2. The way you are inserting the elements is not quite right. The addDoc > method assumes you insert the elements in a sorted fashion. Calling > doc.addDoc(rand.nextInt(maxDoc) does not quite ensure you are loading > the > docSet in a sorted fashion. This is specially useful in BitSet and P4D > set > cases as P4D encodes only delta values between conscutive integers. > > 3. I would recommend using FOCUS.OPTIMAL for best performance/space > tradeoff, albeit SPACE should work too, if you find any issues with that > let > us know, we will be glad to fix it. > > 4. Finally, I believe you want to just get a plain vanilla docSet from > one > of the OR/AND sets. This would be cool to do, however the idea with > Boolean > Sets are that they are never really materialized, they are iterated over > on > the fly. I believe we could do an enhancement to construct the docSet on > the > fly while iterating the Boolean DocSet but as of now there is no > established > way of doing that. > > Hope I covered all your concerns. I rewrote and run your test case like > this > > public class KamikazeTest extends TestCase > { > public void testGrowingP4() > { > DocSet doc = > DocSetFactory.getDocSetInstance(0, 35000000, 200000, > DocSetFactory.FOCUS.SPACE); > Random rand = new Random(System.currentTimeMillis()); > // int maxDoc = 3500000; > //doc.addDoc(0); > > int i = 0; > try > { > while(i < 500000) > { > int nextDoc = i; > doc.addDoc(nextDoc); > i+=rand.nextInt(50); > } > } > catch(Exception e) > { > e.printStackTrace(); > return; > } > assertTrue(true); > > } > > > } > > Thanks, > Anmol > > Software Engineer > Anmol Bhasin > www.linkedin.com > > > > Michael Mastroianni wrote: >> >> Hi-- >> >> I just got kamikaze somewhat integrated into a project of mine. I'm >> having problems growing the DocIdSets, though. Up to the point where > the >> first regrow happens, everything is fine. Once the regrow happens, I > get >> an ArrayOutOfBoundsException. The following unit test will exhibit > this >> behavior. If I change the third param of getDocSetInstance to be >> something lower, I get a p4Doc, if I leave it as is, I get an > OpenBitSet >> doc, in either case, I get the same crash. Do I need to initialize the >> docs in some way other than just creating them? >> >> regards, >> Michael >> >> import org.apache.lucene.search.DocIdSet; >> import org.apache.lucene.util.OpenBitSet; >> >> >> import com.kamikaze.docidset.api.DocSet; >> import com.kamikaze.docidset.impl.AndDocIdSet; >> import com.kamikaze.docidset.impl.OrDocIdSet; >> import com.kamikaze.docidset.utils.DocSetFactory; >> >> import junit.framework.TestCase; >> >> >> public class KamikazeTest extends TestCase >> { >> public void testGrowingP4() >> { >> DocSet doc = >> DocSetFactory.getDocSetInstance(0, 350000, 3000000, >> DocSetFactory.FOCUS.SPACE); >> Random rand = new Random(System.currentTimeMillis()); >> int maxDoc = 350000; >> doc.addDoc(rand.nextInt(maxDoc)); >> int i = 0; >> try >> { >> while(i < 256) >> { >> int nextDoc = rand.nextInt(maxDoc); >> doc.addDoc(nextDoc); >> ++i; >> } >> } >> catch(Exception e) >> { >> return; >> } >> assertTrue(false); >> } >> } >> >> -----Original Message----- >> From: John Wang [mailto:john.w...@gmail.com] >> Sent: Friday, April 24, 2009 7:50 PM >> To: java-user@lucene.apache.org >> Subject: Re: kamikaze >> >> Hi Michael: >> We are using it internally here at LinkedIn for both our search >> engine >> as well as our social graph engine. And we have a team developing >> actively >> on it. Let us know how we can help you. >> >> -John >> >> On Fri, Apr 24, 2009 at 1:56 PM, Michael Mastroianni < >> mmastroia...@glgroup.com> wrote: >> >>> Hi-- >>> >>> >>> >>> Has anyone here used kamikaze much? I'm interested in using it in >>> situations where I'll have several docidsets of >2M, plus several in >> the >>> 10s of thousands. >>> >>> >>> >>> On prototype basis, I got something running nicely using OpenBitSet, >> but >>> I can't use that much memory for my real application. >>> >>> >>> >>> regards, >>> >>> Michael Mastroianni >>> >>> >>> >>> This e-mail message, and any attachments, is intended only for the > use >> of >>> the individual or entity identified in the alias address of this >> message and >>> may contain information that is confidential, privileged and subject >> to >>> legal restrictions and penalties regarding its unauthorized > disclosure >> and >>> use. Any unauthorized review, copying, disclosure, use or > distribution >> is >>> strictly prohibited. If you have received this e-mail message in >> error, >>> please notify the sender immediately by reply e-mail and delete this >>> message, and any attachments, from your system. Thank you. >>> >>> >> >> This e-mail message, and any attachments, is intended only for the use > of >> the individual or entity identified in the alias address of this > message >> and may contain information that is confidential, privileged and > subject >> to legal restrictions and penalties regarding its unauthorized > disclosure >> and use. Any unauthorized review, copying, disclosure, use or > distribution >> is strictly prohibited. If you have received this e-mail message in > error, >> please notify the sender immediately by reply e-mail and delete this >> message, and any attachments, from your system. Thank you. >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > -- > View this message in context: > http://www.nabble.com/kamikaze-tp23224760p23288825.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > This e-mail message, and any attachments, is intended only for the use of > the individual or entity identified in the alias address of this message > and may contain information that is confidential, privileged and subject > to legal restrictions and penalties regarding its unauthorized disclosure > and use. Any unauthorized review, copying, disclosure, use or distribution > is strictly prohibited. If you have received this e-mail message in error, > please notify the sender immediately by reply e-mail and delete this > message, and any attachments, from your system. Thank you. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -- View this message in context: http://www.nabble.com/kamikaze-tp23224760p23302407.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org