That works. Thank you very much!
On Wed, May 28, 2014 at 9:59 AM, Aric Coady <aric.co...@gmail.com> wrote: > On May 28, 2014, at 12:03 AM, Michael O'Leary <mich...@moz.com> wrote: > > Hi Andi, > > Thanks for the help. I just tried to import TVTermsEnum so I could try > > casting my iter, and I don't see how to do it since TVTermsEnum is a > > private class with fully qualified > > name > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum. > > I tried > > Cast the TermsEnum object with BytesRefIterator.cast_. Then it will have > a next method, and be python-iterable. > > Here’s an example that outputs the term vectors as a generator. Look at > the vector method just above: > > https://pythonhosted.org/lupyne/_modules/lupyne/engine/indexers.html#IndexReader.termvector > > > from org.apache.lucene.codecs.compressing import > > CompressingTermVectorsReader$TVTermsEnum > > from org.apache.lucene.codecs.compressing import TVTermsEnum > > and > > import org.apache.lucene.codecs.compressing > > > > but none of them provided access to TVTermsEnum (the first two raised > > exceptions). After running import org.apache.lucene.codecs.compressing, I > > could do dir(org.apache.lucene.codecs.compressing) and see the contents > of > > that module. CompressingTermVectorsReader was listed, but TVTermsEnum > > wasn't. TVTermsEnum also wasn't listed in the output of > > dir(org.apache.lucene.codecs.compressing.CompressingTermVectorsReader). > So > > it looks like my first problem is how to get access to TVTermsEnum. > > Mike > > > > > > On Tue, May 27, 2014 at 11:10 PM, Andi Vajda <va...@apache.org> wrote: > > > >> > >>> On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote: > >>> > >>> *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in > >>> Lucene 4.8.1, but it looks like there is no next() method available for > >> an > >>> object that looks like it is an instance of the Python class > TVTermsEnum > >> in > >>> PyLucene 4.8.1. > >> > >> If there is a next() method, there is a good chance the object is even > >> iterable (in the python sense). You may need to cast it first, though, > as > >> the api that returned it to you may not be defined to return > TVTermsEnum: > >> TVTermsEnum.cast_(obj) > >> > >> A good place for PyLucene code examples is its suite of unit tests. It > >> also has a few samples - way less than in 3.x releases because the APIs > >> changed too much. > >> I'm pretty sure there is a test involving TermsEnum in the tests > directory. > >> > >> Andi.. > >> > >>> I have a set of documents that I would like to cluster. These documents > >>> share a vocabulary of only about 3,000 unique terms, but there are > about > >>> 15,000,000 documents. One way I thought of doing this would be to index > >> the > >>> documents using PyLucene (Python is the preferred programming language > at > >>> work), obtain term vectors for the documents using PyLucene API > >> functions, > >>> and calculate cosine similarities between pairs of term vectors in > order > >> to > >>> determine which documents are close to each other. > >>> > >>> I found some sample Java code on the web that various people have > posted > >>> showing ways to do this with older versions of Lucene. I downloaded > >>> PyLucene 4.8.1 and compared its API functions with the ones used in the > >>> code samples, and saw that this is an area of Lucene that has changed > >> quite > >>> a bit. I can send an email to the lucene-user mailing group to ask what > >>> would be a good way of doing this using version 4.8.1, but the > question I > >>> have for this mailing group has to do with some Java API functions that > >> it > >>> looks like are not exposed in Python, unless I have to go about > accessing > >>> them in a different way. > >>> > >>> If I obtain the term vector for the field "cat_ids" in a document with > id > >>> doc_id_1 > >>> > >>> doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids") > >>> > >>> then doc_1_tfv is displayed as this object: > >>> > >>> <Terms: > >>> > >> > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396 > >>> > >>> In some of the sample code I looked at, the terms in doc_1_tfv could be > >>> obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a > >>> member function of Terms or its subclasses any more. In another code > >>> sample, an iterator for the term vector is obtained via tfv_iter = > >>> doc_1_tfv.iterator(None) and then the terms are obtained one by one > with > >>> calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this > >>> value: > >>> > >>> <TermsEnum: > >>> > >> > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369 > >>> > >>> and there is a next() function defined for the TVTermsEnum class, but > >> this > >>> object doesn't list next() as one of its member functions and an > >> exception > >>> is raised if it is called. It looks like the object only supports the > >>> member functions defined for the TermsEnum class, and next() is not one > >> of > >>> them. Is this the case, or is there a way have it support all of the > >>> TVTermsEnum member functions, including next()? TVTermsEnum is a > private > >>> class in CompressingTermVectorsReader.java. > >>> > >>> So I am wondering if there is a way to obtain term vectors in this way > >> and > >>> that I am just not treating doc_1_tfv and tfv_iter in the right way, or > >> if > >>> there is a different, better way to get term vectors for documents in a > >>> PyLucene index, or if this isn't something that Lucene should be used > >> for. > >>> Thank you very much for any help you can provide. > >>> Mike > >> > >