Re: Getting term vectors/computing cosine similarity

Michael O'Leary Wed, 28 May 2014 10:17:21 -0700

That works. Thank you very much!


On Wed, May 28, 2014 at 9:59 AM, Aric Coady <aric.co...@gmail.com> wrote:

> On May 28, 2014, at 12:03 AM, Michael O'Leary <mich...@moz.com> wrote:
> > Hi Andi,
> > Thanks for the help. I just tried to import TVTermsEnum so I could try
> > casting my iter, and I don't see how to do it since TVTermsEnum is a
> > private class with fully qualified
> > name
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.
> > I tried
>
> Cast the TermsEnum object with BytesRefIterator.cast_.  Then it will have
> a next method, and be python-iterable.
>
> Here’s an example that outputs the term vectors as a generator.  Look at
> the vector method just above:
>
> https://pythonhosted.org/lupyne/_modules/lupyne/engine/indexers.html#IndexReader.termvector
>
> > from org.apache.lucene.codecs.compressing import
> > CompressingTermVectorsReader$TVTermsEnum
> > from org.apache.lucene.codecs.compressing import TVTermsEnum
> > and
> > import org.apache.lucene.codecs.compressing
> >
> > but none of them provided access to TVTermsEnum (the first two raised
> > exceptions). After running import org.apache.lucene.codecs.compressing, I
> > could do dir(org.apache.lucene.codecs.compressing) and see the contents
> of
> > that module. CompressingTermVectorsReader was listed, but TVTermsEnum
> > wasn't. TVTermsEnum also wasn't listed in the output of
> > dir(org.apache.lucene.codecs.compressing.CompressingTermVectorsReader).
> So
> > it looks like my first problem is how to get access to TVTermsEnum.
> > Mike
> >
> >
> > On Tue, May 27, 2014 at 11:10 PM, Andi Vajda <va...@apache.org> wrote:
> >
> >>
> >>> On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote:
> >>>
> >>> *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in
> >>> Lucene 4.8.1, but it looks like there is no next() method available for
> >> an
> >>> object that looks like it is an instance of the Python class
> TVTermsEnum
> >> in
> >>> PyLucene 4.8.1.
> >>
> >> If there is a next() method, there is a good chance the object is even
> >> iterable (in the python sense). You may need to cast it first, though,
> as
> >> the api that returned it to you may not be defined to return
> TVTermsEnum:
> >>  TVTermsEnum.cast_(obj)
> >>
> >> A good place for PyLucene code examples is its suite of unit tests. It
> >> also has a few samples - way less than in 3.x releases because the APIs
> >> changed too much.
> >> I'm pretty sure there is a test involving TermsEnum in the tests
> directory.
> >>
> >> Andi..
> >>
> >>> I have a set of documents that I would like to cluster. These documents
> >>> share a vocabulary of only about 3,000 unique terms, but there are
> about
> >>> 15,000,000 documents. One way I thought of doing this would be to index
> >> the
> >>> documents using PyLucene (Python is the preferred programming language
> at
> >>> work), obtain term vectors for the documents using PyLucene API
> >> functions,
> >>> and calculate cosine similarities between pairs of term vectors in
> order
> >> to
> >>> determine which documents are close to each other.
> >>>
> >>> I found some sample Java code on the web that various people have
> posted
> >>> showing ways to do this with older versions of Lucene. I downloaded
> >>> PyLucene 4.8.1 and compared its API functions with the ones used in the
> >>> code samples, and saw that this is an area of Lucene that has changed
> >> quite
> >>> a bit. I can send an email to the lucene-user mailing group to ask what
> >>> would be a good way of doing this using version 4.8.1, but the
> question I
> >>> have for this mailing group has to do with some Java API functions that
> >> it
> >>> looks like are not exposed in Python, unless I have to go about
> accessing
> >>> them in a different way.
> >>>
> >>> If I obtain the term vector for the field "cat_ids" in a document with
> id
> >>> doc_id_1
> >>>
> >>> doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids")
> >>>
> >>> then doc_1_tfv is displayed as this object:
> >>>
> >>> <Terms:
> >>>
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396
> >>>
> >>> In some of the sample code I looked at, the terms in doc_1_tfv could be
> >>> obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a
> >>> member function of Terms or its subclasses any more. In another code
> >>> sample, an iterator for the term vector is obtained via tfv_iter =
> >>> doc_1_tfv.iterator(None) and then the terms are obtained one by one
> with
> >>> calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this
> >>> value:
> >>>
> >>> <TermsEnum:
> >>>
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369
> >>>
> >>> and there is a next() function defined for the TVTermsEnum class, but
> >> this
> >>> object doesn't list next() as one of its member functions and an
> >> exception
> >>> is raised if it is called. It looks like the object only supports the
> >>> member functions defined for the TermsEnum class, and next() is not one
> >> of
> >>> them. Is this the case, or is there a way have it support all of the
> >>> TVTermsEnum member functions, including next()? TVTermsEnum is a
> private
> >>> class in CompressingTermVectorsReader.java.
> >>>
> >>> So I am wondering if there is a way to obtain term vectors in this way
> >> and
> >>> that I am just not treating doc_1_tfv and tfv_iter in the right way, or
> >> if
> >>> there is a different, better way to get term vectors for documents in a
> >>> PyLucene index, or if this isn't something that Lucene should be used
> >> for.
> >>> Thank you very much for any help you can provide.
> >>> Mike
> >>
>
>

Re: Getting term vectors/computing cosine similarity

Reply via email to