Quetion about TermVector - Version 4.0.6

2014-09-24 Thread Weberth Fernandes
Dear, I need to retrieve all terms that are stored in various fields in the documents so that I can perform calculations of some metrics for each term t from my base document. I realized that by using the TermVector the index gets a large size, about 80% of the size of my collection of documents

Accessing TermVector content at the result page building stage

2013-10-04 Thread Igor Shalyminov
Hello! I need to access token position and payload info during the search result page building. I need to do this for 10 documents max, so retrieving TermVectors is totally OK for me. Say, I retrieve it for one document: Terms tv = _indexDirectoryReader.getTermVector(0, "wordform"); >From the

RE: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Mike O'Leary
- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, August 24, 2012 9:52 AM To: java-user@lucene.apache.org Subject: Re: Problem with TermVector offsets and positions not being preserved Calling IR.document does not restore your 'original Document' completely. This is really an ag

Re: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Robert Muir
ctorInfo is called, it displays offsets and > positions for the fields that have term vectors with offsets and positions. > The second time it is called, it doesn't display anything because none of the > term vectors satisfy termFreqVector instanceof TermPositionVector. Is it

RE: Problem with TermVector offsets and positions not being preserved

2012-08-22 Thread Mike O'Leary
term vectors in the affected fields? Is there a way to add a field to the documents in an index in which this doesn't occur? Thanks, Mike -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, July 20, 2012 5:59 PM To: java-user@lucene.apache.org Subje

Re: Problem with TermVector offsets and positions not being preserved

2012-07-27 Thread Robert Muir
On Fri, Jul 27, 2012 at 9:10 AM, Andrzej Bialecki wrote: > > Catching up with this thread ... Luke 4.0-ALPHA makes a similar mistake. I > fixed this in svn (to be released in a week or so) so that: > > * Luke now actually checks whether a doc has term vectors for a particular > field and adjusts t

Re: Problem with TermVector offsets and positions not being preserved

2012-07-27 Thread Andrzej Bialecki
012 5:59 PM To: java-user@lucene.apache.org Subject: Re: Problem with TermVector offsets and positions not being preserved On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote: Hi Robert, I'm not trying to determine whether a document has term vectors, I'm trying to determine whet

RE: Problem with TermVector offsets and positions not being preserved

2012-07-26 Thread Mike O'Leary
Subject: Re: Problem with TermVector offsets and positions not being preserved On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote: > Hi Robert, > I'm not trying to determine whether a document has term vectors, I'm trying > to determine whether the term vectors that are in th

Re: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Robert Muir
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote: > Hi Robert, > I'm not trying to determine whether a document has term vectors, I'm trying > to determine whether the term vectors that are in the index have offsets and > positions > stored. Right: what i'm trying to tell you is that offsets

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
o: java-user@lucene.apache.org Subject: Re: Problem with TermVector offsets and positions not being preserved I think its wrong for DumpIndex to look at term vector information from the Document that was retrieved from IndexReader.document, thats basically just a way of getting access to your store

Re: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Robert Muir
functions, and to add a loop that writes names of fields and their > TermVector, offset and position settings to the console. > > The other application is called DumpIndex, and got it from a web site > somewhere about 6 months ago. I changed a few lines to get rid of deprecated > fun

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
-user@lucene.apache.org Subject: RE: Problem with TermVector offsets and positions not being preserved Hi Robert, I put together the following two small applications to try to separate the problem I am having from my own software and any bugs it contains. One of the applications is c

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
from Manning Publications. I changed it a tiny bit to get rid of a special analyzer that is irrelevant to what I am looking at, to get rid of a few warnings about deprecated functions, and to add a loop that writes names of fields and their TermVector, offset and position settings to the console.

Re: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Robert Muir
Hi Mike: I wrote up some tests last night against 3.6 trying to find some way to reproduce what you are seeing, e.g. adding additional segments with the field specified without term vectors, without tv offsets, omitting TF, and merging them and checking everything out. I couldnt find any problems.

Problem with TermVector offsets and positions not being preserved

2012-07-19 Thread Mike O'Leary
I created an index using Lucene 3.6.0 in which I specified that a certain text field in each document should be indexed, stored, analyzed with no norms, with term vectors, offsets and positions. Later I looked at that index in Luke, and it said that term vectors were created for this field, but

MoreLikeThis and TermVector relationship

2011-10-24 Thread Saurabh Gokhale
, MoreLikeThis will generate terms from stored fields Now since I am using lucene and not Solr, I will ask question from Lucene point of view: 1. What is the difference between the below 2 index statements. As per my understanding first one does not store separate TermVector and second does. new

Re: Lucene TermVector

2011-02-22 Thread Simon Willnauer
cur in other document but are not in a particular one are omitted (they might even not be present when the vector is stored). Yet, what lucene offers you is one big vector with all unique terms for each field, its term dictionary. you can simply build your dense vector as needed at runtime. You can

Lucene TermVector

2011-02-21 Thread Ajay Anandan
Hi I am trying to implement an Expectation Maximization algorithm for document clustering. I am planning to use Lucene Term Vectors for finding similarity between 2 documents. There are 2 kinds of EM algos using naive Bayes: the multivariate model and the multinomial model. In simple terms, the

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2009-04-24 Thread Michael McCandless
I don't think there's an easy way to jump straight from term + freq per doc to a Lucene index. Mike On Tue, Apr 21, 2009 at 7:14 AM, Thomas Pönitz wrote: > Hi, > > I have the same problem as discussed here: > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/%3c200511021310.1

Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2009-04-21 Thread Thomas Pönitz
Hi, I have the same problem as discussed here: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/%3c200511021310.18686...@last.fm%3e I want to specify termvectors directly instead of constructing a dummy string like "a a a b b c" that will be transformed to a[3] b[2] c[1].

Re: TermVector

2008-01-29 Thread Grant Ingersoll
Have a look at the SpanQuery, specifically the SpanNearQuery. The getSpans() method will return a Spans object, which you can use to access the positions. -Grant On Jan 29, 2008, at 7:17 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: And how can I find the offsets of something like "f

RE: TermVector

2008-01-29 Thread spring
> > And how can I find the offsets of something like "foo bar"? > I think > > this > > will get tokenized into 2 terms and thus I have no chance to find > > it, right? > > I wouldn't say no chance... TermVectorMapper would be good > for this, > as you can watch the terms as they are being

Re: TermVector

2008-01-28 Thread Grant Ingersoll
On Jan 28, 2008, at 4:04 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: Also, search the archives for Term Vector, as you will find discussion of it there. Ah I see, I need to cast it to TermPositionVector. OK. yep You may also, eventually, be interested in the new TermVectorMapper

RE: TermVector

2008-01-28 Thread spring
> Also, search the archives for Term Vector, as you will find > discussion > of it there. Ah I see, I need to cast it to TermPositionVector. OK. > You may also, eventually, be interested in the new > TermVectorMapper capabilities in 2.3 which should help speed up the > processing of term

Re: TermVector

2008-01-28 Thread Grant Ingersoll
le index, not a single document. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Montag, 28. Januar 2008 15:28 To: java-user@lucene.apache.org Subject: TermVector Hi, how do I get the TermVector from a document which I have gotten from an IndexSearcher via Inde

RE: TermVector

2008-01-28 Thread spring
ndexReader#termPositions(Term t) - but this returns the positions for the whole index, not a single document. > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Montag, 28. Januar 2008 15:28 > To: java-user@lucene.apache.org > Subject: TermVector >

TermVector

2008-01-28 Thread spring
Hi, how do I get the TermVector from a document which I have gotten from an IndexSearcher via IndexSearcher#search(Query q). Luke can do it, but I do not know how... Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: TermVector

2007-06-25 Thread Grant Ingersoll
That seems to be the correct usage. Can you provide a self contained unit test showing what you are doing or, at least, more supporting code? -Grant On Jun 24, 2007, at 5:14 PM, Lee Li Bin wrote: Hi, May I know how do I store TermVector? When I set the last parameter to true, isn&#

RE: TermVector

2007-06-24 Thread Liu_Andy2
Suggest you use lucene 2.1 or above Andy -Original Message- From: Lee Li Bin [mailto:[EMAIL PROTECTED] Sent: Monday, June 25, 2007 5:14 AM To: java-user@lucene.apache.org Subject: TermVector Hi, May I know how do I store TermVector? When I set the last parameter to true, isn&#

TermVector

2007-06-24 Thread Lee Li Bin
Hi, May I know how do I store TermVector? When I set the last parameter to true, isn't it setting storeTermVector to true? But I get null value in TermFreqVector. BTW, I'm using lucene 1.4.3 Not intended to upgrade to 2.0 docAll.add(Field.Text("contentText&quo

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
OK, final note. I wish I knew what kind of drugs I was on when I first thought that the sizes were so much smaller. Because they weren't. I got to thinking that "gee, it's kind of weird that if you don't specify anything for TermVector when creating a field, you get all this a

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
> > document > > and a LOT of OCR data. I'm indexing over 20,000 books and the index > > size is > > 8G. So I decided to play around with not storing some of the > > termvector > > information and I'm shocked at how much smaller the index is. By > >

Re: Omitting TermVector info and index size

2007-02-14 Thread Mark Miller
ormance benefit to storing them, at the cost of disk space, like you said. On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: > I'm indexing books, with a significant amount of overhead in each > document > and a LOT of OCR data. I'm indexing over 20,000 books and the index >

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
h > document > and a LOT of OCR data. I'm indexing over 20,000 books and the index > size is > 8G. So I decided to play around with not storing some of the > termvector > information and I'm shocked at how much smaller the index is. By > storing all > my fields

Re: Omitting TermVector info and index size

2007-02-14 Thread Mark Miller
As Erick said, Term positions are kept regardless of whether you store term vectors. The positional information is needed for phrase queries, span queries, etc. You certainly don't lose the ability to use phrase queries if you do not store term vectors. If you check out the Posting class in Doc

Re: Omitting TermVector info and index size

2007-02-14 Thread Grant Ingersoll
27;m indexing over 20,000 books and the index size is 8G. So I decided to play around with not storing some of the termvector information and I'm shocked at how much smaller the index is. By storing all my fields with Field.TermVector.WITH_POSITIONS, my index is reduced by OVER 75%. It

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
Erik Hatcher sez no. Erick On 2/14/07, karl wettin <[EMAIL PROTECTED]> wrote: 14 feb 2007 kl. 15.03 skrev Erick Erickson: > My reasoning was that I do need position information since I need > to do Span > queries, but character information (WITH_OFFSETS) isn't necessary > here/now. > So I t

Re: Omitting TermVector info and index size

2007-02-14 Thread karl wettin
14 feb 2007 kl. 15.03 skrev Erick Erickson: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. So I thought I'd make a small test to see if this was worth pursuing. If omitting offsets ha

Re: Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
You've made me a happy man . Thanks again. [EMAIL PROTECTED] . On 2/14/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: > My reasoning was that I do need position information since I need > to do Span > queries, but character information (WITH_OF

Re: Omitting TermVector info and index size

2007-02-14 Thread Erik Hatcher
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote: My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. 1> Am I going off a cliff here? I suppose this is really answered by 2> what is the d

Omitting TermVector info and index size

2007-02-14 Thread Erick Erickson
I'm indexing books, with a significant amount of overhead in each document and a LOT of OCR data. I'm indexing over 20,000 books and the index size is 8G. So I decided to play around with not storing some of the termvector information and I'm shocked at how much smaller the index

FieldCache vs TermVector

2006-11-22 Thread Volodymyr Bychkoviak
: - Can TermVector be used instead of FieldCache to implement sorting (and other activities where FieldCache is used) ? - Would it be much slower? -- regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
> Ah, so the fact that "1" actually appears many times in the string you > give Lucene is important. Neat application! > > Sounds like the custom Analyzer (really a custom TokenStream) approach > suggested by others may be the way for you to go. If the information > you get from the MySQL profile

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: If you're willing to continue subsetting / summarizing the data out into Lucene, how about subsetting it out into a dedicated MySQL instance for this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int = roughly 1 GB of data, which would easily fit into RAM. Queries

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
> If you're willing to continue subsetting / summarizing the data out into > Lucene, how about subsetting it out into a dedicated MySQL instance for > this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int = > roughly 1 GB of data, which would easily fit into RAM. Queries should > be pret

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: The data i'm dealing with is stored over a few mysql dbs on different machines, horizontally partitioned so each user is assigned to a single db. The queries i'm doing can be done in SQL in parallel over all machines then combined, which i've tested - it's unacceptably slo

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Grant Ingersoll
Not sure if this is feasible, but is there someway you could use a "fake" analyzer that you constructed using your hashtable/termvector and then have it output the tokens directly from the hashtable via the TokenStream? Maybe you would have to pass in an empty/dummy string to

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
using whitespace analyzer), although it may remove some of the work, i'd still have to add in the extra steps of building strings instead of handing over a termvector durectly. I guess i need to delve into the lucene code see what's going on. Cheers, RJ > last.fm using Lucene, sweet!

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
> I can think of a few ways. If elegance is your goal, then a little > relational database theory might help. Specifically, instead of having > one record per listener, have one record per listener-artist > combination, with three fields: listenerid, artistid, and count. Your > example above wo

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Erik Hatcher
hen feeding this into lucene (i store the termvec of the field in lucene). Is there a way i could pass a termvector directly to lucene to cut out the ugly "turn it into a string and let lucene parse it" step? basically i want to provide the termvector for a field when insertin

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: Hi, I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for various things, and i've run into a situation that seems somewhat inelegant regarding populating fields which i already know the termvector for. I'm creating a document for eac

Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Richard Jones
Hi, I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for various things, and i've run into a situation that seems somewhat inelegant regarding populating fields which i already know the termvector for. I'm creating a document for each user (last.fm tracks

Re: Clustering Carrot2 vs TermVector Analysis

2005-06-01 Thread Andrew Boyd
Responses inline prefixed with -Original Message- From: Dawid Weiss <[EMAIL PROTECTED]> Sent: Jun 1, 2005 3:24 AM To: java-user@lucene.apache.org Subject: Re: Clustering Carrot2 vs TermVector Analysis Hi Andrew, Coming up with an answer... sorry for the delay. > By

Re: Clustering Carrot2 vs TermVector Analysis

2005-06-01 Thread Dawid Weiss
wondering if there was a way to do something similar using term vector analysis and the built in TermVector / Similarity api. Yes, most clustering methods are based just on that (term-vector matrix). Carrot also uses this internally, but builds its own data structure from the provided data instead of

Clustering Carrot2 vs TermVector Analysis

2005-05-30 Thread Andrew Boyd
term vector analysis and the built in TermVector / Similarity api. Please bear with me as I'm just learning about term vector analysis mostly from: http://www.miislita.com/term-vector/term-vector-1.html Where it discusses wi = tfi * IDFi I've ordered the book Information Retrieval: