Dear,
I need to retrieve all terms that are stored in various fields in the
documents so that I can perform calculations of some metrics for each term
t from my base document.
I realized that by using the TermVector the index gets a large size, about
80% of the size of my collection of documents
Hello!
I need to access token position and payload info during the search result page
building. I need to do this for 10 documents max, so retrieving TermVectors is
totally OK for me.
Say, I retrieve it for one document:
Terms tv = _indexDirectoryReader.getTermVector(0, "wordform");
>From the
-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, August 24, 2012 9:52 AM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved
Calling IR.document does not restore your 'original Document'
completely. This is really an ag
ctorInfo is called, it displays offsets and
> positions for the fields that have term vectors with offsets and positions.
> The second time it is called, it doesn't display anything because none of the
> term vectors satisfy termFreqVector instanceof TermPositionVector. Is it
term vectors
in the affected fields? Is there a way to add a field to the documents in an
index in which this doesn't occur?
Thanks,
Mike
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, July 20, 2012 5:59 PM
To: java-user@lucene.apache.org
Subje
On Fri, Jul 27, 2012 at 9:10 AM, Andrzej Bialecki wrote:
>
> Catching up with this thread ... Luke 4.0-ALPHA makes a similar mistake. I
> fixed this in svn (to be released in a week or so) so that:
>
> * Luke now actually checks whether a doc has term vectors for a particular
> field and adjusts t
012 5:59 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote:
Hi Robert,
I'm not trying to determine whether a document has term vectors, I'm trying to
determine whet
Subject: Re: Problem with TermVector offsets and positions not being preserved
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying
> to determine whether the term vectors that are in th
On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying
> to determine whether the term vectors that are in the index have offsets and
> positions > stored.
Right: what i'm trying to tell you is that offsets
o: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved
I think its wrong for DumpIndex to look at term vector information from the
Document that was retrieved from IndexReader.document, thats basically just a
way of getting access to your store
functions, and to add a loop that writes names of fields and their
> TermVector, offset and position settings to the console.
>
> The other application is called DumpIndex, and got it from a web site
> somewhere about 6 months ago. I changed a few lines to get rid of deprecated
> fun
-user@lucene.apache.org
Subject: RE: Problem with TermVector offsets and positions not being preserved
Hi Robert,
I put together the following two small applications to try to separate the
problem I am having from my own software and any bugs it contains. One of the
applications is c
from Manning Publications. I changed
it a tiny bit to get rid of a special analyzer that is irrelevant to what I am
looking at, to get rid of a few warnings about deprecated functions, and to add
a loop that writes names of fields and their TermVector, offset and position
settings to the console.
Hi Mike:
I wrote up some tests last night against 3.6 trying to find some way
to reproduce what you are seeing, e.g. adding additional segments with
the field specified without term vectors, without tv offsets, omitting
TF, and merging them and checking everything out. I couldnt find any
problems.
I created an index using Lucene 3.6.0 in which I specified that a certain text
field in each document should be indexed, stored, analyzed with no norms, with
term vectors, offsets and positions. Later I looked at that index in Luke, and
it said that term vectors were created for this field, but
, MoreLikeThis will generate terms from stored
fields
Now since I am using lucene and not Solr, I will ask question from Lucene
point of view:
1. What is the difference between the below 2 index statements. As per my
understanding first one does not store separate TermVector and second does.
new
cur in other document but are not in a
particular one are omitted (they might even not be present when the
vector is stored). Yet, what lucene offers you is one big vector with
all unique terms for each field, its term dictionary. you can simply
build your dense vector as needed at runtime. You can
Hi
I am trying to implement an Expectation Maximization algorithm for document
clustering. I am planning to use Lucene Term Vectors for finding similarity
between 2 documents. There are 2 kinds of EM algos using naive Bayes: the
multivariate model and the multinomial model. In simple terms, the
I don't think there's an easy way to jump straight from term + freq
per doc to a Lucene index.
Mike
On Tue, Apr 21, 2009 at 7:14 AM, Thomas Pönitz
wrote:
> Hi,
>
> I have the same problem as discussed here:
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/%3c200511021310.1
Hi,
I have the same problem as discussed here:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200511.mbox/%3c200511021310.18686...@last.fm%3e
I want to specify termvectors directly instead of constructing a dummy
string like "a a a b b c" that will be transformed to a[3] b[2] c[1].
Have a look at the SpanQuery, specifically the SpanNearQuery. The
getSpans() method will return a Spans object, which you can use to
access the positions.
-Grant
On Jan 29, 2008, at 7:17 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:
And how can I find the offsets of something like "f
> > And how can I find the offsets of something like "foo bar"?
> I think
> > this
> > will get tokenized into 2 terms and thus I have no chance to find
> > it, right?
>
> I wouldn't say no chance... TermVectorMapper would be good
> for this,
> as you can watch the terms as they are being
On Jan 28, 2008, at 4:04 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:
Also, search the archives for Term Vector, as you will find
discussion
of it there.
Ah I see, I need to cast it to TermPositionVector. OK.
yep
You may also, eventually, be interested in the new
TermVectorMapper
> Also, search the archives for Term Vector, as you will find
> discussion
> of it there.
Ah I see, I need to cast it to TermPositionVector. OK.
> You may also, eventually, be interested in the new
> TermVectorMapper capabilities in 2.3 which should help speed up the
> processing of term
le index, not a single document.
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Montag, 28. Januar 2008 15:28
To: java-user@lucene.apache.org
Subject: TermVector
Hi,
how do I get the TermVector from a document which I have
gotten from an
IndexSearcher via Inde
ndexReader#termPositions(Term t) - but this returns the
positions for the whole index, not a single document.
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Montag, 28. Januar 2008 15:28
> To: java-user@lucene.apache.org
> Subject: TermVector
>
Hi,
how do I get the TermVector from a document which I have gotten from an
IndexSearcher via IndexSearcher#search(Query q).
Luke can do it, but I do not know how...
Thank you.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
That seems to be the correct usage. Can you provide a self contained
unit test showing what you are doing or, at least, more supporting code?
-Grant
On Jun 24, 2007, at 5:14 PM, Lee Li Bin wrote:
Hi,
May I know how do I store TermVector?
When I set the last parameter to true, isn
Suggest you use lucene 2.1 or above
Andy
-Original Message-
From: Lee Li Bin [mailto:[EMAIL PROTECTED]
Sent: Monday, June 25, 2007 5:14 AM
To: java-user@lucene.apache.org
Subject: TermVector
Hi,
May I know how do I store TermVector?
When I set the last parameter to true, isn
Hi,
May I know how do I store TermVector?
When I set the last parameter to true, isn't it setting storeTermVector to
true?
But I get null value in TermFreqVector.
BTW, I'm using lucene 1.4.3
Not intended to upgrade to 2.0
docAll.add(Field.Text("contentText&quo
OK, final note. I wish I knew what kind of drugs I was on when I first
thought that the sizes were so much smaller. Because they weren't. I got to
thinking that "gee, it's kind of weird that if you don't specify anything
for TermVector when creating a field, you get all this a
> > document
> > and a LOT of OCR data. I'm indexing over 20,000 books and the index
> > size is
> > 8G. So I decided to play around with not storing some of the
> > termvector
> > information and I'm shocked at how much smaller the index is. By
> >
ormance benefit to storing them,
at the cost of disk space, like you said.
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
> I'm indexing books, with a significant amount of overhead in each
> document
> and a LOT of OCR data. I'm indexing over 20,000 books and the index
>
h
> document
> and a LOT of OCR data. I'm indexing over 20,000 books and the index
> size is
> 8G. So I decided to play around with not storing some of the
> termvector
> information and I'm shocked at how much smaller the index is. By
> storing all
> my fields
As Erick said, Term positions are kept regardless of whether you store
term vectors. The positional information is needed for phrase queries,
span queries, etc. You certainly don't lose the ability to use phrase
queries if you do not store term vectors. If you check out the Posting
class in Doc
27;m indexing over 20,000 books and the index
size is
8G. So I decided to play around with not storing some of the
termvector
information and I'm shocked at how much smaller the index is. By
storing all
my fields with Field.TermVector.WITH_POSITIONS, my index is reduced
by OVER
75%. It
Erik Hatcher sez no.
Erick
On 2/14/07, karl wettin <[EMAIL PROTECTED]> wrote:
14 feb 2007 kl. 15.03 skrev Erick Erickson:
> My reasoning was that I do need position information since I need
> to do Span
> queries, but character information (WITH_OFFSETS) isn't necessary
> here/now.
> So I t
14 feb 2007 kl. 15.03 skrev Erick Erickson:
My reasoning was that I do need position information since I need
to do Span
queries, but character information (WITH_OFFSETS) isn't necessary
here/now.
So I thought I'd make a small test to see if this was worth
pursuing. If
omitting offsets ha
You've made me a happy man .
Thanks again.
[EMAIL PROTECTED] .
On 2/14/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
> My reasoning was that I do need position information since I need
> to do Span
> queries, but character information (WITH_OF
On Feb 14, 2007, at 9:03 AM, Erick Erickson wrote:
My reasoning was that I do need position information since I need
to do Span
queries, but character information (WITH_OFFSETS) isn't necessary
here/now.
1> Am I going off a cliff here? I suppose this is really answered by
2> what is the d
I'm indexing books, with a significant amount of overhead in each document
and a LOT of OCR data. I'm indexing over 20,000 books and the index size is
8G. So I decided to play around with not storing some of the termvector
information and I'm shocked at how much smaller the index
:
- Can TermVector be used instead of FieldCache to implement sorting (and
other activities where FieldCache is used) ?
- Would it be much slower?
--
regards,
Volodymyr Bychkoviak
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
> Ah, so the fact that "1" actually appears many times in the string you
> give Lucene is important. Neat application!
>
> Sounds like the custom Analyzer (really a custom TokenStream) approach
> suggested by others may be the way for you to go. If the information
> you get from the MySQL profile
Richard Jones wrote:
If you're willing to continue subsetting / summarizing the data out into
Lucene, how about subsetting it out into a dedicated MySQL instance for
this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int =
roughly 1 GB of data, which would easily fit into RAM. Queries
> If you're willing to continue subsetting / summarizing the data out into
> Lucene, how about subsetting it out into a dedicated MySQL instance for
> this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int =
> roughly 1 GB of data, which would easily fit into RAM. Queries should
> be pret
Richard Jones wrote:
The data i'm dealing with is stored over a few mysql dbs on different
machines, horizontally partitioned so each user is assigned to a single db.
The queries i'm doing can be done in SQL in parallel over all machines then
combined, which i've tested - it's unacceptably slo
Not sure if this is feasible, but is there someway you could use a
"fake" analyzer that you constructed using your hashtable/termvector and
then have it output the tokens directly from the hashtable via the
TokenStream? Maybe you would have to pass in an empty/dummy string to
using
whitespace analyzer), although it may remove some of the work, i'd still have
to add in the extra steps of building strings instead of handing over a
termvector durectly.
I guess i need to delve into the lucene code see what's going on.
Cheers,
RJ
> last.fm using Lucene, sweet!
> I can think of a few ways. If elegance is your goal, then a little
> relational database theory might help. Specifically, instead of having
> one record per listener, have one record per listener-artist
> combination, with three fields: listenerid, artistid, and count. Your
> example above wo
hen feeding this into
lucene (i
store the termvec of the field in lucene). Is there a way i could
pass a
termvector directly to lucene to cut out the ugly "turn it into a
string and
let lucene parse it" step? basically i want to provide the
termvector for a
field when insertin
Richard Jones wrote:
Hi,
I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for
various things, and i've run into a situation that seems somewhat inelegant
regarding populating fields which i already know the termvector for.
I'm creating a document for eac
Hi,
I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for
various things, and i've run into a situation that seems somewhat inelegant
regarding populating fields which i already know the termvector for.
I'm creating a document for each user (last.fm tracks
Responses inline prefixed with
-Original Message-
From: Dawid Weiss <[EMAIL PROTECTED]>
Sent: Jun 1, 2005 3:24 AM
To: java-user@lucene.apache.org
Subject: Re: Clustering Carrot2 vs TermVector Analysis
Hi Andrew,
Coming up with an answer... sorry for the delay.
> By
wondering if there
was a way to do something similar using term vector analysis and the
built in TermVector / Similarity api.
Yes, most clustering methods are based just on that (term-vector
matrix). Carrot also uses this internally, but builds its own data
structure from the provided data instead of
term
vector analysis and the built in TermVector / Similarity api.
Please bear with me as I'm just learning about term vector analysis mostly from:
http://www.miislita.com/term-vector/term-vector-1.html
Where it discusses wi = tfi * IDFi
I've ordered the book Information Retrieval:
55 matches
Mail list logo