Re: Can the BooleanQuery execution be optimized with same term queries?

2023-09-19 Thread Michael Sokolov
y.I am encounterd by a question about the execution > of BooleanQuery:although,BooleanQuery#rewrite has done some works to > remove duplicate FILTER,SHOULD clauses.however still the same term query > can been executed the several times. > > I copy the test code in the Tes

Can the BooleanQuery execution be optimized with same term queries?

2023-09-18 Thread YouPeng Yang
,SHOULD clauses.however still the same term query can been executed the several times. I copy the test code in the TestBooleanQuery to approve my assumption. Unit Test Code as follows: BooleanQuery.Builder qBuilder = new BooleanQuery.Builder(); qBuilder = new BooleanQuery.Builder

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-26 Thread Adrien Grand
ast 5x compared to old code. > Is there any thoughts on why term frequency calls on PostingsEnum are that > slow ? > > > > *Thanks and Regards,* > *Vimal Jain* > > > On Wed, Jun 21, 2023 at 1:43 PM Adrien Grand wrote: > > > As far as your performance problem i

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-21 Thread Vimal Jain
I did profiling of new code and found that below api call is most time consuming :- org.apache.lucene.index.PostingsEnum#freq If i comment out this call and instead use some random integer for testing purpose, then perf is at least 5x compared to old code. Is there any thoughts on why term

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-21 Thread Adrien Grand
mal Jain wrote: > > > > > Thanks Adrien for quick response. > > > Yes , i am replacing disjuncts across multiple fields with single > custom > > > term query over merged field. > > > Can you please provide more details on what do you mean by dynamic

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Vimal Jain
esponse. > > Yes , i am replacing disjuncts across multiple fields with single custom > > term query over merged field. > > Can you please provide more details on what do you mean by dynamic > pruning > > in context of custom term query ? > > > > On Tue,

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand> a few years ago if you're curious. On Tue, Jun 20, 2023 at 6:58 PM Vimal Jain wrote: > Thanks Adrien for quick response. > Yes , i am replacing disjuncts across multiple fields with single custom > te

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Vimal Jain
Thanks Adrien for quick response. Yes , i am replacing disjuncts across multiple fields with single custom term query over merged field. Can you please provide more details on what do you mean by dynamic pruning in context of custom term query ? On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, wrote

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
Intuitively replacing a disjunction across multiple fields with a single term query should always be faster. You're saying that you're storing the type of token as part of the term frequency. This doesn't sound like something that would play well with dynamic pruning, so I wonder

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Vimal Jain
Ok , sorry , I realized that I need to provide more context. So we used to create a lucene query which consisted of custom term queries for different fields and based on the type of field , we used to assign a boost that would be used in scoring. Now we want to get rid off different fields and

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-20 Thread Adrien Grand
i, > > I want to understand if fetching the term frequency of a term during > > scoring is relatively cpu bound operation ? > > Context - I am storing custom term frequency during indexing and later > > using it for scoring during query execution time ( in Scorer'

Re: Relative cpu cost of fetching term frequency during scoring

2023-06-19 Thread Vimal Jain
Note - i am using lucene 7.7.3 *Thanks and Regards,* *Vimal Jain* On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain wrote: > Hi, > I want to understand if fetching the term frequency of a term during > scoring is relatively cpu bound operation ? > Context - I am storing custom term freq

Relative cpu cost of fetching term frequency during scoring

2023-06-19 Thread Vimal Jain
Hi, I want to understand if fetching the term frequency of a term during scoring is relatively cpu bound operation ? Context - I am storing custom term frequency during indexing and later using it for scoring during query execution time ( in Scorer's score() method ). I noticed a performance

Multi-term synonyms in SynonymQuery

2022-12-28 Thread Anh Dũng Bùi
Hi Lucene users, I recently came across SynonymQuery and found out that it only supports single-term synonyms (since it accepts a list of Term which will be considered as synonyms). We have some multi-term synonyms like "internet device" <-> "wifi router" or "dns&q

Re: Best practice - preparing search term for Lucene

2022-09-24 Thread Hrvoje Lončar
Oh yes, I also use Spring Cache which works fine and I don't have to store products in Lucene making index smaller and faster. On Fri, 23 Sept 2022, 19:26 Stephane Passignat, wrote: > Hi > > I would don't store the original value. That's "just" an index. But store > the value of your db identifi

Re: Best practice - preparing search term for Lucene

2022-09-24 Thread Hrvoje Lončar
Well, my bad is that I used wrong word. I'm not storing but just goving keywords to analyzer. That was my mistake in writing. So far I don't index exotic letters, but just normalized. Additionally I put in index something like "Prod_3443" which is a product ID for situation when specific product is

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Hrvoje Lončar
Good point! For now I'll leave it normalized. Every search term coming from frontend is stored and also its counter updated which will help me after some time to see trends and to decide to change the logic or not. P.S. Here is the funny part: in Croatian "pišanje" means peeing

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Stephane Passignat
Hi I would don't store the original value. That's "just" an index. But store the value of your db identifiers, because I think you'll want it at some point. (I made the same kind of feature on top of datanucleus) I use to have tech id in my db. Even more since I started to use jdo jpa some 20

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Michael Sokolov
I think it depends how precise you want to make the search. If you want to enable diacritic-sensitive search in order to avoid confusions when users actually are able to enter the diacritics, you can index both ways (ascii-folded and not folded) and not normalize the query terms. Or you can just fo

Re: Best practice - preparing search term for Lucene

2022-09-23 Thread Hrvoje Lončar
Hi Stephane! Actually, I have excactly that kind of conversion, but I didn't mention as my mail was long enough whithout it :) My main concern it should I let Lucene index original keywords or not. Considering what you wrote, I guess your answer would be to store only converted values without exot

Re: Best practice - preparing search term for Lucene

2022-09-22 Thread Stephane Passignat
Hello, The way I did it took me some time and I almost sure it's applicable to all languages. I normalized the words. Replacing letters or group of letters by another approaching one. In french e é è ê ai ei sound a bit the same, and for someone who write mistakes having to use the right lett

Best practice - preparing search term for Lucene

2022-09-22 Thread Hrvoje Lončar
Hi! I'm using Hibernate Search / Lucene to index my entities in Spring Boot aplication. One thing I'm not sure is how to handle Croatian specific letters. Croatian language has few additional letters "*č* *Č* *ć* *Ć* *đ* *Đ* *š* *Š* *ž* *Ž*". Letters "*đ* *Đ*" are commonly replaced with "*dj* *DJ

Need some guidance for multi-term synonym matching

2021-12-07 Thread Trevor Nicholls
I am using Lucene 8.6.3 in an application which searches a library of technical documentation. I have implemented synonym matching which works for single word replacements, but does not match when one of the synonyms has two or more words. My attempts to support multi-term synonyms are failing

RE: Changing Term Vectors for Query

2021-06-07 Thread Marcel D.
Hi, at first i think i missed pointing out my problem exactly. What i wanna do is run a normal query on my index. After that i wanna change the frequencies of some important terms to another number and i know neither the new frequency nor the term which frequency i update at index creation. As

RE: Changing Term Vectors for Query

2021-06-07 Thread Uwe Schindler
applicable. If you want to have "per document" scoring factors (not per term), you can also use additional DocValues fields with per-document factors and you can use a function query (e.g. using expressions module) to modify the score. Uwe - Uwe Schindler Achterdiek 19,

Re: Changing Term Vectors for Query

2021-06-07 Thread Marcel D.
.@protonmail.com.invalid wrote: > > > Hello, > > for some Queries i need to calcuate the score mostly like the normal > > score, but for some documents certain terms are assigned a Frequency given > > by me and the score should be calculated with these new term frequencies. >

Re: Changing Term Vectors for Query

2021-06-07 Thread Adrien Grand
be calculated with these new term frequencies. > After some research, it seems i have to implement a custom Query, custom > Weight and Custom Scorer for this. I wanted to ask if I'm overlooking a > simpler solution or if this is the way to go. > Thanks, > Marcel -- Adrien

Changing Term Vectors for Query

2021-06-06 Thread Hannes Lohr
Hello, for some Queries i need to calcuate the score mostly like the normal score, but for some documents certain terms are assigned a Frequency given by me and the score should be calculated with these new term frequencies. After some research, it seems i have to implement a custom Query

Re: Optimizing term-occurrence counting (code included)

2020-09-21 Thread Michael McCandless
to each of the candidate docs > >> > > >> > Simple example of the query: > >> > - query terms ("aaa", "bbb") > >> > - indexed docs with terms: > >> > docId 0 has terms ("aaa", "bbb") > >>

Re: Optimizing term-occurrence counting (code included)

2020-09-20 Thread Alex K
date docs >> > >> > Simple example of the query: >> > - query terms ("aaa", "bbb") >> > - indexed docs with terms: >> > docId 0 has terms ("aaa", "bbb") >> > docId 1 has terms ("aaa", &q

Re: Optimizing term-occurrence counting (code included)

2020-07-24 Thread Alex K
idates = 1 > > - simple scoring function score(docId) = docId + 10 > > The query first builds a count array [2, 1], because docId 0 contains two > > matching terms and docId 1 contains 1 matching term. > > Then it picks docId 0 as the candidate subset. > > Then i

Re: Optimizing term-occurrence counting (code included)

2020-07-23 Thread Ali Akhtar
ple scoring function score(docId) = docId + 10 > The query first builds a count array [2, 1], because docId 0 contains two > matching terms and docId 1 contains 1 matching term. > Then it picks docId 0 as the candidate subset. > Then it applies the scoring function, returning a score of

Optimizing term-occurrence counting (code included)

2020-07-23 Thread Alex K
2, 1], because docId 0 contains two matching terms and docId 1 contains 1 matching term. Then it picks docId 0 as the candidate subset. Then it applies the scoring function, returning a score of 10 for docId 0. The main bottleneck right now is doing the initial counting, i.e. the part that ret

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-30 Thread Michael McCandless
e e.g. string and numeric values from the original document, but not schema level information like whether offsets/positions are indexed into postings and term vectors for each field, or not. That would be safe, if you are trying to avoid the cost of retrieving the full values for all fields from your ba

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-29 Thread Albert MacSweeny
Regards, Albert > From: "Michael McCandless" > To: "java-user" , "albert macsweeny" > > Sent: Monday, 29 June, 2020 15:23:43 > Subject: Re: Adding fields with same field type complains that they have > different term vector settings > Hi A

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-29 Thread Michael McCandless
field results in the following exception > > java.lang.IllegalArgumentException: all instances of a given field name > must have the same term vectors settings (storeTermVectorPositions changed > for field="f1") > at > org.apache.lucene.index.TermVectorsC

Adding fields with same field type complains that they have different term vector settings

2020-06-29 Thread Albert MacSweeny
e for the same field results in the following exception java.lang.IllegalArgumentException: all instances of a given field name must have the same term vectors settings (storeTermVectorPositions changed for field="f1") at org.apache.lucene.index.TermVectorsConsume

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Alex K
Hi Tommaso, thanks for the input and links! I'll add your paper to my literature review. So far I've seen very promising results from modifying the TermInSetQuery. It was pretty simple to keep a map of `doc id -> matched term count` and then only evaluate the exact similarity on the

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Toke Eskildsen
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote: > My implementation isn't specific to any particular dataset or access > pattern (i.e. infinite vs. subset). Without a clearly defined use case, I would say that the sequential scan approach is not the right one: As these things goes, someone will

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Tommaso Teofili
hi Alex, I had worked on a similar problem directly on Lucene (within Anserini toolkit) using LSH fingerprints of tokenized feature vector values. You can find code at [1] and some information on the Anserini documentation page [2] and in a short preprint [3]. As a side note my current thinking is

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem space! My implementation isn't specific to any particular dataset or access pattern (i.e. infinite vs. subset). So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular similarities with LSH variants for all but

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
> tricks > > >> > for my use case. > > >> > > > >> > An example use case for the plugin is reverse image search. A user > can > > >> > store vectors representing images and run a nearest-neighbors query > to > > >> > retr

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Toke Eskildsen
On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote: > I'm working on an Elasticsearch plugin (using Lucene internally) that > allows users to index numerical vectors and run exact and approximate > k-nearest-neighbors similarity queries. Quite a coincidence. I'm looking into the same thing :-) > 1

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Michael Sokolov
elastiknn.klibisz.com/ > >> > > >> > The main method for indexing the vectors is based on Locality Sensitive > >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. > >> > The general pattern is: > >> > >

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
> >> > The main method for indexing the vectors is based on Locality Sensitive >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. >> > The general pattern is: >> > >> >1. When indexing a vector, apply a hash function to it, prod

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
g a > set > >of discrete hashes. Usually there are anywhere from 100 to 1000 > hashes. > >Similar vectors are more likely to share hashes (i.e., similar vectors > >produce hash collisions). > >2. Convert each hash to a byte array and store the byte arr

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Michael Sokolov
, producing a set >of discrete hashes. Usually there are anywhere from 100 to 1000 hashes. >Similar vectors are more likely to share hashes (i.e., similar vectors >produce hash collisions). >2. Convert each hash to a byte array and store the byte array as a >Lucene

Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
discrete hashes. Usually there are anywhere from 100 to 1000 hashes. Similar vectors are more likely to share hashes (i.e., similar vectors produce hash collisions). 2. Convert each hash to a byte array and store the byte array as a Lucene Term at a specific field. 3. Store the comp

回复:Lucene Query parser term Length

2020-01-20 Thread 陈志祥
e the contents of this communication to others. -- 发件人:Akanksha 日 期:2020年01月20日 20:55:23 收件人: 主 题:Lucene Query parser term Length Hello Everyone I am working with Lucene 4.7.1 When parsing query using Lucene query parser. If query l

Lucene Query parser term Length

2020-01-20 Thread Akanksha
Hello Everyone I am working with Lucene 4.7.1 When parsing query using Lucene query parser. If query length is greater than 255 bytes, it returns query with space appended after every 255 bytes. Which is causing further issues in my project. Can you please let me know why the term (parsed query

Re: Multi-IDF for a single term possible?

2019-12-04 Thread Ravikumar Govindarajan
Thanks Ameer! Was thinking about few ideas. Thought something like tapping into Codec extension to store multi-IDF values in 2 files, namely an IDF Meta-file & a IDF Data-file IDF Meta-file holds List of {UserId, Terms-Data-File-Offset} pairs for each Term, encoded via ForUtil. IDF Data-

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Ameer Albahem
; dceccarel...@bloomberg.net> wrote: > > > Hi Ravi, > > Can you give more details on how you store an entity into lucene? what is > > a doc type? > > what fields do you have? > > > > Cheers > > > > From: java-user@lucene.apache.org At: 12/03/19 1

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Ravikumar Govindarajan
give more details on how you store an entity into lucene? what is > a doc type? > what fields do you have? > > Cheers > > From: java-user@lucene.apache.org At: 12/03/19 12:50:40To: > java-user@lucene.apache.org > Subject: Multi-IDF for a single term possible? > >

Re:Multi-IDF for a single term possible?

2019-12-03 Thread Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi Ravi, Can you give more details on how you store an entity into lucene? what is a doc type? what fields do you have? Cheers From: java-user@lucene.apache.org At: 12/03/19 12:50:40To: java-user@lucene.apache.org Subject: Multi-IDF for a single term possible? Hello, We are using TF-IDF

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Robert Muir
it is enough to give each its own field. On Tue, Dec 3, 2019 at 7:57 AM Adrien Grand wrote: > Is there any reason why you are not storing each DOC_TYPE in its own index? > > On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan > wrote: > > > > Hello, > > > > We are using TF-IDF for scoring (Ye

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Adrien Grand
Is there any reason why you are not storing each DOC_TYPE in its own index? On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan wrote: > > Hello, > > We are using TF-IDF for scoring (Yet to migrate to BM25). Different > entities (DOC_TYPES) are crunched & stored together in a single index. > >

Multi-IDF for a single term possible?

2019-12-03 Thread Ravikumar Govindarajan
Hello, We are using TF-IDF for scoring (Yet to migrate to BM25). Different entities (DOC_TYPES) are crunched & stored together in a single index. When it comes to IDF, I find that there is a single value computed across documents & stored as part of TermStats, whereas our documents are not homoge

Re: fields contains equals term docs search

2019-04-22 Thread Michael Sokolov
Can you create a scoring scenario that counts the number of fields in which a term occurs and rank by that (descending) with some kind of post-filtering? On Fri, Apr 19, 2019 at 11:24 AM Valentin Popov wrote: > > Hi, > I trying find the way, to search all docs has equals term on

Re: fields contains equals term docs search

2019-04-21 Thread Valentin Popov
static approaches? > I would index an auxiliary field which has binary values (0/1 or > "T"/"F") representing "has equals term on different fields" > so that you can filtering out the docs (maybe by constant score query). > > Tomoko > > 2019年4月20

Re: fields contains equals term docs search

2019-04-19 Thread Tomoko Uchida
Hi, I'm not sure there are better ways to meet your requirement by querying, but how about considering static approaches? I would index an auxiliary field which has binary values (0/1 or "T"/"F") representing "has equals term on different fields" so that you c

fields contains equals term docs search

2019-04-19 Thread Valentin Popov
Hi, I trying find the way, to search all docs has equals term on different fields. Like doc1 {"foo":"master", "bar":"master"} doc2 {"foo":"test", "bar":"master"} As result should be doc1 only. Right now, I'm ge

Re: Question: find term by position and document id

2018-04-05 Thread Adrien Grand
on to the following problem: > > I’m trying to implement suggestions for PhraseQuery. Let’s say we have a > PhraseQuery in the form "com.example". I would like to find all the terms > that are right after the `example` term. From the implementation of the > PhraseQuery I was

Question: find term by position and document id

2018-04-05 Thread Adam Hornacek
Hello, I’m having difficulty finding a solution to the following problem: I’m trying to implement suggestions for PhraseQuery. Let’s say we have a PhraseQuery in the form "com.example". I would like to find all the terms that are right after the `example` term. From the implementat

Re: To get the term-freq

2017-11-20 Thread Michael McCandless
You could use the PostingsEnum API, advance to your document, then call freq()? I believe there is also a function query based on the term doc freq. Mike McCandless http://blog.mikemccandless.com On Fri, Nov 17, 2017 at 11:37 AM, Ahmet Arslan wrote: > Hi, > > I am also intersted

Re: To get the term-freq

2017-11-17 Thread Ahmet Arslan
Hi, I am also intersted into the answer to this question.  I wonder whether term freq. function query would work here. Ahmet On Friday, November 17, 2017, 10:32:23 AM GMT+3, Dwaipayan Roy wrote: ​Hi, I want to get the term frequency of a given term t in a given document with lucene

To get the term-freq

2017-11-16 Thread Dwaipayan Roy
​Hi, I want to get the term frequency of a given term t in a given document with lucene docid say d. Formally, I need a function say f() that takes two arguments: 1. lucene-docid d, 2. term t, and returns the number of time t occurs in d. I know of one solution, that is, traversing the whole

PostingsEnum for documents that does not contain a term

2017-08-07 Thread Ahmet Arslan
Hi, I am traversing posting list of a given term/word using the following code. I am accessing/processing term frequency and document length. Term term = new Term(field, word); PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, field, term.bytes()); if (postingsEnum == null) return

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-07-02 Thread David Smiley
of 50m. > > I am looking to reduce search/sort time to 10ms. I have 4g of RAM for the > java process which is more than sufficient. > > Any suggestions greatly appreciated. > > Thanks, > sc > > > > -- > View this message in context: > http://lucene.472066.

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-29 Thread sc
radius of 50m. I am looking to reduce search/sort time to 10ms. I have 4g of RAM for the java process which is more than sufficient. Any suggestions greatly appreciated. Thanks, sc -- View this message in context: http://lucene.472066.n3.nabble.com/Term-Dictionary-taking-up-lots-of-memory

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-14 Thread David Smiley
can support about 5k qps > @ p95 9ms, which is a great improvement from the RPT strategy we had been > using. Once again, thanks for your help. > > Best, > Tom Hirschfeld > > On Thu, May 18, 2017 at 4:22 AM, Uwe Schindler wrote: > > > Hi, > > Are you sure that the ter

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-06-13 Thread Tom Hirschfeld
Once again, thanks for your help. Best, Tom Hirschfeld On Thu, May 18, 2017 at 4:22 AM, Uwe Schindler wrote: > Hi, > Are you sure that the term index is the problem? Even with huge indexes > you never need 65 good of heap! That's impossible. > Are you sure that your problem is not

Re: Penalize fact the searched term is within a world

2017-06-09 Thread Jacek Grzebyta
is only a match if the tokens are exactly > the same bytes. There are never done any substring matches, just simple > comparison of bytes. > > To have more fuzzier matches, you have to do text analysis right. This > includes splitting of tokens (Tokenizer), but also term "normali

RE: Penalize fact the searched term is within a world

2017-06-09 Thread Uwe Schindler
term "normalization" (TokenFilters). One example is lowercasing (to allow case insensitive matching), but also stemming might be done, or conversion to phonetic codes (to allow phonetic matches). The output of the tokens does not necessarily need to be "human readable" an

Re: Penalize fact the searched term is within a world

2017-06-09 Thread Jacek Grzebyta
; > I need to use lucene to create a mapping file based on text searching and I > found there is a following problem. Let take a term 'abcd' which is mapped > to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is > searching the ter

Re: Penalize fact the searched term is within a world

2017-06-08 Thread Ahmet Arslan
on that. I need to use lucene to create a mapping file based on text searching and I found there is a following problem. Let take a term 'abcd' which is mapped to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is searching the term and finds it

Penalize fact the searched term is within a world

2017-06-08 Thread Jacek Grzebyta
following problem. Let take a term 'abcd' which is mapped to node 'abcd-2' whereas node 'abcd' exists. I found the issue is lucene is searching the term and finds it in both nodes 'abcd' and 'abcd-2' and gives the same score. My question is: how to modify

Re: Term Dictionary taking up lots of heap memory, looking for solutions, lucene 5.3.1

2017-06-06 Thread David Smiley
to issue with memory on our leaf servers as the term > dictionary for the entire index is being loaded into heap space. If we > allocate > 65g heap space, our queries return relatively quickly (10s -100s > of ms), but if we drop below ~65g heap space on the leaf nodes, query time > dr

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-18 Thread Uwe Schindler
Hi, Are you sure that the term index is the problem? Even with huge indexes you never need 65 good of heap! That's impossible. Are you sure that your problem is not something else?: - too large heap? Heaps greater than 31 gigs are bad by default. Lucene needs only few heap, although you

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-18 Thread Michael McCandless
That sounds like a fun amount of terms! Note that Lucene does not load all terms into memory; only the "prefix trie", stored as an FST ( http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html), mapping term prefixes to on-disk blocks of terms. FSTs are very co

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-17 Thread Adrien Grand
Hirschfeld a écrit : > Hey! > > I am working on a lucene based service for reverse geocoding. We have a > large index with lots of unique terms (550 million) and it appears that > we're running into issue with memory on our leaf servers as the term > dictionary for the entire ind

Term Dictionary taking up lots of heap memory, looking for solutions, lucene 5.3.1

2017-05-17 Thread Tom Hirschfeld
Hey! I am working on a lucene based service for reverse geocoding. We have a large index with lots of unique terms (550 million) and it appears that we're running into issue with memory on our leaf servers as the term dictionary for the entire index is being loaded into heap space. If we all

Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

2017-05-17 Thread Tom Hirschfeld
Hey! I am working on a lucene based service for reverse geocoding. We have a large index with lots of unique terms (550 million) and it appears that we're running into issue with memory on our leaf servers as the term dictionary for the entire index is being loaded into heap space. If we all

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-04 Thread Markus Jelsma
ent: Monday 1st May 2017 12:33 > To: java-user@lucene.apache.org; solr-user > Subject: RE: Term no longer matches if PositionLengthAttr is set to two > > Hello again, apologies for cross-posting and having to get back to this > unsolved problem. > > Initially i thought this

RE: Term no longer matches if PositionLengthAttr is set to two

2017-05-01 Thread Markus Jelsma
- > From:Markus Jelsma > Sent: Tuesday 25th April 2017 13:40 > To: java-user@lucene.apache.org > Subject: Term no longer matches if PositionLengthAttr is set to two > > Hello, > > We have a decompounder and recently implemented the PositionLengthAttribute > in it and

Term no longer matches if PositionLengthAttr is set to two

2017-04-25 Thread Markus Jelsma
query time seems to be a problem. Any thoughts on this issue? Is it a bug? Do i not understand PositionLengthAttribute? Why does it affect term/document matching? At query time but not at index time? Many thanks, Markus ---

Re: Total of term frequencies

2017-04-18 Thread Michael McCandless
Ahh I see. Term vectors are actually an inverted index for a single document, and they also have the same postings API as the whole index (including TermsEnum.totalTermFreq), but that method likely always returns -1 for term vectors because it's not implemented? Maybe Lucene's def

Re: Total of term frequencies

2017-04-18 Thread Michael McCandless
I think you want to use the TermsEnum.totalTermFreq method? Mike McCandless http://blog.mikemccandless.com On Sun, Apr 16, 2017 at 11:36 AM, Manjula Wijewickrema wrote: > Hi, > > Is there any way to get the total count of terms in the Term Frequency > Vector (tvf)? I need to c

Total of term frequencies

2017-04-16 Thread Manjula Wijewickrema
Hi, Is there any way to get the total count of terms in the Term Frequency Vector (tvf)? I need to calculate the Normalized term frequency of each term in my tvf. I know how to obtain the length of the tvf, but it doesn't work since I need to count duplicate occurrences as well. H

Re: Only term frequencies

2017-04-06 Thread Erick Erickson
See "termfreq" in the function query section of the reference guide. Best, Erick On Thu, Apr 6, 2017 at 1:02 AM, Manjula Wijewickrema wrote: > Hi, > > I have a document collection with hundreds of documents. I need to do know > the term frequency for a given query term in

Only term frequencies

2017-04-06 Thread Manjula Wijewickrema
Hi, I have a document collection with hundreds of documents. I need to do know the term frequency for a given query term in each document. I know that 'hit.score' will give me the Lucene score for each document (and it includes term frequency as well). But I need to call only term freq

RE: calculate term co-occurrence matrix

2017-03-20 Thread Allison, Timothy B.
I have code as part of LUCENE-5318 that counts terms that cooccur within a window of where your query terms appear. This makes a really useful query term recommender, and the math is dirt simple. INPUT Doc1: quick brown fox jumps over the lazy dog Doc2: quick green fox leaps over the lazy dog

Re: codec: accessing term dictionary

2017-03-10 Thread Jürgen Jakobitsch
bitsch-punkt | xmlns:tg = "http://www.turnguard.com/turnguard#"; | blockchain : https://onename.com/turnguard 2017-03-10 11:49 GMT+01:00 Dawid Weiss : > Or you could encode those term/ ngram frequencies one FST and then > reuse it. This would be memory-saving and fairly fast (~compar

Re: codec: accessing term dictionary

2017-03-10 Thread Jürgen Jakobitsch
er to make it faster to get ngram counts. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch < > juergen.jakobit...@semantic-web.com> wrote: > >> hi, >> >> i'd like to ask users for their experien

Re: codec: accessing term dictionary

2017-03-10 Thread Dawid Weiss
Or you could encode those term/ ngram frequencies one FST and then reuse it. This would be memory-saving and fairly fast (~comparable to a hash table). Dawid On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless wrote: > Yes, this is a reasonable way to use Lucene (to see terms statistics acr

Re: codec: accessing term dictionary

2017-03-10 Thread Michael McCandless
h the fastest way to access > the term dictionary. > > what i want to do is to implement some algorithms to find phrases (e.g. > mutual rank ratio [1]) > (and other statistics on term distribution, generally: corpus related > stuff) > > the idea would be to do statistics on n

codec: accessing term dictionary

2017-03-09 Thread Jürgen Jakobitsch
hi, i'd like to ask users for their experiences with the fastest way to access the term dictionary. what i want to do is to implement some algorithms to find phrases (e.g. mutual rank ratio [1]) (and other statistics on term distribution, generally: corpus related stuff) the idea would be

RE: Lucene 6: Recommended way to store numeric values, given the need to form term vocabulary?

2017-03-02 Thread Maros Urbanec
anec a écrit : >> Lucene beginner here, please excuse me if I’m asking anything obvious. >> >> In Lucene 6, LongField and IntField were renamed to LegacyLongField >> and LegacyIntField, deprecated with a JavaDoc suggestion to use >> LongPoint and IntPoint classes ins

Re: Lucene 6: Recommended way to store numeric values, given the need to form term vocabulary?

2017-03-01 Thread Adrien Grand
th a JavaDoc suggestion to use LongPoint and > IntPoint classes instead. > > However, it seems impossible to build a term vocabulary (=enumerate all > distinct values) of these XPoint fields. > > As a third option, one can add a field of class NumericDocValuesField. I > tried har

Lucene 6: Recommended way to store numeric values, given the need to form term vocabulary?

2017-03-01 Thread Maros Urbanec
Lucene beginner here, please excuse me if I’m asking anything obvious. In Lucene 6, LongField and IntField were renamed to LegacyLongField and LegacyIntField, deprecated with a JavaDoc suggestion to use LongPoint and IntPoint classes instead. However, it seems impossible to build a term

Re: unable to delete document via the IndexWriter.deleteDocuments(term) method

2017-02-17 Thread Armnotstrong
ring ID = "_id" and ... KEY = >> > "5836962b0293a47b09d345f1". Minimises the risk of typos. >> > >> > And use RAMDirectory. Means your program doesn't leave junk on my disk >> if >> > I run it, and also means it starts with an empty

  1   2   3   4   5   6   7   8   9   10   >