Re: Lucene same search result for worlds with and without spaces

2018-06-26 Thread Ahmet Arslan
Hi Egorlex, Shingle filter won't turn "similarissues" into "similar issues". But it can do the reverse. It is like a sliding window. Think about what indexed tokens would be if you set token separator to "" Ahmet On Wednesday, June 20, 2018, 12:42:22 PM GMT+3, egorlex wrote: Tha

Re: Lucene same search result for worlds with and without spaces

2018-06-19 Thread Ahmet Arslan
Hi Egorlex, ShingleFilter could be used to achieve your goal. Ahmet On Tuesday, June 19, 2018, 8:06:46 PM GMT+3, egorlex wrote: Hi, I need help with Lucene. How a can realize same search result for worlds with and without spaces. For example request "similar issues" and "similari

Re: Case Insensitive Search for StringField

2018-05-25 Thread Ahmet Arslan
Hi, string_ci type could be constructed from: keyword tokenizer + lowercase filter + may be trim filter. Ahmet On Friday, May 25, 2018, 1:50:19 PM GMT+3, Chellasamy G wrote: Hi Team, Kindly help me out with this problem. Thanks, Satyan On Wed, 23 May 2018 15:01:3

Re: Custom Similarity

2018-02-08 Thread Ahmet Arslan
Hi Roy, In order to activate payloads during scoring, you need to do two separate things at the same time: * use a payload aware query type: org.apache.lucene.queries.payloads.* * use payload aware similarity Here is an old post that might inspire you :  https://lucidworks.com/2009/08/05/get

Re: To get the term-freq

2017-11-17 Thread Ahmet Arslan
Hi, I am also intersted into the answer to this question.  I wonder whether term freq. function query would work here. Ahmet On Friday, November 17, 2017, 10:32:23 AM GMT+3, Dwaipayan Roy wrote: ​Hi, I want to get the term frequency of a given term t in a given document with lucene

Re: get begin/end of matched terms

2017-10-21 Thread Ahmet Arslan
Hi Nicolas, With SpanQuery family, it is possible to retrieve spans (index/position information) Also, you may find luwak relevant.  https://github.com/flaxsearch/luwak Ahmet On Sunday, October 22, 2017, 1:16:01 AM GMT+3, Nicolas Paris wrote: Hi I am looking for a way to get

Re: Accent insensitive search for greek characters

2017-09-27 Thread Ahmet Arslan
accent characters and it supports only Latin like accent characters. Am I missing anything? Chitra On Wed, Sep 27, 2017 at 5:47 PM, Ahmet Arslan wrote: Hi, Yes ICUFoldingFilter or ASCIIFoldingFilter could be used. ahmet  On Wednesday, September 27, 2017, 1:54:43 PM GMT+3, Chitra

Re: Accent insensitive search for greek characters

2017-09-27 Thread Ahmet Arslan
Hi, Yes ICUFoldingFilter or ASCIIFoldingFilter could be used. ahmet  On Wednesday, September 27, 2017, 1:54:43 PM GMT+3, Chitra wrote: Hi,                 In Lucene, I want to search greek characters(with accent insensitive) by removing or replacing accent marks with similar charact

Re: Re: What is the fastest way to loop over all documents in an index?

2017-09-05 Thread Ahmet Arslan
at 7:57 AM, Ahmet Arslan wrote: > Hi Jean, > > I am also interested answers to this question. I need this feature too. > Currently I am using a hack. > I create an artificial field (with an artificial token) attached to every > document. > > I traverse all documents using t

Re: What is the fastest way to loop over all documents in an index?

2017-09-04 Thread Ahmet Arslan
Hi Jean, I am also interested answers to this question. I need this feature too. Currently I am using a hack. I create an artificial field (with an artificial token) attached to every document.  I traverse all documents using the code snippet given in my previous related question. (no one answ

Re: Occur.FILTER clarification

2017-08-11 Thread Ahmet Arslan
:58:25 PM GMT+3, Adrien Grand wrote: FILTER does the opposite of MUST_NOT. Regarding scoring, putting the query in a FILTER or MUST_NOT clause is good enough since such clauses do not need scores. You do not need to add an additional ConstantScoreQuery wrapper. Le mar. 8 août 2017 à 23:06, Ahmet

Occur.FILTER clarification

2017-08-08 Thread Ahmet Arslan
Hi all, I am trying to access document lenght statistics of the documents that do not contain a given term. I have written following piece of code BooleanQuery.Builder builder = new BooleanQuery.Builder();builder.add(new MatchAllDocsQuery(), BooleanClause.Occur.MUST).add(new TermQuery(te

Re: How to fetch documents for which field is not defined

2017-08-07 Thread Ahmet Arslan
How about Solr's exists function query? How does it work?function queries are now part of Lucene (org.apache.lucene.queries.function.) right? Ahmet On Sunday, July 16, 2017, 11:19:40 AM GMT+3, Trejkaz wrote: On Sat, Jul 15, 2017 at 8:12 PM, Uwe Schindler wrote: > That is the "Solr" answer.

PostingsEnum for documents that does not contain a term

2017-08-07 Thread Ahmet Arslan
Hi, I am traversing posting list of a given term/word using the following code. I am accessing/processing term frequency and document length. Term term = new Term(field, word); PostingsEnum postingsEnum = MultiFields.getTermDocsEnum(reader, field, term.bytes()); if (postingsEnum == null) return

Re: How to fetch documents for which field is not defined

2017-07-15 Thread Ahmet Arslan
Hi, As an alternative, function queries can also be used.exists function may be more intuitive. q={!func}(not(exists(field3)) On Saturday, July 15, 2017, 1:01:04 PM GMT+3, Rajnish kamboj wrote: Ok, I will check. On Sat, 15 Jul 2017 at 3:26 PM, Ahmet Arslan wrote: > Hi, > > Yes, h

Re: How to fetch documents for which field is not defined

2017-07-15 Thread Ahmet Arslan
Hi, Yes, here it is:  q=+*:* -field3:[* TO *] Ahmet On Saturday, July 15, 2017, 8:16:00 AM GMT+3, Rajnish kamboj wrote: Hi Does Lucene provide any API to fetch documents for which a field is not defined. Example Document1 : field1=value1, field2=value2,field3=value3 Document2 : field1=value4,

Re: Penalize fact the searched term is within a world

2017-06-08 Thread Ahmet Arslan
Hi, You can completely ban within-a-word search by simply using WhitespaceTokenizer for example.By the way, it is all about how you tokenize/analyze your text. Once you decided, you can create a two versions of a single field using different analysers.This allows you to assign different weights

Re: A question over TokenFilters

2017-04-21 Thread Ahmet Arslan
Hi, LimitTokenCountFilter is used to index first n tokens. May be it can inspire you. Ahmet On Friday, April 21, 2017, 6:20:11 PM GMT+3, Edoardo Causarano wrote: Hi all. I’m relatively new to Lucene, so I have a couple questions about writing custom filters. The way I understand it, one woul

Re: How to get the index last modification date ?

2017-04-08 Thread Ahmet Arslan
Hi Jean, How about LukeRequest handler? Many of the information displayed on the admin screen comes from it.https://wiki.apache.org/solr/LukeRequestHandler Ahmet On Sunday, April 9, 2017, 2:21:38 AM GMT+3, Jean-Claude Dauphin wrote: Hello, I need to check the index last modification date to c

Re: How to customize the delimiters used by the WordDelimiterFilter in Lucene?

2017-03-18 Thread Ahmet Arslan
Hi, May be look at the factory class to see how types argument is handled? Ahmet On Friday, March 17, 2017 11:05 PM, "pha...@mailbox.org" wrote: Hi, I am trying to index words like 'e-mail' as 'email', 'e mail' and 'e-mail' with Lucene 4.4.0. Lucene's WordDelimiterFilter should be ide

Re: search any field name having a specific value

2017-03-17 Thread Ahmet Arslan
Hi, You can retrieve the list of field names using LukeRequestHandler. Ahmet On Friday, March 17, 2017 9:53 PM, Cristian Lorenzetto wrote: It permits to search in a predefined lists of fields that you have to know in advance. In my case i dont know what is the fieldname. maybe WildcardQuer

Re: any analyzer will keep punctuation?

2017-03-08 Thread Ahmet Arslan
te. I don't understand how "a customised word delimiter filter factory" works in tokenizer. 2017-03-06 22:26 GMT+08:00 Ahmet Arslan : > Hi Zhao, > > WhiteSpace tokeniser followed by a customised word delimiter filter > factory would be solution. > Please see types att

Re: any analyzer will keep punctuation?

2017-03-06 Thread Ahmet Arslan
punctuation, but it only breaks word by space. I didn’t explain my requirement clearly. I want to an analyzer like standard analyzer but may keep some punctuation configured. 2017-03-06 18:03 GMT+08:00 Ahmet Arslan : > Hi, > > Whitespace analyser/tokenizer for example. > > Ahmet &g

Re: any analyzer will keep punctuation?

2017-03-06 Thread Ahmet Arslan
Hi, Whitespace analyser/tokenizer for example. Ahmet On Monday, March 6, 2017 10:21 AM, Yonghui Zhao wrote: Lucene standard anlyzer will remove almost all punctuation. In some cases, we want to keep some punctuation, for example in music search, some singer name and album name could be a punc

Re: term frequency in solr

2017-01-05 Thread Ahmet Arslan
ot;name"); System.out.print("size="+ terms.size()); } } /// I got this error: numFound: 32 Exception in thread "main" java.lang.NullPointerException at testPkg.App3.main(App3.java:30) On 5 January 2017 at 18:25, Ahm

Re: term frequency in solr

2017-01-05 Thread Ahmet Arslan
Hi, I think you are missing the main query parameter? q=*:* By the way you may get more response in the sole-user mailing list. Ahmet On Wednesday, January 4, 2017 4:59 PM, huda barakat wrote: Please help me with this: I have this code which return term frequency from techproducts example:

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread Ahmet Arslan
Hi, You can index whole address in a separate field. Otherwise, how would you handle positions of the split tokens? By the way, speed of phrase search may be just fine, so consider trying first. Ahmet On Tuesday, December 20, 2016 5:15 PM, suriya prakash wrote: Hi, I am using standard anal

Re: ComplexPhraseQueryParser with wildcards

2016-12-20 Thread Ahmet Arslan
Hi Otmar, A single term inside quotes is meaningless. A phrase query should have at least two terms in it, shouldn't it? What is your intention with a such "john*" query? Ahmet On Tuesday, December 20, 2016 4:56 PM, Otmar Caduff wrote: Hi, I have an index with a single document with a fi

Re: Best way to search by pages

2016-11-26 Thread Ahmet Arslan
How about keeping two indices: page index and document index. Issue the query to the document index and list n documents. For each document, list k pages fetched from page index. Ahmet On Saturday, November 26, 2016 12:16 PM, Joe MA wrote: Greetings, I am trying to use Lucene to search lar

Re: Multi-field IDF

2016-11-18 Thread Ahmet Arslan
discrimination power based in all the body text, not just the titles. Because otherwise terms that are really not that relevant end up being very high! El 17/11/16 a las 18:25, Ahmet Arslan escribió: > Hi Nicholas, > > IDF, among others, is a measure of term specificity. If 'or&#x

Re: Multi-field IDF

2016-11-17 Thread Ahmet Arslan
Hi Nicholas, IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain. I think it's OK 'or' to get a high IDF value in this case. Ahmet On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier wrote: IDF

Re: How exclude empty fields?

2016-11-11 Thread Ahmet Arslan
Hi, Match all docs query minus Promotion.endDate:[* TO *] +*:* -Promotion.endDate:[* TO *] Ahmet On Friday, November 11, 2016 5:59 PM, voidmind wrote: Hi, I have indexed content about Promotions with effectiveDate and endDate fields for when the promotions start and end. I want to query for

Re: Isn't fieldLength in BM25 supposed to be an integer?

2016-11-09 Thread Ahmet Arslan
Hi Mossaab, Probably due to the encodeNormValue/decodeNormValue transformation of the document length. Please see the aforementioned methods in BM25Similarity.java Ahmet On Wednesday, November 9, 2016 10:25 PM, Mossaab Bagdouri wrote: Hi, On Lucene 6.2.1, I have the following explain ou

Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-11 Thread Ahmet Arslan
Hi, I forgot to include : .addTokenFilter("asciifolding") Ahmet On Tuesday, October 11, 2016 5:37 PM, Ahmet Arslan wrote: Hi Kumaran, Writing a custom analyzer is easier than it seems. Please see how I added kstem to classic analyzer: return CustomAnalyzer.builder() .withTokenize

Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-11 Thread Ahmet Arslan
Hi Kumaran, Writing a custom analyzer is easier than it seems. Please see how I added kstem to classic analyzer: return CustomAnalyzer.builder() .withTokenizer("classic") .addTokenFilter("classic") .addTokenFilter("lowercase") .addTokenFilter("kstem") .build(); Ahmet On Tuesday, October 11,

Re: How can I list all the terms from a document?

2016-09-16 Thread Ahmet Arslan
Hi, I thought the link/url below has the example code, no? http://makble.com/what-is-term-vector-in-lucene If not, in the source tree, under the tests folder, there should be some test cases for termVectors, which can be used as en example code. I guess internal lucene document id, which easy

Re: How can I list all the terms from a document?

2016-09-13 Thread Ahmet Arslan
Hi, First you need to enable term vectors at index time. Then you can access terms and their statistics in a document. http://makble.com/what-is-term-vector-in-lucene Ahmet On Tuesday, September 13, 2016 11:53 AM, szzoli wrote: Hi, how can I use TermVectors ? I have read the API, but it is

Re: Is it possible to search for a paragraph in Lucene?

2016-09-12 Thread Ahmet Arslan
Hi, If you have some tool/mechanism to detect paragraph boundaries, yes it is possible to search for a paragraph. But Lucene it self cannot detect sentence/paragraph for you. There are other libraries for this. Ahmet On Monday, September 12, 2016 1:06 PM, szzoli wrote: Hi All, Is it possibl

Re: How can I list all the terms from a document?

2016-09-07 Thread Ahmet Arslan
Hi, TermVectors perhaps? Ahmet On Tuesday, September 6, 2016 4:21 PM, szzoli wrote: Hi All, How can I list all the terms from a document? I also need the counts of each term per document. I use Lucene 6.2. I found some solutions for older versions. These din't work with 6.2 Thank you in ad

Re: Doc length nomalization in Lucene LM

2016-07-22 Thread Ahmet Arslan
in byte format for less memory consumption. But while debugging, I found that the doc length, that is passed in score() is 2621.44 where the actual doc length is 2355. I am confused. Please help. On Fri, Jul 22, 2016 at 1:46 PM, Ahmet Arslan wrote: > Hi Roy, > > It is about storing

Re: Doc length nomalization in Lucene LM

2016-07-22 Thread Ahmet Arslan
Hi Roy, It is about storing the document length into a byte (to use less memory). Please edit the source code to avoid this encode/decode thing: /** * Encodes the document length in a lossless way */ @Override public long computeNorm(FieldInvertState state) { return state.getLength() - state.getN

Re: Help Relevance Feedback (Rocchio) with lucene

2016-06-28 Thread Ahmet Arslan
Hi Andres, While there can be other ways, in general term vectors are used to extract "important terms" from top-k documents returned by the initial query. Please see getTopTerms() method in http://www.cortecostituzionale.it/documenti/news/advancedluceneeu_69.pdf Ahmet On Tuesday, June 28, 20

Re: Favoring Terms Occurring in Close Proximity

2016-06-27 Thread Ahmet Arslan
e a custom query parser if they want reasonable results? - On Jun 24, 2016, at 12:25 PM, Ahmet Arslan wrote: > Hi Daniel, > You can add optional clauses to your query for boosting purposes. > for example, > temperate OR climates OR "temperate climates"~5^100 >

Re: Favoring Terms Occurring in Close Proximity

2016-06-24 Thread Ahmet Arslan
Hi Daniel, You can add optional clauses to your query for boosting purposes. for example, temperate OR climates OR "temperate climates"~5^100 ahmet On Friday, June 24, 2016 5:07 PM, Daniel Bigham wrote: Something significant that I've noticed about using the default Lucene query parser is

Re: Preprocess input text before tokenizing

2016-06-24 Thread Ahmet Arslan
other version of that analyzer. Whenever any of those analyzer is changed, I will need to manually apply the changes. Isn't there a better way to do this? El 23/06/2016 a las 20:28, Ahmet Arslan escribió: > Hi, > > Zero or more CharFilter(s) is the way to manipulate text before the t

Re: Preprocess input text before tokenizing

2016-06-23 Thread Ahmet Arslan
Hi, Zero or more CharFilter(s) is the way to manipulate text before the tokenizer. I think init reader is the method you want to plug char filters. https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java

Re: How to prevent WordDelimiterFilter tokenize the string with underscore?

2016-06-15 Thread Ahmet Arslan
Hi, You can supply custom types. please see WordDelimiterFilterFactory and wdfftypes.txt for an example. ahmet On Wednesday, June 15, 2016 10:32 PM, Xiaolong Zheng wrote: Hi, How can I prevent WordDelimiterFilter tokenize the string with underscore, e.g. word_with_underscore. I am using Wo

Re: Cache Lucene based index.

2016-05-22 Thread Ahmet Arslan
Hi Singhal, May be MemoryIndex or RAMDirectory? Ahmet On Saturday, May 21, 2016 1:42 PM, Prateek Singhal wrote: You can consider that I want to store the lucene index in some sort of temporary memory or a HashMap so that I do not need to index the documents every time as it is a costly opera

Re: Query Grammar

2016-05-16 Thread Ahmet Arslan
Hi Taher, Please find and see QueryParser.jj file in the source tree. You can find all operators such as && || AND OR !. Ahmet On Sunday, May 15, 2016 1:57 PM, Taher Galal wrote: Hi All, I was just checking the query grammer found in the java docs of the query parser : Query ::= ( Clause )

Re: Simple Similarity Implementation to Count the Number of Hits

2016-05-12 Thread Ahmet Arslan
Hi Luis, Thats an interesting question. Can you share your similarity? I suspect you return 1 expect Similarity#coord method. Not sure but, for phrase query, one may require to modify ExactPhraseScorer/ExactPhraseScorer etc. ahmet On Thursday, May 12, 2016 5:41 AM, Luís Filipe Nassif wrote:

Re: Query Expansion for Synonyms

2016-04-28 Thread Ahmet Arslan
Hi Daniel, Since you are restricting inOrder=true and proximity=0 in the top level query, there is no problem in your particular example. If you weren't restricting, injecting synonyms with plain OR, sometimes cause 'query drift': injection/addition of one term changes result list drastically.

Re: Evaluate if a document satisfies a query

2016-04-25 Thread Ahmet Arslan
Hi, MemoryIndex is used for that purpose. Please see : https://github.com/flaxsearch/luwak https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html http://lucene.apache.org/core/6_0_0/memory/index.html?org/apache/lucene/index/memory/MemoryIndex.html Ahmet On Mo

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
s around BlendedTermQuery. Just to help isolate the issues. Here's Lucene's tests for BlendedTermQuery as a basis https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/test/org/apache/lucene/search/TestBlendedTermQuery.java On Tue, Ap

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Hi Again, For those who are interested, I uploaded BM25's Term Frequency graph [0] for some common and content-bearing words. [0] http://2.1m.yt/PgUEcZ.png Ahmet On Tuesday, April 19, 2016 5:16 PM, Ahmet Arslan wrote: Hi Markus, It is a known property of BM25. It produces neg

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Hi Markus, It is a known property of BM25. It produces negative scores for common terms. Most of the term-weighting models are developed for indices in which stop words are eliminated. Therefore, most of the term-weighting models have problems scoring common terms. By the way, DFI model does a

Re: Custom indexing

2016-04-18 Thread Ahmet Arslan
itting on dot, > hyphen, and underscore, in addition to whitespace and other punctuation. > > Can you post some specific test cases you are concerned with? (You should > always run some test cases.) > > -- Jack Krupansky > > On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan > w

Re: Custom indexing

2016-04-12 Thread Ahmet Arslan
Hi Chamarty, Well, there are a lot of options here. 1) Use LetterTokenizer 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer 3) Use MappingCharFilter to replace those characters with spaces . . . Ahmet On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty wrote: Hi, What

Re: Regarding the Lucene Proximity Search

2016-04-04 Thread Ahmet Arslan
Hi, If you are writing your queries programmatically, (without using a query parser), nested proximity is possible with SpanQuery family. Actually there exists surround query parser for this. Please see o.a.lucene.queryparser.surround.parser.QueryParser Proximity search uses position informati

Re: Subset Matching

2016-03-25 Thread Ahmet Arslan
Hi Otmar, For this requirement, you need to create an additional field containing the number of words/terms in the field. For example. field : blue pill length = 2 query : if you take the blue pill length : 6 Please see my previous responses on the same topic: http://search-lucene.com/m/e

Re: Serializing Queries

2016-03-18 Thread Ahmet Arslan
Hi, I think, xml query parser examples [1] are the safest way to persist Lucene queries. [1]https://github.com/apache/lucene-solr/tree/master/lucene/queryparser/src/test/org/apache/lucene/queryparser/xml Ahmet On Friday, March 18, 2016 4:02 PM, "Bauer, Herbert S. (Scott)" wrote: Has anyone

Re: Problem with porter stemming

2016-03-14 Thread Ahmet Arslan
Hi Dwaipayan, Another way is to use KeywordMarkerFilter. Stemmer implementations respect this attribute. If you want to supply your own mappings, StemmerOverrideTokenFilter could be used as well. ahmet On Monday, March 14, 2016 4:31 PM, Dwaipayan Roy wrote: ​I am using EnglishAnalyzer wi

Re: Top terms relevance from specific documents ?

2016-01-27 Thread Ahmet Arslan
Hi Yannick, More like this (mlt) stuff does this already. It extracts "interesting terms" from top N documents. Don't remember but this feature may require "term vectors" to be stored. Ahmet On Wednesday, January 27, 2016 10:41 AM, Yannick Martel wrote: Le Tue, 15 Dec 2015 17:56:05 +0100, Ya

Re: How to escape URL at indexing time

2015-12-27 Thread Ahmet Arslan
Hi Daniel, The exception you have posted is a parse exception. Something occurs during querying. Not indexing. There are some special characters that are part of query parsing syntax. You need to escape them. Ahmet On Sunday, December 27, 2015 10:53 PM, Daniel Valdivia wrote: Hi I'm tryi

Re: Jensen–Shannon divergence

2015-12-13 Thread Ahmet Arslan
Hi Shay, I suggest you to extend o.a.l.search.similarities.SimilarityBase. All you need to implement a score() method. After all fancy names (language models, etc), a similarity is a function of seven salient statistics. It is actually six: avgFieldLength can derived from other two (numberOfFiel

Re: Position and Range Information

2015-12-11 Thread Ahmet Arslan
Hi, Yes, TextField includes positions. Ahmet On Friday, December 11, 2015 5:40 PM, Douglas Kunzma wrote: All - I'm using a TextField and a BufferedReader to add text to a Lucene Document object. Can I still get all of the matches in a Document including the position information and start an

Re: lucene classpath

2015-12-03 Thread Ahmet Arslan
Hi, May be windows path separator messing things. Can you try to copy jars to current working directory and re-try java -classpath lucene-demo-5.3.1.jar;lucene-core-5.3.1.jar Ahmet On Thursday, December 3, 2015 11:57 PM, jerrittpace wrote: I am trying to set the classpath for the lucene jars

Re: dynamic pruning (WAND) supported ??

2015-12-03 Thread Ahmet Arslan
Hi Zong, I don't think Lucene has this. People usually needs all candidate documents to be scored. They sometimes sort by price, popularity, etc, sometimes combined with document relevancy scores. However, with time limited collector, closest thing could be: https://issues.apache.org/jira/br

Re: Access query length inside similarity

2015-11-03 Thread Ahmet Arslan
w can I pass query length(maxOverlap/maxCoord) inside the Similarity.SimScorer#score method? Any help on this is really appreciated. Thanks, Ahmet On Tuesday, October 27, 2015 10:27 AM, Ahmet Arslan wrote: Hi, How can I access length of the query (number of words in the query) ins

Access query length inside similarity

2015-10-27 Thread Ahmet Arslan
Hi, How can I access length of the query (number of words in the query) inside a SimilarityBase implementation? P.S. I am implementing multi-aspect TF [1] for an experimental study. So it does not have to be fast/optimized as production code. [1] http://dl.acm.org/citation.cfm?doid=2484028.2484

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Ahmet Arslan
Hi Uwe, What is the meaning of "the Unicode Policeman" ? Thanks, Ahmet On Thursday, October 22, 2015 2:59 PM, Uwe Schindler wrote: Hi, > >> Setting aside the fact that Character.toLowerCase is already dubious > >> in some locales (e.g. Turkish), > > > > This is not true. Character.toLower

Re: Learning to Rank algorithms in Lucene

2015-08-18 Thread Ahmet Arslan
Hi Ajinkya, I don't think there exists any production-ready LtR-Lucene/Solr setup. LtR simply re-rank top N (typically 1000) documents. Fetching top N documents is what we do today with Lucene. There is an API for re-rank in Lucene/Solr but no LtR support yet. https://cwiki.apache.org/confluenc

Re: Using lucene queries to search StringFields

2015-06-19 Thread Ahmet Arslan
Hi, Why don't you create your query with API? Term term = new Term("B", "1 2"); Query query = new TermQuery(term); Ahmet On Friday, June 19, 2015 9:31 AM, Gimantha Bandara wrote: Correction.. second time I used the following code to test. Then I got the above IllegalStateException issue. w

Re: Tf and Df in lucene

2015-06-15 Thread Ahmet Arslan
tates" (two terms) or "free speech zones" (three terms). Shay On Mon, Jun 15, 2015 at 4:55 PM Ahmet Arslan wrote: > Hi Hummel, > > regarding df, > > Term term = new Term(field, word); > TermStatistics termStatistics = searcher.termStatistics(term, > Te

Re: Tf and Df in lucene

2015-06-15 Thread Ahmet Arslan
Hi Hummel, regarding df, Term term = new Term(field, word); TermStatistics termStatistics = searcher.termStatistics(term, TermContext.build(reader.getContext(), term)); System.out.println(query + "\t totalTermFreq \t " + termStatistics.totalTermFreq()); System.out.println(query + "\t docFreq \t

Re: IllegalArgumentException: docID must be >= 0 and < maxDoc=48736112 (got docID=2147483647)

2015-05-30 Thread Ahmet Arslan
re if collectors could easily have the same performance without them. To me, such scores seem always undesirable and only bugs, and the current assertions are a good tradeoff. On Fri, May 29, 2015 at 8:18 AM, Ahmet Arslan wrote: > Hello List, > > When a similarity returns NEGATIVE_INFINIT

IllegalArgumentException: docID must be >= 0 and < maxDoc=48736112 (got docID=2147483647)

2015-05-29 Thread Ahmet Arslan
Hello List, When a similarity returns NEGATIVE_INFINITY, hits[i].doc becomes 2147483647. Thus, exception is thrown in the following code: for (int i = 0; i < hits.length; i++) { int docId = hits[i].doc; Document doc = searcher.doc(docId); } I know it is an awkward to return infinity (comes from

access query term in similarity calcuation

2015-05-23 Thread Ahmet Arslan
Hi, I have a number of similarity implementation that extends SimilarityBase. I need to learn which term I am scoring inside the method : abstract float score(BasicStats stats, float freq, float docLen); What is the easiest way to access the query term that I am scoring in similarity class? Th

intersection of two posting lists

2015-05-08 Thread Ahmet Arslan
Hello All, I am traversing posting list of a single term by following code. (not sure if there is a better way) Now I need to handle/aggregate multiple terms. Traverse intersection of multiple posting lists and obtain summed freq() of multiple terms per document. What is the easiest way to obta

Re: Phrase query given a word

2015-04-23 Thread Ahmet Arslan
Hi, May be LUCENE-5317 relevant? Ahmet On Thursday, April 23, 2015 8:33 PM, Shashidhar Rao wrote: Hi, I have a large text and from that I need to calculated the top frequencies of words , say 'Driving' occurs the most. Now , I need to find phrase containing 'Driving' in the given text and th

Re: Changing analyzer in an indexwriter

2015-04-19 Thread Ahmet Arslan
Hi Lisa, I think AnalyzerWrapper https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/AnalyzerWrapper.html Ahmet On Sunday, April 19, 2015 1:37 PM, Lisa Ziri wrote: Hi, I'm upgrading to lucene 5.1.0 from lucene 4. In our index we have documents in different languages which are

Re: Text dependent analyzer

2015-04-17 Thread Ahmet Arslan
ed, Apr 15, 2015 at 3:50 AM Ahmet Arslan wrote: > Hi Hummel, > > You can perform sentence detection outside of the solr, using opennlp for > instance, and then feed them to solr. > > https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect &g

Re: Text dependent analyzer

2015-04-14 Thread Ahmet Arslan
Hi Hummel, You can perform sentence detection outside of the solr, using opennlp for instance, and then feed them to solr. https://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect Ahmet On Tuesday, April 14, 2015 8:12 PM, Shay Hummel wrote: Hi I would l

Re: CachingTokenFilter tests fail when using MockTokenizer

2015-03-23 Thread Ahmet Arslan
Hi Spyros, Not 100% sure but I think you should override reset method. @Override public void reset() throws IOException { super.reset(); cachedInput = null; } Ahmet On Monday, March 23, 2015 1:29 PM, Spyros Kapnissis wrote: Hello, We have a couple of custom token filters that use CachingTo

Re: Would Like to contribute to Lucene

2015-03-19 Thread Ahmet Arslan
Hi Gimanta, Not sure about the lucene internals, but here are some pointers : http://find.searchhub.org/document/a81b4c9af49c3d0f http://find.searchhub.org/?q=contribute#%2Fp%3Alucene%2Fs%3Aemail Ahmet On Thursday, March 19, 2015 3:58 PM, Gimantha Bandara wrote: Any clue on where to start

Re: understanding the norm encode and decode

2015-03-05 Thread Ahmet Arslan
s full float precision, but scoring being >>> fuzzy anyway this would multiply your memory needs for norms by 4 >>> while not really improving the quality of the scores of your >>> documents. This precision loss is the right trade-off for most >>> use-cases. &g

Re: understanding the norm encode and decode

2015-03-04 Thread Ahmet Arslan
Hi Adrien, I read somewhere that norms are stored using docValues. In my understanding, docvalues can store lossless float values. So the question is, why are still several decode/encode methods exist in similarity implementations? Intuitively switching to docvalues for norms should prevent prec

Re: getting number of terms in a document/field

2015-02-08 Thread Ahmet Arslan
ll compute length of fields by myself. Thanks, Ahmet On Friday, February 6, 2015 5:31 PM, Michael McCandless wrote: On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan wrote: > Hi Michael, > > Thanks for the explanation. I am working with a TREC dataset, > since it is static, I

Re: getting number of terms in a document/field

2015-02-06 Thread Ahmet Arslan
approximately in the doc's norm value. Maybe you can use that? Alternatively, you can store this statistic yourself, e.g as a doc value. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan wrote: > Hello Lucene Users, > > I am traversing all

getting number of terms in a document/field

2015-02-05 Thread Ahmet Arslan
Hello Lucene Users, I am traversing all documents that contains a given term with following code : Term term = new Term(field, word); Bits bits = MultiFields.getLiveDocs(reader); DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes()); while (docsEnum.nextDoc() != Doc

Re: disabling all scoring?

2015-02-05 Thread Ahmet Arslan
Hi Rob, May be you wrap your query in a ConstantScoreQuery? ahmet On Thursday, February 5, 2015 9:17 AM, Rob Audenaerde wrote: Hi all, I'm doing some analytics with a custom Collector on a fairly large number of searchresults (+-100.000, all the hits that return from a query). I need to retr

Re: Analyzer: Access to document?

2015-02-04 Thread Ahmet Arslan
Hi Ralf, Does following code fragment work for you? /** * Modified from : http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/analysis/package-summary.html */ public List getAnalyzedTokens(String text) throws IOException { final List list = new ArrayList<>(); try (TokenStream ts = analy

Re: AW: LowercaseFilter, preserveOriginal?

2015-01-27 Thread Ahmet Arslan
Hi Clemens, Please see : https://issues.apache.org/jira/browse/LUCENE-5620 Ahmet On Tuesday, January 27, 2015 10:56 AM, Clemens Wyss DEV wrote: > I very much preserveOriginal="true" when applying the >ASCIIFoldingFilter for (german)suggestions Must revise my statement, as I just noticed tha

Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Ahmet Arslan
Hi Clemens, Since you are a lucene user, you might be interested in Uwe's response on a similar topic : http://find.searchhub.org/document/abb73b45a48cb89e Ahmet On Wednesday, January 7, 2015 6:30 PM, Erick Erickson wrote: Should be, but it's a bit confusing because the query syntax is not

Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
hetaphi.de > -Original Message- > From: Barry Coughlan [mailto:b.coughl...@gmail.com] > Sent: Monday, January 05, 2015 3:40 PM > To: java-user@lucene.apache.org; Ahmet Arslan > Subject: Re: IndexSearcher.setSimilarity thread-safety > > Hi Ahmet, > > The IndexSearcher is "t

Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
an use a single IndexReader for the IndexSearchers Barry On Mon, Jan 5, 2015 at 1:10 PM, Ahmet Arslan wrote: > > > anyone? > > > > On Thursday, December 25, 2014 4:42 PM, Ahmet Arslan > wrote: > Hi all, > > Javadocs says "IndexSearcher instances are completely th

Re: IndexSearcher.setSimilarity thread-safety

2015-01-05 Thread Ahmet Arslan
anyone? On Thursday, December 25, 2014 4:42 PM, Ahmet Arslan wrote: Hi all, Javadocs says "IndexSearcher instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently" Is this true for setSimilarity() method? What happens when every t

IndexSearcher.setSimilarity thread-safety

2014-12-25 Thread Ahmet Arslan
Hi all, Javadocs says "IndexSearcher instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently" Is this true for setSimilarity() method? What happens when every thread uses different similarity implementations? Thanks, Ahmet -

Re: lucene query with additional clause field not null

2014-12-01 Thread Ahmet Arslan
Hi Sascha, Generally RangeQuery is used for that, e.g. fieldName:[* TO *] Ahmet On Monday, December 1, 2014 9:44 PM, Sascha Janz wrote: Hi, is there a chance to add a additional clause to a query for a field that should not be null ? greetings sascha -

Re: Document Term matrix

2014-11-11 Thread Ahmet Arslan
Hi, Mahout and Carrot2 can cluster the documents from lucene index. ahmet On Tuesday, November 11, 2014 10:37 PM, Elshaimaa Ali wrote: Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matri

Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Ahmet Arslan
o the LowerCaseFilter. This seems to work. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 10 Nov 2014 15 19 To: java-user@lucene.apache.org Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2 Hi, Regarding Uwe's warnin

  1   2   3   >