from:"Mike Sokolov"

RE: run in eclipse error

2017-10-17 Thread Mike Sokolov

Checkstyle has a onetoplevelclass rule that would enforce this On October 17, 2017 3:45:01 AM EDT, Uwe Schindler wrote: >Hi, > >this has nothing to do with the Java version. I generally ignore this >Eclipse-failure as I only develop in Eclipse, but run from command >line. The reason for this beha

Re: FunctionValues vs DoubleValuesSource

2017-10-13 Thread Mike Sokolov

Oh thanks Alan that's a good suggestion, but I already wrote max and sum double values sources since it was easy enough. If you think that's a good approach I could post a patch. On October 13, 2017 3:57:30 AM EDT, Alan Woodward wrote: >Hi, > >Yes, moving stuff over to DoubleValuesSource is onl

Re: Accent insensitive search for greek characters

2017-09-27 Thread Mike Sokolov

These are only used in classical Greek I think, explaining probably why they are not covered by the simpler filter. On September 27, 2017 9:48:37 AM EDT, Ahmet Arslan wrote: >I may be wrong about ASCIIFoldingFilter. Please go with the >ICUFoldingFilter. >Ahmet >On Wednesday, September 27, 2017,

Re: Small Vocabulary

2012-08-06 Thread Mike Sokolov

There was some interesting work done on optimizing queries including very common words (stop words) that I think overlaps with your problem. See this blog post http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 from the Hathi Trust. The upshot in a nutshel

Re: find meaningful words through Lucene

2012-06-27 Thread Mike Sokolov

Maybe high frequency terms that are not evenly distributed throughout the corpus would be a better definition. Discriminative terms. I'm sure there is something in the machine learning literature about unsupervised clustering that would help here. But I don't know what it is :) -Mike On 0

Re: Fast way to get the start of document

2012-06-25 Thread Mike Sokolov

l_text" field and only read _the_start_ of it? Otherwise, I'm thinking I'll go with an extra 1st page field for the too-huge documents. -Paul -Original Message- From: Mike Sokolov [mailto:soko...@ifactory.com] Sent: Saturday, June 23, 2012 7:16 PM To: java-user@lucene.ap

Re: Fast way to get the start of document

2012-06-23 Thread Mike Sokolov

e the decision about whether to highlight. -Mike Sokolov On 6/23/2012 6:17 PM, Jack Krupansky wrote: Simply have two fields, "full_body" and "limited_body". The former would index but not store the full document text from Tika (the "content" metadata.) The latter would

Re: filter by term frequency

2012-06-17 Thread Mike Sokolov

'memory') See: http://wiki.apache.org/solr/FunctionQuery#tf Lucene does have "FunctionQuery", "ValueSource", and "TermFreqValueSource". See: http://lucene.apache.org/solr/api/org/apache/solr/search/function/FunctionQuery.html -- Jack Krupansky -Orig

filter by term frequency

2012-06-16 Thread Mike Sokolov

I imagine this is a question that comes up from time to time, but I haven't been able to find a definitive answer anywhere, so... I'm wondering whether there is some type of Lucene query that filters by term frequency. For example, suppose I want to find all documents that have exactly 2 occ

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Mike Sokolov

It sounds me as if there could be a market for a new kind of query that would implement: A w/5 (B and C) in the way that people understand it to mean - the same A near both B and C, not just any A. Maybe it's too hard to implement using rewrites into existing SpanQueries? In term of the Pos

Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov

r, but I don't know if it would be worth the trouble. It turns out in my very specific case I have a term that appears in every document in a particular field, so I am just using a search for that at the moment. -Mike On 5/6/2012 8:04 PM, Mike Sokolov wrote: I think what I have in min

Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov

itions for the whole document? Maybe it could be a "fake span" for each document of 0 ... Integer.MAX_VALUE? I think it would be nice to have as long as its not going to be too inefficient... On Sun, May 6, 2012 at 5:26 PM, Mike Sokolov wrote: does anybody know how to express a MatchAllDocs

Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov

No, that doesn't work either - it works for the lucene query parser, but not for the *surround* query parser, which I'm using because it has a syntax for span queries. On 5/6/2012 6:10 PM, Vladimir Gubarkov wrote: Do you mean *:* ? On Mon, May 7, 2012 at 1:26 AM, Mike Sokolov wr

surround parser match-all query

2012-05-06 Thread Mike Sokolov

does anybody know how to express a MatchAllDocsQuery in surround query parser language? I've tried * and() but those don't parse. I looked at the grammar and I don't think there is a way. Please let us all know if you know otherwise! Thanks

Re: Retrieving offsets

2012-01-19 Thread Mike Sokolov

I think you have hit on all the best solutions. The Jira issues you mentioned do indeed hold out some promising solutions here, but they are a ways away, requiring some significant re-plumbing and I'm not sure there is a lot of attention being paid to that at the moment. You should vote for t

Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-07 Thread Mike Sokolov

My personal view, as a bystander with no more information than you, is that one has to assume there will be further index format changes before a 4.0 release. This is based on the number of changes in the last 9 months, and the amount of activity on the dev list. For us the implication is we

Re: Advanced NearSpanQuery

2011-07-13 Thread Mike Sokolov

oint me in the right direction? Jeroen -Original Message- From: Mike Sokolov [mailto:soko...@ifactory.com] Sent: woensdag 13 juli 2011 15:23 To: java-user@lucene.apache.org Cc: Jeroen Lauwers Subject: Re: Advanced NearSpanQuery Can you wrap a SpanNearQuery around an DisjunctionSumQuery with

Re: Advanced NearSpanQuery

2011-07-13 Thread Mike Sokolov

Can you wrap a SpanNearQuery around an DisjunctionSumQuery with minNrShouldMatch=8? -Mike On 07/13/2011 08:53 AM, Jeroen Lauwers wrote: Hi, I was wondering if anyone could help me on this: I want to search for: 1. a set of words (eg. 10) 2. only a couple of words may come in be

highlighting performance

2011-06-20 Thread Mike Sokolov

Our apps use highlighting, and I expect that highlighting is an expensive operation since it requires processing the text of the documents, but I ran a test and was surprised just how expensive it is. I made a test index with three fields: path, modified, and contents. I made the index using

Re: Sharding Techniques

2011-05-10 Thread Mike Sokolov

Down to basics, Lucene searches work by locating terms and resolving documents from them. For standard term queries, a term is located by a process akin to binary search. That means that it uses log(n) seeks to get the term. Let's say you have 10M terms in your corpus. If you stored that in a si

Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov

l the documents that contain foo, but I want them sorted by frequency. Then, I would have doc1, doc2. Now, I want to search for all the documents that contain foon, but I want them sorted by weight1. Then, I would have doc2, doc1 Does that clarify? On May 5, 2011, at 3:01 PM, Mike Sokolov

Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov

Are the tokens unique within a document? If so, why not store a document for every doc/token pair with fields: id (doc#/token#) doc-id (doc#) token weight1 weight2 frequency Then search for token, sort by weight1, weight2 or frequency. If the token matches are unique within a document you will

Re: QueryValidator

2011-05-05 Thread Mike Sokolov

It's an idea - sorry I don't have an implementation I can share easily; it's embedded in our application code and not easy to refactor. I'm not sure where this would fit in the solr architecture; maybe some subclass of SearchHandler? I guess the query rewriter would need to be aware of which

proposed change to CharTokenizer

2010-10-14 Thread Mike Sokolov

Background: I've been trying to enable hit highlighting of XML documents in such a way that the highlighting preserves the well-formedness of the XML. I thought I could get this to work by implementing a CharFilter that extracts text from XML (somewhat like HTMLStripCharFilter, except I am us

RE: run in eclipse error

Re: FunctionValues vs DoubleValuesSource

Re: Accent insensitive search for greek characters

Re: Small Vocabulary

Re: find meaningful words through Lucene

Re: Fast way to get the start of document

Re: Fast way to get the start of document

Re: filter by term frequency

filter by term frequency

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

Re: surround parser match-all query

Re: surround parser match-all query

Re: surround parser match-all query

surround parser match-all query

Re: Retrieving offsets

Re: Lucene 4.0 Index Format Finalization Timetable

Re: Advanced NearSpanQuery

Re: Advanced NearSpanQuery

highlighting performance

Re: Sharding Techniques

Re: new to lucene, non standard index

Re: new to lucene, non standard index

Re: QueryValidator

proposed change to CharTokenizer

24 matches

Site Navigation

Mail list logo

Footer information