Re: Custom indexing

2016-04-18 Thread Jack Krupansky
You failed to disclose up front that you are using such an old release of Lucene. Lucene is now on 6.0. I'll defer to others if they wish to provide support for such an old release. -- Jack Krupansky On Mon, Apr 18, 2016 at 8:01 AM, PK C wrote: > Hi, > >Thank you very much

Re: Custom indexing

2016-04-12 Thread Jack Krupansky
The standard analyzer/tokenizer should do a decent job of splitting on dot, hyphen, and underscore, in addition to whitespace and other punctuation. Can you post some specific test cases you are concerned with? (You should always run some test cases.) -- Jack Krupansky On Tue, Apr 12, 2016 at

Re: Subset Matching

2016-03-25 Thread Jack Krupansky
There is no simple, direct way to do this "Boolean Reverse Query" in Lucene, but I suggest filing a Jira to request this as a feature improvement/new feature. -- Jack Krupansky On Fri, Mar 25, 2016 at 11:43 AM, Ahmet Arslan wrote: > Hi Otmar, > > For this requirement, yo

Re: Query regarding Lucene

2016-03-10 Thread Jack Krupansky
Are you calling the IndexSearcher#explain method to get the details of the score calculation? How exactly are your results not what you expect? What Similarity are you using? Scores will be the product of the underlying calculated scores and you term boost values. -- Jack Krupansky On Thu, Mar

Re: Creating composite query in lucene

2016-03-08 Thread Jack Krupansky
BooleanQuery can be nested, so you do a top-level BQ that has two clauses, the first a TQ for a:x and the second another BQ that itself has two clauses, both SHOULD. -- Jack Krupansky On Tue, Mar 8, 2016 at 4:38 AM, sandeep das wrote: > Hi, > > I'm using lucene-5.2.0 and in que

Field name syntax for Lucene Expressions

2016-02-29 Thread Jack Krupansky
ARIABLE binding. -- Jack Krupansky

Re: Spaces in regular expressions

2016-02-15 Thread Jack Krupansky
source line. And then there is the issue of code sequences that span source lines. -- Jack Krupansky On Mon, Feb 15, 2016 at 8:30 AM, Kudrettin Güleryüz wrote: > Since documents are source code, I am considering matching on operators > too. > > Using whitespace analyzer, A=foo(){ would

Re: Spaces in regular expressions

2016-02-13 Thread Jack Krupansky
ant to search for two keywords with any operator sequence between them? Or... do you want to match on operators as well but simply want to ignore whitespace? Generally, the standard analyzer/tokenizer is better/easier - you can simply query "A foo" and it will match all three of you s

Re: Spaces in regular expressions

2016-02-13 Thread Jack Krupansky
separate string (not tokenized text) field and then you can do a complex regex that spans terms (and only do that if normal span queries don't do what you need.) What does your typical cross-term regex actually look like? -- Jack Krupansky On Sat, Feb 13, 2016 at 1:25 PM, Uwe Schindler wrot

Re: boolean query for multiple values on a specific field

2016-01-27 Thread Jack Krupansky
code that would tell the analyzer that "tag" is a defined field. Also, I see no value to having the single-clause BooleanQuery wrapped around the actual query. -- Jack Krupansky On Wed, Jan 27, 2016 at 12:52 PM, G.Long wrote: > Hi :) > > I would like to retrieve a document from

Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Jack Krupansky
Be sure to check and see if your app is compute or I/O bound during this process - whether too little of your index is cached in system memory and each query requires I/O, lots of it. -- Jack Krupansky On Thu, Jan 21, 2016 at 1:52 PM, Doug Turnbull < dturnb...@opensourceconnections.com>

Re: How to escape URL at indexing time

2015-12-27 Thread Jack Krupansky
It looks like you attempted to quote the URL in your query using apostrophes (sometimes referred to as single quotes), but you need to use quote (sometimes referred to as double quote). Change: id:'http://www.yahoo.com' to: id:"http://www.yahoo.com"; -- Jack Krupansky On Su

Re: Any lucene query sorts docs by Hamming distance?

2015-12-24 Thread Jack Krupansky
, but is deprecated and has been relegated to the sand box, so it is not really usable going forward: http://lucene.apache.org/core/5_4_0/sandbox/index.html?org/apache/lucene/sandbox/queries/SlowFuzzyQuery.html -- Jack Krupansky On Tue, Dec 22, 2015 at 4:02 AM, Yonghui Zhao wrote: > Hi, >

Re: Searching for "iso surface", and looking for "isosurface"

2015-12-17 Thread Jack Krupansky
/DictionaryCompoundWordTokenFilterFactory.html The doc is weak. I do have some examples in my old Solr 4.x Deep Dive e-book: http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html You might also be able to achieve a similar effect with synonyms, but again only

Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Jack Krupansky
You could certainly read your stored values from your current index and then write new documents to a new index and then use the new index. That's if all of the indexed field values are stored. -- Jack Krupansky On Thu, Dec 17, 2015 at 2:10 PM, Kumaran Ramasubramanian wrote: >

Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Jack Krupansky
Delete the full index and create from scratch with the correct field type, re-adding all documents. Any remnants of the old field must be removed. -- Jack Krupansky On Thu, Dec 17, 2015 at 11:48 AM, Kumaran R wrote: > While Reindexing only am facing this problem. > > Just to confir

Re: Need change one field type from IntField to String including indexOptions to store positions & Norms

2015-12-17 Thread Jack Krupansky
The standard answer is that you need to reindex all of your data. -- Jack Krupansky On Thu, Dec 17, 2015 at 6:10 AM, Kumaran Ramasubramanian wrote: > Dear All > > i am using lucene 4.10.4. Is there any more information i missed to > provide? Please let me know. > >

Re: Jensen–Shannon divergence

2015-12-14 Thread Jack Krupansky
earch/similarities/TFIDFSimilarity.html https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/BM25Similarity.html -- Jack Krupansky On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel wrote: > Hi > > I need help to implement similarity between query model and document model.

Re: Wildcard Terms and total word or phrase count

2015-11-29 Thread Jack Krupansky
You didn't post your code that creates the index. Make sure you are using a tokenized TextField rather than a single-token StringField. -- Jack Krupansky On Fri, Nov 27, 2015 at 4:06 PM, Kunzman, Douglas * < douglas.kunz...@fda.hhs.gov> wrote: > Hi - > > This is my firs

Re: lucene query complexity

2015-11-20 Thread Jack Krupansky
nce and memory for a significant sample of realistic data and then you can empirically deduce who the big-O function is for your particular application data and data model. -- Jack Krupansky On Fri, Nov 20, 2015 at 4:38 AM, Adrien Grand wrote: > I don't think the big-O notation is approp

Re: need help in search

2015-10-05 Thread Jack Krupansky
, so if you need to keep that entire string as one term, use the whitespace tokenizer. That said, treating hyphen as a word break is usually not a problem as long as you enable auto phrase generation for the query parser. -- Jack Krupansky On Mon, Oct 5, 2015 at 4:06 AM, Bhaskar wrote: >

Re: Need help in alphanumeric search

2015-10-01 Thread Jack Krupansky
Phrase query for a tokenized text field should do it. -- Jack Krupansky On Thu, Oct 1, 2015 at 10:04 PM, Bhaskar wrote: > Hi Jack, > > my searching is working like this. > > if i give input as "SD RAM Bhaskar" then which ever strings are having > "SD", &

Re: Need help in alphanumeric search

2015-10-01 Thread Jack Krupansky
Technically, there is no such thing as a "sentence search" in Lucene. Please provide an example of how you wish to search, and then we can determine whether a phrase query or a span query might accomplish the task. -- Jack Krupansky On Thu, Oct 1, 2015 at 11:53 AM, Bhaskar wrote: &g

Re: How to use case in-sentive search

2015-08-14 Thread Jack Krupansky
really how to get case-sensitive query, simply create your own analyzer without the lower case filter. -- Jack Krupansky On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson wrote: > Add LowercaseFilterFactory to your analysis chain for the fieldType > both at query and index time. You'll

Re: ignore score and weight in lucene search

2015-07-29 Thread Jack Krupansky
ConstantScoreQuery is the proper approach. What specific failure did you encounter? -- Jack Krupansky On Wed, Jul 29, 2015 at 7:09 AM, 丁儒 wrote: > Hi, all > Currently i'm using lucene. But i don't care the score and weight, i > just need the documents meets the query.

Re: Analyzer for supporting hyphenated words

2015-07-21 Thread Jack Krupansky
assic/QueryParserBase.html#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky On Fri, Jul 17, 2015 at 4:41 AM, Diego Socaceti wrote: > Hi all, > > i'm new to lucene and tried to write my own analyzer to support > hyphenated words like wi-fi, jean-pierre, etc. > For our customer it

Re: Using lucene queries to search StringFields

2015-06-21 Thread Jack Krupansky
://lucene.apache.org/core/5_2_0/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html You can also simply escape the spaces with a backslash rather than quote the entire term, but you still need to use the keyword analyzer. -- Jack Krupansky On Fri, Jun 19, 2015 at 2:31 AM, Gimantha

Re: Text dependent analyzer

2015-04-15 Thread Jack Krupansky
sentence boundaries are? Be specific, because that determines what your queries should look like, which determines what the indexed text should look like, which determines how the text should be analyzed. -- Jack Krupansky On Wed, Apr 15, 2015 at 8:12 AM, Shay Hummel wrote: > Hi Ahment, > Tha

Re: Calculate the score of an arbitrary string vs a query?

2015-04-10 Thread Jack Krupansky
/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query, int) -- Jack Krupansky On Fri, Apr 10, 2015 at 4:15 PM, Gregory Dearing wrote: > Hi Ali, > > The short answer to your question is... there's no good way to create a > score from your result stri

Re: Lucene and accumulo

2015-04-09 Thread Jack Krupansky
/browse/ACCUMULO-3698 The SQRRL commercial product has (or at least had before the company shifted its corporate strategy) Lucene indexing of Accumulo data, but that's a proprietary product: http://sqrrl.com/product/search/ -- Jack Krupansky On Thu, Apr 9, 2015 at 6:33 AM, madhvi wrote:

Re: Would Like to contribute to Lucene

2015-03-27 Thread Jack Krupansky
is always a great contribution. -- Jack Krupansky On Thu, Mar 26, 2015 at 8:15 PM, Erick Erickson wrote: > You really have to just pick a problem, dive into the code and learn > it bit by bit through exploration. The code base changes fast enough > that anything published will be out o

Re: how to reasonably estimate the disk size for Lucene 4.x

2015-03-24 Thread Jack Krupansky
t;hey, everything runs great on commodity hardware!" Kool-Aid. IOW, running a 32GB index on a 16 GB box is probably not a great idea if you need low latency. -- Jack Krupansky On Tue, Mar 24, 2015 at 8:37 AM, Gaurav gupta wrote: > Erick, > When further testing the index sizes using Lucene APIs

Re: Tokenizer for Brown Corpus?

2015-02-24 Thread Jack Krupansky
This is the first mention that I have seen for that corpus on this list. There seem to be more than a few references when I google for ""brown corpus" lucene", such as: https://github.com/INL/BlackLab/wiki/Blacklab-query-tool -- Jack Krupansky On Tue, Feb 24, 2015 at 1:4

Re: Indexing Query

2015-02-18 Thread Jack Krupansky
You could store the length of the field (in terms) in a second field and then add a MUST term to the BooleanQuery which is a RangeQuery with an upper bound that is the maximum length that can match. -- Jack Krupansky On Wed, Feb 18, 2015 at 4:54 AM, Ian Lea wrote: > You mean you'

Re: Boolean Search Query is not workng

2015-01-24 Thread Jack Krupansky
me of your documents have different capitalization of Java/java. -- Jack Krupansky On Fri, Jan 23, 2015 at 9:54 AM, Rajendra Rao wrote: > Hello > Reply to the mail, sent by Nitin We tried and this is what we got : > > My query was dotNet^10.0 Resume:jdbc Resume:C# Resume:MVC > > Do

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Jack Krupansky
vated. -- Jack Krupansky On Thu, Jan 15, 2015 at 11:23 AM, danield wrote: > Hi Mike, > > Thank you for your reply. Yes, I had thought of this, but it is not a > solution to my problem, and this is because the Term Frequency and > therefore > the results will still be wrong, as pr

Re: Questions regarding Lucene 5

2015-01-10 Thread Jack Krupansky
/lucene/facet/FacetsCollector.java?revision=1634013&view=markup Any other particular features of Lucene 5 that you are particularly interested in? -- Jack Krupansky On Sat, Jan 10, 2015 at 3:01 PM, Elad Margalit wrote: > Hi, > > I would like to ask regarding Lucene 5, > &

Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Jack Krupansky
Oops... I take that back! After I clicked Send I realized that this is the Lucene list - what I said is true for Solr queries, but that is because Solr added a "hack" to do things properly, but the Lucene query parser doesn't have that hack, so Erick is correct. -- Jack Krupansky

Re: Looking for docs that have certain fields empty (an/or not set)

2015-01-07 Thread Jack Krupansky
The pure negative query should work fine as a top level query - it's just when nested as a sub-query within parentheses that it misbehaves. -- Jack Krupansky On Wed, Jan 7, 2015 at 11:30 AM, Erick Erickson wrote: > Should be, but it's a bit confusing because the query syntax

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Jack Krupansky
the above strategy would be reasonable, or do you need to process large numbers of large documents. -- Jack Krupansky -Original Message- From: ryanb Sent: Tuesday, November 25, 2014 7:39 PM To: java-user@lucene.apache.org Subject: OutOfMemoryError indexing large documents Hello, We

Re: Exceptions during batch indexing

2014-11-08 Thread Jack Krupansky
Oops... you sent this to the wrong list - this is the Lucene user list, send it to the Solr user list. -- Jack Krupansky -Original Message- From: Peter Keegan Sent: Thursday, November 6, 2014 3:21 PM To: java-user Subject: Exceptions during batch indexing How are folks handling Solr

Re: Questions about the Lucene query language

2014-10-27 Thread Jack Krupansky
Pure negative queries are not supported, but all you need to do is include *:*, which translates into MatchAllDocsQuery. "hello dolly" is the same as "hello dolly"~0 -- Jack Krupansky -Original Message- From: Prad Nelluru Sent: Monday, October 27, 2014 8

Re: How to properly use Levenstein distance with ~ in Java

2014-10-18 Thread Jack Krupansky
Oops... for future reference, please post Solr questions to the *Solr* user list, not the *Lucene* ("java") user list! -- Jack Krupansky -Original Message----- From: Jack Krupansky Sent: Saturday, October 18, 2014 7:50 AM To: java-user@lucene.apache.org Subject: Re: How to pr

Re: How to properly use Levenstein distance with ~ in Java

2014-10-18 Thread Jack Krupansky
What is the value of the "qf" parameter? You don't have an explicit field name such as "title" in your query string, "q". -- Jack Krupansky -Original Message- From: Aleksander Sadecki Sent: Thursday, October 16, 2014 11:46 AM To: java-user@lucene.

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky
Yes, most special characters are treated as term delimiters, except that underscores, dots, and commas have some special rules. See the details under Standard Tokenizer in my Solr e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product

Re: NOTICE: Seeking Moderators for java-user@lucene

2014-09-30 Thread Jack Krupansky
Yeah, I can be a moderator, for both Lucene and Solr. -- Jack Krupansky -Original Message- From: Chris Hostetter Sent: Tuesday, September 30, 2014 12:51 PM To: java-user@lucene.apache.org Cc: java-user-ow...@lucene.apache.org Subject: NOTICE: Seeking Moderators for java-user@lucene

Re: Term vectors

2014-09-30 Thread Jack Krupansky
/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html The free Solr Reference Guide has a short section on the Solr Term Vector component. You could check it out before buying my $10 e-book. See: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component --

Re: Migrating lucene index to Elastic Search

2014-09-26 Thread Jack Krupansky
since ES has some special things they do so that a raw Lucene index will unlikely be compatible with ES, and to simple "reindex" your source data directly into ES to take full advantage of ES. -- Jack Krupansky -Original Message- From: Aditya Sent: Friday, September 26, 2014

Re: How to properly correlate relevance in a search across multiple collections

2014-09-06 Thread Jack Krupansky
r users pure-tf scoring if it provides faster search results, and then the user could click on a "refine results" button to re-do the search with the more expensive cross-corpus df-based scoring. Thoughts? -- Jack Krupansky -Original Message- From: Baldwin, David Sent:

Re: Question regarding complex queries and long tail suggestions

2014-09-03 Thread Jack Krupansky
pache/lucene/queryparser/complexPhrase/TestComplexPhraseQuery.java?revision=1622067&view=markup -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, September 3, 2014 7:14 PM To: java-user Subject: Re: Question regarding complex queries and long tail suggestions Ta

Re: indexing all suffixes to support leading wildcard?

2014-08-28 Thread Jack Krupansky
Use the ngram token filter, and the a query of 512 would match by itself: http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, August 28, 2014 11:52 PM To

Re: Why does this search fail?

2014-08-27 Thread Jack Krupansky
;s documented. See: https://support.google.com/websearch/answer/136861?hl=en It also seems to support "**" in a quoted phrase to mean one or more arbitrary terms. This isn't documented, but seems to work. -- Jack Krupansky -Original Message- From: Milind Sent: Wednesday

Re: Why does this search fail?

2014-08-27 Thread Jack Krupansky
core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html -- Jack Krupansky -Original Message- From: Michael Sokolov Sent: Wednesday, August 27, 2014 10:26 AM To: java-user@lucene.apache.org Subject: Re: Why does this search fail? Tokenization is tricky.

Re: Why does this search fail?

2014-08-26 Thread Jack Krupansky
alyzed. Some filters such as lower case are defined as "multi-term", so they will be performed, but the standard tokenizer is not being called, so the dot remains and this whole term is treated as one term, unlike the index analysis. -- Jack Krupansky -Original Message- From:

Re: WhiteSpaceTokenizer

2014-08-15 Thread Jack Krupansky
Sure, that should be a configurable option. Oh, and I neglected to mention a workaround: use the pattern tokenizer, which doesn't have a limit (yet.) But it might be slower. -- Jack Krupansky -Original Message- From: Sheng Sent: Friday, August 15, 2014 8:13 AM To: java

Re: WhiteSpaceTokenizer

2014-08-15 Thread Jack Krupansky
: https://issues.apache.org/jira/browse/LUCENE-5785 -- Jack Krupansky -Original Message- From: Sheng Sent: Thursday, August 14, 2014 11:38 PM To: java-user@lucene.apache.org Subject: WhiteSpaceTokenizer The length of token has to be shorter than 255, otherwise there will be unpredictable

Re: Searching with String that Represents a Signature

2014-08-14 Thread Jack Krupansky
The standard analyzer will discard most special characters as punctuation. What analyzer are you using? -- Jack Krupansky -Original Message- From: Scott Selvia Sent: Thursday, August 14, 2014 7:42 PM To: java-user@lucene.apache.org Subject: Searching with String that Represents a

Re: escaping characters

2014-08-12 Thread Jack Krupansky
The default changed to "false" in Lucene 3.1. Before that it was "true". -- Jack Krupansky -Original Message- From: Chris Salem Sent: Tuesday, August 12, 2014 8:34 AM To: java-user@lucene.apache.org Subject: RE: escaping characters Thanks! That worked. We recent

Re: Can't get case insensitive keyword analyzer to work

2014-08-12 Thread Jack Krupansky
And unfiltered. So even if you use the keyword tokenizer that only generates a single token, you still want token filtering, such as lower case. -- Jack Krupansky -Original Message- From: Christoph Kaser Sent: Tuesday, August 12, 2014 3:07 AM To: java-user@lucene.apache.org Subject

Re: escaping characters

2014-08-11 Thread Jack Krupansky
#setAutoGeneratePhraseQueries(boolean) -- Jack Krupansky -Original Message- From: Chris Salem Sent: Monday, August 11, 2014 1:03 PM To: java-user@lucene.apache.org Subject: RE: escaping characters I'm not using Solr. Here's my code: FSDirectory fsd = FSDirectory.open(n

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

2014-08-07 Thread Jack Krupansky
Also, usually query-time analysis is done by a "query parser", so if you aren't going through a quwery parser, you have to call the aalyzer yourself. The stemming is very likely the culprit here. -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Thu

Re: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency

2014-08-07 Thread Jack Krupansky
need to manually filter your query terms. Sounds like maybe a term got stemmed. -- Jack Krupansky -Original Message- From: Bianca Pereira Sent: Thursday, August 7, 2014 7:28 AM To: java-user@lucene.apache.org Subject: EnglishAnalyzer vs WhiteSpaceAnalyzer in getting Term Frequency Hi

Re: Lucene Query Wrong Result for phrase.

2014-07-18 Thread Jack Krupansky
The standard tokenizer will strip off those escaped quotes at query time. Ditto for the hyphen at index time. Try constructing your own analyzer using the white space tokenizer instead of the standard tokenizer. -- Jack Krupansky -Original Message- From: itisismail Sent: Friday

Re: How to handle words that stem to stop words

2014-07-07 Thread Jack Krupansky
your stop words, or possibly a pattern that matches stop words plus a short suffix that might get stemmed. -- Jack Krupansky -Original Message- From: Arjen van der Meijden Sent: Sunday, July 6, 2014 2:47 PM To: java-user@lucene.apache.org Subject: How to handle words that stem to stop

Re: QueryParserUtil, big query with wildcards -> runs endlessly and produces heavy load

2014-06-26 Thread Jack Krupansky
I'll defer the the hard-core Lucene committers for the technical details, but I would suggest that a very large term with dozens of wildcards is a "known limitation" (albeit not well-documented.) IOW, to use wildcards in Lucene in a performant manner, they need to be &qu

Re: Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Jack Krupansky
h introduces a regex query term. It is added by the escape method you call, but the escaping will be gone by the time your analyzer is called. So, just try a simple, unescaped slash in your char mapping table. -- Jack Krupansky -Original Message- From: Luis Pureza Sent: Tuesday, Jun

Re: searching with stemming

2014-06-09 Thread Jack Krupansky
Please do file a Jira. I'm sure the discussion will be interesting. -- Jack Krupansky -Original Message- From: Jamie Sent: Monday, June 9, 2014 9:33 AM To: java-user@lucene.apache.org Subject: Re: searching with stemming Jack Thanks. I figured as much. I'm modifying eac

Re: searching with stemming

2014-06-09 Thread Jack Krupansky
mprovement. -- Jack Krupansky -Original Message- From: Jamie Sent: Monday, June 9, 2014 6:56 AM To: java-user@lucene.apache.org Subject: Re: searching with stemming To me, it seems strange that these default analyzers, don't provide constructors that enable one to override stemming, e

Re: How to approach indexing source code?

2014-06-03 Thread Jack Krupansky
indexed. -- Jack Krupansky -Original Message- From: Johan Tibell Sent: Tuesday, June 3, 2014 9:32 PM To: java-user@lucene.apache.org Subject: How to approach indexing source code? Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) t

Re: search performance

2014-06-02 Thread Jack Krupansky
256GB machine? How frequent are your commits for updates while doing queries? -- Jack Krupansky -Original Message- From: Jamie Sent: Monday, June 2, 2014 2:51 AM To: java-user@lucene.apache.org Subject: search performance Greetings Despite following all the recommended optimizations (as

Re: Multi-thread indexing, should the commit be called from each thread?

2014-05-21 Thread Jack Krupansky
(Was this supposed to be a java-user/Lucene question or a Solr question?!) -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Wednesday, May 21, 2014 10:58 AM To: java-user Subject: Re: Multi-thread indexing, should the commit be called from each thread? I'll be

Re: Performance issue when using multiple PhraseQueries against a 1+ million entries index

2014-05-19 Thread Jack Krupansky
Does your index fit fully in system memory - the OS file cache? If not, there could be a lot of thrashing (I/O) as Lucene accesses the index. -- Jack Krupansky -Original Message- From: Liviu Matei Sent: Monday, May 19, 2014 4:21 PM To: java-user@lucene.apache.org Subject: Performance

Re: writer.updateDocument() not working (possible bug?)

2014-05-19 Thread Jack Krupansky
for a batch update model as opposed to a true real-time database (it's a search engine, not a database!), but... the original goals and requirements might give us some insight. Thanks. -- Jack Krupansky -Original Message- From: Michael McCandless Sent: Monday, May 19, 2014 6:10 AM

Re: A work around to get matching terms from document - Stemmed and Synonyms

2014-05-17 Thread Jack Krupansky
Oops... I just noticed that you sent this request to the "java-user" list, which is primarily for developers using the Lucene library directly. Try sending it to the solr-user list, which is for users and developers working with Solr. -- Jack Krupansky -Original Message-

Re: A work around to get matching terms from document - Stemmed and Synonyms

2014-05-17 Thread Jack Krupansky
The "explain" section of the debug response when you set the debugQuery=true parameter will give you the final terms that were matched for each document. -- Jack Krupansky -Original Message- From: venkatesham.gu...@igate.com Sent: Saturday, May 17, 2014 2:28 AM To:

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-16 Thread Jack Krupansky
that could be handled by having a tokenizer that that simply ignored punctuation and whitespace and generated one big original token and then n-grammed it based on some maximal query phrase size. And... the original requirement spec didn't list that as a use case anyway. -- Jack Kr

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-11 Thread Jack Krupansky
#x27;s not very practical. In truth, Lucene/Solr doesn't have a good out of the box solution for this use case. -- Jack Krupansky -Original Message- From: teko Sent: Thursday, May 8, 2014 9:03 AM To: java-user@lucene.apache.org Subject: How to locate a Phrase inside text (lik

Re: is there a historical reason why default conjunction operator is "OR"?

2014-04-16 Thread Jack Krupansky
s. Using explicit operators gives you "precision", which power users will appreciate. Average users just get annoyed when the search engine is being so picky. -- Jack Krupansky -Original Message- From: Jose Carlos Canova Sent: Wednesday, April 16, 2014 12:53 PM

Re: Stored fields and OS file caching

2014-04-05 Thread Jack Krupansky
. -- Jack Krupansky -Original Message- From: Adrien Grand Sent: Friday, April 4, 2014 4:50 PM To: java-user@lucene.apache.org Subject: Re: Stored fields and OS file caching Hi Vitaly, Doc values are indeed well-suited for grouping and sorting. However stored fields remain better at returning

Re: Lucene Wildcard for zero or one character

2014-03-25 Thread Jack Krupansky
/houses?/ -- Jack Krupansky -Original Message- From: Uwe Schindler Sent: Tuesday, March 25, 2014 11:34 AM To: java-user@lucene.apache.org Subject: RE: Lucene Wildcard for zero or one character The default WildcardQuery only supports: '*' (star) is the wildcard in Wildcar

Re: maxDoc/numDocs int fields

2014-03-21 Thread Jack Krupansky
nch that literally does that switch now, but otherwise, that's the limit for now. -- Jack Krupansky -Original Message- From: Artem Gayardo-Matrosov Sent: Friday, March 21, 2014 12:41 PM To: java-user@lucene.apache.org Subject: Re: maxDoc/numDocs int fields Hi Oli, Thanks for y

Re: How to search for terms containing negation

2014-03-18 Thread Jack Krupansky
Of course - you need to use the same analyzer for both indexing and query. So, just reindex your data with this new analyzer. -- Jack Krupansky -Original Message- From: Natalia Connolly Sent: Tuesday, March 18, 2014 10:37 AM To: java-user@lucene.apache.org Subject: Re: How to search

Re: tf/idf similarity with modified document similarity

2014-03-07 Thread Jack Krupansky
that info is hanging around as part of the query matching process. Still, that is a reasonable feature to want and it has been requested before. Worth a Jira. -- Jack Krupansky -Original Message- From: Christian Reuschling Sent: Thursday, March 6, 2014 1:34 PM To: java-user

Re: encoding problem when retrieving document field value

2014-03-03 Thread Jack Krupansky
come about picking a PU Unicode character? -- Jack Krupansky -Original Message- From: G.Long Sent: Monday, March 3, 2014 12:09 PM To: java-user@lucene.apache.org Subject: encoding problem when retrieving document field value Hi :) My index (Lucene 3.5) contains a field called title. It

Re: query regarding Lucene Indexing and searching

2014-03-02 Thread Jack Krupansky
Please elaborate on what you expect will be in this payload. Is it information derived from the indexing process itself or is it external information to be added to the indexed terms? -- Jack Krupansky -Original Message- From: Mrugendra Sent: Sunday, March 2, 2014 5:15 AM To: java

Re: Fuzzy query on capital letters does not match documents

2014-02-27 Thread Jack Krupansky
Be careful with very short terms and fuzzy query. The rounding when converting from a fraction to an edit distance can make the match exact rather than fuzzy. What terms does your index have? XV, Xv, xV, xv? XV~0.7 may only match XV. -- Jack Krupansky -Original Message- From: G.Long

Re: How to delete a token that comes exactly after a token

2014-02-26 Thread Jack Krupansky
If this is primarily an issue with the document input, as opposed to queries, you might be better off simply preprocessing the text before it is given to Lucene to be indexed. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, February 26, 2014 1:37 PM To

Re: How to delete a token that comes exactly after a token

2014-02-26 Thread Jack Krupansky
Sounds like a custom filter. Or maybe an option for stop filter or a specialization of stop filter. Or maybe it could be even more generalized. What are some practical example token sequences? -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, February 26

Re: codec mismatch

2014-02-17 Thread Jack Krupansky
native file system for greater performance. Solrandra stored the Lucene indexes in Cassandra, but the performance penalty was too high. -- Jack Krupansky -Original Message- From: Jason Wee Sent: Friday, February 14, 2014 3:13 AM To: java-user@lucene.apache.org Subject: codec mismatch

Re: char mapping in lucene-icu

2014-02-14 Thread Jack Krupansky
it could be as simple as whether the data file should have DOS or UNIX or Mac line endings (CRLF vs. NL vs. CR.) Be sure to use an editor that satisfies the requirements of ICU. To be clear, Lucene itself does not have a published API for modifying the mappings of ICU. -- Jack Krupansky -O

Re: Wildcard searches

2014-02-05 Thread Jack Krupansky
Take a look at the complex phrase query parser. See: http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html See also: https://issues.apache.org/jira/browse/LUCENE-1486 -- Jack Krupansky -Original Message- From

Re: Why PhraseQuery translate stopwords to "?"

2013-12-10 Thread Jack Krupansky
In theory, the query with holes (position increments) for stop words should work... unless you originally indexed the data without the stop word filter. Any time you change the filters, you typically need to reindex the data. -- Jack Krupansky -Original Message- From: Jean-Claude

Re: Why PhraseQuery translate stopwords to "?"

2013-12-09 Thread Jack Krupansky
The analyzer is generating holes for the stop words - the position of the subsequent term is incremented an extra time for each stop word so that their positions are maintained. -- Jack Krupansky -Original Message- From: Jean-Claude Dauphin Sent: Monday, December 09, 2013 4:15 PM To

Re: tokenizer to strip a set of characters

2013-11-21 Thread Jack Krupansky
the start or end. -- Jack Krupansky -Original Message- From: Stephane Nicoll Sent: Thursday, November 21, 2013 9:42 AM To: java-user@lucene.apache.org Subject: tokenizer to strip a set of characters Hi, I am using lucene 3.6 and I am looking to a tokenized that would remove certain

Re: How to perform Wildcard search when using WhitespaceAnalyzer?

2013-11-18 Thread Jack Krupansky
As I indicated in my previous message, we need actual queries and the actual indexed data where matches are failing. Note that *NALYZE will not match ANALYZER. So, it might be that you have composed queries in which some of the terms match properly and only some fail. -- Jack Krupansky

Re: How to perform Wildcard search when using WhitespaceAnalyzer?

2013-11-17 Thread Jack Krupansky
what does the indexed data look like? The simple answer to your question is that wildcards don't behave any differently between the two analyzers - simply because they are not used at all for the wildcard terms. -- Jack Krupansky -Original Message- From: raghavendra.k@barclay

Re: Twitter analyser

2013-11-05 Thread Jack Krupansky
byte[] charTypeTable, int configurationFlags, CharArraySet protWords) See: http://lucene.apache.org/core/4_5_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html -- Jack Krupansky -Original Message- From: Stéphane Nicoll

Re: DateQuery with comparison operators

2013-10-29 Thread Jack Krupansky
TO *] x >= v -> x:[v TO *] Note the use of curly braces for exclusive end points. -- Jack Krupansky -Original Message- From: Umashanker, Srividhya Sent: Tuesday, October 29, 2013 3:57 AM To: java-user@lucene.apache.org Subject: DateQuery with comparison operators HI - I are using Lu

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
. -- Jack Krupansky -Original Message- From: saisantoshi Sent: Sunday, October 20, 2013 7:43 PM To: java-user@lucene.apache.org Subject: Re: Handling special characters in Lucene 4.0 what about other characters like '&,'( quote) characters. We have a requirement that a text

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Jack Krupansky
white space tokenizer and then also uses a filter to strip out any punctuation characters that you don't want to keep (e.g., period, comma, semicolon, parentheses, etc.) The query parser itself knows nothing about what your chosen analyzer does. But the query parser does specially interpret the

  1   2   3   >