tokenizing text using language analyzer but preserving stopwords if possible

2011-12-06 Thread Ilya Zavorin
so that their translations have the same order in the output. Can I accomplish this using Lucene components? I presume I'd have to start by creating an analyzer for the foreign language, but then what? How do I (i) tokenize, (ii) access words in the correct order, (iii) also access non

highlighter: how can I get locations of fragments?

2011-12-13 Thread Ilya Zavorin
can I instead get pointers to these fragments in the original contents? In other words, I need to know where these fragments start and, if possible, end. Thanks, Ilya Zavorin - To unsubscribe, e-mail: java-user-unsubscr

how to preserve whitespaces etc when tokenizing stream?

2012-01-13 Thread Ilya Zavorin
I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stop

RE: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Ilya Zavorin
o:torin...@gmail.com] Sent: Monday, January 16, 2012 5:50 AM To: java-user@lucene.apache.org Subject: Re: how to preserve whitespaces etc when tokenizing stream? Maybe you could simply use String.replace()? Or the text actually needs to be tokenized? On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin w

can I make incremental index/search more efficient?

2012-02-21 Thread Ilya Zavorin
rch only the part of index that corresponds to doc X". Or can I? Is there any way to make this incremental index/search more efficient? For instance, is it at all possible to restrict where in the index a search for hits is performed? Or any other optimization? Thanks much Ilya Zavorin

Can I detect incorrect language selection after creating an index?

2012-02-27 Thread Ilya Zavorin
languages using different scripts, e.g. Latin vs Cyrillic vs Arabic vs Chinese etc. Thanks much Ilya Zavorin

is there an efficient way of finding locations of highlighted fragments in original text?

2012-03-19 Thread Ilya Zavorin
a way to do it faster using Lucene's core or Highlighter machinery? Thanks Ilya Zavorin

can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
I am writing a Lucene based indexing-search app and testing it using some simple docs and querries. I have 3 simples docs that are shown at the bottom of the this email between pairs of "==="s and about a dozen terms. One of them is "electricity". As you can see, it appears in al

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
can tell you what's in your index: <http://code.google.com/p/luke/> Steve -Original Message- From: Ilya Zavorin [mailto:izavo...@caci.com] Sent: Monday, March 26, 2012 10:11 AM To: java-user@lucene.apache.org Subject: can't find common words -- using Lucene 3.4.0 I am w

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Ilya Zavorin
original text. Are you sure that these files were analyzed with StandardAnalyzer, and not some other language-specific analyzer, as a result of language misidentification? Steve -Original Message- From: Ilya Zavorin [mailto:izavo...@caci.com] Sent: Monday, March 26, 2012 11:21 AM To: j

RE: can't find common words -- using Lucene 3.4.0

2012-03-28 Thread Ilya Zavorin
)); IndexWriter writer = new IndexWriter(dir, iwc); Anything suspicious here? Thanks Ilya Zavorin -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Monday, March 26, 2012 1:48 PM To: java-user@lucene.apache.org Subject: RE: can't find common

need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-13 Thread Ilya Zavorin
Hello All, I am using 3.4. I need to find locations of query hits in a document. What I've implemented works fine for textual queries but does not work for phone numbers. Here's how I index my docs: String oc = "Joe dialed 800-555-1212 but got a busy signal"; doc.add(new Field("contents",

RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Ilya Zavorin
numbers Try putting the phone number in quotes in the query: String qstr = "\"800-555-1212\""; And check query.toString to see how the query parser analyzed the term, bot with and without quotes. And make sure you initialized the query parser with "contents" as the default

RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Ilya Zavorin
ler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Ilya Zavorin [mailto:izavo...@caci.com] > Sent: Thursday, June 14, 2012 6:49 PM > To: java-user@lucene.apache.org > Subject: RE: need to find locations of quer

RE: need to find locations of query hits in doc: works fine for regular text but not for phone numbers

2012-06-14 Thread Ilya Zavorin
numbers Look at this code: QueryTermExtractor.getTerms(Query query) http://lucene.apache.org/core/3_6_0/api/contrib-highlighter/org/apache/lucene/search/highlight/QueryTermExtractor.html -- Jack Krupansky -Original Message- From: Ilya Zavorin Sent: Thursday, June 14, 2012 2:36 PM To: java-user

can't find queries when they are one per line in target file

2012-07-13 Thread Ilya Zavorin
Hi, I am using 3.4.0 and just discovered a weird issue. I have a set of simple English one-word queries and two target files that I want to search. One has all these queries in one line, i.e. something like this Query1 Query2 Query3 Query4 The other has them one per line, i.e. Query1 Query2 Q

RE: can't find queries when they are one per line in target file

2012-07-13 Thread Ilya Zavorin
But why then does it find all the querries in the 1st file? I use exactly the same code. IZ -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Friday, July 13, 2012 12:32 PM To: java-user@lucene.apache.org Subject: RE: can't find queries when they are one per line

RE: can't find queries when they are one per line in target file

2012-07-13 Thread Ilya Zavorin
ou are doing we cannot answer your request. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message- > From: Ilya Zavorin [mailto:izavo...@caci.com] > Sent: Friday, July 13, 2012 6:39 PM > To: java-user@l

RE: can't find queries when they are one per line in target file

2012-07-13 Thread Ilya Zavorin
Ian, Turns out you were very close to the truth. The problem was in how I was ingesting the original file into memory before indexing. Thanks, Mr. Ilya Zavorin Applied Research and Consulting CACI Advanced Knowledge Solutions Division 4831 Walden Lane, Lanham, MD 20706 ph: 1-301-306-2859 fx

Lucene.NET based text triage

2012-08-21 Thread Ilya Zavorin
nce rather than tokenizing and looping over tokens? Thanks much, Ilya Zavorin

Efficient string lookup using Lucene

2012-08-24 Thread Ilya Zavorin
t. Essentially, what I am trying to do is implement substring matching more efficiently that using Java's standard substring matching methods. Thanks! Ilya Zavorin

RE: Efficient string lookup using Lucene

2012-08-25 Thread Ilya Zavorin
Does it mean that the resulting index will be very large? Thanks, Ilya -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, August 24, 2012 4:59 PM To: java-user@lucene.apache.org Subject: Re: Efficient string lookup using Lucene > search for a string "run", I

RE: Efficient string lookup using Lucene

2012-08-25 Thread Ilya Zavorin
Does Lucene support this type of structure, or do I need to somehow implement it outside Lucene? By the way, I need this to run on an Android phone so size of memory might be an issue... Thanks, Ilya Zavorin -Original Message- From: Dawid Weiss [mailto:dawid.we...@gmail.com] Sent

RE: Efficient string lookup using Lucene

2012-08-26 Thread Ilya Zavorin
The user uploads a set of text files, either all of them at once or one at a time, and then they will be searched locally on the phone against a set of "hotlist" words. This assumes no connection to any sort of server so everything must be done locally. I already have Lucene integrated so I mig

how to fully preprocess query before fuzzy search?

2012-09-17 Thread Ilya Zavorin
ust like the tilde is removed above. What is the complete set of such characters? Do I need to do any other preprocess? Thanks, Ilya Zavorin

RE: how to fully preprocess query before fuzzy search?

2012-09-17 Thread Ilya Zavorin
dd the fuzzy query. Note: In 4.0 the fuzzy query is limited to an editing distance of 2. -- Jack Krupansky -Original Message- From: Ilya Zavorin Sent: Monday, September 17, 2012 10:41 AM To: java-user@lucene.apache.org Subject: how to fully preprocess query before fuzzy search? I am proces

Lucene on Android: indexing, searching and highlighting

2011-11-23 Thread Ilya Zavorin
e indexing/searching/highlighting steps? Can I use the lucene and highlighting jars (lucene-core-3.4.0.jar and lucene-highlighter-3.4.0.jar) "out of the box"? Also, is there any sample code that would show how Lucene components should be invoked on Android? Thank you, Ilya Zavorin

Design qs: search for multiple terms in document collection

2011-12-01 Thread Ilya Zavorin
text that was quite far from the original query. For instance, I was looking for a 3-word term and it highlighted a sequence of only 2 of these 3 words. How can I control how close highlighted fragments should be to the original query? Thanks much, Ilya Zavorin