Re: Is it possible to retrieve Terms from a Document?

2009-07-31 Thread Grant Ingersoll
See the Term Vector capability. http://www.lucidimagination.com/search/?q=term+vectors#/ p:lucene By default the information is _not_ stored in the index. You will need to add Field.TermVector.YES to your indexing in order for this information to be available. -Grant On Jul 31, 2009, at

Tutorial on Lucene on S3

2009-07-31 Thread Allahbaksh Mohammedali Asadullah
Hi, Is there any tutorial on how to store Lucene Index in S3. How do we access the index from S3. Are there any wrapper of amazon S3. The other question is how do I store and access existing lucene index on Google App Engine. Thanks in advance. Warm Regards, Allahbaksh

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Jibo John
Hi Phil, It's 5 threads for IndexWriter. For ThreadedIndexWriter, I used: writer.num.threads=16 writer.max.thread.queue.size=80 Thanks, -Jibo On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote: Hi Jibo, Your mergeFactor is different, and the resulting numFiles (segment files) is different. May

Is it possible to retrieve Terms from a Document?

2009-07-31 Thread Phil Whelan
Hi, I know you can use Field.Store.YES, but I want to inspect the terms / tokens and their order related to the field name at search time. Is this possible? Obviously this information is stored in the index, but I can not find any API to access it. I'm guessing the answer might be that Terms point

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread ohaya
Hi, I don't know the answer to your questions, but I'm guessing that the answer to #3 is probably because the answers to #1 and #2. Did you try to look at the indexes using Luke? That shows the top 50 terms when it starts, so it might be obvious what the differences are, and that might give

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Phil Whelan
Hi Jibo, Your mergeFactor is different, and the resulting numFiles (segment files) is different. Maybe each thread is responsible for a segment file. Just curious - do you have 3 threads? Phil - To unsubscribe, e-mail: java-user

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Jibo John
Mike, Here you go: IndexWriter: $ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/ lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/ Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index NOTE: testing will be more thorough if y

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Jibo John
Tried with a larger set of documents (2,000,000 ) this time. ThreadedIndexWriter --- Size - 1.4 G optimized - yes (as suggested by Phil) Number of documents - 1,999,924 (Not idea where the 76 documents vanished...) Number of terms - 3,638,801 IndexWriter

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Michael McCandless
Hmmm... can you run CheckIndex on both indexes and post the results? java org.apache.lucene.index.CheckIndex /path/to/index Mike On Fri, Jul 31, 2009 at 2:38 PM, Jibo John wrote: > Number of docs are the same in the index for both the cases (200,000). > I haven't altered the benchmark/ code, b

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread ohaya
Hi, Sorry to jump in, but I've been following this thread with interest :)... Am I misunderstanding your original observation, that ThreadedIndexWriter produced smaller index? Did the ThreadedIndexWriter also finish faster (I'm assuming that it should)? If the index is smaller, and everyt

Re: Seeking guidance for updating indexes

2009-07-31 Thread ohaya
Hi, Phil and Ian, Thanks for the responses and confirmations about this. Assuming that our requirements (as I described earlier) don't change, it looks like this updating/inserting thing should be pretty easy :)! Later, and have a great weekend! Jim Phil Whelan wrote: > Hi Jim, >

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Robert Muir
Simon, no problem. I am looking at it now. I will just post my approach and let people tear it apart / get things moving :) On Fri, Jul 31, 2009 at 2:45 PM, Simon Willnauer wrote: > @Michael: add yourself as a Watcher for the issue. > @Robert: I can start working on this within the next weeks - ca

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Phil Whelan
Hi Jibo, Have you tried optimizing indexes? I do not know anything about the implementation of ThreadedIndexWriter, but if they both optimize down to the same size, it could just mean that ThreadedIndexWriter is not as optimized. Thanks, Phil On Fri, Jul 31, 2009 at 11:38 AM, Jibo John wrote: >

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Simon Willnauer
@Michael: add yourself as a Watcher for the issue. @Robert: I can start working on this within the next weeks - can you help too? simon On Fri, Jul 31, 2009 at 7:49 PM, Robert Muir wrote: > Michael, makes sense. most of the issues probably have some > workaround, so reply back if you need. > > Th

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Jibo John
Number of docs are the same in the index for both the cases (200,000). I haven't altered the benchmark/ code, but, used a profiler to verify that Benchmark main thread is closed only after all other threads are closed. Thanks, -Jibo On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Robert Muir
Michael, makes sense. most of the issues probably have some workaround, so reply back if you need. Thanks for your feedback though, it is helpful to know that its important! On Fri, Jul 31, 2009 at 1:36 PM, Michael Thomsen wrote: > Not really. At this point, I just needed to know where the UCS4 >

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Michael Thomsen
Not really. At this point, I just needed to know where the UCS4 support stands. I'm reasonably familiar with the various analyzers and what they can do. It's just the state of UCS4 support that might be an issue for us. Thanks, Mike On Fri, Jul 31, 2009 at 12:25 PM, Robert Muir wrote: > Michael

Re: Seeking guidance for updating indexes

2009-07-31 Thread Phil Whelan
Hi Jim, There should not be much difference from the lucene end between a new index and index you want to update (add more documents to). As stated in the Lucene docs IndexWriter will create the index "if it does not already exist". http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/in

Re: Seeking guidance for updating indexes

2009-07-31 Thread Ian Lea
You're pretty much spot on. Read the FAQ entry "Does Lucene allow searching and indexing simultaneously?" for one of your questions (the answer is yes btw). With only a single update app running there won't be any locking issues. When the updater code opens the index you'll need to ensure that i

Seeking guidance for updating indexes

2009-07-31 Thread ohaya
Hi, I still am new to Lucene, but I think I have an initial indexer app (based on the demo IndexFiles app) working, and also have a web app, based on the demo luceneweb web app working. I'm still busy tweaking both, but am starting to think ahead, about operational type issues, esp. updating

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Robert Muir
Michael just out of curiousity, did you have a particular Analyzer in mind you were planning on using, or rather certain features in Lucene you were concerned would work with these codepoints? On Fri, Jul 31, 2009 at 12:19 PM, Simon Willnauer wrote: > Hey Robert, good to see that you found the lin

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Simon Willnauer
Hey Robert, good to see that you found the link :) On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir wrote: > Michael, as Simon mentioned I created an issue describing where you > might run into trouble, at least in lucene core. > > The low-level lucene stuff, it treats these just fine (as surrogate pa

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Robert Muir
Michael, as Simon mentioned I created an issue describing where you might run into trouble, at least in lucene core. The low-level lucene stuff, it treats these just fine (as surrogate pairs). But most analyzers run into some trouble. (things like WhitespaceAnalyzer are ok) Also wildcard queries

Re: Is there a list of "special" characters for standard analyzer?

2009-07-31 Thread Simon Willnauer
On Fri, Jul 31, 2009 at 5:00 PM, wrote: > Hi Ahmet, > > Thanks for the clarification and information!  That was exactly what I was > looking for. > > Jim > > > AHMET ARSLAN wrote: >> >> > I guess that the obvious question is "Which characters are >> > considered 'punctuation characters'?".

Re: Is there any difference between using QueryParser and MultiFieldQueryParser when have single default search field ?

2009-07-31 Thread Paul Taylor
Simon Willnauer wrote: This would not make much of a difference. I would guess that you have one additional "wrapping" boolean query if you use MultiFieldQueryParser. For query "foo AND bar" the MFQueryParser creates +(fname:foo) +(fname:bar) and QueryParser would create +fname:foo +fname:bar so

Re: Is there a list of "special" characters for standard analyzer?

2009-07-31 Thread ohaya
Hi Ahmet, Thanks for the clarification and information! That was exactly what I was looking for. Jim AHMET ARSLAN wrote: > > > I guess that the obvious question is "Which characters are > > considered 'punctuation characters'?". > > Punctuation = ("_"|"-"|"/"|"."|",") > > > In part

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Michael Thomsen
Thanks for your quick response! Mike On Fri, Jul 31, 2009 at 10:25 AM, Simon Willnauer wrote: > If I understand you correctly you are asking if lucene can deal with > encodings that use more than 16 bit. Well yes and no but mainly no. > The support for unicode 4.0 was introduced in Java 1.5 and l

Re: indexing multiple email addresses in one field

2009-07-31 Thread Phil Whelan
Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major rewrite of my indexer. I think these changes are going to make a huge difference. Cheers, Phil On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall wrote: > And to address the stop word issue, you can override the stop word list that > i

Re: Quick question about Lucene and UCS4

2009-07-31 Thread Simon Willnauer
If I understand you correctly you are asking if lucene can deal with encodings that use more than 16 bit. Well yes and no but mainly no. The support for unicode 4.0 was introduced in Java 1.5 and lucene core has still back-compat requirements for java 1.4. Lucene's analyzers make use of char[] all

Re: Is there any difference between using QueryParser and MultiFieldQueryParser when have single default search field ?

2009-07-31 Thread prashant ullegaddi
In MultiFieldQueryParser, you can mention different fields of the document which can be searched for E.g. in contents of the document, if you index different fields such as URL, BOLD, ITALIC, you can search over all of them. Additionally, there is provision to boost a field over the other as well.

Re: Is there any difference between using QueryParser and MultiFieldQueryParser when have single default search field ?

2009-07-31 Thread Simon Willnauer
This would not make much of a difference. I would guess that you have one additional "wrapping" boolean query if you use MultiFieldQueryParser. For query "foo AND bar" the MFQueryParser creates +(fname:foo) +(fname:bar) and QueryParser would create +fname:foo +fname:bar so in this case one level of

Quick question about Lucene and UCS4

2009-07-31 Thread Michael Thomsen
Is Lucene capable of handling UCS4 data natively? Thanks, Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Is there any difference between using QueryParser and MultiFieldQueryParser when have single default search field ?

2009-07-31 Thread Ian Lea
I'd guess there wouldn't be any difference, but haven't tried it. Try it out and see what query.toString() says in each case. -- Ian. On Fri, Jul 31, 2009 at 1:37 PM, Paul Taylor wrote: > Is there any difference between using QueryParser and MultiFieldQueryParser > when have single default sea

Re: indexing multiple email addresses in one field

2009-07-31 Thread Matthew Hall
And to address the stop word issue, you can override the stop word list that it uses. Most analyzers that use stop words, (Standard included) has an option to pass it an arbitrary list of StopWords which will override the defaults. You could also just roll your own (which is what you are goin

Is there any difference between using QueryParser and MultiFieldQueryParser when have single default search field ?

2009-07-31 Thread Paul Taylor
Is there any difference between using QueryParser and MultiFieldQueryParser when have single default search field ? Depending on how many default search fields on an searching an index I select which of the two QueryParsers to use, but does it mater if I just use MultiFIeldQueryParser all the

Re: Boosting Search Results

2009-07-31 Thread prashant ullegaddi
It might be because there are hardly any documents containing both the words. Try exact search: "\"tall fat\"" On Fri, Jul 31, 2009 at 3:31 PM, bourne71 wrote: > > Hi, new here. > > I recently started using lucene and had encounter a problem.I crawl and > index a number of documents. > When i pe

Lucene for dynamic data retrieval

2009-07-31 Thread Findsatish
Hi All, I am new to Lucene and I am working on a search application. My application needs dynamic data retrieval from the database. That means, based on my previous step output, I need to retrieve entries from the DB for the next step. For example, if my search query contains "Name" field entry,

Re: Boosting Search Results

2009-07-31 Thread Ian Lea
Hi It's not quite that simple. Other things being equal, results that match all keywords are likely to come first but there are other factors such as term frequency and the length of the document. Searcher.explain() will give you the gory details. Luke will let you see what is in your index.

Re: Boosting Search Results

2009-07-31 Thread AHMET ARSLAN
> When i perform a search, lets say "tall fat", by right the > results that matches all the keyword should be on top and display first. Answer of your question lies at the end of this thread: http://www.nabble.com/Generating-Query-for-Multiple-Clauses-in-a-Single-Field-td24694748.html

Re: Term's frequency

2009-07-31 Thread prashant ullegaddi
Thanks Ahmet. This answers my question. On Fri, Jul 31, 2009 at 1:30 PM, AHMET ARSLAN wrote: > > > > Given a term say "apache", I want to look up the lucene index > > programmatically to find out its frequency in the corpus. > > I think you are asking collection frequency of a term. Term Frequen

Boosting Search Results

2009-07-31 Thread bourne71
Hi, new here. I recently started using lucene and had encounter a problem.I crawl and index a number of documents. When i perform a search, lets say "tall fat", by right the results that matches all the keyword should be on top and display first. But in my search results, some of the document

Re: ThreadedIndexWriter vs. IndexWriter

2009-07-31 Thread Michael McCandless
Hmm... this doesn't sound right. That example (ThreadedIndexWriter) is meant to be a drop-in replacement, wherever you use an IndexWriter, that keeps an under-the-hood thread pool (using java.util.concurrent.*) to add/update documents with multiple threads. It should not result in a smaller index

Re: Term's frequency

2009-07-31 Thread AHMET ARSLAN
> Given a term say "apache", I want to look up the lucene index > programmatically to find out its frequency in the corpus. I think you are asking collection frequency of a term. Term Frequency is defined between a document and a term which is printed in the loop in the following code. And at

Re: Term's frequency

2009-07-31 Thread prashant ullegaddi
Given a term say "apache", I want to look up the lucene index programmatically to find out its frequency in the corpus. On Fri, Jul 31, 2009 at 12:23 AM, wrote: > > prashant ullegaddi wrote: > > How to get the number of times a term occurs in the Lucene index? > > > > Regards, > > Prashant

Re: Is there a list of "special" characters for standard analyzer?

2009-07-31 Thread AHMET ARSLAN
> I guess that the obvious question is "Which characters are > considered 'punctuation characters'?". Punctuation = ("_"|"-"|"/"|"."|",") > In particular, does the analyzer consider "=" (equal) and > ":" (colon) to be punctuation characters? ":" is special character at QueryParser (if you are