See the Term Vector capability. http://www.lucidimagination.com/search/?q=term+vectors#/
p:lucene
By default the information is _not_ stored in the index. You will
need to add Field.TermVector.YES to your indexing in order for this
information to be available.
-Grant
On Jul 31, 2009, at
Hi,
Is there any tutorial on how to store Lucene Index in S3. How do we access the
index from S3. Are there any wrapper of amazon S3.
The other question is how do I store and access existing lucene index on Google
App Engine.
Thanks in advance.
Warm Regards,
Allahbaksh
Hi Phil,
It's 5 threads for IndexWriter.
For ThreadedIndexWriter, I used:
writer.num.threads=16
writer.max.thread.queue.size=80
Thanks,
-Jibo
On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
Hi Jibo,
Your mergeFactor is different, and the resulting numFiles (segment
files) is different. May
Hi,
I know you can use Field.Store.YES, but I want to inspect the terms /
tokens and their order related to the field name at search time. Is
this possible? Obviously this information is stored in the index, but
I can not find any API to access it. I'm guessing the answer might be
that Terms point
Hi,
I don't know the answer to your questions, but I'm guessing that the answer to
#3 is probably because the answers to #1 and #2.
Did you try to look at the indexes using Luke? That shows the top 50 terms
when it starts, so it might be obvious what the differences are, and that might
give
Hi Jibo,
Your mergeFactor is different, and the resulting numFiles (segment
files) is different. Maybe each thread is responsible for a segment
file. Just curious - do you have 3 threads?
Phil
-
To unsubscribe, e-mail: java-user
Mike,
Here you go:
IndexWriter:
$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
NOTE: testing will be more thorough if y
Tried with a larger set of documents (2,000,000 ) this time.
ThreadedIndexWriter
---
Size - 1.4 G
optimized - yes (as suggested by Phil)
Number of documents - 1,999,924 (Not idea where the 76 documents
vanished...)
Number of terms - 3,638,801
IndexWriter
Hmmm... can you run CheckIndex on both indexes and post the results?
java org.apache.lucene.index.CheckIndex /path/to/index
Mike
On Fri, Jul 31, 2009 at 2:38 PM, Jibo John wrote:
> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, b
Hi,
Sorry to jump in, but I've been following this thread with interest
:)...
Am I misunderstanding your original observation, that
ThreadedIndexWriter produced smaller index? Did the ThreadedIndexWriter
also finish faster (I'm assuming that it should)?
If the index is smaller, and everyt
Hi,
Phil and Ian,
Thanks for the responses and confirmations about this.
Assuming that our requirements (as I described earlier) don't change, it looks
like this updating/inserting thing should be pretty easy :)!
Later, and have a great weekend!
Jim
Phil Whelan wrote:
> Hi Jim,
>
Simon, no problem. I am looking at it now. I will just post my
approach and let people tear it apart / get things moving :)
On Fri, Jul 31, 2009 at 2:45 PM, Simon
Willnauer wrote:
> @Michael: add yourself as a Watcher for the issue.
> @Robert: I can start working on this within the next weeks - ca
Hi Jibo,
Have you tried optimizing indexes? I do not know anything about the
implementation of ThreadedIndexWriter, but if they both optimize down
to the same size, it could just mean that ThreadedIndexWriter is not
as optimized.
Thanks,
Phil
On Fri, Jul 31, 2009 at 11:38 AM, Jibo John wrote:
>
@Michael: add yourself as a Watcher for the issue.
@Robert: I can start working on this within the next weeks - can you help too?
simon
On Fri, Jul 31, 2009 at 7:49 PM, Robert Muir wrote:
> Michael, makes sense. most of the issues probably have some
> workaround, so reply back if you need.
>
> Th
Number of docs are the same in the index for both the cases (200,000).
I haven't altered the benchmark/ code, but, used a profiler to verify
that Benchmark main thread is closed only after all other threads
are closed.
Thanks,
-Jibo
On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
Michael, makes sense. most of the issues probably have some
workaround, so reply back if you need.
Thanks for your feedback though, it is helpful to know that its important!
On Fri, Jul 31, 2009 at 1:36 PM, Michael Thomsen wrote:
> Not really. At this point, I just needed to know where the UCS4
>
Not really. At this point, I just needed to know where the UCS4
support stands. I'm reasonably familiar with the various analyzers and
what they can do. It's just the state of UCS4 support that might be an
issue for us.
Thanks,
Mike
On Fri, Jul 31, 2009 at 12:25 PM, Robert Muir wrote:
> Michael
Hi Jim,
There should not be much difference from the lucene end between a new
index and index you want to update (add more documents to). As stated
in the Lucene docs IndexWriter will create the index "if it does not
already exist".
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/in
You're pretty much spot on. Read the FAQ entry "Does Lucene allow
searching and indexing simultaneously?" for one of your questions (the
answer is yes btw). With only a single update app running there won't
be any locking issues. When the updater code opens the index you'll
need to ensure that i
Hi,
I still am new to Lucene, but I think I have an initial indexer app (based on
the demo IndexFiles app) working, and also have a web app, based on the demo
luceneweb web app working.
I'm still busy tweaking both, but am starting to think ahead, about operational
type issues, esp. updating
Michael just out of curiousity, did you have a particular Analyzer in
mind you were planning on using, or rather certain features in Lucene
you were concerned would work with these codepoints?
On Fri, Jul 31, 2009 at 12:19 PM, Simon
Willnauer wrote:
> Hey Robert, good to see that you found the lin
Hey Robert, good to see that you found the link :)
On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir wrote:
> Michael, as Simon mentioned I created an issue describing where you
> might run into trouble, at least in lucene core.
>
> The low-level lucene stuff, it treats these just fine (as surrogate pa
Michael, as Simon mentioned I created an issue describing where you
might run into trouble, at least in lucene core.
The low-level lucene stuff, it treats these just fine (as surrogate pairs).
But most analyzers run into some trouble. (things like
WhitespaceAnalyzer are ok)
Also wildcard queries
On Fri, Jul 31, 2009 at 5:00 PM, wrote:
> Hi Ahmet,
>
> Thanks for the clarification and information! That was exactly what I was
> looking for.
>
> Jim
>
>
> AHMET ARSLAN wrote:
>>
>> > I guess that the obvious question is "Which characters are
>> > considered 'punctuation characters'?".
Simon Willnauer wrote:
This would not make much of a difference. I would guess that you have
one additional "wrapping" boolean query if you use
MultiFieldQueryParser. For query "foo AND bar" the MFQueryParser
creates +(fname:foo) +(fname:bar) and QueryParser would create
+fname:foo +fname:bar so
Hi Ahmet,
Thanks for the clarification and information! That was exactly what I was
looking for.
Jim
AHMET ARSLAN wrote:
>
> > I guess that the obvious question is "Which characters are
> > considered 'punctuation characters'?".
>
> Punctuation = ("_"|"-"|"/"|"."|",")
>
> > In part
Thanks for your quick response!
Mike
On Fri, Jul 31, 2009 at 10:25 AM, Simon
Willnauer wrote:
> If I understand you correctly you are asking if lucene can deal with
> encodings that use more than 16 bit. Well yes and no but mainly no.
> The support for unicode 4.0 was introduced in Java 1.5 and l
Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major
rewrite of my indexer. I think these changes are going to make a huge
difference.
Cheers,
Phil
On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall wrote:
> And to address the stop word issue, you can override the stop word list that
> i
If I understand you correctly you are asking if lucene can deal with
encodings that use more than 16 bit. Well yes and no but mainly no.
The support for unicode 4.0 was introduced in Java 1.5 and lucene core
has still back-compat requirements for java 1.4. Lucene's analyzers
make use of char[] all
In MultiFieldQueryParser, you can mention different fields of the document
which can
be searched for
E.g. in contents of the document, if you index different fields such as URL,
BOLD, ITALIC, you can search over all of them.
Additionally, there is provision to boost a field over the other as well.
This would not make much of a difference. I would guess that you have
one additional "wrapping" boolean query if you use
MultiFieldQueryParser. For query "foo AND bar" the MFQueryParser
creates +(fname:foo) +(fname:bar) and QueryParser would create
+fname:foo +fname:bar so in this case one level of
Is Lucene capable of handling UCS4 data natively?
Thanks,
Mike
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
I'd guess there wouldn't be any difference, but haven't tried it. Try
it out and see what query.toString() says in each case.
--
Ian.
On Fri, Jul 31, 2009 at 1:37 PM, Paul Taylor wrote:
> Is there any difference between using QueryParser and MultiFieldQueryParser
> when have single default sea
And to address the stop word issue, you can override the stop word list
that it uses.
Most analyzers that use stop words, (Standard included) has an option to
pass it an arbitrary list of StopWords which will override the defaults.
You could also just roll your own (which is what you are goin
Is there any difference between using QueryParser and
MultiFieldQueryParser when have single default search field ?
Depending on how many default search fields on an searching an index I
select which of the two QueryParsers to use, but does it mater if I just
use MultiFIeldQueryParser all the
It might be because there are hardly any documents containing both the
words.
Try exact search: "\"tall fat\""
On Fri, Jul 31, 2009 at 3:31 PM, bourne71 wrote:
>
> Hi, new here.
>
> I recently started using lucene and had encounter a problem.I crawl and
> index a number of documents.
> When i pe
Hi All,
I am new to Lucene and I am working on a search application.
My application needs dynamic data retrieval from the database. That means,
based on my previous step output, I need to retrieve entries from the DB for
the next step.
For example, if my search query contains "Name" field entry,
Hi
It's not quite that simple. Other things being equal, results that
match all keywords are likely to come first but there are other
factors such as term frequency and the length of the document.
Searcher.explain() will give you the gory details. Luke will let you
see what is in your index.
> When i perform a search, lets say "tall fat", by right the
> results that matches all the keyword should be on top and display first.
Answer of your question lies at the end of this thread:
http://www.nabble.com/Generating-Query-for-Multiple-Clauses-in-a-Single-Field-td24694748.html
Thanks Ahmet. This answers my question.
On Fri, Jul 31, 2009 at 1:30 PM, AHMET ARSLAN wrote:
>
>
> > Given a term say "apache", I want to look up the lucene index
> > programmatically to find out its frequency in the corpus.
>
> I think you are asking collection frequency of a term. Term Frequen
Hi, new here.
I recently started using lucene and had encounter a problem.I crawl and
index a number of documents.
When i perform a search, lets say "tall fat", by right the results that
matches all the keyword should be on top and display first.
But in my search results, some of the document
Hmm... this doesn't sound right.
That example (ThreadedIndexWriter) is meant to be a drop-in
replacement, wherever you use an IndexWriter, that keeps an
under-the-hood thread pool (using java.util.concurrent.*) to
add/update documents with multiple threads.
It should not result in a smaller index
> Given a term say "apache", I want to look up the lucene index
> programmatically to find out its frequency in the corpus.
I think you are asking collection frequency of a term. Term Frequency is
defined between a document and a term which is printed in the loop in the
following code. And at
Given a term say "apache", I want to look up the lucene index
programmatically to find out its frequency in the corpus.
On Fri, Jul 31, 2009 at 12:23 AM, wrote:
>
> prashant ullegaddi wrote:
> > How to get the number of times a term occurs in the Lucene index?
> >
> > Regards,
> > Prashant
> I guess that the obvious question is "Which characters are
> considered 'punctuation characters'?".
Punctuation = ("_"|"-"|"/"|"."|",")
> In particular, does the analyzer consider "=" (equal) and
> ":" (colon) to be punctuation characters?
":" is special character at QueryParser (if you are
45 matches
Mail list logo