Hello Eric,
I agree, the number of unique terms might be less, but [ 4 * reader.maxdoc() *
different fields ] will increase the memory consumption. I am having 100
million records spread across 10 DB. 4 * 100M is itself 400 MB. If i try to use
2 fields for sorting then it would be 800 MB. The u
off the top of my head, if you have in hand all the doc IDs that were
returned so far, you can do this:
1) Build a Filter which will return any doc ID that is not in that list. For
example, pass it the list of doc IDs and every time next() or skipTo is
called, it will skip over the given doc IDs.
2
I don't use the Lucene stemming Analyzers. My version, if asked to keep the
original tokens, sets the position of both stem and original to be the same,
and adds another character to the stem version.
During query, that Analyzer is usually instructed to not keep the original
tokens, just the stems
Hi,
I'm relatively new to Lucene. I have the following case: I have
indexed a bunch of documents. I then, query the index using
IndexSearcher and retrieve the documents using Hits (I do know this is
deprecated -- I'm using v 2.4.1). So, I do this for a set of queries
and maintain which documents a
Hey everybody,
It looks like we might actually see Lucene 2.9.0 get released "soon" ...
there are less then 20 open issues remaining, and several of those are
just waiting on one blocker.
Now's the time when everybody who has been asking "when is 2.9 going to be
released?!?!?!?!?!" has an o
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall wrote:
> Not sure if this helps you, but some of the issue you are facing seem
> similar to those in the "real time" search threads.
Hi Matthew,
Do you have a pointer of where to go to see the "real time" threads?
Thanks,
Phil
-
But as far as I know, it doesn't index the original termtoo (at the same
offset), which you have to do if you
want to distinguish between the two cases, I think.
But I confess I've been out of the guts of Lucene for some
time, so I could be way off.
But you'd sure want to use a different toke
I was assuming you were storing things as strings, in which case
it works something like this:
Let's say you broke it up into
MM
DD
HH
MM
The number of unique terms that need to be kept in
memory to sort is just (let's say your documents
span 100 years)
100 + 12 + 31 + 24 + 60.
But that's a
Actually my stemming Analyzer adds a similar character to stems, to
distinguish between original tokens (like orig=test) to stems (testing -->
test$).
On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson wrote:
> A closely related approach to what Shai outlined is to index the
> *original*token
> wit
Can you re-index the documents? Because it's much simpler tojust count the
number of volunteers *as you add fields to the
doc to index it* and then just add the count field after you're
done parsing the document. Your corpus is small, so this
shouldn't take very long.
Or I completely misunders
A closely related approach to what Shai outlined is to index the
*original*token
with a special ender (say $) with a 0 increment (see SynonymAnalyzer
in LIA). Then, whenever you determined you wanted to use the un-stemmed
version, just add your token to the terms (i.e. testing$ when you didn't
want
: I am firing a query having terms- each associated with a boost factor. Some
: of the terms are having negative boost also (for negative boost I am using
: values between 0 and 1).
except that a value between 0 and 1 isn't really a negative boost --
there's no such thing as a negative boost. wh
Not sure if this helps you, but some of the issue you are facing seem
similar to those in the "real time" search threads.
Basically their problem involves indexing twitter and the blogosphere,
and making lucene work for super large data sets like that.
Perhaps some of the discussion in those
> Out of curiosity, what is the size of your corpus? How much and how
> quickly do you expect it to grow?
in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, bu
Out of curiosity, what is the size of your corpus? How much and how
quickly do you expect it to grow?
I'm just trying to make sure that we are all on the same page here ^^
I can see the benefits of doing what you are describing with a very
large corpus that is expected to grow at quick rate,
> If you did this, wouldn't you be binding the processing of the results
> of all queries to that of the slowest performing one within the collection?
I would imagine it would, but I haven't seen too much variance between
lucene query speeds in our data.
> I'm guessing you are trying for some sor
Queries cannot be ordered "sequentially". Let's say that you run 3 Queries,
w/ one term each "a", "b" and "c". On disk, the posting lists of the terms
can look like this: post1(a), post1(c), post2(a), post1(b), post2(c),
post2(b) etc. They are not guaranteed to be consecutive. The code makes sure
t
Hi Ganesh,
I'm not sure whether this will work for you, but one way I got around
this was with multiple searches. I only needed the first 50 results,
but wanted to sort by date,hour,min,sec. This could result in 5
results or millions of results.
I added the date to the query, so I'd search for r
> It's not accurate to say that Lucene scans the index for each search.
> Rather, every Query reads a set of posting lists, each are typically read
> from disk. If you pass Query[] which have nothing to do in common (for
> example no terms in common), then you won't gain anything, b/c each Query
>
If you did this, wouldn't you be binding the processing of the results
of all queries to that of the slowest performing one within the collection?
I'm guessing you are trying for some sort of performance benefit by
batch processing, but I question whether or not you will actually get
more perf
It's not accurate to say that Lucene scans the index for each search.
Rather, every Query reads a set of posting lists, each are typically read
from disk. If you pass Query[] which have nothing to do in common (for
example no terms in common), then you won't gain anything, b/c each Query
will alrea
If I understand lucene correctly, when doing multiple simultaneous
searches on the same IndexSearcher, they will basically all do their
own index scans and collect results independently. If that's correct,
is there a way to batch searches together, so only one index scan is
done? What I'd like is
You may also be interested in Andrzej Bialecki's patch to Solr that provides
distributed indexing using Hadoop:
https://issues.apache.org/jira/browse/SOLR-1301
Steve
> -Original Message-
> From: Phil Whelan [mailto:phil...@gmail.com]
> Sent: Wednesday, July 22, 2009 12:46 PM
> To: ja
On Wed, Jul 22, 2009 at 5:46 AM, m.harig wrote:
> Is there any article or forum for using Hadoop with lucene? Please any1 help
> me
Hi M,
Katta is a project that is combining Lucene and Hadoop. Check it out here...
http://katta.sourceforge.net/
Thanks,
Phil
If there are only have a few thousand documents, and the number of
results quite small is this a case where post-search filtering can be
done?
I have not done anything like this myself with Lucene, so is this a
bad idea? If not, what would be the best way to do this?
org.apache.lucene.search.Filte
If the number of volunteers is small enough, you could exclude all others in
your query, e.g.:
All volunteers: a, b, c, d, e, f
Query to include documents containing only volunteers a and b:
+vol:a +vol:b -vol:c -vol:d -vol:e -vol:f
Steve
On 7/22/2009 at 6:49 AM, ba3 wrote:
> Yes, the doc
http://arstechnica.com/hardware/news/2009/07/intels-new-34nm-ssds-cut-prices-by-60-percent-boost-speed.ars
For me the price on the 80GB is now within reason for a $1300
SuperMicro quad-core 12GB RAM type of server.
-
To unsubscri
Shai,
Thanks for the tip. I´ll start with it.
Alex
Shai Erera wrote:
Hi Alex,
You can start with this article:
http://www.manning.com/free/green_HotBackupsLucene.html (you'll need to
register w/ your email). It describes how one can write Hot Backups w/
Lucene, and capture just the "delta" si
This might be irrelevant, but have you considered using ZFS? This file
system is designed to do what you need. Assuming you can trigger
events at the time after you have updated the index, you would have to
trigger new ZFS snapshot and place it elsewhere.
This might have some side effects though (
Hi Jamie,
I would appreciate if you could provide details on the hardware/OS you are
running this system on and what kind of search response time you are getting.
As well as how you add email data to your index.
Thanks,
Dan
-Original Message-
From: Jamie [mailto:ja...@stimulussoft.com
The I think what you're looking for is getting the ScoreDoc[] stored in the
TopDocs object. Then modify each doc's score (a ScoreDoc has a doc ID and
score [float]) and then sort the array. You can use Arrays.sort(ScoreDoc[],
Comparator) by passing a Comparator which will compare the docs by their
hey after the query returned top N docs
then rearrange them with my algortim
--- On Wed, 7/22/09, Shai Erera wrote:
From: Shai Erera
Subject: Re: reranking Lucene TopDocs
To: java-user@lucene.apache.org
Date: Wednesday, July 22, 2009, 6:57 AM
You mean after the query has returned the top N d
You mean after the query has returned the top N docs? why?
If it's before, then given your use case, there are a number of approaches
you can use during indexing and/or search time, so that your custom ranking
function would be applied to documents.
Shai
On Wed, Jul 22, 2009 at 4:53 PM, henok sa
i like to write a code that re assign weight to documets so that they can be
reranked
--- On Wed, 7/22/09, Shai Erera wrote:
From: Shai Erera
Subject: Re: reranking Lucene TopDocs
To: java-user@lucene.apache.org
Date: Wednesday, July 22, 2009, 6:44 AM
Can you be more specific? What do you me
Hi Alex,
You can start with this article:
http://www.manning.com/free/green_HotBackupsLucene.html (you'll need to
register w/ your email). It describes how one can write Hot Backups w/
Lucene, and capture just the "delta" since the last backup.
I'm about to try it myself, so if you get to do it b
Can you be more specific? What do you mean by re-rank? Reverse the sort?
give different weights?
Shai
On Wed, Jul 22, 2009 at 4:35 PM, henok sahilu wrote:
> hello there
> i like to re-rank lucene TopDoc result set.
> where shall i start
> thanks
>
>
>
>
>
hello there
i like to re-rank lucene TopDoc result set.
where shall i start
thanks
Hi All,
We have a system with a lucene index with 100GB and growing fast. I
wonder whether
there is an efficient way to backup it taking into account only the
changes among old
and new version of the index, once after optimization process the name
of the main
index file change.
Regards,
Ale
Hi Robert,
What you could do is use the Stemmer (as a TokenFilter I assume) and produce
two tokens always - the stem and the original. Index both of them in the
same position.
Then tell your users that if they search for [testing], it will find results
for 'testing', 'test' etc (the stems) and if
Hello,
I would like to use a stemming analyser similar to KStem or PorterStem to
provide access to a wider search scope for our users. However, at the same
time I also want to provide the ability for the users to throw out the stems
if they want to search more accurately. I have a number of ideas
HI There
We have lucene searching across several terabytes of email data and
there is no problem at all.
Regards,
Jamie
Shai Erera wrote:
There shouldn't be a problem to search such index. It depends on the machine
you use. If it's a strong enough machine, I don't think you should have an
Is there any article or forum for using Hadoop with lucene? Please any1 help
me
--
View this message in context:
http://www.nabble.com/indexing-100GB-of-data-tp24600563p24605164.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
However you do it, it seems to me that you're going to need loads of
memory if you want lucene to do this type of sorting on indexes with
100 million docs. Can't you just buy or allocate some more memory?
One alternative would be to do the sorting yourself once you've got a
list of hits. Trade so
The Travel Assistance Committee is taking in applications for those
wanting
to attend ApacheCon US 2009 (Oakland) which takes place between the
2nd and
6th November 2009.
The Travel Assistance Committee is looking for people who would like
to be
able to attend ApacheCon US 2009 who may nee
Yes, the documents were already indexed and the documents do not get updated.
Maintaining an alternate index is a nice solution. Will try it out.
Thanks for the pointer.
If there is a solution which can use the same index it would be great!
--Rgds
Ba3
Perhaps I misunderstood something, but ho
Is it that boost of a Document is stored in 6-bits?
On Wed, Jul 22, 2009 at 8:26 AM, Grant Ingersoll wrote:
> I'd probably look at the function package in Lucene. While the document
> boost can be used, it may not give you the granularity you need, as you only
> have something like 6 bits of rep
Perhaps I misunderstood something, but how do you update a document?
I mean, if a document contains vol:a, vol:b and vol:c and then you want to
add vol:d to it, don't you remove the document and add it back?
If that's what you do, then you can also update the numvols field, right?
Or .. you mean
There shouldn't be a problem to search such index. It depends on the machine
you use. If it's a strong enough machine, I don't think you should have any
problems.
But like I said, you can always try it out on your machine before you make a
decision.
Also, Lucene has a Benchmark package which incl
Hello Eric,
Thanks for your reply.
Memory reqd for sorting: 4 * reader.maxdoc()
.
I am sorting datetime with minute resolution. 100 records are representing a
minute then in a 1 million record database, there will be around 2 unique
terms. the amount of memory consumed would be 4 * 100
>> Maybe add to each doc a field numVolunteers and then constraint the query
to
>> vol:krish and vol:raj and numvol:2 (something like that)?
Thanks for the reply. But.,
The number of documents run into few thousands. Hence editing them is not an
option.
Is there any other ways to solve this scen
Yes you can use Hadoop with Lucene. Borrow some code from Nutch. Look at
org.apache.nutch.indexer.IndexerMapReduce and org.apache.nutch.indexer.
Indexer.
Prashant.
On Wed, Jul 22, 2009 at 2:00 PM, m.harig wrote:
>
> Thanks Shai
>
> So there won't be problem when searching that kind of
Thanks Shai
So there won't be problem when searching that kind of large index
. am i right?
Can anyone tell me is it possible to use hadoop with lucene??
--
View this message in context:
http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html
Sent from the
>From my experience, you shouldn't have any problems indexing that amount of
content even into one index. I've successfully indexed 450 GB of data w/
Lucene, and I believe it can scale much higher if rich text documents are
indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB
Maybe add to each doc a field numVolunteers and then constraint the query to
vol:krish and vol:raj and numvol:2 (something like that)?
On Wed, Jul 22, 2009 at 9:49 AM, ba3 wrote:
>
> Hi,
>
> In the documents which contain the volunteer information :
>
> Doc1 :
> volunteer krish
> volunteer john
54 matches
Mail list logo