Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Ganesh
Hello Eric, I agree, the number of unique terms might be less, but [ 4 * reader.maxdoc() * different fields ] will increase the memory consumption. I am having 100 million records spread across 10 DB. 4 * 100M is itself 400 MB. If i try to use 2 fields for sorting then it would be 800 MB. The u

Re: Doc IDs via IndexReader?

2009-07-22 Thread Shai Erera
off the top of my head, if you have in hand all the doc IDs that were returned so far, you can do this: 1) Build a Filter which will return any doc ID that is not in that list. For example, pass it the list of doc IDs and every time next() or skipTo is called, it will skip over the given doc IDs. 2

Re: Lucene - Search breadth approach

2009-07-22 Thread Shai Erera
I don't use the Lucene stemming Analyzers. My version, if asked to keep the original tokens, sets the position of both stem and original to be the same, and adds another character to the stem version. During query, that Analyzer is usually instructed to not keep the original tokens, just the stems

Doc IDs via IndexReader?

2009-07-22 Thread Anuj Bhatt
Hi, I'm relatively new to Lucene. I have the following case: I have indexed a bunch of documents. I then, query the index using IndexSearcher and retrieve the documents using Hits (I do know this is deprecated -- I'm using v 2.4.1). So, I do this for a set of queries and maintain which documents a

Documentation improvements leading up to 2.9

2009-07-22 Thread Chris Hostetter
Hey everybody, It looks like we might actually see Lucene 2.9.0 get released "soon" ... there are less then 20 open issues remaining, and several of those are just waiting on one blocker. Now's the time when everybody who has been asking "when is 2.9 going to be released?!?!?!?!?!" has an o

Re: Batch searching

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 12:28 PM, Matthew Hall wrote: > Not sure if this helps you, but some of the issue you are facing seem > similar to those in the "real time" search threads. Hi Matthew, Do you have a pointer of where to go to see the "real time" threads? Thanks, Phil -

Re: Lucene - Search breadth approach

2009-07-22 Thread Erick Erickson
But as far as I know, it doesn't index the original termtoo (at the same offset), which you have to do if you want to distinguish between the two cases, I think. But I confess I've been out of the guts of Lucene for some time, so I could be way off. But you'd sure want to use a different toke

Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Erick Erickson
I was assuming you were storing things as strings, in which case it works something like this: Let's say you broke it up into MM DD HH MM The number of unique terms that need to be kept in memory to sort is just (let's say your documents span 100 years) 100 + 12 + 31 + 24 + 60. But that's a

Re: Lucene - Search breadth approach

2009-07-22 Thread Shai Erera
Actually my stemming Analyzer adds a similar character to stems, to distinguish between original tokens (like orig=test) to stems (testing --> test$). On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson wrote: > A closely related approach to what Shai outlined is to index the > *original*token > wit

Re: Exclusion search

2009-07-22 Thread Erick Erickson
Can you re-index the documents? Because it's much simpler tojust count the number of volunteers *as you add fields to the doc to index it* and then just add the count field after you're done parsing the document. Your corpus is small, so this shouldn't take very long. Or I completely misunders

Re: Lucene - Search breadth approach

2009-07-22 Thread Erick Erickson
A closely related approach to what Shai outlined is to index the *original*token with a special ender (say $) with a 0 increment (see SynonymAnalyzer in LIA). Then, whenever you determined you wanted to use the un-stemmed version, just add your token to the terms (i.e. testing$ when you didn't want

Re: Terms with negative boost should not contribute to coord()

2009-07-22 Thread Chris Hostetter
: I am firing a query having terms- each associated with a boost factor. Some : of the terms are having negative boost also (for negative boost I am using : values between 0 and 1). except that a value between 0 and 1 isn't really a negative boost -- there's no such thing as a negative boost. wh

Re: Batch searching

2009-07-22 Thread Matthew Hall
Not sure if this helps you, but some of the issue you are facing seem similar to those in the "real time" search threads. Basically their problem involves indexing twitter and the blogosphere, and making lucene work for super large data sets like that. Perhaps some of the discussion in those

Re: Batch searching

2009-07-22 Thread tsuraan
> Out of curiosity, what is the size of your corpus? How much and how > quickly do you expect it to grow? in terms of lucene documents, we tend to have in the 10M-100M range. Currently we use merging to make larger indices from smaller ones, so a single index can have a lot of documents in it, bu

Re: Batch searching

2009-07-22 Thread Matthew Hall
Out of curiosity, what is the size of your corpus? How much and how quickly do you expect it to grow? I'm just trying to make sure that we are all on the same page here ^^ I can see the benefits of doing what you are describing with a very large corpus that is expected to grow at quick rate,

Re: Batch searching

2009-07-22 Thread tsuraan
> If you did this, wouldn't you be binding the processing of the results > of all queries to that of the slowest performing one within the collection? I would imagine it would, but I haven't seen too much variance between lucene query speeds in our data. > I'm guessing you are trying for some sor

Re: Batch searching

2009-07-22 Thread Shai Erera
Queries cannot be ordered "sequentially". Let's say that you run 3 Queries, w/ one term each "a", "b" and "c". On disk, the posting lists of the terms can look like this: post1(a), post1(c), post2(a), post1(b), post2(c), post2(b) etc. They are not guaranteed to be consecutive. The code makes sure t

Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Phil Whelan
Hi Ganesh, I'm not sure whether this will work for you, but one way I got around this was with multiple searches. I only needed the first 50 results, but wanted to sort by date,hour,min,sec. This could result in 5 results or millions of results. I added the date to the query, so I'd search for r

Re: Batch searching

2009-07-22 Thread tsuraan
> It's not accurate to say that Lucene scans the index for each search. > Rather, every Query reads a set of posting lists, each are typically read > from disk. If you pass Query[] which have nothing to do in common (for > example no terms in common), then you won't gain anything, b/c each Query >

Re: Batch searching

2009-07-22 Thread Matthew Hall
If you did this, wouldn't you be binding the processing of the results of all queries to that of the slowest performing one within the collection? I'm guessing you are trying for some sort of performance benefit by batch processing, but I question whether or not you will actually get more perf

Re: Batch searching

2009-07-22 Thread Shai Erera
It's not accurate to say that Lucene scans the index for each search. Rather, every Query reads a set of posting lists, each are typically read from disk. If you pass Query[] which have nothing to do in common (for example no terms in common), then you won't gain anything, b/c each Query will alrea

Batch searching

2009-07-22 Thread tsuraan
If I understand lucene correctly, when doing multiple simultaneous searches on the same IndexSearcher, they will basically all do their own index scans and collect results independently. If that's correct, is there a way to batch searches together, so only one index scan is done? What I'd like is

RE: indexing 100GB of data

2009-07-22 Thread Steven A Rowe
You may also be interested in Andrzej Bialecki's patch to Solr that provides distributed indexing using Hadoop: https://issues.apache.org/jira/browse/SOLR-1301 Steve > -Original Message- > From: Phil Whelan [mailto:phil...@gmail.com] > Sent: Wednesday, July 22, 2009 12:46 PM > To: ja

Re: indexing 100GB of data

2009-07-22 Thread Phil Whelan
On Wed, Jul 22, 2009 at 5:46 AM, m.harig wrote: > Is there any article or forum for using Hadoop with lucene? Please any1 help > me Hi M, Katta is a project that is combining Lucene and Hadoop. Check it out here... http://katta.sourceforge.net/ Thanks, Phil

Re: Exclusion search

2009-07-22 Thread Phil Whelan
If there are only have a few thousand documents, and the number of results quite small is this a case where post-search filtering can be done? I have not done anything like this myself with Lucene, so is this a bad idea? If not, what would be the best way to do this? org.apache.lucene.search.Filte

RE: Exclusion search

2009-07-22 Thread Steven A Rowe
If the number of volunteers is small enough, you could exclude all others in your query, e.g.: All volunteers: a, b, c, d, e, f Query to include documents containing only volunteers a and b: +vol:a +vol:b -vol:c -vol:d -vol:e -vol:f Steve On 7/22/2009 at 6:49 AM, ba3 wrote: > Yes, the doc

New more affordable and performant Intel SSDs

2009-07-22 Thread Jason Rutherglen
http://arstechnica.com/hardware/news/2009/07/intels-new-34nm-ssds-cut-prices-by-60-percent-boost-speed.ars For me the price on the 80GB is now within reason for a $1300 SuperMicro quad-core 12GB RAM type of server. - To unsubscri

Re: [?? Probable Spam] Re: Backing up large indexes

2009-07-22 Thread Alexandre Leopoldo Gonçalves
Shai, Thanks for the tip. I´ll start with it. Alex Shai Erera wrote: Hi Alex, You can start with this article: http://www.manning.com/free/green_HotBackupsLucene.html (you'll need to register w/ your email). It describes how one can write Hot Backups w/ Lucene, and capture just the "delta" si

Re: Backing up large indexes

2009-07-22 Thread Mindaugas Žakšauskas
This might be irrelevant, but have you considered using ZFS? This file system is designed to do what you need. Assuming you can trigger events at the time after you have updated the index, you would have to trigger new ZFS snapshot and place it elsewhere. This might have some side effects though (

RE: indexing 100GB of data

2009-07-22 Thread Dan OConnor
Hi Jamie, I would appreciate if you could provide details on the hardware/OS you are running this system on and what kind of search response time you are getting. As well as how you add email data to your index. Thanks, Dan -Original Message- From: Jamie [mailto:ja...@stimulussoft.com

Re: reranking Lucene TopDocs

2009-07-22 Thread Shai Erera
The I think what you're looking for is getting the ScoreDoc[] stored in the TopDocs object. Then modify each doc's score (a ScoreDoc has a doc ID and score [float]) and then sort the array. You can use Arrays.sort(ScoreDoc[], Comparator) by passing a Comparator which will compare the docs by their

Re: reranking Lucene TopDocs

2009-07-22 Thread henok sahilu
hey after the query returned top N docs then rearrange them with my algortim --- On Wed, 7/22/09, Shai Erera wrote: From: Shai Erera Subject: Re: reranking Lucene TopDocs To: java-user@lucene.apache.org Date: Wednesday, July 22, 2009, 6:57 AM You mean after the query has returned the top N d

Re: reranking Lucene TopDocs

2009-07-22 Thread Shai Erera
You mean after the query has returned the top N docs? why? If it's before, then given your use case, there are a number of approaches you can use during indexing and/or search time, so that your custom ranking function would be applied to documents. Shai On Wed, Jul 22, 2009 at 4:53 PM, henok sa

Re: reranking Lucene TopDocs

2009-07-22 Thread henok sahilu
i like to write a code that re assign weight to documets so that they can be reranked --- On Wed, 7/22/09, Shai Erera wrote: From: Shai Erera Subject: Re: reranking Lucene TopDocs To: java-user@lucene.apache.org Date: Wednesday, July 22, 2009, 6:44 AM Can you be more specific? What do you me

Re: Backing up large indexes

2009-07-22 Thread Shai Erera
Hi Alex, You can start with this article: http://www.manning.com/free/green_HotBackupsLucene.html (you'll need to register w/ your email). It describes how one can write Hot Backups w/ Lucene, and capture just the "delta" since the last backup. I'm about to try it myself, so if you get to do it b

Re: reranking Lucene TopDocs

2009-07-22 Thread Shai Erera
Can you be more specific? What do you mean by re-rank? Reverse the sort? give different weights? Shai On Wed, Jul 22, 2009 at 4:35 PM, henok sahilu wrote: > hello there > i like to re-rank lucene TopDoc result set. > where shall i start > thanks > > > > >

reranking Lucene TopDocs

2009-07-22 Thread henok sahilu
hello there i like to re-rank lucene TopDoc result set. where shall i start thanks

Backing up large indexes

2009-07-22 Thread Alexandre Leopoldo Gonçalves
Hi All, We have a system with a lucene index with 100GB and growing fast. I wonder whether there is an efficient way to backup it taking into account only the changes among old and new version of the index, once after optimization process the name of the main index file change. Regards, Ale

Re: Lucene - Search breadth approach

2009-07-22 Thread Shai Erera
Hi Robert, What you could do is use the Stemmer (as a TokenFilter I assume) and produce two tokens always - the stem and the original. Index both of them in the same position. Then tell your users that if they search for [testing], it will find results for 'testing', 'test' etc (the stems) and if

Lucene - Search breadth approach

2009-07-22 Thread Robert Corbett
Hello, I would like to use a stemming analyser similar to KStem or PorterStem to provide access to a wider search scope for our users. However, at the same time I also want to provide the ability for the users to throw out the stems if they want to search more accurately. I have a number of ideas

Re: indexing 100GB of data

2009-07-22 Thread Jamie
HI There We have lucene searching across several terabytes of email data and there is no problem at all. Regards, Jamie Shai Erera wrote: There shouldn't be a problem to search such index. It depends on the machine you use. If it's a strong enough machine, I don't think you should have an

Re: indexing 100GB of data

2009-07-22 Thread m.harig
Is there any article or forum for using Hadoop with lucene? Please any1 help me -- View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24605164.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. -

Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Ian Lea
However you do it, it seems to me that you're going to need loads of memory if you want lucene to do this type of sorting on indexes with 100 million docs. Can't you just buy or allocate some more memory? One alternative would be to do the sorting yourself once you've got a list of hits. Trade so

[ApacheCon US] Travel Assistance

2009-07-22 Thread Grant Ingersoll
The Travel Assistance Committee is taking in applications for those wanting to attend ApacheCon US 2009 (Oakland) which takes place between the 2nd and 6th November 2009. The Travel Assistance Committee is looking for people who would like to be able to attend ApacheCon US 2009 who may nee

Re: Exclusion search

2009-07-22 Thread ba3
Yes, the documents were already indexed and the documents do not get updated. Maintaining an alternate index is a nice solution. Will try it out. Thanks for the pointer. If there is a solution which can use the same index it would be great! --Rgds Ba3 Perhaps I misunderstood something, but ho

Re: PageRanking with Lucene

2009-07-22 Thread prashant ullegaddi
Is it that boost of a Document is stored in 6-bits? On Wed, Jul 22, 2009 at 8:26 AM, Grant Ingersoll wrote: > I'd probably look at the function package in Lucene. While the document > boost can be used, it may not give you the granularity you need, as you only > have something like 6 bits of rep

Re: Exclusion search

2009-07-22 Thread Shai Erera
Perhaps I misunderstood something, but how do you update a document? I mean, if a document contains vol:a, vol:b and vol:c and then you want to add vol:d to it, don't you remove the document and add it back? If that's what you do, then you can also update the numvols field, right? Or .. you mean

Re: indexing 100GB of data

2009-07-22 Thread Shai Erera
There shouldn't be a problem to search such index. It depends on the machine you use. If it's a strong enough machine, I don't think you should have any problems. But like I said, you can always try it out on your machine before you make a decision. Also, Lucene has a Benchmark package which incl

Re: Alternative way to simulate sorting without doing actual sort

2009-07-22 Thread Ganesh
Hello Eric, Thanks for your reply. Memory reqd for sorting: 4 * reader.maxdoc() . I am sorting datetime with minute resolution. 100 records are representing a minute then in a 1 million record database, there will be around 2 unique terms. the amount of memory consumed would be 4 * 100

Re: Exclusion search

2009-07-22 Thread ba3
>> Maybe add to each doc a field numVolunteers and then constraint the query to >> vol:krish and vol:raj and numvol:2 (something like that)? Thanks for the reply. But., The number of documents run into few thousands. Hence editing them is not an option. Is there any other ways to solve this scen

Re: indexing 100GB of data

2009-07-22 Thread prashant ullegaddi
Yes you can use Hadoop with Lucene. Borrow some code from Nutch. Look at org.apache.nutch.indexer.IndexerMapReduce and org.apache.nutch.indexer. Indexer. Prashant. On Wed, Jul 22, 2009 at 2:00 PM, m.harig wrote: > > Thanks Shai > > So there won't be problem when searching that kind of

Re: indexing 100GB of data

2009-07-22 Thread m.harig
Thanks Shai So there won't be problem when searching that kind of large index . am i right? Can anyone tell me is it possible to use hadoop with lucene?? -- View this message in context: http://www.nabble.com/indexing-100GB-of-data-tp24600563p24602064.html Sent from the

Re: indexing 100GB of data

2009-07-22 Thread Shai Erera
>From my experience, you shouldn't have any problems indexing that amount of content even into one index. I've successfully indexed 450 GB of data w/ Lucene, and I believe it can scale much higher if rich text documents are indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB

Re: Exclusion search

2009-07-22 Thread Shai Erera
Maybe add to each doc a field numVolunteers and then constraint the query to vol:krish and vol:raj and numvol:2 (something like that)? On Wed, Jul 22, 2009 at 9:49 AM, ba3 wrote: > > Hi, > > In the documents which contain the volunteer information : > > Doc1 : > volunteer krish > volunteer john