Re: assign a id to document?

2010-10-20 Thread Anshum
Hi Nilesh, No you can't do that. Though you may store your own id as a separate field for whatever purpose you want. I don't see any reason why you'd essentially want to override the lucene document id with your own. Let me know in case there's something I didn't get. -- Anshum Gupta http://ai-caf

assign a id to document?

2010-10-20 Thread Nilesh Vijaywargiay
I understand lucene provides its own document id. can we assign this id in any way?

Re: how to index large number of files?

2010-10-20 Thread Sahin Buyrukbilen
by the ways file size is not big. mostly 1kB. I am working on wikipedia articles in txt format. On Wed, Oct 20, 2010 at 11:01 PM, Sahin Buyrukbilen < sahin.buyrukbi...@gmail.com> wrote: > Unfortunately both methods didnt go through. I am getting memory error even > at reading the directory conte

Re: how to index large number of files?

2010-10-20 Thread Sahin Buyrukbilen
Unfortunately both methods didnt go through. I am getting memory error even at reading the directory contents. Now, I am thinking this: What if I split 4.5million files into 100.000 (or less depending on java error) files directories, index each of them separately and merge those indexes(if possib

Re: how to index large number of files?

2010-10-20 Thread Erick Erickson
My first guess is that you're accumulating too many documents in before the flush gets triggered. The quick-n-dirty way to test this is to do an IndexWriter.flush after every addDocument. This will slow down indexing, but it will also tell you whether this is the problem and you can look for more e

RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O'Shea
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201010.mbox/%3c128 7065863.4cb7110774...@netmail.pipex.net%3e will give you a better idea of what I'm moving towards. It's all a bit grey at the moment so further investigation is inevitable. I expect that a combination of MySQL database s

Re: how to index large number of files?

2010-10-20 Thread Qi Li
If I were you, I would write like this. Not sure this helps. Let me how it works public static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30); Directory indexDir = FSDirectory.op

Re: Get number of hits for various combinations

2010-10-20 Thread Pradeep Singh
Use Solr and look at faceting. On Wed, Oct 20, 2010 at 9:12 AM, Bob Miller wrote: > Dear all, > > > I would like to use a tag cloud of keywords to narrow down the currently > displayed search results. In order to implement this, it must be possible > to > determine the number of hits for the cur

Re: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Grant Ingersoll
On Oct 20, 2010, at 2:53 PM, Martin O'Shea wrote: > Uwe > > Thanks - I figured that bit out. I'm a Lucene 'newbie'. > > What I would like to know though is if it is practical to search a single > document of one field simply by doing this: > > IndexReader trd = IndexReader.open(index); >

Re: problem during index merge

2010-10-20 Thread Cristian Vat
Unfortunately my target index is fully optimized so only one segment. So CheckIndex won't be useful. I'll reindex from scratch and also check for hardware issues. Hopefully it won't get corrupted again. Thanks for your help. - Cristian Vat On Wed, Oct 20, 2010 at 22:58, Michael McCandless wrot

Re: problem during index merge

2010-10-20 Thread Michael McCandless
Not recoverable automatically. But you can run CheckIndex -fix -- it removes the segment(s) that has the corruption (losing all docs in that segment). Searches appear fine because we don't do this check ("docs out of order") during searching, but results are likely wrong when they hit that segmen

Re: problem during index merge

2010-10-20 Thread Cristian Vat
Corruption in the sense that it isn't recoverable or is there something I might be able to do? The index opens up without problems in Luke and there's an application using it for searching without problems (apparently) It's actually an application with a (possibly) hourly update, so finding the l

Re: problem during index merge

2010-10-20 Thread Michael McCandless
This looks like index corruption. But: are you able to reproduce this corruption, eg on different machines? I'd love to see how :) Your usage (using addIndexesNoOptimize to add indices) looks fine. Mike On Wed, Oct 20, 2010 at 2:45 PM, Cristian Vat wrote: > Hello, > > I've been running into a

Lucene index update

2010-10-20 Thread Nilesh Vijaywargiay
I've written a blog regarding a work around for updating index in Lucene using parallel reader. It's explained with results and pictures. It would be great if you have a look at it. The link: http://the10minutes.blogspot.com/2010/10/lucene-index-update.html

RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O'Shea
Uwe Thanks - I figured that bit out. I'm a Lucene 'newbie'. What I would like to know though is if it is practical to search a single document of one field simply by doing this: IndexReader trd = IndexReader.open(index); TermFreqVector tfv = trd.getTermFreqVector(docId, "title");

problem during index merge

2010-10-20 Thread Cristian Vat
Hello, I've been running into a problem during a merge. Would appreciate knowing what to look for since the exception doesn't seem too explanatory. I get: -- --- Nested Exception --- java.io.IOException: background merge hit exception: _2p:c695204 _2q:c93106 into _2r [optimize] [mergeDocStores]

Re: how to index large number of files?

2010-10-20 Thread Sahin Buyrukbilen
with the different parameters I still got the same error. My code is very simple, indeed I am only concerned with creating the index and then I will do some private information retrieval experiments on the inverted index file, which I created with the information extracted from the index. That is w

Re: how to index large number of files?

2010-10-20 Thread Qi Li
1. What is the difference when you used different vm parameters? 2 What merge policy and optimization strategy did you use? 3. How did you use the commit or flush ? Qi On Wed, Oct 20, 2010 at 2:05 PM, Sahin Buyrukbilen < sahin.buyrukbi...@gmail.com> wrote: > Thank you so much for this infor. it

RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Uwe Schindler
TermVectors are only available when enabled for the field/document. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Martin O'Shea [mailto:app...@dsl.pipex.com] > Sent: Wednesday, October 20, 2010 8:23 PM

Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O'Shea
Hello I am trying to use a TermFreqVector to get a count of all words in a Document as follows: // Search. int hitsPerPage = 10; IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, t

Re: how to index large number of files?

2010-10-20 Thread Sahin Buyrukbilen
Thank you so much for this infor. it looks pretty complicated for me but I will try. On Wed, Oct 20, 2010 at 1:18 AM, Johnbin Wang wrote: > You can start a fixedThreadPool to index all these files in the multhread > way. Every thread execute an index task which could index a part of all the > f

Get number of hits for various combinations

2010-10-20 Thread Bob Miller
Dear all, I would like to use a tag cloud of keywords to narrow down the currently displayed search results. In order to implement this, it must be possible to determine the number of hits for the current search query combined with each keyword in the tag cloud. What would be the most effici

Re: London open-source search social - 28th Nov - NEW VENUE

2010-10-20 Thread Richard Marr
Wow, apologies for utter stupidity. Both subject line and body should have read 28th OCT. On 20 October 2010 15:42, Richard Marr wrote: > Hi all, > > We've booked a London Search Social for Thursday the 28th Sept. Come > along if you fancy geeking out about search and related technology > over

London open-source search social - 28th Nov - NEW VENUE

2010-10-20 Thread Richard Marr
Hi all, We've booked a London Search Social for Thursday the 28th Sept. Come along if you fancy geeking out about search and related technology over a beer. Please note that we're not meeting in the same place as usual. Details on the meetup page. http://www.meetup.com/london-search-social/ Rich

Re: Best implementation for address searching

2010-10-20 Thread Jasper de Barbanson
Hi Anshum, 1. The unstructured addresses are sometimes separated by comma, but most of the time by a single space. 2. The parts can be in an increasing or decreasing order, but not always. Most common combinations are "A street, B housenumber, C city", "D zipcode, E housenumber", "C city", "A stre

Re: Best implementation for address searching

2010-10-20 Thread Anshum
Hi Jason, Just a suggestion to start with, was almost an information overload for me but here is what I could think of straight off. Let me know in case you try it or have already tried it. A few point I'd want to know. 1. I understand that the addresses would/could be unstructured/ill-strcutured