Searching with too many clauses + Out of Memory
Hi Everyone, I am using Compass 1.1 M2 which supports Lucene 2.2 to store & search huge amount of company, executive and employment data. There are some usecases where I need to search for executives/employments on the result set of company search. But when I try to create a compass query to search for executives for over 1 lac company ids, it runs out of memory as the query is huge. Here is the exception stack trace: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:342) at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:435) at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:428) at org.apache.lucene.index.MultiTermDocs.read(MultiReader.java:393) at org.apache.lucene.search.TermScorer.next(TermScorer.java:106) at org.apache.lucene.util.ScorerDocQueue.topNextAndAdjustElsePop( ScorerDocQueue.ja va:116) at org.apache.lucene.search.DisjunctionSumScorer.advanceAfterCurrent(DisjunctionSu mScorer.java:175) at org.apache.lucene.search.DisjunctionSumScorer.next( DisjunctionSumScorer.java:14 6) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:124) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:232) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74) at org.apache.lucene.search.Hits.(Hits.java:61) at org.apache.lucene.search.Searcher.search(Searcher.java:55) at org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.findByQuery( ReadCommittedTransaction.java:469) at org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.doFind(Read CommittedTransaction.java:426) at org.compass.core.lucene.engine.transaction.AbstractTransaction.find(AbstractTra nsaction.java:91) at org.compass.core.lucene.engine.LuceneSearchEngine.find( LuceneSearchEngine.java: 379) at org.compass.core.lucene.engine.LuceneSearchEngineQuery.hits(LuceneSearchEngineQ uery.java:151) at org.compass.core.impl.DefaultCompassQuery.hits(DefaultCompassQuery.java:133) at org.compass.core.support.search.CompassSearchHelper.performSearch(CompassSearch Helper.java:144) at org.compass.core.support.search.CompassSearchHelper$1.doInCompass(CompassSearch Helper.java:89) at org.compass.core.CompassTemplate.execute(CompassTemplate.java:137) at org.compass.core.support.search.CompassSearchHelper.search(CompassSearchHelper. java:86) It looks like this error is actually in the lucene code. It would be great if anyone in this group has an idea about this kind of usecase and has some suggestions. Thanks, Harini
Re: Problem Search using lucene
Chhabra, Kapil wrote: You just have to make sure that what you are searching is indexed (and esp. in the same format/case). Use Luke (http://www.getopt.org/luke/) to browse through your index. Does Luke also work re to Nutch? Thanks Michael This might give you an insight of what you have indexed and what you are searching for. Regards, kapilChhabra -Original Message- From: masz-wow [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 01, 2007 12:13 PM To: java-user@lucene.apache.org Subject: Re: Problem Search using lucene Thanks Joe I'm using this function as my analyzer public static Analyzer getDefaultAnalyzer() { PerFieldAnalyzerWrapper perFieldAnalyzer = new PerFieldAnalyzerWrapper(new StopAnalyzer()); perFieldAnalyzer.addAnalyzer("contents", new StopAnalyzer()); perFieldAnalyzer.addAnalyzer("fileID", new WhitespaceAnalyzer()); perFieldAnalyzer.addAnalyzer("path", new KeywordAnalyzer()); return perFieldAnalyzer; } StopAnalyzer builds an analyzer which removes words in ENGLISH_STOP_WORDS.That might be the cause why I cannot search words such as 'and' 'to' BUT I'm still having problem when I search a few words other than english words such as name (eg: Ghazat) or string of numbers (eg:45600). -- Michael Wechner Wyona - Open Source Content Management - Yanel, Yulup http://www.wyona.com [EMAIL PROTECTED], [EMAIL PROTECTED] +41 44 272 91 61 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching with too many clauses + Out of Memory
What is the size of heap u r allocating for your app ? -Original Message- From: Harini Raghavan [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 01, 2007 2:29 PM To: java-user@lucene.apache.org Subject: Searching with too many clauses + Out of Memory Hi Everyone, I am using Compass 1.1 M2 which supports Lucene 2.2 to store & search huge amount of company, executive and employment data. There are some usecases where I need to search for executives/employments on the result set of company search. But when I try to create a compass query to search for executives for over 1 lac company ids, it runs out of memory as the query is huge. Here is the exception stack trace: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.lucene.index.SegmentReader.termDocs(SegmentReader.java:342) at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:435) at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:428) at org.apache.lucene.index.MultiTermDocs.read(MultiReader.java:393) at org.apache.lucene.search.TermScorer.next(TermScorer.java:106) at org.apache.lucene.util.ScorerDocQueue.topNextAndAdjustElsePop( ScorerDocQueue.ja va:116) at org.apache.lucene.search.DisjunctionSumScorer.advanceAfterCurrent(Disjunctio nSu mScorer.java:175) at org.apache.lucene.search.DisjunctionSumScorer.next( DisjunctionSumScorer.java:14 6) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:124) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:232) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74) at org.apache.lucene.search.Hits.(Hits.java:61) at org.apache.lucene.search.Searcher.search(Searcher.java:55) at org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.findByQu ery( ReadCommittedTransaction.java:469) at org.compass.core.lucene.engine.transaction.ReadCommittedTransaction.doFind(R ead CommittedTransaction.java:426) at org.compass.core.lucene.engine.transaction.AbstractTransaction.find(Abstract Tra nsaction.java:91) at org.compass.core.lucene.engine.LuceneSearchEngine.find( LuceneSearchEngine.java: 379) at org.compass.core.lucene.engine.LuceneSearchEngineQuery.hits(LuceneSearchEngi neQ uery.java:151) at org.compass.core.impl.DefaultCompassQuery.hits(DefaultCompassQuery.java:133) at org.compass.core.support.search.CompassSearchHelper.performSearch(CompassSea rch Helper.java:144) at org.compass.core.support.search.CompassSearchHelper$1.doInCompass(CompassSea rch Helper.java:89) at org.compass.core.CompassTemplate.execute(CompassTemplate.java:137) at org.compass.core.support.search.CompassSearchHelper.search(CompassSearchHelp er. java:86) It looks like this error is actually in the lucene code. It would be great if anyone in this group has an idea about this kind of usecase and has some suggestions. Thanks, Harini - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Crawling in Nutch
Hi, Where does (in which field) nutch stores the content of a document while indexing. I am using this nutch index to search in Lucene. So i want to know the field in which the content of the document is present. Thank You
IndexReader deletes more that expected
Hi, I got unexpected behavior while testing lucene. To shortly address the problem: Using IndexWriter I add docs with fields named ID with a consecutive order (1,2,3,4, etc) then close that index. I get new IndexReader, and call IndexReader.deleteDocuments(Term). The term is simply new Term("ID", "1"). and then class close on IndexReader. Things work out fine. But if i add docs using IndexWriter, close writer, then create new IndexReader to delete one of the docs already inserted, but without closing index. while the indexReader that perform deletion is still not closed, I add more docs, and then commit the IndexWriter, so when i search I get all added docs in the two phases (before using deleteDocuments() on IndexReader and after because i haven't closed IndexReader, howerer, closed IndexWriter). I close IndexReader and then query the index, so i deletes all docs after opening it till closing it, in addition to the specified doc in the Term object (in this test case: ID=1). I know that i can avoid this by close IndexReader directly after deleting docs, but what about runing it on mutiThread app like web application? There you are the code: IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory); Hits hitsB4InsertAndClose = null; hitsB4InsertAndClose = getAllAsHits(indexSearcher); int beforeInsertAndClose = hitsB4InsertAndClose.length(); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.close(); IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory); indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID","1")); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.close(); Hits hitsAfterInsertAndClose = getAllAsHits(indexSearcher); int AfterInsertAndClose = hitsAfterInsertAndClose.length();//This is 14 indexWriter.addDocument(getNewElement()); indexWriter.close(); Hits hitsAfterInsertAndAfterCloseb4Delete = getAllAsHits(indexSearcher); int hitsAfterInsertButAndAfterCountb4Delete = hitsAfterInsertAndAfterCloseb4Delete.length();//This is 15 indexSearcherDel.close(); Hits hitsAfterInsertAndAfterClose = getAllAsHits(indexSearcher);int hitsAfterInsertButAndAfterCount = hitsAfterInsertAndAfterClose.length();//This is 2 The two methods I Use private Hits getAllAsHits(IndexSearcher indexSearcher){ try{ Analyzer analyzer = new StandardAnalyzer(); String defaultSearchField = "all"; QueryParser parser = new QueryParser(defaultSearchField, analyzer); indexSearcher = new IndexSearcher(this.indexDirectory); Hits hits = indexSearcher.search(parser.parse("+alias:mydoc")); indexSearcher.close(); return hits; }catch(IOException ex){ throw new RuntimeException(ex); }catch(org.apache.lucene.queryParser.ParseException ex){ throw new RuntimeException(ex); } } private Document getNewElement(){ Map map = new HashMap(); map.put("ID", new Integer(insertCounter).toString()); map.put("name", "name"+insertCounter); insertCounter++; Document document = new Document(); for (Iterator iter = map.keySet().iterator(); iter.hasNext();) { String key = (String) iter.next(); document.add(new Field(key, map.get(key), Store.YES, Index.TOKENIZED)); } document.add(new Field("alias", "mydoc", Store.YES, Index.UN_TOKENIZED)); return document;} any clue why it works that way? I expect it to delete only one doc? _ PC Magazine’s 2007 editors’ choice for best web mail—award-winning Windows Live Hotmail. http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HMWL_mini_pcmag_0707
More IP/MAC indexing questions
Hi again, everyone. First of all, I want to thank everyone for their extremely helpful replies so far. Also, I just started reading the book "Lucene in Action" last night. So far it's an awesome book, so a big thanks to the authors. Anyhow, on to my question. As I've mentioned in several of my previous messages, I am indexing different pieces of information about servers - in particular, my question is about indexing the IP address and MAC address. Using the StandardAnalyzer, an IP is kept as a single token ("192.168.1.100"), and a MAC is broken up into one token per octet ("00", "17", "fd", "14", "d3", "2a"). Many searches will be for partial IPs or MACs ("192.168", "00:17:fd", etc). Are either of these methods of indexing the addresses (single token vs per-octet token) more or less efficient than the other when indexing large numbers of these? -- Joe Attardi [EMAIL PROTECTED] http://thinksincode.blogspot.com/
RE: IndexReader deletes more that expected
If I'm reading this correctly, there's something a little wonky here. In your example code, you close the IndexWriter and then, without creating a new IndexWriter, you call addDocument again. This shouldn't be possible (what version of Lucene are you using?) Assuming for the time being that you are creating the IndexWriter again, the other issue here is that you shouldn't be able to have a reader and a writer changing an index at the same time. There should be a lock failure. This should occur either in the Index Might you be creating your IndexWriters (which you don't show) with the create flag always set to true? That will wipe your index each time, ignoring the locks and cause all sorts of weird results. -Original Message- From: Ridwan Habbal [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 01, 2007 8:48 AM To: java-user@lucene.apache.org Subject: IndexReader deletes more that expected Hi, I got unexpected behavior while testing lucene. To shortly address the problem: Using IndexWriter I add docs with fields named ID with a consecutive order (1,2,3,4, etc) then close that index. I get new IndexReader, and call IndexReader.deleteDocuments(Term). The term is simply new Term("ID", "1"). and then class close on IndexReader. Things work out fine. But if i add docs using IndexWriter, close writer, then create new IndexReader to delete one of the docs already inserted, but without closing index. while the indexReader that perform deletion is still not closed, I add more docs, and then commit the IndexWriter, so when i search I get all added docs in the two phases (before using deleteDocuments() on IndexReader and after because i haven't closed IndexReader, howerer, closed IndexWriter). I close IndexReader and then query the index, so i deletes all docs after opening it till closing it, in addition to the specified doc in the Term object (in this test case: ID=1). I know that i can avoid this by close IndexReader directly after deleting docs, but what about runing it on mutiThread app like web application? There you are the code: IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory); Hits hitsB4InsertAndClose = null; hitsB4InsertAndClose = getAllAsHits(indexSearcher); int beforeInsertAndClose = hitsB4InsertAndClose.length(); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.close(); IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory); indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID","1")); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.addDocument(getNewElement()); indexWriter.close(); Hits hitsAfterInsertAndClose = getAllAsHits(indexSearcher); int AfterInsertAndClose = hitsAfterInsertAndClose.length();//This is 14 indexWriter.addDocument(getNewElement()); indexWriter.close(); Hits hitsAfterInsertAndAfterCloseb4Delete = getAllAsHits(indexSearcher); int hitsAfterInsertButAndAfterCountb4Delete = hitsAfterInsertAndAfterCloseb4Delete.length();//This is 15 indexSearcherDel.close(); Hits hitsAfterInsertAndAfterClose = getAllAsHits(indexSearcher);int hitsAfterInsertButAndAfterCount = hitsAfterInsertAndAfterClose.length();//This is 2 The two methods I Use private Hits getAllAsHits(IndexSearcher indexSearcher){ try{ Analyzer analyzer = new StandardAnalyzer(); String defaultSearchField = "all"; QueryParser parser = new QueryParser(defaultSearchField, analyzer); indexSearcher = new IndexSearcher(this.indexDirectory); Hits hits = indexSearcher.search(parser.parse("+alias:mydoc")); indexSearcher.close(); return hits; }catch(IOException ex){ throw new RuntimeException(ex); }catch(org.apache.lucene.queryParser.ParseException ex){ throw new RuntimeException(ex); } } private Document getNewElement(){ Map map = new HashMap(); map.put("ID", new Integer(insertCounter).toString()); map.put("name", "name"+insertCounter); insertCounter++; Document document = new Document(); for (Iterator iter = map.keySet().iterator(); iter.hasNext();) { String key = (String) iter.next(); document.add(new Field(key, map.get(key), Store.YES, Index.TOKENIZED)); } document.add(new Field("alias", "mydoc", Store.YES, Index.UN_TOKENIZED)); return document;} any clue why it works that way? I expect it to delete only one doc? _ PC Magazine's 2007 editors' choice for best web mail-award-winning Windows Live Hotmail. http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migr ation_HMWL_mini_pcmag_0707 ---
Re: IndexReader deletes more that expected
On 8/1/07, Ridwan Habbal <[EMAIL PROTECTED]> wrote: > > but what about runing it on mutiThread app like web application? There > you are the code: If you are targeting a multi threaded webapp than I strongly suggest you look into using either Solr or the LuceneIndexAccessor code. You will want to use some form of reference counting to manage your Readers and Writers. Also, you can now use IndexWriter (Lucene 2.0 and greater I think) to delete. This allows for efficient mixing of deletes and adds by buffering the deletes, and then opening an IndexReader to commit them later. This is much more efficient than IndexModifier. - Mark
Re: More IP/MAC indexing questions
First, consider using your own analyzer and/or breaking the IP addresses up by substituting ' ' for '.' upon input. Otherwise, you'll have endless issues as time passes.. But on to your question. Please post what you mean by "a large number". 10,000? 1,000,000,000? we have no clue from your posts so far... That said, efficiency is hugely overrated at this stage of your design. I'd personally use whatever is easiest and run some tests. Just index them as single (unbroken) tokens to start and search your partial address with PrefixQuery. Or index them as individual tokens and create a SpanFirstQuery. Or... And measure . Best Erick On 8/1/07, Joe Attardi <[EMAIL PROTECTED]> wrote: > > Hi again, everyone. First of all, I want to thank everyone for their > extremely helpful replies so far. > Also, I just started reading the book "Lucene in Action" last night. So > far > it's an awesome book, so a big thanks to the authors. > > Anyhow, on to my question. As I've mentioned in several of my previous > messages, I am indexing different pieces of information about servers - in > particular, my question is about indexing the IP address and MAC address. > > Using the StandardAnalyzer, an IP is kept as a single token (" > 192.168.1.100"), > and a MAC is broken up into one token per octet ("00", "17", "fd", "14", > "d3", "2a"). Many searches will be for partial IPs or MACs ("192.168", > "00:17:fd", etc). > > Are either of these methods of indexing the addresses (single token vs > per-octet token) more or less efficient than the other when indexing large > numbers of these? > > -- > Joe Attardi > [EMAIL PROTECTED] > http://thinksincode.blogspot.com/ >
Re: More IP/MAC indexing questions
Hi Erick, First, consider using your own analyzer and/or breaking the IP addresses > up by substituting ' ' for '.' upon input. Do you mean breaking the IP up into one token for each segment, like ["192", "168", "1", "100"] ? > But on to your question. Please post what you mean by > "a large number". 10,000? 1,000,000,000? we have no clue > from your posts so far... I apologize for the lack of details. A large part of the data will be wireless MAC addresses detected over the air, so it depends on the site. But I suppose, worst case, we're looking at thousands or tens of thousands. Comparatively speaking, then, I guess it's not such a large number compared to some of the other questions discussed on the list. That said, efficiency is hugely overrated at this stage of your > design. I'd personally use whatever is easiest and run some > tests. > > Just index them as single (unbroken) tokens to start and search > your partial address with PrefixQuery. This is what I was thinking originally, too. Although there could be times where they are searching for a piece at the end of the address, which is why my original post had me building a WildcardQuery. The system will be searching log messages, too, and for that I'll use the more normal StandardAnalyzer/QueryParser approach. So what I am thinking of doing going forward is creating a custom query parser class, that basically has special cases (IP addresses, MAC addresses) where the query must be more customized, and in the other cases fall through to the standard QueryParser class. Does this sound like a good idea? Thanks again for your continued help!
Re: More IP/MAC indexing questions
Think of a custom analyzer class rather than an custom query parser. The QueryParser uses your analyzer, so it all just "comes along". Here's the approach I'd try first, off the top of my head Yes, break the IP and etc. up into octets and index them tokenized. Use a SpanNearQuery with a slop of 0 and specify true for ordering. What that will do is require that the segments you specify must appear in order with no gaps. You have to construct this yourself since there's no support for SpanQueries in the QueryParser yet. This'll avoid having to deal with Wildcards, which have their own issues (try searching on a thread "I just don't understand wildcards at all" for an exposition from "the guys" on this. Best Erick On 8/1/07, Joe Attardi <[EMAIL PROTECTED]> wrote: > > Hi Erick, > > First, consider using your own analyzer and/or breaking the IP addresses > > up by substituting ' ' for '.' upon input. > > Do you mean breaking the IP up into one token for each segment, like > ["192", > "168", "1", "100"] ? > > > > > But on to your question. Please post what you mean by > > "a large number". 10,000? 1,000,000,000? we have no clue > > from your posts so far... > > I apologize for the lack of details. A large part of the data will be > wireless MAC addresses detected over the air, so it depends on the site. > But > I suppose, worst case, we're looking at thousands or tens of thousands. > Comparatively speaking, then, I guess it's not such a large number > compared > to some of the other questions discussed on the list. > > That said, efficiency is hugely overrated at this stage of your > > design. I'd personally use whatever is easiest and run some > > tests. > > > > Just index them as single (unbroken) tokens to start and search > > your partial address with PrefixQuery. > > This is what I was thinking originally, too. Although there could be times > where they are searching for a piece at the end of the address, which is > why > my original post had me building a WildcardQuery. > > The system will be searching log messages, too, and for that I'll use the > more normal StandardAnalyzer/QueryParser approach. > > So what I am thinking of doing going forward is creating a custom query > parser class, that basically has special cases (IP addresses, MAC > addresses) > where the query must be more customized, and in the other cases fall > through > to the standard QueryParser class. Does this sound like a good idea? > > Thanks again for your continued help! >
Re: More IP/MAC indexing questions
On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > Use a SpanNearQuery with a slop of 0 and specify true for ordering. > What that will do is require that the segments you specify must appear > in order with no gaps. You have to construct this yourself since there's > no support for SpanQueries in the QueryParser yet. This'll avoid having > to deal with Wildcards, which have their own issues (try searching on > a thread "I just don't understand wildcards at all" for an exposition from > "the guys" on this. Thanks Erick, I'll try this. My only other question here though, is what if they specify an incomplete octet of an address? For example, I want ' 192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this without wildcards, is there a way to put a PrefixQuery into the Span Query? Sorry if I don't make any sense
Re: Size of field?
Hi Erick!! You're right, I just use setMaxFieldLength() and all work fine. You save my life, thanks! (y) On 7/30/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default > max field length, last I knew, was 10,000. But this sounds like > it might relate to your issue. > > Best > Erick > > On 7/27/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote: > > > > Hi guys, > > > > I would like to know if exist some limit of size for the fields of a > > document. > > > > I'm with the following problem: > > When a term is after a certain amount of characters (approximately > 87300) > > in > > a field, the search does not find de occurrency. > > If I divide my field in pages, the terms are found normally. > > This problem occours when I make an exact query (query between quotes) > > > > What can be happening? > > > > I'm using BrazilianAnalyzer and StandardAnalyzer(for tests only) for > both, > > search and indexation. > > > > thanks... > > > > Sorry for my poor english... > > >
Re: More IP/MAC indexing questions
I suspect you're going to have to deal with wildcards if you really want this functionality. Erick On 8/1/07, Joe Attardi <[EMAIL PROTECTED]> wrote: > > On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > > > Use a SpanNearQuery with a slop of 0 and specify true for ordering. > > What that will do is require that the segments you specify must appear > > in order with no gaps. You have to construct this yourself since there's > > no support for SpanQueries in the QueryParser yet. This'll avoid having > > to deal with Wildcards, which have their own issues (try searching on > > a thread "I just don't understand wildcards at all" for an exposition > from > > "the guys" on this. > > > Thanks Erick, I'll try this. My only other question here though, is what > if > they specify an incomplete octet of an address? For example, I want ' > 192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this > without wildcards, is there a way to put a PrefixQuery into the Span > Query? > > Sorry if I don't make any sense >
Re: More IP/MAC indexing questions
On 1-Aug-07, at 11:34 AM, Joe Attardi wrote: On 8/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Use a SpanNearQuery with a slop of 0 and specify true for ordering. What that will do is require that the segments you specify must appear in order with no gaps. You have to construct this yourself since there's no support for SpanQueries in the QueryParser yet. This'll avoid having to deal with Wildcards, which have their own issues (try searching on a thread "I just don't understand wildcards at all" for an exposition from "the guys" on this. Thanks Erick, I'll try this. My only other question here though, is what if they specify an incomplete octet of an address? For example, I want ' 192.168.10' to match 192.168.10.1 and 192.168.100.1. How can I do this without wildcards, is there a way to put a PrefixQuery into the Span Query? If 192 168 10 1 are separate tokens, then a phrase query on "192 168 10" will find it. If it is a single token, then a wildcard or regex query is necessary. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Size of field?
Glad it worked out for you Did you ever have any insight into what was magical about 87,300? Although now that I re-read your mail, that was the number of characters, so I can imagine that your corpus averaged 8.73 characters/word Best Erick On 8/1/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote: > > Hi Erick!! > > You're right, I just use setMaxFieldLength() and all work fine. > > You save my life, thanks! (y) > > On 7/30/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > > > See IndexWriter.setMaxFieldLength(). 87,300 is odd, since the default > > max field length, last I knew, was 10,000. But this sounds like > > it might relate to your issue. > > > > Best > > Erick > > > > On 7/27/07, Eduardo Botelho <[EMAIL PROTECTED]> wrote: > > > > > > Hi guys, > > > > > > I would like to know if exist some limit of size for the fields of a > > > document. > > > > > > I'm with the following problem: > > > When a term is after a certain amount of characters (approximately > > 87300) > > > in > > > a field, the search does not find de occurrency. > > > If I divide my field in pages, the terms are found normally. > > > This problem occours when I make an exact query (query between quotes) > > > > > > What can be happening? > > > > > > I'm using BrazilianAnalyzer and StandardAnalyzer(for tests only) for > > both, > > > search and indexation. > > > > > > thanks... > > > > > > Sorry for my poor english... > > > > > >
Re: High CPU usage duing index and search
It sounds like you have a fairly busy system, perhaps 100% load on the process is not that strange, at least not during short periods of time. A simpler solution would be to nice the process a little bit in order to give your background jobs some more time to think. Running a profiler is still the best advice I can think of. It should clearly show you what is going on when you run out of CPU. -- karl 1 aug 2007 kl. 04.29 skrev Chew Yee Chuang: Hi, Thanks for the link provided, actually I've go through those article when I developing the index and search function for my application. I haven’t try profiler yet, but I monitor the CPU usage and notice that whatever index or search performing, the CPU usage raise to 100%. Below I will try to elaborate more on what my application is doing and how I index and search. There are many concurrent process running, first, the application will write records that received into a text file with tab separated each different field. Application will point to a new file every 10mins and start writing to it. So every file will contains only 10mins record, approximate 600,000 records per file. Then, the indexing process will check whether there is a text file to be index, if it is, the thread will wake up and start perform indexing. The indexing process will first add documents to RAMDir, Then later, add RAMDir into FSDir by calling addIndexNoOptimize() when there is 100,000 documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter (FSDir) was created but a few IndexWriter(RAMDir) was created during the whole process. Below are some configuration for IndexWriters that I mentioned:- IndexWriter (RAMDir) - SimpleAnalyzer - setMaxBufferedDocs(1) - Filed.Store.YES - Field.Index.NO_NORMS IndexWriter (FSDir) - SimpleAnalyzer - setMergeFactor(20) - addIndexesNoOptimize() For the searching, because there are many queries(20,000) run continuously to generate the aggregate table for reporting purpose. All this queries is run in nested loop, and there is only 1 Searcher created, I try searcher and filter as well, filter give me better result, but both also utilize lots of CPU resources. Hope this info will help and sorry for my bad English. Thanks eChuang, Chew -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 31, 2007 5:54 PM To: java-user@lucene.apache.org Subject: Re: High CPU usage duing index and search 31 jul 2007 kl. 05.25 skrev Chew Yee Chuang: But just notice that when Lucene performing search or index, the CPU usage on my machine raise to 100%, because of this issue, some of my others backend process will slow down eventually. Just want to know does anyone face this problem before ? and is it any idea on how to overcome this problem ? Did you run a profiler to see what it is that consume all the resources? It is very hard to guess based on the information you supplied. Start here: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://wiki.apache.org/lucene-java/ImproveSearchingSpeed -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.476 / Virus Database: 269.11.0/927 - Release Date: 7/30/2007 5:02 PM No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.476 / Virus Database: 269.11.0/929 - Release Date: 7/31/2007 5:26 PM - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I do boosting based on term postions?
Thanks for the quick response =) On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote: > Yes, it is easily doable through "Payload" facility. During indexing process > (mainly tokenization), you need to push this extra information in each > token. And then you can use BoostingTermQuery for using Payload value to > include Payload in the score. You also need to implement Similarity for this > (mainly scorePayload method). If I store, say a custom boost factor as Payload, does it means that I will store one more byte per term per document in the index file? So the index file would be much larger? > > Other way can be to extend SpanTermQuery, this already calculates the > position of match. You just need to do something to use this position value > in the score calculation. I see that SpanTermQuery takes a TermPositions from the indexReader and I can get the term position from there. However I am not sure how to incorporate it into the score calculation. Would you mind give a little more detail on this? > > One possible advantage of SpanTermQuery approach is that you can play > around, without re-creating indices everytime. > > Thanks, > Shailendra Sharma, > CTO, Ver se' Innovation Pvt. Ltd. > Bangalore, India > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > > > Hi all, > > > > I was wondering if it is possible to do boosting by search terms' > > position in the document. > > > > for example: > > search terms appear in the first 100 words, or first 10% words, or in > > first two paragraphs would be given higher score. > > > > Is it achievable through using the new Payload function in lucene 2.2? > > Or are there any easier ways to achieve these ? > > > > > > Regards, > > Cedric > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > Thanks, Cedric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Solr's NumberUtils doesnt work
Hi I am using NumberUtils to encode and decode numbers while indexing and searching, when I am going to decode the number retrieved from an index it throws exception for some fields the exception message is: Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(Unknown Source) at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java :125) at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java:37) at com.payvand.lucene.util.ExtendedNumberUtils.decodeInteger( ExtendedNumberUtils.java:123) I dont know why this happen, I am wondering if it has something to do with character encoding. have you had such problem? thanks -- Regards, Mohammad Norouzi -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
je-analysis.jar
Dear All, Who has the je-analysis.jar? If somebody has, can you send it to me? I don't have the access to download something in my computer now. Thank you very much! Yours truly, Daniel This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. Visit us at http://www.cognizant.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using Nutch APIs in Lucene
How can we use nutch APIs in Lucene? For example using FetchedSegments , we can get ParseText from which we can get the content of the document. So can we use these classes (FetchedSegments, ParseText ) in lucene. If so, how to use them? Thank You