Re: Tomcat Threads are BLOCKED after some time
Hi, I think it might be a case of the allowed open files at the OS. Try setting a higher ulimit and run the program. Also, what are the gc parameters you have set on the jvm? Regards Varun Dhussa Product Architect CE InfoSystems (P) Ltd http://www.mapmyindia.com damu_verse wrote: Hi Thanx for the reply.. we have not tested this against the versions (both java-1.6.12 and lucene-2.4) mentioned and more over we can not move to those verions right away... So we need a solution for this particular version only.. thanx & regards damu damu_verse wrote: Hi All, We Have used Lucene as our Search Engine and all our applications are deployed onto tomcat and running with thread pool size of 200. Java Version - 1.6.0-rc Lucene Version - 2.3.2 Tomcat Version - 6.0.14 OS - Red Hat Enterprise Linux ES release 4 (Nahant Update 5) kernel - 2.6.9-55.0.2.ELsmp RAM - 4 GB Tomcat Memory - 1.5 GB Index Size - 2 GB After 10-12 hrs of tomcat running, tomcat becomes irresponsive. After doing core dump of tomcat process We observed that all tomcat threads are blocked (Thread-pool size-200). none of the tomcat threads are in runnable state. each thread at the time of the core dump is in BLOCKED state...The following are the stack trace of blocked. "MultiSearcher thread #3" daemon prio=10 tid=0x337ddc00 nid=0x4827 waiting for monitor entry [0x2f2f..0x2f2f0ea0] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:235) - waiting to lock <0x45d49d88> (a org.apache.lucene.store.FSDirectory$FSIndexInput) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217) at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) at org.apache.lucene.index.IndexReader.termDocs(IndexReader.java:668) at org.apache.lucene.search.ConstantScoreTermQuery$TermWeight.scorer(ConstantScoreTermQuery.java:63) at org.apache.lucene.search.VBooleanQuery$BooleanWeight.scorer(VBooleanQuery.java:276) at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:232) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:124) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250) "http-8080-194" daemon prio=10 tid=0x08927800 nid=0x128d waiting for monitor entry [0x2e188000..0x2e189e20] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:235) - waiting to lock <0x45d49d88> (a org.apache.lucene.store.FSDirectory$FSIndexInput) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVLong(IndexInput.java:96) at org.apache.lucene.index.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:196) at org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:97) at org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164) at in.verse.search.query.spans.TermSpans.skipTo(TermSpans.java:85) at in.verse.search.query.spans.SpanScorer.skipTo(SpanScorer.java:70) at org.apache.lucene.search.VConjunctionScorer.doNext(VConjunctionScorer.java:78) at org.apache.lucene.search.VConjunctionScorer.next(VConjunctionScorer.java:71) at org.apache.lucene.search.VBooleanScorer2.next(VBooleanScorer2.java:456) at org.apache.lucene.search.VConjunctionScorer.init(VConjunctionScorer.java:136) at org.apache.lucene.search.VConjunctionScorer.next(VConjunctionScorer.java:65) at org.apache.lucene.search.VBooleanScorer2.score(VBooleanScorer2.java:412) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:173) at org.apache.lucene.search.Searcher.search(Searcher.java:118) at org.apache.lucene.search.Searcher.search(Searcher.java:97)
Re: crawler questions..
That's interesting. I've been working in python recently, not crawling though. But, as ever, the more you get into it the more curious you get. Did you come up with a solution to a node error? Are you really talking about a broken link, or are you just saying the bottom of the tree has been reached? Presumably the last one would be when every link on every page has been followed, which means you have to track what pages have been crawled and find a way of uniquely and correctly identifying them internally? I think the problem is that while a URL might be unique, there can be more than one URL pointing to the same content - for instance in struts where action a and action b are appended to a URL but produce the same result. I believe I am right about this. In the site that I am working on google have told us they are unable to crawl the whole site because some URLs result in a loop - another problem. It would be cool if you have solved these sorts of problems, or rather can identify where they are on a site in a quick and easy way. Best, Adam 2009/3/4 bruce > Hi... > > Sorry that this is a bit off track. Ok, maybe way off track! > > But I don't have anyone to bounce this off of.. > > I'm working on a crawling project, crawling a college website, to extract > course/class information. I've built a quick test app in python to crawl > the > site. I crawl at the top level, and work my way down to getting the > required > course/class schedule. The app works. I can consistently run it and extract > the information. > > My issue is now that I have a "basic" app that works, i need to figure out > how I guarantee that I'm correctly crawling the site. How do I know when > I've got an error at a given node/branch, so that the app knows that it's > not going to fetch the underlying branch/nodes of the tree.. > > How do I know when I have a complete "tree"! > > I'm looking for someone, or some group/prof that I can talk to about these > issues. My goal is to eventually look at using nutch/lucene if at all > applicable. > > Any pointers, or people, or papers, etc... would be helpful. > > Thanks > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
IndexSearcher
Hi, I would like to do a search that will return documents that contain a given word. For example, I created the following index: IndexWriter writer = new IndexWriter("C:/TryIndex", new StandardAnalyzer()); Document doc = new Document(); doc.add(new Field(WordIndex.FIELD_WORLDS, "111 222 333", Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(WordIndex.FIELD_WORLDS, "111", Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(WordIndex.FIELD_WORLDS, "222 333", Field.Store.YES, Field.Index.UN_TOKENIZED)); writer.addDocument(doc); writer.optimize(); writer.close(); now I want to get all the documents that contain the word "222". I tried to run the following code but it doesn;t return any doc IndexSearcher searcher = new IndexSearcher(indexPath); // // TermQuery mapQuery = new TermQuery(new Term(FIELD_WORLDS, worldNum)); - this one also didn't word Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser(FIELD_WORLDS, analyzer); Query query = parser.parse(worldNum); Hits mapHits = searcher.search(query); Thanks a lot, Liat
Re: IndexSearcher
I think your root problem is that you're indexing UN_TOKENIZED, which means that the tokens you're adding to your index are NOT run through the analyzer. So your terms are exactly "111", "222 333" and "111 222 333", none of which match "222". I expect you wanted your tokens to be "111", "222", and "333", each appearing twice in your index. Try indexing them tokenized. Although note that I don't remember what StandardAnalyzer does with numbers. WhitespaceAnalyzer does the more intuitive thing, but beware that it doesn't fold case. But it might be an easier place for you to start until you get more comfortable with what various analyzers do. Also, I *strongly* advise that you get a copy of Luke. It is a wonderful tool that allows you to examine your index, analyze queries, test queries, etc. But be aware that the site that maintains Luke was having problems yesterday, look over the user list messages from yesterday if you have problems. Best Erick On Thu, Mar 5, 2009 at 8:40 AM, liat oren wrote: > Hi, > > I would like to do a search that will return documents that contain a given > word. > For example, I created the following index: > > IndexWriter writer = new IndexWriter("C:/TryIndex", new > StandardAnalyzer()); > Document doc = new Document(); > doc.add(new Field(WordIndex.FIELD_WORLDS, "111 222 333", Field.Store.YES, > Field.Index.UN_TOKENIZED)); > writer.addDocument(doc); > doc = new Document(); > doc.add(new Field(WordIndex.FIELD_WORLDS, "111", Field.Store.YES, > Field.Index.UN_TOKENIZED)); > writer.addDocument(doc); > doc = new Document(); > doc.add(new Field(WordIndex.FIELD_WORLDS, "222 333", Field.Store.YES, > Field.Index.UN_TOKENIZED)); > writer.addDocument(doc); > writer.optimize(); > writer.close(); > > now I want to get all the documents that contain the word "222". > > I tried to run the following code but it doesn;t return any doc > > IndexSearcher searcher = new IndexSearcher(indexPath); > > // // TermQuery mapQuery = new TermQuery(new Term(FIELD_WORLDS, > worldNum)); - this one also didn't word > Analyzer analyzer = new StandardAnalyzer(); > QueryParser parser = new QueryParser(FIELD_WORLDS, analyzer); > Query query = parser.parse(worldNum); > Hits mapHits = searcher.search(query); > > > Thanks a lot, > Liat >
Learning Lucene
dear all I am really new to lucene Is there anyone who can guid me learning lucene I have lucene in action the old book, but I get hard time to understand the syntaxes in the book and the new lucene release (2.4) Can anyone give me copy of the new lucen inaction book or any other material that i can go thru. thanks a lot Tuztuz
RE: Learning Lucene
Hi Tuztuz, Please visit the book's website and the forum. You will get most queries cleared. Sincerely, Sithu D Sudarsan -Original Message- From: Tuztuz T [mailto:tuztu...@yahoo.com] Sent: Thursday, March 05, 2009 9:24 AM To: java-user@lucene.apache.org Subject: Learning Lucene dear all I am really new to lucene Is there anyone who can guid me learning lucene I have lucene in action the old book, but I get hard time to understand the syntaxes in the book and the new lucene release (2.4) Can anyone give me copy of the new lucen inaction book or any other material that i can go thru. thanks a lot Tuztuz - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
public apology for company spam
This morning, an apparently over-zealous marketing firm, on behalf of the company I work for, sent out a marketing email to a large number of subscribers of the Lucene email lists. This was done without my knowledge or approval, and I can assure you that I'll make all efforts to prevent it from happening again. Sincerest apologies, -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
similarity function
For my work, I have read an article stating that " Answer type can be automatically constructed by Indexing Different Questions and Answer types. Later, when an unseen question apears, answer type for this question will be found with the help of 'similarity function' computation" so I am clear with the arguement above. my problem is, 1. how can I index individual questions and Answer types as is ( not tokenized 2. how can I calculate the similarity between indexed questions and and unseen questions (question of any type that can be asked latter) to make things clear: the senario is 1. Who is the president of UN Answer type 2. When will the presidency of Meles Zenawi hold? Answer Type these two will be indexed and and later an unseen question like who is the president of Kenya should match the first question and so that will have answer type of I appricate any help Seid M - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Learning Lucene
On Mar 5, 2009, at 9:24 AM, Tuztuz T wrote: dear all I am really new to lucene Is there anyone who can guid me learning lucene I have lucene in action the old book, but I get hard time to understand the syntaxes in the book and the new lucene release (2.4) Can anyone give me copy of the new lucen inaction book or any other material that i can go thru. The second edition is available through Manning's MEAP program already. Still some writing left to do on it, and hopefully 2.9 will be out first, before it goes to print, but it has been updated to the latest API and contains lots of great new material primarily thanks to Mike McCandless. http://www.manning.com/hatcher3/ Erik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: public apology for company spam
Yonik, Thank-you for your email. I appreciated and accept your apology. Indeed the spam was annoying, but I think that you and your colleagues have significant social capital in the Lucene and Solr communities, so this minor but unfortunate incident should have minimal impact. That said, you and your colleagues do not have infinite social capital, and hopefully you will have no reason to be forced to spend this capital in such an unfortunate manner in the future. :-) sincerely, Glen Newton 2009/3/5 Yonik Seeley : > This morning, an apparently over-zealous marketing firm, on behalf of > the company I work for, sent out a marketing email to a large number > of subscribers of the Lucene email lists. This was done without my > knowledge or approval, and I can assure you that I'll make all efforts > to prevent it from happening again. > > Sincerest apologies, > -Yonik > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
indexing but not tokenizing
Hi all, I'm not able to see what's wrong in the following sample code. I'm indexing a document with 5 fields, using five different indexing strategies. I'm fine the the results for 4 of them, but field B is causing me some trouble in understanding what's going on. The value of field B is X (uppercase). The analyzer is a SimpleAnalyzer, which I use on the QueryParser as well. But when I search for X (uppercase) on field B, the X is converted to lowercase. Now, I know that SimpleAnalyzer converts to lowercase, but I was expecting it not to do so on field B, which is NOT_ANALYZED. How should I fix my code? Thank you in advance! -John --- code --- package test; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocCollector; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.queryParser.QueryParser; public class Test { public static void main(String[] args) { try { RAMDirectory idx = new RAMDirectory(); SimpleAnalyzer analyzer = new SimpleAnalyzer(); IndexWriter writer = new IndexWriter(idx, analyzer, true, IndexWriter.MaxFieldLength.LIMITED); Document doc = new Document(); doc.add(new Field("A", "X", Field.Store.YES, Field.Index.NO)); doc.add(new Field("B", "X", Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("C", "X", Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("D", "x", Field.Store.NO, Field.Index.NOT_ANALYZED)); doc.add(new Field("E", "X", Field.Store.NO, Field.Index.ANALYZED)); writer.addDocument(doc); writer.close(); IndexSearcher searcher = new IndexSearcher(idx); String field = "B"; QueryParser parser = new QueryParser(field, analyzer); Query query = parser.parse("X"); System.out.println("Query: " + query.toString()); TopDocCollector collector = new TopDocCollector(1); searcher.search(query, collector); int numHits = collector.getTotalHits(); System.out.println(numHits + " total matching documents"); if ( numHits > 0) { ScoreDoc[] hits = collector.topDocs().scoreDocs; doc = searcher.doc(hits[0].doc); System.out.println("A: " + doc.get("A")); System.out.println("B: " + doc.get("B")); System.out.println("C: " + doc.get("C")); System.out.println("D: " + doc.get("D")); System.out.println("E: " + doc.get("E")); } } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: indexing but not tokenizing
Hi I think that the SimpleAnalyzer you are passing to the query parser will be downcasing the X. You can fix it using an analyzer that doesn't convert to lower case, creating the query directly in code, or by using PerFieldAnalyzerWrapper, and no doubt other ways too. If you want a direct suggestion: use PerFieldAnalyzerWrapper, specifying a different analyzer for field B. -- Ian. On Thu, Mar 5, 2009 at 3:17 PM, John Marks wrote: > Hi all, > > I'm not able to see what's wrong in the following sample code. > I'm indexing a document with 5 fields, using five different indexing > strategies. > I'm fine the the results for 4 of them, but field B is causing me some > trouble in understanding what's going on. > > The value of field B is X (uppercase). > The analyzer is a SimpleAnalyzer, which I use on the QueryParser as well. > But when I search for X (uppercase) on field B, the X is converted to > lowercase. > Now, I know that SimpleAnalyzer converts to lowercase, but I was > expecting it not to do so on field B, which is NOT_ANALYZED. > > How should I fix my code? > > Thank you in advance! > -John > > > > --- code --- > > > package test; > > import org.apache.lucene.analysis.SimpleAnalyzer; > import org.apache.lucene.store.RAMDirectory; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.TopDocCollector; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.queryParser.QueryParser; > > > > public class Test > { > public static void main(String[] args) > { > try > { > RAMDirectory idx = new RAMDirectory(); > SimpleAnalyzer analyzer = new SimpleAnalyzer(); > > IndexWriter writer = new IndexWriter(idx, analyzer, true, > IndexWriter.MaxFieldLength.LIMITED); > > Document doc = new Document(); > doc.add(new Field("A", "X", > Field.Store.YES, Field.Index.NO)); > doc.add(new Field("B", "X", > Field.Store.YES, Field.Index.NOT_ANALYZED)); > doc.add(new Field("C", "X", > Field.Store.YES, Field.Index.ANALYZED)); > doc.add(new Field("D", "x", > Field.Store.NO, Field.Index.NOT_ANALYZED)); > doc.add(new Field("E", "X", > Field.Store.NO, Field.Index.ANALYZED)); > writer.addDocument(doc); > writer.close(); > > IndexSearcher searcher = new IndexSearcher(idx); > String field = "B"; > QueryParser parser = new QueryParser(field, analyzer); > Query query = parser.parse("X"); > System.out.println("Query: " + query.toString()); > > TopDocCollector collector = new TopDocCollector(1); > searcher.search(query, collector); > int numHits = collector.getTotalHits(); > System.out.println(numHits + " total matching documents"); > > if ( numHits > 0) > { > ScoreDoc[] hits = collector.topDocs().scoreDocs; > doc = searcher.doc(hits[0].doc); > System.out.println("A: " + doc.get("A")); > System.out.println("B: " + doc.get("B")); > System.out.println("C: " + doc.get("C")); > System.out.println("D: " + doc.get("D")); > System.out.println("E: " + doc.get("E")); > } > } > catch (Exception e) > { > System.out.println(" caught a " + e.getClass() + "\n with message: " > + e.getMessage()); > } > } > > } > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: 答复: 答复: Lucene in large database contexts
mkjjyy On 8/10/07, Askar Zaidi wrote: Hey Guys, I am trying to do something similar. Make the content search-able as soon as it is added to the website. The way it can work in my scenario is that , I create the Index for a every new user account created. Then, whenever a new document is uploaded, its contents are added to the users Index using writer.addDocument(...) As for closing the writer, yes ! I'll close the writer and optimize after its added to the index. I really think this should work. Don't you ? thanks AZ On 8/10/07, Erick Erickson wrote: Well, closing/opening an index is MUCH less expensive than rebuilding the whole thing, so I don't understand part of your statements It *may* (but I haven't tried it) be possible to flush the writer rather than close/open it. But, you MUST close/reopen the reader you search with even if flush works like I think it does. But it's also possible to use a two tiered approach. 1G isn't all that big. Could you read it into a RAMDir and use that for your searches? Then, when you add data, you add it to *both* indexes, but close/open the RAMdir for searching. It's also possible to keep the RAMdir as the delta between the FSdir and "current" states of your index. Add to both and search both. Although deletes may be a problem here. You haven't specified how often you expect changes, though. 100/ second? 1/minute? How real is "real time"? You could do something like warm up a new reader in the background whenever you decided you needed to be absolutely up to date and swap your "live" reader for the newly warmed up one whenever you deemed it wise. Or you could just close/open your reader after each modification, fire off a couple of warmup queries at it and let the users live with slow responses if they happen to search before your warm-up queries completed. The point is that there are many options, but to suggest the best one, we need some throughput numbers and a better definition of what "real time" means. Is a one minute delay acceptable? 10 seconds? a millisecond? the answer defines the scope of reasonable solutions. Best Erick On 8/10/07, Antonello Provenzano wrote: Kai, The context I'm going to work with requires a continuous addition of documents to the indexes, since it's user-driven content, and this would require the content to be always up-to-date. This is the problem I'm facing, since I cannot rebuild a 1Gb (at least) index every time a user inserts a new entry into the database. I know Digg, for instance, is using Lucene as search engine: since the amount of data they're dealing with is much higher than mine, I would like to understand the way they used to implement this kind of solution. Thank you again. Antonello On 8/10/07, Kai Hu wrote: Antonello, You are right,I think lucene indexsearcher will search the old information if IndexWriter was not closed(I think lucene release the Lock here),so I only add a few documents every time from buffer to implement index "real time". kai 发件人: antonellop...@gmail.com [mailto:antonellop...@gmail.co m] 代表 Antonello Provenzano 发送时间: 2007年8月10日 星期五 17:59 收件人: java-user@lucene.apache.org 主题: Re: 答复: Lucene in large database contexts Kai, Thanks. The problem I see it's that although I can add a Document through IndexWriter or IndexModifier, this won't be searchable until the index is closed and, possibly, optimized, since the score of the document in the index context must be re-calculated on the basis of the whole context. Is this assumption true? or am I completely wrong? Cheers. Antonello On 8/10/07, Kai Hu wrote: Hi, Antonello You can use IndexWriter.addDocument(Document document) to add single document,same to update,delete operation. kai -邮件原件- 发件人: Antonello Provenzano [mailto:antonellop...@gmail.com] 发送时间: 2007年8月10日 星期五 17:09 收件人: java-user@lucene.apache.org 主题: Lucene in large database contexts Hi There! I've been working for a while on the implementation of a website oriented to contents that would contain millions of entries, most of them indexable (such as descriptions, texts, names, etc.). The ideal solution to make them searchable would be to use Lucene as index and search engine. The reason I'm posting the mailing list is the following: since all the entries will be stored in a database (most likely MySQL InnoDB or Oracle), what's the best technique to implement a system that indexes in "real time" (eg. when an entry is inserted into the databsse) the content and make it searchable? Based on my understanding of Lucene, such this thing is not possible, since the index must be re- created to be able to search the indexed contents. Is this true? Eventually, could anyone point me to a working example about how to implement such a similar context? Thank you for the support. Antonello --
Re: public apology for company spam
Let's see, you guys generously contributed your time and saved my butt way more than once. I *think* I can stand an inadvertent message or two ... Best Erick On Thu, Mar 5, 2009 at 10:12 AM, Glen Newton wrote: > Yonik, > > Thank-you for your email. I appreciated and accept your apology. > > Indeed the spam was annoying, but I think that you and your colleagues > have significant social capital in the Lucene and Solr communities, so > this minor but unfortunate incident should have minimal impact. > > That said, you and your colleagues do not have infinite social > capital, and hopefully you will have no reason to be forced to spend > this capital in such an unfortunate manner in the future. :-) > > sincerely, > > Glen Newton > > 2009/3/5 Yonik Seeley : > > This morning, an apparently over-zealous marketing firm, on behalf of > > the company I work for, sent out a marketing email to a large number > > of subscribers of the Lucene email lists. This was done without my > > knowledge or approval, and I can assure you that I'll make all efforts > > to prevent it from happening again. > > > > Sincerest apologies, > > -Yonik > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > -- > > - > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Instantiating a RAMDirectory from a mutating directory
Hello, I would like to be able to instantiate a RAMDirectory from a directory that an IndexWriter in another process might currently be modifying. Ideally, I would like to do this without any synchronizing or locking. Kind-of like the way in which an IndexReader can open an index in a directory, even if it's currently being modified by an IndexWriter. However, simply calling: RAMDirectory rd = new RAMDirectory("/path/to/index"); Will not work. It will periodically fail with a FileNotFoundException. It's fairly obvious why this happens: Directory.copy() gets a list of the files it needs to copy, and then copies them into the RAMDirectory instance one-by-one. If, in the meantime, the IndexWriter deletes one of these files, a FileNotFoundException occurs. One thought that I had was that I would take advantage of the fact that it's possible to open an IndexReader on the mutating directory, and then use the "addIndexes()" method, as follows: // 1. create RAMDirectory. RAMDirectory ramDirectory = new RAMDirectory(); // 2. create an index in the RAMDirectory. IndexWriter writer = new IndexWriter(ramDirectory, null/*analyzer*/, true /*create*/) ; // 3. open the (possibly mutating) source index. IndexReader reader = IndexReader.open("/path/to/index"); // 4. copy the source index into the RAMDirectory index. writer.addIndexes(new IndexReader [] {reader}); However ... there is a fairly unambiguous warning in IndexWriter.addIndexes()'s documentation: >> NOTE: the index in each Directory must not be changed (opened by a writer) while this method is running. This method does not acquire a write lock in each input Directory, so it is up to the caller to enforce this. I'm slightly confused by this warning though, as IndexReader's documentation implies that it is OK to open an IndexReader in this fashion. I'm wondering whether anyone knows the internals of IndexWriter.addIndexes() well enough to know whether my proposed solution will work reliably? Or, indeed, whether there might be another way of instantiating a RAMDirectory from a directory which might currently be being modified by an IndexWriter? Many thanks in advance, Kieran Topping - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: similarity function
Hi, The very fact that you are trying to answer factoid questions to start with, it is better to use OpenNLP components to identify NER (Named Entity recognition) in the document and use those tags as part of your indexing process. REgards Vasu On Thu, Mar 5, 2009 at 8:19 PM, Seid Mohammed wrote: > For my work, I have read an article stating that " Answer type can be > automatically constructed by Indexing Different Questions and Answer > types. Later, when an unseen question apears, answer type for this > question will be found with the help of 'similarity function' > computation" > > so I am clear with the arguement above. my problem is, > 1. how can I index individual questions and Answer types as is ( not > tokenized > 2. how can I calculate the similarity between indexed questions and > and unseen questions (question of any type that can be asked latter) > > to make things clear: the senario is > 1. Who is the president of UN > Answer type > 2. When will the presidency of Meles Zenawi hold? > Answer Type > these two will be indexed and > and later an unseen question like > who is the president of Kenya > should match the first question and so that will have answer > type of > > I appricate any help > > Seid M > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: public apology for company spam
Yes, it is good to learn that Yonik, Erik et al are also human-beings. :-) Thanks for all your contributions to Lucene/Solr, this list and the OSS community in general. Best, Shashi On Thu, Mar 5, 2009 at 11:36 AM, Erick Erickson wrote: > Let's see, you guys generously contributed your time and saved > my butt way more than once. I *think* I can stand an inadvertent > message or two ... > > Best > Erick > > On Thu, Mar 5, 2009 at 10:12 AM, Glen Newton > wrote: > > > Yonik, > > > > Thank-you for your email. I appreciated and accept your apology. > > > > Indeed the spam was annoying, but I think that you and your colleagues > > have significant social capital in the Lucene and Solr communities, so > > this minor but unfortunate incident should have minimal impact. > > > > That said, you and your colleagues do not have infinite social > > capital, and hopefully you will have no reason to be forced to spend > > this capital in such an unfortunate manner in the future. :-) > > > > sincerely, > > > > Glen Newton > > > > 2009/3/5 Yonik Seeley : > > > This morning, an apparently over-zealous marketing firm, on behalf of > > > the company I work for, sent out a marketing email to a large number > > > of subscribers of the Lucene email lists. This was done without my > > > knowledge or approval, and I can assure you that I'll make all efforts > > > to prevent it from happening again. > > > > > > Sincerest apologies, > > > -Yonik > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > -- > > > > - > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >
Re: similarity function
Hi Seid, Do you have a reference for the article? I've done some QA in my day, but don't recall reading that one. At any rate, I do think it is possible to do what you are after. See below. On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote: For my work, I have read an article stating that " Answer type can be automatically constructed by Indexing Different Questions and Answer types. Later, when an unseen question apears, answer type for this question will be found with the help of 'similarity function' computation" so I am clear with the arguement above. my problem is, 1. how can I index individual questions and Answer types as is ( not tokenized I'm not sure you want this, but when constructing your Field, just use the NOT_ANALYZED option. 2. how can I calculate the similarity between indexed questions and and unseen questions (question of any type that can be asked latter) In line with #1, I think you might be better off to actually tokenize the question as one one field, and the answer type as a second field. Then, you can let Lucene calculate similarity via it's normal query mechanisms. In this case, I would like try experimenting with things like: exact match, phrase queries with slop, etc. That way, not only can you match "Who is the president of UN" but you might also match on things that are a bit fuzzier. To do this, you might need to have several fields per document with variations. I could also see using Lucene's payload mechanism as well. But, as Vasu said, you will likely need other parts too, like OpenNLP. HTH, Grant - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: similarity function
Sounds like your most difficult part will be the question parser using POS. This is kind of old school but use something like the AliceBot AIML library http://en.wikipedia.org/wiki/AIML Where the subjective terms can be extracted from the questions, and indexed separately. Or as Grant and others suggest use OpenNLP (which rocks) or LingPipe (LingPipe license is a little bit of a pain) for entity extraction. An interesting way to look at the data would be to construct 3 fields, Original_Question, Question_base, Subject Doc: Original_Question: Who is the president of the UN Question_base: Who is the president of Question_base: Who is Subject: the president of the UN Subject: the president Subject: the UN /Doc And similarity can be somewhat easier to calculate with similar question bases, subjects, etc P On Thu, Mar 5, 2009 at 3:05 PM, Grant Ingersoll wrote: > Hi Seid, > > Do you have a reference for the article? I've done some QA in my day, but > don't recall reading that one. > > At any rate, I do think it is possible to do what you are after. See > below. > > On Mar 5, 2009, at 9:49 AM, Seid Mohammed wrote: > > For my work, I have read an article stating that " Answer type can be >> automatically constructed by Indexing Different Questions and Answer >> types. Later, when an unseen question apears, answer type for this >> question will be found with the help of 'similarity function' >> computation" >> >> so I am clear with the arguement above. my problem is, >> 1. how can I index individual questions and Answer types as is ( not >> tokenized >> > > I'm not sure you want this, but when constructing your Field, just use the > NOT_ANALYZED option. > > >> 2. how can I calculate the similarity between indexed questions and >> and unseen questions (question of any type that can be asked latter) >> > > In line with #1, I think you might be better off to actually tokenize the > question as one one field, and the answer type as a second field. Then, you > can let Lucene calculate similarity via it's normal query mechanisms. In > this case, I would like try experimenting with things like: exact match, > phrase queries with slop, etc. That way, not only can you match "Who is the > president of UN" but you might also match on things that are a bit fuzzier. > To do this, you might need to have several fields per document with > variations. I could also see using Lucene's payload mechanism as well. > > But, as Vasu said, you will likely need other parts too, like OpenNLP. > > HTH, > Grant > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Query against newly created index.. Do not work
: I can now create indexes with Nutch, and see them in Luke.. this is : fantastic news, well for me it is beyond fantastic.. : Now I would like to (need to) query them, and to that end I wrote the : following code segment. : : int maxHits = 1000; : NutchBean nutchBean = new NutchBean(nutchConf); : Query nutchQuery = Query.parse(nutchSearchTerm, : nutchConf); : Hits nutchHits = nutchBean.search(nutchQuery, maxHits); : return nutchHits.getLength(); ...even though your code is written in java "java-u...@lucene" isn't the appropriate mailing list for this type of question, java-user is for users of the Lucene Java API that is the underpinninings of Nutch (it's slightly confusing that the sub-projecct name has java in it) If you ask your question on the nutch-u...@lucene mailing list, i'm guessing you'll get a lot of feedback from people who are familiar with the Nutch java code. (most people on this list probably have no idea what a NutchBean is) -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
execute on server and read from file
hello. i have data files on web server that contains some values(i need to build from them chart). i make applet that read information from file and build chart. but when i upload the applet to server , it didn't find the files. can you please suggest how can i make java program that will be executing on server and read there from files ? thank you -- View this message in context: http://www.nabble.com/execute-on-server-and-read-from-file-tp22363229p22363229.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Confidence scores at search time
: > Hmm, bugzilla has moved to JIRA. I'm not sure where the mapping is : > anymore. There used to be a Bugzilla Id in JIRA, I think. Sorry. FYI... by default the jira homepage has a form for searching by legacy bugzilla ID... https://issues.apache.org/jira/ ...if you create a Jira account you can customize that page (which is why some people might not see it if they are logged in) Also: if you go the "Find Issues" and select a project that was migrated from Bugzilla, you can then click the link that apears to refresh the search menu to show you new options specific for that project ... a search by bugzilla id box will appear at the bottom of the left nav. -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Confidence scores at search time
: That being said, I could see maybe determining a delta value such that if the : distance between any two scores is more than the delta, you cut off the rest : of the docs. This takes into account the relative state of scores and is not : some arbitrary value (although, the delta is, of course) I read an interesting paper a while back that suggested a similar strategy for a related problem... http://www.isi.edu/integration/people/michelso/paps/ijdar2007.pdf ...while the whole paper might be interesting to some, the relevant parts to this discussion are Section!2.1 and Table#1 . the goal there is to identify which refrence set(s) are relevant to an input set -- they compute a similarty score for each set, sort them, and then compute the percentage difference for each successive pair. they consider any set with a score above the average score for all sets *and* with a score percentage diff (relative the next highest scoring set) greater then some arbitrary delta to be a match. (the theory being that an arbitrary percentage delta is better then an arbitrary score cutoff, and that you only want things scoring better then average, because as scores taper off on the lower end, they can taper off quickly and show very high percentage differneces. I have no idea how well this approach would work for general search (with a large set of documents and a large number of matches) To keep in mind just how diverse the appraoches to this type of problem can be depending on the nitty gritty specifics of your use case, consider the "GuardianComponent" example from my BTB talk at apachecon last year (slides 32-25)... http://people.apache.org/~hossman/apachecon2008us/btb/apache-solr-beyond-the-box.pdf ...either of the approaches mention there tackle the "sacrifice recall to achieve greater precision" aspect of your problem in the specific domain of short documents where you want to eliminate matches that are significantly longer then the input (even if they score well using traditional tf/idf metrics) -Hoss - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: execute on server and read from file
Uhhhm, this is the Lucene user's list, not a general Java programming thread, so unless this has something to do with Lucene I doubt you'll get much help. I'd suggest one of the Java programming language lists rather than this one. Best Erick On Thu, Mar 5, 2009 at 6:32 PM, futurpc wrote: > > hello. > i have data files on web server that contains some values(i need to build > from them chart). > i make applet that read information from file and build chart. > but when i upload the applet to server , it didn't find the files. > can you please suggest how can i make java program that will be executing > on > server and read there from files ? > > thank you > -- > View this message in context: > http://www.nabble.com/execute-on-server-and-read-from-file-tp22363229p22363229.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
deletion of index-files fails
So, I have a (small) Lucene index, all fine; I use it a bit, and then (on app shutdown) want to delete its files and the containing directory (the index is intended as a temp object). At some earlier time this was working just fine, using java.io.File.delete(). Now however, some of the files get deleted (segments*) whereas others fail (no Exn is thrown, just java.io.File.delete() returns false: _0.cfs, _0.cfx). I've tried closing the IndexReader (no IndexWriter exists at shutdown), but that makes no diff. Any ideas? thanks Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
error in code
Hi all, I am getting error in running this code. Can somebody please tell me what is the problem? The code is given below. The bold lines were giving error as *cannot find symbol * import java.io.File; import java.io.FileReader; import java.io.Reader; import java.util.Date; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; /** * This class demonstrates the process of creating an index with Lucene * for text files in a directory. */ public class TextFileIndexer { public static void main(String[] args) throws Exception{ //fileDir is the directory that contains the text files to be indexed File fileDir = new File("C:\\files_to_index "); //indexDir is the directory that hosts Lucene's index files File indexDir = new File("C:\\luceneIndex"); Analyzer luceneAnalyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true); File[] textFiles = fileDir.listFiles(); long startTime = new Date().getTime(); //Add documents to the index for(int i = 0; i < textFiles.length; i++){ if(textFiles[i].isFile() > textFiles[i].getName().endsWith(".txt")){ System.out.println("File " + textFiles[i].getCanonicalPath() + " is being indexed"); Reader textReader = new FileReader(textFiles[i]); Document document = new Document(); *document.add(Field.Text("content",textReader)); document.add(Field.Text("path",textFiles[i].getPath()));* indexWriter.addDocument(document); } } indexWriter.optimize(); indexWriter.close(); long endTime = new Date().getTime(); System.out.println("It took " + (endTime - startTime) + " milliseconds to create an index for the files in the directory " + fileDir.getPath()); } } Regards , Nitin Gopi
Re: error in code
Hello gopi, My comments. if(textFiles[i].isFile() > textFiles[i].getName().endsWith(".txt")){ && should be used. *document.add(Field.Text("content",textReader)); document.add(new Field("content", textReader); document.add(Field.Text("path",textFiles[i].getPath()));* document.add(new Field("path", textFiles[i].getPath()); Regards Ganesh - Original Message - From: "nitin gopi" To: Sent: Friday, March 06, 2009 8:24 AM Subject: error in code Hi all, I am getting error in running this code. Can somebody please tell me what is the problem? The code is given below. The bold lines were giving error as *cannot find symbol * import java.io.File; import java.io.FileReader; import java.io.Reader; import java.util.Date; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; /** * This class demonstrates the process of creating an index with Lucene * for text files in a directory. */ public class TextFileIndexer { public static void main(String[] args) throws Exception{ //fileDir is the directory that contains the text files to be indexed File fileDir = new File("C:\\files_to_index "); //indexDir is the directory that hosts Lucene's index files File indexDir = new File("C:\\luceneIndex"); Analyzer luceneAnalyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true); File[] textFiles = fileDir.listFiles(); long startTime = new Date().getTime(); //Add documents to the index for(int i = 0; i < textFiles.length; i++){ if(textFiles[i].isFile() > textFiles[i].getName().endsWith(".txt")){ System.out.println("File " + textFiles[i].getCanonicalPath() + " is being indexed"); Reader textReader = new FileReader(textFiles[i]); Document document = new Document(); *document.add(Field.Text("content",textReader)); document.add(Field.Text("path",textFiles[i].getPath()));* indexWriter.addDocument(document); } } indexWriter.optimize(); indexWriter.close(); long endTime = new Date().getTime(); System.out.println("It took " + (endTime - startTime) + " milliseconds to create an index for the files in the directory " + fileDir.getPath()); } } Regards , Nitin Gopi Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Using Lucene for user query parsing
I am trying to evaluate as to whether Lucene is the right candidate for the problem at hand. Say I have 3 indexes: Index 1 has street names. Index 2 has business names. Index 3 has area names. All these names can be single words or a combination of words like woodward street or marks and spencers street etc etc. Now the use enters a query saying "mc donalds woodward street kingston precinct". I have to parse this query and come up with the best match possible. The problem is, in the query I do not know which part is the business name or area name or street name. Also the user may give the query in any order for example he may give it as "kingston precinct mc donalds woodward street". There might be spelling mistkaes in the query enterd by the user. Also he might use road for street or lane for street and such things. I know that Lucene is the right candidate for the synonym and spelling mistakes part but am a bit hazy regarding the user query parsing part as to in which index to search what. Any help is greatly appreciated. Thanks, Srini.
Re: indexing but not tokenizing
Thank you Ian, > If you want a direct suggestion: use PerFieldAnalyzerWrapper, > specifying a different analyzer for field B. > > > -- > Ian. this makes a lot of sense. -John - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Questions about analyzer
Hello all 1) Which is best to use Snowball analyzer or Lucene contrib analyzer? There is no inbuilt stop word list for Snowball analyzer? 2) Whether Analyzer and QueryParser are thread-free. They could created once and use it in as many threads? 3) I am using Snowball Analyzer to do index and search., When i search for windows AND vista, QueryParser is adding AND as part of search, But i am expecting something like +windows +vista. Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene for user query parsing
Hi Srinivas, Perhaps what you need here is a query formation logic which assigns the right keywords to the right fields. Let me know in case I got it wrong. One way to do that could be by using index time boost for fields and then running a query (so that a particular field is preferred over the other). As per my knowledge lucene should be a better solution that anything else that 'I know of' for such a thing, but there'd be a few things that you would have to build yourself as well. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Fri, Mar 6, 2009 at 11:55 AM, Srinivas Bharghav wrote: > I am trying to evaluate as to whether Lucene is the right candidate for the > problem at hand. > > Say I have 3 indexes: > > Index 1 has street names. > Index 2 has business names. > Index 3 has area names. > > All these names can be single words or a combination of words like woodward > street or marks and spencers street etc etc. > > Now the use enters a query saying "mc donalds woodward street kingston > precinct". > > I have to parse this query and come up with the best match possible. The > problem is, in the query I do not know which part is the business name or > area name or street name. Also the user may give the query in any order for > example he may give it as "kingston precinct mc donalds woodward street". > There might be spelling mistkaes in the query enterd by the user. Also he > might use road for street or lane for street and such things. I know that > Lucene is the right candidate for the synonym and spelling mistakes part > but > am a bit hazy regarding the user query parsing part as to in which index to > search what. Any help is greatly appreciated. > > Thanks, > Srini. >