Re: search problem
I guess that fixes the problem. Thanx -- View this message in context: http://www.nabble.com/search-problem-t1506294.html#a4096490 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Cannot save index to 'index' directory, please delete it first
I have met a error like this:"Cannot save index to 'index' directory, please delete it first" when I run the demo in lucene1.9.1. please tell me why? i hava set classpath! -- 『忙忙碌碌 ★ 碌碌无为』 一只小蚂蚁 http://blog.csdn.net/qixiang_nj
java.io.IOException: Stale NFS file handle
Hey, I'm running into this exception with my lucene searching. We have a cluster of 2 servers that execute searches and one server in the back end that writes to the index. I thought that setting up the external boxes on nfs would be alright since searching doesn't require locking. Can anyone tell me why this may be happening and possibly suggest a fix for the solution? I've already tried setting -Dorg.apache.lucene.lockDir=/tmp in the JVM args but it doesn't seem to do the job. I have also considdered local filesystems on each cluster member but the index is updated frequently and would need to be mirrored too often for it to be worth while. Any suggestions would be helpful. Thank you, Steve. Here is the stack trace in case you need it. 2006-04-26 08:57:36,160 INFO [STDOUT] java.io.IOException: Stale NFS file handle 2006-04-26 08:57:36,163 INFO [STDOUT] at java.io.RandomAccessFile.readBytes(Native Method) 2006-04-26 08:57:36,164 INFO [STDOUT] at java.io.RandomAccessFile.read(RandomAccessFile.java:315) 2006-04-26 08:57:36,164 INFO [STDOUT] at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:449) 2006-04-26 08:57:36,165 INFO [STDOUT] at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:45) 2006-04-26 08:57:36,166 INFO [STDOUT] at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:219) 2006-04-26 08:57:36,166 INFO [STDOUT] at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64) 2006-04-26 08:57:36,167 INFO [STDOUT] at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33) 2006-04-26 08:57:36,167 INFO [STDOUT] at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56) 2006-04-26 08:57:36,168 INFO [STDOUT] at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62) 2006-04-26 08:57:36,169 INFO [STDOUT] at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117) 2006-04-26 08:57:36,170 INFO [STDOUT] at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148) 2006-04-26 08:57:36,170 INFO [STDOUT] at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:157) 2006-04-26 08:57:36,171 INFO [STDOUT] at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151) 2006-04-26 08:57:36,172 INFO [STDOUT] at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:300) 2006-04-26 08:57:36,173 INFO [STDOUT] at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:78) 2006-04-26 08:57:36,173 INFO [STDOUT] at org.apache.lucene.search.Similarity.idf(Similarity.java:255) 2006-04-26 08:57:36,174 INFO [STDOUT] at org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:43) 2006-04-26 08:57:36,175 INFO [STDOUT] at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:142) 2006-04-26 08:57:36,175 INFO [STDOUT] at org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:203) 2006-04-26 08:57:36,176 INFO [STDOUT] at org.apache.lucene.search.BooleanQuery$BooleanWeight2.(BooleanQuery.java:330) 2006-04-26 08:57:36,177 INFO [STDOUT] at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:372) 2006-04-26 08:57:36,177 INFO [STDOUT] at org.apache.lucene.search.Query.weight(Query.java:93) 2006-04-26 08:57:36,178 INFO [STDOUT] at org.apache.lucene.search.Hits.(Hits.java:48) 2006-04-26 08:57:36,179 INFO [STDOUT] at org.apache.lucene.search.Searcher.search(Searcher.java:53)
Highlight
Hi I wrote a program that make a pdf document to an Lucene document. The field ate "contents", "sentence", : How do i display the sentence the query String is in? and how do I Highlight the String? cheers anton feldmann package de.coli.seek.lucene; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.io.StringWriter; import java.util.Calendar; import java.util.StringTokenizer; import java.net.URL; import java.net.URLConnection; import java.util.Date; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.pdmodel.PDDocumentInformation; import org.pdfbox.exceptions.CryptographyException; import org.pdfbox.exceptions.InvalidPasswordException; import org.pdfbox.util.PDFTextStripper; public final class Sentence2Document { private static final char FILE_SEPARATOR = System.getProperty("file.separator").charAt(0); // given caveat of increased search times when using //MICROSECOND, only use SECOND by default private DateTools.Resolution dateTimeResolution = DateTools.Resolution.SECOND; /** * accessor * @return current date/time resolution */ public DateTools.Resolution getDateTimeResolution() { return dateTimeResolution; } /** * mutator * @param resolution set new date/time resolution */ public void setDateTimeResolution( DateTools.Resolution resolution ) { dateTimeResolution = resolution; } // // compatibility methods for lucene-1.9+ // private String timeToString( long time ) { return DateTools.timeToString( time, dateTimeResolution ); } private static void addKeywordField( Document document, String name, String value ) { if ( value != null ) { document.add( new Field( name, value, Field.Store.YES, Field.Index.UN_TOKENIZED ) ); } } private static void addTextField( Document document, String name, Reader value ) { if ( value != null ) { document.add( new Field( name, value ) ); } } private static void addTextField( Document document, String name, String value ) { if ( value != null ) { document.add( new Field( name, value, Field.Store.YES, Field.Index.TOKENIZED ) ); } } private void addTextField( Document document, String name, Date value ) { if ( value != null ) { addTextField( document, name, DateTools.dateToString( value, dateTimeResolution ) ); } } private void addTextField( Document document, String name, Calendar value ) { if ( value != null ) { addTextField( document, name, value.getTime() ); } } private static void addUnindexedField( Document document, String name, String value ) { if ( value != null ) { document.add( new Field( name, value, Field.Store.YES, Field.Index.NO ) ); } } private static void addUnstoredKeywordField( Document document, String name, String value ) { if ( value != null ) { document.add( new Field( name, value, Field.Store.NO, Field.Index.UN_TOKENIZED ) ); } } /** * private constructor because there are only static methods. */ private Sentence2Document() { //utility class should not be instantiated } /** * This will get a lucene document from a PDF file. * * @param is The stream to read the PDF from. * * @return The lucene document. * * @throws IOException If there is an error parsing or indexing the document. */ public static Document getDocument( InputStream is ) throws IOException { Sentence2Document converter = new Sentence2Document(); return converter.convertDocument( is ); } /** * Convert the PDF stream to a lucene document. * * @param is The input stream. * @return The input stream converted to a lucene document. * @throws IOException If there is an error converting the PDF. */ public Document convertDocument( InputStream is ) throws IOException { Document document = new Document(); addContent( document, is, "" ); return document; } /** * This will take a reference to a PDF document and create a lucene document. * * @param file A reference to a PDF document. * @return The converted lucene document. * * @throws IOException If there is an exception while converting the document. */ public Document convertDocument( File file ) throws IOException { Doc
RAM Directory / querying Performance issue
I've rewritten the RAM DIR to supprt 64 bit (still havent had time to add this to lucene, hopefully in the coming months when i have a free second) My question: i have a machine with 4 GB RAM i have a 3GB index file, i successfully load the 3GB index into memory, the first few queries run with normal response time, but very quickly response time becomes unbearably slow (webloading with 1 con user), how are queries expanded in memory when run (how much memory do they use up)? could this be an issue of the queries themselves talking up large chunks of RAM? - Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception
Hi, I have encountered an issue with lucene1.9.1. It involves MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The following code throws java.lang.UnsupportedOperationException. If I remove the MatchAllDocsQuery condition (comment whole //1 block), or if I dont use the custom hitcollector (ms.search(mbq); instead of ms.search(mbq, allcoll);) the exception goes away. By stepping into the source I can see it seems due to MatchAllDocsQuery no implementing extractTerms() I never looked at lucene internals before, any help as to what extractTerms() should do, or any other hint to overcome this? thanks, Searcher searcher = new IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\"); Searchable[] indexes = new IndexSearcher[1]; indexes[0] = searcher; MultiSearcher ms = new MultiSearcher(indexes); AllCollector allcoll = new AllCollector(ms); BooleanQuery mbq = new BooleanQuery(); mbq.add(new TermQuery(new Term("body", "value1")), BooleanClause.Occur.MUST_NOT); // 1 MatchAllDocsQuery alld = new MatchAllDocsQuery(); mbq.add(alld, BooleanClause.Occur.MUST); // System.out.println("Query: " + mbq.toString()); // 2 ms.search(mbq, allcoll); //ms.search(mbq); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception
Hi Jim, This went to the old mailing list... Could you email this to java-user@lucene.apache.org and maybe open a JIRA bug for it? -Yonik On 4/26/06, jm <[EMAIL PROTECTED]> wrote: > Hi, > > I have encountered an issue with lucene1.9.1. It involves > MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The > following code throws java.lang.UnsupportedOperationException. > > If I remove the MatchAllDocsQuery condition (comment whole //1 > block), or if I dont use the custom hitcollector (ms.search(mbq); > instead of ms.search(mbq, allcoll);) the exception goes away. By > stepping into the source I can see it seems due to MatchAllDocsQuery > no implementing extractTerms() > I never looked at lucene internals before, any help as to what > extractTerms() should do, or any other hint to overcome this? > > thanks, > > > Searcher searcher = new > IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\"); > Searchable[] indexes = new IndexSearcher[1]; > indexes[0] = searcher; > MultiSearcher ms = new MultiSearcher(indexes); > > AllCollector allcoll = new AllCollector(ms); > > BooleanQuery mbq = new BooleanQuery(); > mbq.add(new TermQuery(new Term("body", "value1")), > BooleanClause.Occur.MUST_NOT); > // 1 > MatchAllDocsQuery alld = new MatchAllDocsQuery(); > mbq.add(alld, BooleanClause.Occur.MUST); > // > > System.out.println("Query: " + mbq.toString()); > > // 2 > ms.search(mbq, allcoll); > //ms.search(mbq); > -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception
On 4/26/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hi Jim, > > This went to the old mailing list... > Could you email this to java-user@lucene.apache.org > and maybe open a JIRA bug for it? > > -Yonik > > On 4/26/06, jm <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I have encountered an issue with lucene1.9.1. It involves > > MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The > > following code throws java.lang.UnsupportedOperationException. > > > > If I remove the MatchAllDocsQuery condition (comment whole //1 > > block), or if I dont use the custom hitcollector (ms.search(mbq); > > instead of ms.search(mbq, allcoll);) the exception goes away. By > > stepping into the source I can see it seems due to MatchAllDocsQuery > > no implementing extractTerms() > > I never looked at lucene internals before, any help as to what > > extractTerms() should do, or any other hint to overcome this? > > > > thanks, > > > > > > Searcher searcher = new > > IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\"); > > Searchable[] indexes = new IndexSearcher[1]; > > indexes[0] = searcher; > > MultiSearcher ms = new MultiSearcher(indexes); > > > > AllCollector allcoll = new AllCollector(ms); > > > > BooleanQuery mbq = new BooleanQuery(); > > mbq.add(new TermQuery(new Term("body", "value1")), > > BooleanClause.Occur.MUST_NOT); > > // 1 > > MatchAllDocsQuery alld = new MatchAllDocsQuery(); > > mbq.add(alld, BooleanClause.Occur.MUST); > > // > > > > System.out.println("Query: " + mbq.toString()); > > > > // 2 > > ms.search(mbq, allcoll); > > //ms.search(mbq); > > > > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MatchAllDocsQuery, MultiSearcher and a custom HitCollector throwing exception
ok, thanks for letting me know. I entered a bug, 556. javi On 4/26/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hi Jim, > > This went to the old mailing list... > Could you email this to java-user@lucene.apache.org > and maybe open a JIRA bug for it? > > -Yonik > > On 4/26/06, jm <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I have encountered an issue with lucene1.9.1. It involves > > MatchAllDocsQuery, MultiSearcher and a custom HitCollector. The > > following code throws java.lang.UnsupportedOperationException. > > > > If I remove the MatchAllDocsQuery condition (comment whole //1 > > block), or if I dont use the custom hitcollector (ms.search(mbq); > > instead of ms.search(mbq, allcoll);) the exception goes away. By > > stepping into the source I can see it seems due to MatchAllDocsQuery > > no implementing extractTerms() > > I never looked at lucene internals before, any help as to what > > extractTerms() should do, or any other hint to overcome this? > > > > thanks, > > > > > > Searcher searcher = new > > IndexSearcher("c:\\projects\\mig\\runtime\\index\\01Aug16\\"); > > Searchable[] indexes = new IndexSearcher[1]; > > indexes[0] = searcher; > > MultiSearcher ms = new MultiSearcher(indexes); > > > > AllCollector allcoll = new AllCollector(ms); > > > > BooleanQuery mbq = new BooleanQuery(); > > mbq.add(new TermQuery(new Term("body", "value1")), > > BooleanClause.Occur.MUST_NOT); > > // 1 > > MatchAllDocsQuery alld = new MatchAllDocsQuery(); > > mbq.add(alld, BooleanClause.Occur.MUST); > > // > > > > System.out.println("Query: " + mbq.toString()); > > > > // 2 > > ms.search(mbq, allcoll); > > //ms.search(mbq); > > > > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Partial token matches
Hi All, Just wanted to throw out something I'm working on. It is working well for me, but I wanted to see if anyone can suggest any other alternatives that might perform better than what I'm doing now. I have a field in my index that contains keywords (back of the book index terms) and a UI feature that allows the user to find documents that contain a partial keyword supplied by the user. So a particular document in my index might have the token "informat" in the keywords field and the user may supply "form" in the UI and I should get a match. My old implementation does not use Lucene and just uses String.matches with a regular expression that looks like ".*form.*". I reimplemented using Lucene and just tokenize the field so I get the tokens informat nformat format ormat rmat mat at t Then I use a prefix query to find hits. Both implementations ignore case in the search and the hit order is controlled by another field that I'm sorting on, so relevance ranking is not important in this use case. Search time performance is crucial, time to create the index and index size are not really important. The index is created statically at application startup or possibly delivered to the application and is not updated while the application is using it. Thanks for any suggestions, Eric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.IOException: Stale NFS file handle
Steve, There are some locks involved in search, like the one that gets written to the FS before the readers reads all the segment/index files listed in segments file. Once they are all read, the lock is released. Setting lock dir to the local /tmp doesn't sound good, as locks have to be in the common location in order for them to have the desired locking effect. As for suggestion for large frequently updated indices - have you considered NAS? Otis - Original Message From: "Schwenker, Stephen" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, April 26, 2006 9:19:55 AM Subject: java.io.IOException: Stale NFS file handle Hey, I'm running into this exception with my lucene searching. We have a cluster of 2 servers that execute searches and one server in the back end that writes to the index. I thought that setting up the external boxes on nfs would be alright since searching doesn't require locking. Can anyone tell me why this may be happening and possibly suggest a fix for the solution? I've already tried setting -Dorg.apache.lucene.lockDir=/tmp in the JVM args but it doesn't seem to do the job. I have also considdered local filesystems on each cluster member but the index is updated frequently and would need to be mirrored too often for it to be worth while. Any suggestions would be helpful. Thank you, Steve. Here is the stack trace in case you need it. 2006-04-26 08:57:36,160 INFO [STDOUT] java.io.IOException: Stale NFS file handle 2006-04-26 08:57:36,163 INFO [STDOUT] at java.io.RandomAccessFile.readBytes(Native Method) 2006-04-26 08:57:36,164 INFO [STDOUT] at java.io.RandomAccessFile.read(RandomAccessFile.java:315) 2006-04-26 08:57:36,164 INFO [STDOUT] at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:449) 2006-04-26 08:57:36,165 INFO [STDOUT] at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:45) 2006-04-26 08:57:36,166 INFO [STDOUT] at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:219) 2006-04-26 08:57:36,166 INFO [STDOUT] at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:64) 2006-04-26 08:57:36,167 INFO [STDOUT] at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:33) 2006-04-26 08:57:36,167 INFO [STDOUT] at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56) 2006-04-26 08:57:36,168 INFO [STDOUT] at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62) 2006-04-26 08:57:36,169 INFO [STDOUT] at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117) 2006-04-26 08:57:36,170 INFO [STDOUT] at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148) 2006-04-26 08:57:36,170 INFO [STDOUT] at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:157) 2006-04-26 08:57:36,171 INFO [STDOUT] at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151) 2006-04-26 08:57:36,172 INFO [STDOUT] at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:300) 2006-04-26 08:57:36,173 INFO [STDOUT] at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:78) 2006-04-26 08:57:36,173 INFO [STDOUT] at org.apache.lucene.search.Similarity.idf(Similarity.java:255) 2006-04-26 08:57:36,174 INFO [STDOUT] at org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:43) 2006-04-26 08:57:36,175 INFO [STDOUT] at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:142) 2006-04-26 08:57:36,175 INFO [STDOUT] at org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:203) 2006-04-26 08:57:36,176 INFO [STDOUT] at org.apache.lucene.search.BooleanQuery$BooleanWeight2.(BooleanQuery.java:330) 2006-04-26 08:57:36,177 INFO [STDOUT] at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:372) 2006-04-26 08:57:36,177 INFO [STDOUT] at org.apache.lucene.search.Query.weight(Query.java:93) 2006-04-26 08:57:36,178 INFO [STDOUT] at org.apache.lucene.search.Hits.(Hits.java:48) 2006-04-26 08:57:36,179 INFO [STDOUT] at org.apache.lucene.search.Searcher.search(Searcher.java:53) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene search benchmark/stress test tool
Hi, I'm about to write a little command-line Lucene search benchmark tool. I'm interested in benchmarking search performance and the ability to specify concurrency level (# of parallel search threads) and response timing, so I can calculate min, max, average, and mean times. Something like 'ab' (Apache Benchmark) tool, but for Lucene. Has anyone already written something like this? Thanks, Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Partial token matches
I'm sure the guys will chime in, but I think you're in significant danger of getting a "too many clauses" exception thrown. Try searching on, say, "an". Under the covers, Lucene expands your query to have a clause for *every* item in your index that starts with "an", so there's a clause for "an" "ana" "anb", "anaa", "anab", ... The shorter your term, the more there'll be, and if there are more than 1024, you'll get the exception above. You can set the number of clauses to a bigger number, but that may not scale well. Consider writing a filter (see Lucene In Action). The filter will return a bitset with a bit turned on for each potential match, and avoid this issue. RegexTermEnum helps a lot here. Try searching the archive for a thread started by me, titled "I just don't get wildcards at all" for an exposition by the guys on this sort of thing. That thread centers on wildcard queries, but I'm pretty sure PrefixQuery suffers from the same issue. Chris, Erik, Yonik... Do I have this right Erick
DateTools question
Hello, Why does DateTools.dateToString() return a String representation of my Date, but in a different TimeZone. Does it use its own Calendar/TimeZone settings? F.I. DateFormat format = new SimpleDateFormat("-MM-dd hh:mm:ss.SSS"); System.out.println(DateTools.dateToString(format.parse("2006-04-26 07:29: 52.581"),DateTools.Resolution.MINUTE)); will print out 200604261129 Why the 4 hour difference? Thanks! --Bill
Re: DateTools question
: Why does DateTools.dateToString() return a String representation of my Date, : but in a different TimeZone. Does it use its own Calendar/TimeZone settings? Yes, DateTime is hardcoded to use GMT for it's string representations. It wouldn't be safe for DateTools to use your current TimeZone/Locale, because once you've indexed the value, your index might be used by another application (or another instance of your application) running in a differnet locale. The important thing is not what string DateTools.dateToString returns, it's whether you get an equivilent date back (based on the resolution you specified)) when you do something like this... Date a = ...; DateTools.Resolution r = ...; Date b = DateTools.stringToDate(DateTools.dateToString(a,r)); System.out.println("Is '"+a+"' the same as '"+b+"' with "+r+" resolution?"); -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Partial token matches
: I'm sure the guys will chime in, but I think you're in significant danger of : getting a "too many clauses" exception thrown. Try searching on, say, "an". : Under the covers, Lucene expands your query to have a clause for *every* : item in your index that starts with "an", so there's a clause for "an" "ana" : "anb", "anaa", "anab", ... The shorter your term, the more there'll be, : and if there are more than 1024, you'll get the exception above. You can set : the number of clauses to a bigger number, but that may not scale well. When using any of the queries that expand into a BooleanQuery, there is almost allways the possibility of hitting TooManyClauses -- but this approach of using PrefixQuery is definitely safer/faster then a straight use of WildCardQuery -- at the expense of a Bigger index. The idea mentioned in this thread is basically the same as an idea Erik Hatcher has suggested in the past, which i've taken to refering to as "wildcard term rotating"... http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12261.html : Consider writing a filter (see Lucene In Action). The filter will return a : bitset with a bit turned on for each potential match, and avoid this issue. very true -- but at the expense of scoring information (ie: how many times does the term appear in the document?) ... it's all a question of priorities. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Filter operation
Greetings, If I write a filter, does this run over the documents in the index *before* a search is made (i.e., every document in the index is touched) or on the result set after the search? If it is run over all of the documents, doesn't this become a performance bottleneck on any non-trivial filter? -- Tom Emerson [EMAIL PROTECTED] http://www.dreamersrealm.net/~tree
Dealing with acronyms
Hi All, I would like enable users to do an acronym search on my index. My idea is the following: 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which is going to be indexed) 2.) Store the extracted acronyms in a field, for example called "case" 3.) On search, asking the user to use case:"ABS" to search for acronyms Any experience with this kind of pattern? Other ideas or best practices? Thank you in advance and best regards Hannes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to display a field value
Hi how do i display the whole field value of an document the query string is found? cheers anton - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dealing with acronyms
This makes perfect sense to me. Of course the hard part will be how to extract the acronyms. -- Stefan Hannes Carl Meyer wrote: Hi All, I would like enable users to do an acronym search on my index. My idea is the following: 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which is going to be indexed) 2.) Store the extracted acronyms in a field, for example called "case" 3.) On search, asking the user to use case:"ABS" to search for acronyms Any experience with this kind of pattern? Other ideas or best practices? Thank you in advance and best regards Hannes - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RAM Directory / querying Performance issue
Is this markedly faster than using an MMapDirectory? Copying all this data into the Java heap (as RAMDirectory does) puts a tremendous burden on the garbage collector. MMapDirectory should be nearly as fast, but keeps the index out of the Java heap. Doug z shalev wrote: I've rewritten the RAM DIR to supprt 64 bit (still havent had time to add this to lucene, hopefully in the coming months when i have a free second) My question: i have a machine with 4 GB RAM i have a 3GB index file, i successfully load the 3GB index into memory, the first few queries run with normal response time, but very quickly response time becomes unbearably slow (webloading with 1 con user), how are queries expanded in memory when run (how much memory do they use up)? could this be an issue of the queries themselves talking up large chunks of RAM? - Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dealing with acronyms
On 4/26/06, Hannes Carl Meyer <[EMAIL PROTECTED]> wrote: > > Hi All, > > I would like enable users to do an acronym search on my index. > My idea is the following: > > 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which > is going to be indexed) In case you havent already looked at, you might find this useful. http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf 2.) Store the extracted acronyms in a field, for example called "case" > > 3.) On search, asking the user to use case:"ABS" to search for acronyms I would rather store them in the same field with others, so that you can do phrase queries. Store the acronyms just like you would store synonyms. More information on how to store synonyms is in "Lucene in Action" book. This would facilitate queries like "USA President". If you store "USA" in a separate field, you wouldn't be able to match this query. Any experience with this kind of pattern? Other ideas or best practices? I would also look at HMMs/CRFs to extract acronyms. You need to come up with a list of features to identify a potential acronym. For ex: - All Caps - The acronym appears repeatedly in the rest of the text - Found in the acronym dictionary...etc Hope this helps, --Rajesh Munavalli Blog: http://munavalli.blogspot.com
Re: Dealing with acronyms
Rajesh Munavalli schrieb: On 4/26/06, Hannes Carl Meyer <[EMAIL PROTECTED]> wrote: Hi All, I would like enable users to do an acronym search on my index. My idea is the following: 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document (which is going to be indexed) In case you havent already looked at, you might find this useful. http://www.cs.waikato.ac.nz/~nzdl/publications/1999/Yeates-Auto-Extract.pdf 2.) Store the extracted acronyms in a field, for example called "case" 3.) On search, asking the user to use case:"ABS" to search for acronyms I would rather store them in the same field with others, so that you can do phrase queries. Store the acronyms just like you would store synonyms. More information on how to store synonyms is in "Lucene in Action" book. This would facilitate queries like "USA President". If you store "USA" in a separate field, you wouldn't be able to match this query. Any experience with this kind of pattern? Other ideas or best practices? I would also look at HMMs/CRFs to extract acronyms. You need to come up with a list of features to identify a potential acronym. For ex: - All Caps - The acronym appears repeatedly in the rest of the text - Found in the acronym dictionary...etc Hope this helps, --Rajesh Munavalli Blog: http://munavalli.blogspot.com Hi, thank you, thats a good advice - I don't have the Lucene in Action Book, but I think its worth taking a look at it. So I guess its done by writing or extending an anylzer? H. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dealing with acronyms
> > > So I guess its done by writing or extending an anylzer? > Yes...thats correct. --Rajesh Munavalli Blog: http://munavalli.blogspot.com
Re: performance differences between 1.4.3 and 1.9.1
On Mittwoch 26 April 2006 01:22, RONALD MANTAY wrote: > However when searching muliple indexes with multiSearcher and with a > FuzzyQuery with a prefixLength of 1. The search against 3.7m documents > spread over 23 indexes (due to the natural grouping of the data) the > time changed from 800ms to 4500 ms. MultiSearcher in Lucene 1.4 had a broken ranking implementation. This has been fixed in Lucene 1.9, but this might have bad effects on performance. 23 indexes is quite much, maybe you can speed up things greatly be using a smaller number of indexes. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to serach in sentence and dispaly the whole sentence
Are the names of a field in a document unique or can i make a field with the name "sentence" for each sentence in an text document? Grant Ingersoll schrieb: Anton, I think there are at least a couple of ways of doing this. I assume you have a program that does sentence detection already, as Lucene does not provide this. If not, I am sure a search of the web will find one that has high accuracy. You can: 1. Index each sentence as a separate Document. You will need a field on the Document relating it back to the overall file so you can reconstruct it. 2. As you index, insert sentence/paragraph boundary markers into your index and then use the SpanQuery functionality. Search this mail archive for sentence boundary detection and Span Query (try the dev list too). I think there was a discussion between me, Doug and Hoss on how to do this. 3. Do search as you do now and then post process to figure out what sentence it came from. This will be inefficient, but I don't know what your requirements are that way, so it may work for you. There are probably other ways too. anton feldmann wrote: I intend, to make a search, to find a word or a word pair in a sentence or a paragraph. But then the sentence should be indicated as a whole. The question relates to the fact, that I need to extend Lucene in such a way that this is possible. But where to I make a start, because I have no idea, how I have to change the IndexFile, whether that conforms with the Lucene Team. cheers anton feldmann - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: performance differences between 1.4.3 and 1.9.1
For my application we have several hundred indexes, different subsets of which are searched depending on the situation. Aside from not upgrading to lucene 1.9, or making a big index for every possible subset, do you have any ideas for how can we maintain fast performance? - andy g On 4/26/06, Daniel Naber <[EMAIL PROTECTED]> wrote: > MultiSearcher in Lucene 1.4 had a broken ranking implementation. This has > been fixed in Lucene 1.9, but this might have bad effects on performance. > 23 indexes is quite much, maybe you can speed up things greatly be using a > smaller number of indexes. > > Regards > Daniel > > -- > http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: performance differences between 1.4.3 and 1.9.1
27 apr 2006 kl. 02.18 skrev Andy Goodell: For my application we have several hundred indexes, different subsets of which are searched depending on the situation. Aside from not upgrading to lucene 1.9, or making a big index for every possible subset, do you have any ideas for how can we maintain fast performance? You probably need to explain the reason for splitting them up in order to get a good answer to that. And how big are they? Without knowing anything about your application I say: merge them all to one and add a field you apply to a boolean clause. But with a few hundred indices it sounds like you have a design plan that don't work with above. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools question
Makes sense. Thanks for the response! --Bill On 4/26/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : Why does DateTools.dateToString() return a String representation of my > Date, > : but in a different TimeZone. Does it use its own Calendar/TimeZone > settings? > > Yes, DateTime is hardcoded to use GMT for it's string representations. > > It wouldn't be safe for DateTools to use your current TimeZone/Locale, > because once you've indexed the value, your index might be used by another > application (or another instance of your application) running in a > differnet locale. > > The important thing is not what string DateTools.dateToString returns, > it's whether you get an equivilent date back (based on the resolution you > specified)) when you do something like this... > > Date a = ...; > DateTools.Resolution r = ...; > Date b = DateTools.stringToDate(DateTools.dateToString(a,r)); > System.out.println("Is '"+a+"' the same as '"+b+"' with "+r+" > resolution?"); > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Lucene search benchmark/stress test tool
Hi, I have added some code in the Lucene 1.9 - source code for Lucene RemoteParallelMultisearcher performance benchmark. I have recorded the time to execute the 'searchables[i].docFreq(term)' (in MultiSearcher.java) method in both client and server, and for ' searchable.search' (in ParallelMultiSearcher.java) method also.i have also recorded the total time taken to get hits object. I have tested different complex boolean queries and taken the average time for each queries. But while doing this i am stucked with some doubts. Please find my doubts listed below. What I have understood from Lucene Remote Parallel Multi Searcher Search Procedure is first compute the weight for the Query in each Index sequentially (one by one, eg: - calculate "query weight" of index1 first and then index2) and then perform searching of each index one by one and merge the results. I want to know is there any possibility or method to merge the weight calculation of index 1 and its search in a single RPC instead of doing the both function in separate steps. Another query I have to clear is In RemoteParallelMultiSearcher the method "docFreq (Term term)" is not parallelized, why it is not parallelized, and please specify any reason for that. Regards Sunil On 4/26/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Hi, > > I'm about to write a little command-line Lucene search benchmark > tool. I'm interested in benchmarking search performance and the ability to > specify concurrency level (# of parallel search threads) and response > timing, so I can calculate min, max, average, and mean times. Something > like 'ab' (Apache Benchmark) tool, but for Lucene. > > Has anyone already written something like this? > > Thanks, > Otis > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >