Re: How best to handle a reasonable amount to data (25TB+)
It also depends on your queries. For example if you only query data for 1 month intervals, and you partition by date, you can calculate in which shard your data can be found, and query just that shard. If you can find a partition key that is always present in the query, you can create a gazillion of small shards, but redirect query just to specific shard and keep search latency low. On Wed, Feb 8, 2012 at 09:39, Li Li wrote: > it's up to your machines. in our application, we indexs about > 30,000,000(30M)docs/shard, and the response time is about 150ms. our > machine has about 48GB memory and about 25GB is allocated to solr and other > is used for disk cache in Linux. > if calculated by our application, indexing 1.25T docs will use 40+ machines. > > On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < > peter.mil...@objectconsulting.com.au> wrote: > >> Hi, >> >> I have a little bit of an unusual set of requirements, and I am looking >> for advice. I have researched the archives, and seen some relevant posts, >> but they are fairly old and not specifically a match, so I thought I would >> give this a try. >> >> We will eventually have about 50TB raw, non-searchable data and 25TB of >> search attributes to handle in Lucene, across about 1.25 trillion >> documents. The app is write once, read many. There are many document types >> involved that have to be able to be searched separately or together, with >> some common attributes, but also unique ones per type. I plan on using a >> JCP implementation that uses Lucene under the covers. The data itself is >> not searchable, only the attributes. I plan to hook the JCP repo >> (ModeShape) up to the OpenStack Object Storage on commodity hardware >> eventually with 5 machines, each with 24 x 2TB drives. This should allow >> for redundancy (3 copies), although I would suppose we would add bigger >> drives as we go on. >> >> Since there is such a lot of data to index (not outrageous amounts for >> these days, but a bit chunky), I was sort of assuming that the Lucene >> indexes would go on the object storage solution too, to handle availability >> and other infrastructure issues. Most of the searches would be >> date-constrained, so I thought that the indexes could be sharded by date. >> >> There would be a local disk index being built near real time on the JCP >> hardware that could be regularly merged in with the main indexes on the >> object storage, I suppose. >> >> Does that make sense, and would it work? Sorry, but this is just >> theoretical at the moment and I'm not experienced in Lucene, as you can no >> doubt tell. >> >> I came across a piece that was talking about Hardoop and distributed Solr, >> http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now >> wondering if that would be a superior approach? Or any other suggestions? >> >> Many Thanks, >> The Captn >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How best to handle a reasonable amount to data (25TB+)
On Feb 8, 2012, at 10:14 AM, Danil ŢORIN wrote: > For example if you only query data for 1 month intervals, and you > partition by date, you can calculate in which shard your data can be > found, and query just that shard. This is what one calls "partition pruning" in database terms. http://en.wikipedia.org/wiki/Partition_(database) http://www.orafaq.com/tuningguide/partition%20prune.html Rather handy way to scale to "infinity and beyond" as Buzz Lightyear would have it. Perhaps of interest: Scaling to Infinity – Partitioning in Oracle Data Warehouses http://www.evdbt.com/TGorman%20TD2009%20DWScale.doc http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How best to handle a reasonable amount to data (25TB+)
On Feb 8, 2012, at 10:14 AM, Danil ŢORIN wrote: > For example if you only query data for 1 month intervals, and you > partition by date, you can calculate in which shard your data can be > found, and query just that shard. This is what one calls "partition pruning" in database terms. http://en.wikipedia.org/wiki/Partition_(database) http://www.orafaq.com/tuningguide/partition%20prune.html Rather handy way to scale to "infinity and beyond" as Buzz Lightyear would have it. Perhaps of interest: Scaling to Infinity – Partitioning in Oracle Data Warehouses http://www.evdbt.com/TGorman%20TD2009%20DWScale.doc http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why read past EOF
Hmm, there's a problem with the logic here (sorry: this is my fault -- my prior suggestion is flat out wrong!). The problem is... say you commit once, creating commit point 1. Two hours later, you commit again, creating commit point 2. The bug is, at this point, immediately on committing commit point 2, this deletion policy will go and remove commit point 1. Instead, it's supposed to wait 10 minutes to do so. So... I think you should go back to using System.currentTimeMillis() as "the present". And then, only when the newest commit is more than 10 minutes old, are you allowed to delete the commits before it. That should work? However: you should leave a margin of error, because say the reader takes 10 seconds to reopen + warm/cutover all search threads... then, if timing is unlucky, you can still remove a commit point being used by a reader. I would leave a comfortable margin, eg if you reopen readers every 10 minutes, then delete commits older than 15 or 20 minutes. If commits are rare than leaving a fat margin here will cost nothing in practice... and if there is some clock change (System.currentTimeMillis() suddenly jumps, maybe from daylight savings time, maybe from aggressive clock syncing, whatever), you have some margin Really, a better overall design would be a hard handshake will all outstanding readers, so that only once every single reader using a given commit has closed, do you delete the commit. Then you are immune clock unreliability but this'd require remote communication in your app to track reader states. Also, you should remove that dangerous auto-generated-catch-block? It may suppress a real exception some day... and onCommit is allowed to throw IOE. Mike McCandless http://blog.mikemccandless.com On Tue, Feb 7, 2012 at 9:15 PM, superruiye wrote: > public class PostponeCommitDeletionPolicy implements IndexDeletionPolicy { > private final static long deletionPostPone = 60; > > public void onInit(List commits) { > // Note that commits.size() should normally be 1: > onCommit(commits); > } > > /** > * delete commits after deletePostPone ms. > */ > public void onCommit(List commits) { > // Note that commits.size() should normally be 2 (if not > // called by onInit above): > int size = commits.size(); > try { > long lastCommitTimestamp = commits.get(commits.size() - > 1).getTimestamp(); > for (int i = 0; i < size - 1; i++) { > if (lastCommitTimestamp - > commits.get(i).getTimestamp() > > deletionPostPone) { > commits.get(i).delete(); > } > } > } catch (IOException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > } > -- > indexWriterConfig.setIndexDeletionPolicy(new > PostponeCommitDeletionPolicy()); > -- > and I use a time task(10 minutes) to reopen indexsearcher,but still read > past EOF...the trace: > java.io.IOException: read past EOF > at > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207) > at > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) > at org.apache.lucene.store.DataInput.readInt(DataInput.java:84) > at > org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:153) > at > org.apache.lucene.index.TermVectorsReader.checkValidFormat(TermVectorsReader.java:197) > at > org.apache.lucene.index.TermVectorsReader.(TermVectorsReader.java:86) > at > org.apache.lucene.index.SegmentCoreReaders.openDocStores(SegmentCoreReaders.java:221) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:117) > at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:93) > at > org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113) > at > org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29) > at > org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81) > at > org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:754) > at > org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75) > at org.apache.lucene.index.IndexReader.open(IndexReader.java:421) > at org.apache.lucene.index.IndexReader.open(IndexReader.java:281) > at > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:89) > at > com.ableskysearch.migration.timertask.ReopenIndexSearcherTask.runAsPeriod(ReopenIndexSearcherTask.java:40) > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Why-read-past-EOF-
Re: NRTManager and AlreadyClosedException
Are you closing the SearcherManager? Calling release() multiple times? >From the exception message the first sounds most likely. -- Ian. On Wed, Feb 8, 2012 at 5:20 AM, Cheng wrote: > Hi, > > I am using NRTManager and NRTManagerReopenThread. Though I don't close > either writer or the reopen thread, I receive AlreadyClosedException as > follow. > > My initiating NRTManager and NRTManagerReopenThread are: > > FSDirectory indexDir = new NIOFSDirectory(new File( > indexFolder)); > > IndexWriterConfig iwConfig = new IndexWriterConfig( > version, new LimitTokenCountAnalyzer( > StandardAnalyzer, maxTokenNum)); > > iw = new IndexWriter(indexDir, iwConfig); > > nrtm = new NRTManager(iw, null); > > ropt = new NRTManagerReopenThread(nrtm, > targetMaxStaleSec, > targetMinStaleSec); > > ropt.setName("Reopen Thread"); > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2, > Thread.MAX_PRIORITY)); > ropt.setDaemon(true); > ropt.start(); > > > Where may the searchermanager fall out? > > > > org.apache.lucene.store.AlreadyClosedException: this SearcherManager is > closed77 > at > org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235) > at com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138) > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRTManager and AlreadyClosedException
You are right. There is a method by which I do searching. At the end of the method, I release the index searcher (not the searchermanager). Since this method is called by multiple threads. So I think the index searcher will be released multiple times. First, I wonder if releasing searcher is same as releasing the searcher manager. Second, as said in Mike's blog, the searcher should be released, which has seemingly caused the problem. What are my alternatives here to avoid it? Thanks On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea wrote: > Are you closing the SearcherManager? Calling release() multiple times? > > From the exception message the first sounds most likely. > > > -- > Ian. > > > On Wed, Feb 8, 2012 at 5:20 AM, Cheng wrote: > > Hi, > > > > I am using NRTManager and NRTManagerReopenThread. Though I don't close > > either writer or the reopen thread, I receive AlreadyClosedException as > > follow. > > > > My initiating NRTManager and NRTManagerReopenThread are: > > > > FSDirectory indexDir = new NIOFSDirectory(new File( > > indexFolder)); > > > > IndexWriterConfig iwConfig = new IndexWriterConfig( > > version, new LimitTokenCountAnalyzer( > > StandardAnalyzer, maxTokenNum)); > > > > iw = new IndexWriter(indexDir, iwConfig); > > > > nrtm = new NRTManager(iw, null); > > > > ropt = new NRTManagerReopenThread(nrtm, > > targetMaxStaleSec, > > targetMinStaleSec); > > > > ropt.setName("Reopen Thread"); > > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2, > > Thread.MAX_PRIORITY)); > > ropt.setDaemon(true); > > ropt.start(); > > > > > > Where may the searchermanager fall out? > > > > > > > > org.apache.lucene.store.AlreadyClosedException: this SearcherManager is > > closed77 > > at > > > org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235) > > at > com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138) > > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50) > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: NRTManager and AlreadyClosedException
Releasing a searcher is not the same as closing the searcher manager, if that is what you mean. The searcher should indeed be released, but once only for each acquire(). Your searching threads should have code like that shown in the SearcherManager javadocs. IndexSearcher s = manager.acquire(); try { // Do searching, doc retrieval, etc. with s } finally { manager.release(s); } // Do not use s after this! s = null; -- Ian. On Wed, Feb 8, 2012 at 12:09 PM, Cheng wrote: > You are right. There is a method by which I do searching. At the end of the > method, I release the index searcher (not the searchermanager). > > Since this method is called by multiple threads. So I think the index > searcher will be released multiple times. > > First, I wonder if releasing searcher is same as releasing the searcher > manager. > > Second, as said in Mike's blog, the searcher should be released, which has > seemingly caused the problem. What are my alternatives here to avoid it? > > Thanks > > > > On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea wrote: > >> Are you closing the SearcherManager? Calling release() multiple times? >> >> From the exception message the first sounds most likely. >> >> >> -- >> Ian. >> >> >> On Wed, Feb 8, 2012 at 5:20 AM, Cheng wrote: >> > Hi, >> > >> > I am using NRTManager and NRTManagerReopenThread. Though I don't close >> > either writer or the reopen thread, I receive AlreadyClosedException as >> > follow. >> > >> > My initiating NRTManager and NRTManagerReopenThread are: >> > >> > FSDirectory indexDir = new NIOFSDirectory(new File( >> > indexFolder)); >> > >> > IndexWriterConfig iwConfig = new IndexWriterConfig( >> > version, new LimitTokenCountAnalyzer( >> > StandardAnalyzer, maxTokenNum)); >> > >> > iw = new IndexWriter(indexDir, iwConfig); >> > >> > nrtm = new NRTManager(iw, null); >> > >> > ropt = new NRTManagerReopenThread(nrtm, >> > targetMaxStaleSec, >> > targetMinStaleSec); >> > >> > ropt.setName("Reopen Thread"); >> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2, >> > Thread.MAX_PRIORITY)); >> > ropt.setDaemon(true); >> > ropt.start(); >> > >> > >> > Where may the searchermanager fall out? >> > >> > >> > >> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager is >> > closed77 >> > at >> > >> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235) >> > at >> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138) >> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50) >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRTManager and AlreadyClosedException
I use it exactly the same way. So there must be other reason causing the problem. On Wed, Feb 8, 2012 at 8:21 PM, Ian Lea wrote: > Releasing a searcher is not the same as closing the searcher manager, > if that is what you mean. > > The searcher should indeed be released, but once only for each > acquire(). Your searching threads should have code like that shown in > the SearcherManager javadocs. > > IndexSearcher s = manager.acquire(); > try { > // Do searching, doc retrieval, etc. with s > } finally { > manager.release(s); > } > // Do not use s after this! > s = null; > > -- > Ian. > > > On Wed, Feb 8, 2012 at 12:09 PM, Cheng wrote: > > You are right. There is a method by which I do searching. At the end of > the > > method, I release the index searcher (not the searchermanager). > > > > Since this method is called by multiple threads. So I think the index > > searcher will be released multiple times. > > > > First, I wonder if releasing searcher is same as releasing the searcher > > manager. > > > > Second, as said in Mike's blog, the searcher should be released, which > has > > seemingly caused the problem. What are my alternatives here to avoid it? > > > > Thanks > > > > > > > > On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea wrote: > > > >> Are you closing the SearcherManager? Calling release() multiple times? > >> > >> From the exception message the first sounds most likely. > >> > >> > >> -- > >> Ian. > >> > >> > >> On Wed, Feb 8, 2012 at 5:20 AM, Cheng wrote: > >> > Hi, > >> > > >> > I am using NRTManager and NRTManagerReopenThread. Though I don't close > >> > either writer or the reopen thread, I receive AlreadyClosedException > as > >> > follow. > >> > > >> > My initiating NRTManager and NRTManagerReopenThread are: > >> > > >> > FSDirectory indexDir = new NIOFSDirectory(new File( > >> > indexFolder)); > >> > > >> > IndexWriterConfig iwConfig = new IndexWriterConfig( > >> > version, new LimitTokenCountAnalyzer( > >> > StandardAnalyzer, maxTokenNum)); > >> > > >> > iw = new IndexWriter(indexDir, iwConfig); > >> > > >> > nrtm = new NRTManager(iw, null); > >> > > >> > ropt = new NRTManagerReopenThread(nrtm, > >> > targetMaxStaleSec, > >> > targetMinStaleSec); > >> > > >> > ropt.setName("Reopen Thread"); > >> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2, > >> > Thread.MAX_PRIORITY)); > >> > ropt.setDaemon(true); > >> > ropt.start(); > >> > > >> > > >> > Where may the searchermanager fall out? > >> > > >> > > >> > > >> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager > is > >> > closed77 > >> > at > >> > > >> > org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235) > >> > at > >> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138) > >> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50) > >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: NRTManager and AlreadyClosedException
are you closing the NRTManager while other threads still accessing the SearcherManager? simon On Wed, Feb 8, 2012 at 1:48 PM, Cheng wrote: > I use it exactly the same way. So there must be other reason causing the > problem. > > On Wed, Feb 8, 2012 at 8:21 PM, Ian Lea wrote: > >> Releasing a searcher is not the same as closing the searcher manager, >> if that is what you mean. >> >> The searcher should indeed be released, but once only for each >> acquire(). Your searching threads should have code like that shown in >> the SearcherManager javadocs. >> >> IndexSearcher s = manager.acquire(); >> try { >> // Do searching, doc retrieval, etc. with s >> } finally { >> manager.release(s); >> } >> // Do not use s after this! >> s = null; >> >> -- >> Ian. >> >> >> On Wed, Feb 8, 2012 at 12:09 PM, Cheng wrote: >> > You are right. There is a method by which I do searching. At the end of >> the >> > method, I release the index searcher (not the searchermanager). >> > >> > Since this method is called by multiple threads. So I think the index >> > searcher will be released multiple times. >> > >> > First, I wonder if releasing searcher is same as releasing the searcher >> > manager. >> > >> > Second, as said in Mike's blog, the searcher should be released, which >> has >> > seemingly caused the problem. What are my alternatives here to avoid it? >> > >> > Thanks >> > >> > >> > >> > On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea wrote: >> > >> >> Are you closing the SearcherManager? Calling release() multiple times? >> >> >> >> From the exception message the first sounds most likely. >> >> >> >> >> >> -- >> >> Ian. >> >> >> >> >> >> On Wed, Feb 8, 2012 at 5:20 AM, Cheng wrote: >> >> > Hi, >> >> > >> >> > I am using NRTManager and NRTManagerReopenThread. Though I don't close >> >> > either writer or the reopen thread, I receive AlreadyClosedException >> as >> >> > follow. >> >> > >> >> > My initiating NRTManager and NRTManagerReopenThread are: >> >> > >> >> > FSDirectory indexDir = new NIOFSDirectory(new File( >> >> > indexFolder)); >> >> > >> >> > IndexWriterConfig iwConfig = new IndexWriterConfig( >> >> > version, new LimitTokenCountAnalyzer( >> >> > StandardAnalyzer, maxTokenNum)); >> >> > >> >> > iw = new IndexWriter(indexDir, iwConfig); >> >> > >> >> > nrtm = new NRTManager(iw, null); >> >> > >> >> > ropt = new NRTManagerReopenThread(nrtm, >> >> > targetMaxStaleSec, >> >> > targetMinStaleSec); >> >> > >> >> > ropt.setName("Reopen Thread"); >> >> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2, >> >> > Thread.MAX_PRIORITY)); >> >> > ropt.setDaemon(true); >> >> > ropt.start(); >> >> > >> >> > >> >> > Where may the searchermanager fall out? >> >> > >> >> > >> >> > >> >> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager >> is >> >> > closed77 >> >> > at >> >> > >> >> >> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235) >> >> > at >> >> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138) >> >> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50) >> >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) >> >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) >> >> >> >> - >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: how to create directory on a remote server protected by password
Don't. Likely to cause more problems than it's worth. See recent thread on "Why read past EOF". But if you really feel you must, either write your own implementation of FSDirectory or mount the remote folder locally at the OS level using SMB or NFS or whatever. I know which one I'd go for, except that I wouldn't do it at all. -- Ian. On Wed, Feb 8, 2012 at 12:12 PM, Cheng wrote: > Hi, > > I want to create a writer on a folder ("fsdir") in a remote server > ("10.161.1.23"), which has user id "xyz" and password "pwd". How can I do > so? > > Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: slow speed of searching
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed (the 3rd item is Use a local filesystem!) -- Ian. On Wed, Feb 8, 2012 at 12:44 PM, Cheng wrote: > Hi, > > I have about 6.5 million documents which lead to 1.5G index. The speed of > search a couple terms, like "dvd" and "price", causes about 0.1 second. > > I am afraid that our data will grow rapidly. Except for dividing documents > into multiple indexes, what are the solutions I can try to improve > searching spead? > > Thanks - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: slow speed of searching
thanks a lot On Wed, Feb 8, 2012 at 9:48 PM, Ian Lea wrote: > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed > > (the 3rd item is Use a local filesystem!) > > -- > Ian. > > > On Wed, Feb 8, 2012 at 12:44 PM, Cheng wrote: > > Hi, > > > > I have about 6.5 million documents which lead to 1.5G index. The speed of > > search a couple terms, like "dvd" and "price", causes about 0.1 second. > > > > I am afraid that our data will grow rapidly. Except for dividing > documents > > into multiple indexes, what are the solutions I can try to improve > > searching spead? > > > > Thanks > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Working with MemoryIndex results
Hello, I'm using a MemoryIndex in order to search a block of in-memory text using a lucene query. I'm able to search the text, produce a result, and excerpt a highlight using the highlighter. Right now I'm doing this: MemoryIndex index = new MemoryIndex(); index.addField("content", fullText, LuceneAnalyzer); If(index.search(query) > 0.0f) { Highlighter highlighter = new Highlighter(new QueryScorer(query)); highlighter.setTextFragmenter(new SimpleFragmenter(150)); List excerpts = Arrays.asList(highlighter.getBestFragments(LuceneAnalyzer, "content", fullText, 5)); for(String excerpt : excerpts) { System.out.println(query.toString() + ": " + excerpt); } } I'd really like to be able to get the raw TextFragments from the Highlighter, but I need a TokenStream in order to be able to call highlighter.getBestTextFragments. What's the best way to get a tokenstream from a block of text? Thanks Much! -Dave - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Please explain DisjunctionMaxQuery JavaDoc.
What the heck does is the JavaDoc for DisjunctionMaxQuery saying: "A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries. This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as BooleanQuery would give). If the query is "albino elephant" this ensures that "albino" matching one field and "elephant" matching another gets a higher score than "albino" matching both fields. To get this result, use both BooleanQuery and DisjunctionMaxQuery: for each term a DisjunctionMaxQuery searches for it in each field, while the set of these DisjunctionMaxQuery's is combined into a BooleanQuery. The tie breaker capability allows results that include the same term in multiple fields to be judged better than results that include this term in only the best of those multiple fields, without confusing this with the better case of two different terms in the multiple fields." "Maximum ... as produced by any subquery", OK that makes sense. We pick the score that is the highest If you have DMQ ( Q1, Q2, Q3 ) And the subquery scores are ( 0.1, 0.2, 0.1) then Q2 wins and the overall score is 0.2 right? But then what is the meaning of "any additional matching subqueries"? Is the description then (1)Running with the idea that something has to tie to involve a tie-breaker, I might say "If two subqueries are both the maximum of all the subqueries, the score will be the maximum score increased by the tie breaker increment" Example: DMAQ with an increment of 0.15 and three subqueries ( Q1, Q2, Q3 ) which score (0.1, 0.2, 0.2) then because there are two 0.2 score then the score for this query will be 0.2 + 0.15 or 0.35. If the scores are (0.1,0.1, 0.2) the overall score is 0.2, because we had only one maximum. OR alternately forgetting the idea that anything is tied within the set of subqueries (2)"if in addition to the maximum subquery score there are any other subqueries with nonzero scores, the overall score is increased by the tiebreaker increment." Example: Using the same increment of 0.15, if the score are (0.0, 0.0, 0.2) the result is score 0.2, but (0.0, 0.1, 0.2 ) scores 0.35. I'm leaning toward interpretation #2, but "tie breaking for ... additional matching..." does not say that to me, because I don't see any tie. Once I understand that I'll ask about the how to "use both BooleanQuery and DisjunctionMaxQuery". -Paul
RE: Please explain DisjunctionMaxQuery JavaDoc.
> -Original Message- > From: Paul Allan Hill [mailto:p...@metajure.com] > Sent: Wednesday, February 08, 2012 2:42 PM > To: java-user@lucene.apache.org > Subject: Please explain DisjunctionMaxQuery JavaDoc. > > What the heck does is the JavaDoc for DisjunctionMaxQuery saying: > >[...] plus a tie > breaking increment Oh my, the 1st problem is the class description discusses "tie breaking increment", but the API says tie breaking multiplier. Then wondering around in the code I find DisjuncitonMaxScorer.score() ... return scoreMax + (scoreSum - scoreMax) * tieBreakerMultiplier; ... Which is upon examination IS " the score of each non-maximum disjunct for a document is multiplied by this weight and added into the final score." As described in the c'tor of DisjunctionMaxQuery. But what this has anything to do with any idea of a "tie" anywhere in this query I don't know. -Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Index writing performance of 3.5
Hello, I am currently evaluating Lucene 3.5.0 for upgrading from 3.0.3, and in the context of my usage, the most important parameter is index writing throughput. To that end, I have been running various tests, but seeing some contradictory results from different setups, which hopefully someone with a better knowledge of Lucene's internals could explain... First, let me describe my usage of Lucene, which is common across all of these cases. 1. Terms: non-analyzed strings or integral types, mostly. No free form text values on fields. 2. All indexed fields are stored. 3. Multiple threads per index writer, in the overall application currently capped at 4. 4. Document deletes are performed with each index update, using a simple string term to identify the document. 5. Default IndexWriter config settings are used, i.e. directory type, merge policy, RAM buffer size, etc. 6. Typical data size for an index is anywhere from a few hundred K docs up to a few hundred M. 7. Hardware config: - kernel 2.6.16-60 SMP (SuSE Enterprise Server 10) - 16x CPU - 16G RAM - ReiserFS partition for index data (more on this below) Here is where things diverge though. The first use case is a standalone performance test, which writes 1M documents containing 4 fields (2 string, 2 numeric) to a single index using 10 worker threads. In this case, I do not see any writing performance degradation when going from 3.0.3 to 3.5. The second setup is a distributed multi-threaded client server application, where Lucene is used on the server to implement the search functionality. Clients have the ability to submit searchable data for indexing, as well as to run queries against the data. I realize this is a very generic description, and if needed could provide more specifics later. For now, let's say the second test runs on one such client, and submits 3 million records for the server to process (and also index via Lucene). Total time taken is then reported. But when running the test above, I can definitely observe a consistent increase in test times when the only thing changing is Lucene going from 3.0.3 to 3.5.0, on the order of 15-35%. How could I reconcile this discrepancy? My theory at this point is that the combination of the kernel above and ReiserFS (default FS for the distro) somehow making index writing in 3.5.0 slower, possibly due to the BKL issue, but only when used in a heavily multi-threaded system. Unfortunately, I currently have no ext3 partitions, or ability to upgrade the kernel on the system to prove or disprove this. Has anyone experienced issues like this in a similar setup, or maybe benchmarked Lucene across different file system types and release versions? Thanks, -V - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org