Re: How best to handle a reasonable amount to data (25TB+)

2012-02-08 Thread Danil ŢORIN
It also depends on your queries.

For example if you only query data for 1 month intervals, and you
partition by date, you can calculate in which shard your data can be
found, and query just that shard.

If you can find a partition key that is always present in the query,
you can create a gazillion of small shards, but redirect query just to
specific shard and keep search latency low.


On Wed, Feb 8, 2012 at 09:39, Li Li  wrote:
> it's up to your machines. in our application, we indexs about
> 30,000,000(30M)docs/shard, and the response time is about 150ms. our
> machine has about 48GB memory and about 25GB is allocated to solr and other
> is used for disk cache in Linux.
> if calculated by our application, indexing 1.25T docs will use 40+ machines.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller <
> peter.mil...@objectconsulting.com.au> wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am looking
>> for advice. I have researched the archives, and seen some relevant posts,
>> but they are fairly old and not specifically a match, so I thought I would
>> give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB of
>> search attributes to handle in Lucene, across about 1.25 trillion
>> documents. The app is write once, read many. There are many document types
>> involved that have to be able to be searched separately or together, with
>> some common attributes, but also unique ones per type. I plan on using a
>> JCP implementation that uses Lucene under the covers. The data itself is
>> not searchable, only the attributes. I plan to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware
>> eventually with 5 machines, each with 24 x 2TB drives. This should allow
>> for redundancy (3 copies), although I would suppose we would add bigger
>> drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts for
>> these days, but a bit chunky), I was sort of assuming that the Lucene
>> indexes would go on the object storage solution too, to handle availability
>> and other infrastructure issues. Most of the searches would be
>> date-constrained, so I thought that the indexes could be sharded by date.
>>
>> There would be a local disk index being built near real time on the JCP
>> hardware that could be regularly merged in with the main indexes on the
>> object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just
>> theoretical at the moment and I'm not experienced in Lucene, as you can no
>> doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed Solr,
>> http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now
>> wondering if that would be a superior approach? Or any other suggestions?
>>
>> Many Thanks,
>> The Captn
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How best to handle a reasonable amount to data (25TB+)

2012-02-08 Thread Petite Abeille

On Feb 8, 2012, at 10:14 AM, Danil ŢORIN wrote:

> For example if you only query data for 1 month intervals, and you
> partition by date, you can calculate in which shard your data can be
> found, and query just that shard.

This is what one calls "partition pruning" in database terms.

http://en.wikipedia.org/wiki/Partition_(database)
http://www.orafaq.com/tuningguide/partition%20prune.html

Rather handy way to scale to "infinity and beyond" as Buzz Lightyear would have 
it.

Perhaps of interest:

Scaling to Infinity – Partitioning in Oracle Data Warehouses
http://www.evdbt.com/TGorman%20TD2009%20DWScale.doc
http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How best to handle a reasonable amount to data (25TB+)

2012-02-08 Thread Petite Abeille

On Feb 8, 2012, at 10:14 AM, Danil ŢORIN wrote:

> For example if you only query data for 1 month intervals, and you
> partition by date, you can calculate in which shard your data can be
> found, and query just that shard.

This is what one calls "partition pruning" in database terms.

http://en.wikipedia.org/wiki/Partition_(database)
http://www.orafaq.com/tuningguide/partition%20prune.html

Rather handy way to scale to "infinity and beyond" as Buzz Lightyear would have 
it.

Perhaps of interest:

Scaling to Infinity – Partitioning in Oracle Data Warehouses
http://www.evdbt.com/TGorman%20TD2009%20DWScale.doc
http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why read past EOF

2012-02-08 Thread Michael McCandless
Hmm, there's a problem with the logic here (sorry: this is my fault --
my prior suggestion is flat out wrong!).

The problem is... say you commit once, creating commit point 1.  Two
hours later, you commit again, creating commit point 2.  The bug is,
at this point, immediately on committing commit point 2, this deletion
policy will go and remove commit point 1.  Instead, it's supposed to
wait 10 minutes to do so.

So... I think you should go back to using System.currentTimeMillis()
as "the present".  And then, only when the newest commit is more than
10 minutes old, are you allowed to delete the commits before it.  That
should work?

However: you should leave a margin of error, because say the reader
takes 10 seconds to reopen + warm/cutover all search threads... then,
if timing is unlucky, you can still remove a commit point being used
by a reader.  I would leave a comfortable margin, eg if you reopen
readers every 10 minutes, then delete commits older than 15 or 20
minutes.  If commits are rare than leaving a fat margin here will cost
nothing in practice... and if there is some clock change
(System.currentTimeMillis() suddenly jumps, maybe from daylight
savings time, maybe from aggressive clock syncing, whatever), you have
some margin

Really, a better overall design would be a hard handshake will all
outstanding readers, so that only once every single reader using a
given commit has closed, do you delete the commit.  Then you are
immune clock unreliability but this'd require remote communication
in your app to track reader states.

Also, you should remove that dangerous auto-generated-catch-block?  It
may suppress a real exception some day... and onCommit is allowed to
throw IOE.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Feb 7, 2012 at 9:15 PM, superruiye  wrote:
> public class PostponeCommitDeletionPolicy implements IndexDeletionPolicy {
>        private final static long deletionPostPone = 60;
>
>        public void onInit(List commits) {
>                // Note that commits.size() should normally be 1:
>                onCommit(commits);
>        }
>
>        /**
>         * delete commits after deletePostPone ms.
>         */
>        public void onCommit(List commits) {
>                // Note that commits.size() should normally be 2 (if not
>                // called by onInit above):
>                int size = commits.size();
>                try {
>                        long lastCommitTimestamp = commits.get(commits.size() -
> 1).getTimestamp();
>                        for (int i = 0; i < size - 1; i++) {
>                                if (lastCommitTimestamp - 
> commits.get(i).getTimestamp() >
> deletionPostPone) {
>                                        commits.get(i).delete();
>                                }
>                        }
>                } catch (IOException e) {
>                        // TODO Auto-generated catch block
>                        e.printStackTrace();
>                }
>        }
> }
> --
> indexWriterConfig.setIndexDeletionPolicy(new
> PostponeCommitDeletionPolicy());
> --
> and I use a time task(10 minutes) to reopen indexsearcher,but still  read
> past EOF...the trace:
> java.io.IOException: read past EOF
>        at
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)
>        at
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
>        at org.apache.lucene.store.DataInput.readInt(DataInput.java:84)
>        at
> org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:153)
>        at
> org.apache.lucene.index.TermVectorsReader.checkValidFormat(TermVectorsReader.java:197)
>        at
> org.apache.lucene.index.TermVectorsReader.(TermVectorsReader.java:86)
>        at
> org.apache.lucene.index.SegmentCoreReaders.openDocStores(SegmentCoreReaders.java:221)
>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:117)
>        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:93)
>        at
> org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)
>        at
> org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)
>        at
> org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
>        at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:754)
>        at
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:421)
>        at org.apache.lucene.index.IndexReader.open(IndexReader.java:281)
>        at
> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:89)
>        at
> com.ableskysearch.migration.timertask.ReopenIndexSearcherTask.runAsPeriod(ReopenIndexSearcherTask.java:40)
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Why-read-past-EOF-

Re: NRTManager and AlreadyClosedException

2012-02-08 Thread Ian Lea
Are you closing the SearcherManager?  Calling release() multiple times?

>From the exception message the first sounds most likely.


--
Ian.


On Wed, Feb 8, 2012 at 5:20 AM, Cheng  wrote:
> Hi,
>
> I am using NRTManager and NRTManagerReopenThread. Though I don't close
> either writer or the reopen thread, I receive AlreadyClosedException as
> follow.
>
> My initiating NRTManager and NRTManagerReopenThread are:
>
> FSDirectory indexDir = new NIOFSDirectory(new File(
> indexFolder));
>
> IndexWriterConfig iwConfig = new IndexWriterConfig(
> version, new LimitTokenCountAnalyzer(
> StandardAnalyzer, maxTokenNum));
>
> iw = new IndexWriter(indexDir, iwConfig);
>
> nrtm = new NRTManager(iw, null);
>
> ropt = new NRTManagerReopenThread(nrtm,
> targetMaxStaleSec,
> targetMinStaleSec);
>
> ropt.setName("Reopen Thread");
> ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2,
> Thread.MAX_PRIORITY));
> ropt.setDaemon(true);
> ropt.start();
>
>
> Where may the searchermanager fall out?
>
>
>
> org.apache.lucene.store.AlreadyClosedException: this SearcherManager is
> closed77
> at
> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235)
> at com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138)
> at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRTManager and AlreadyClosedException

2012-02-08 Thread Cheng
You are right. There is a method by which I do searching. At the end of the
method, I release the index searcher (not the searchermanager).

Since this method is called by multiple threads. So I think the index
searcher will be released multiple times.

First, I wonder if releasing searcher is same as releasing the searcher
manager.

Second, as said in Mike's blog, the searcher should be released, which has
seemingly caused the problem. What are my alternatives here to avoid it?

Thanks



On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea  wrote:

> Are you closing the SearcherManager?  Calling release() multiple times?
>
> From the exception message the first sounds most likely.
>
>
> --
> Ian.
>
>
> On Wed, Feb 8, 2012 at 5:20 AM, Cheng  wrote:
> > Hi,
> >
> > I am using NRTManager and NRTManagerReopenThread. Though I don't close
> > either writer or the reopen thread, I receive AlreadyClosedException as
> > follow.
> >
> > My initiating NRTManager and NRTManagerReopenThread are:
> >
> > FSDirectory indexDir = new NIOFSDirectory(new File(
> > indexFolder));
> >
> > IndexWriterConfig iwConfig = new IndexWriterConfig(
> > version, new LimitTokenCountAnalyzer(
> > StandardAnalyzer, maxTokenNum));
> >
> > iw = new IndexWriter(indexDir, iwConfig);
> >
> > nrtm = new NRTManager(iw, null);
> >
> > ropt = new NRTManagerReopenThread(nrtm,
> > targetMaxStaleSec,
> > targetMinStaleSec);
> >
> > ropt.setName("Reopen Thread");
> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2,
> > Thread.MAX_PRIORITY));
> > ropt.setDaemon(true);
> > ropt.start();
> >
> >
> > Where may the searchermanager fall out?
> >
> >
> >
> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager is
> > closed77
> > at
> >
> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235)
> > at
> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138)
> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: NRTManager and AlreadyClosedException

2012-02-08 Thread Ian Lea
Releasing a searcher is not the same as closing the searcher manager,
if that is what you mean.

The searcher should indeed be released, but once only for each
acquire().  Your searching threads should have code like that shown in
the SearcherManager javadocs.

IndexSearcher s = manager.acquire();
 try {
   // Do searching, doc retrieval, etc. with s
 } finally {
   manager.release(s);
 }
 // Do not use s after this!
 s = null;

--
Ian.


On Wed, Feb 8, 2012 at 12:09 PM, Cheng  wrote:
> You are right. There is a method by which I do searching. At the end of the
> method, I release the index searcher (not the searchermanager).
>
> Since this method is called by multiple threads. So I think the index
> searcher will be released multiple times.
>
> First, I wonder if releasing searcher is same as releasing the searcher
> manager.
>
> Second, as said in Mike's blog, the searcher should be released, which has
> seemingly caused the problem. What are my alternatives here to avoid it?
>
> Thanks
>
>
>
> On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea  wrote:
>
>> Are you closing the SearcherManager?  Calling release() multiple times?
>>
>> From the exception message the first sounds most likely.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Feb 8, 2012 at 5:20 AM, Cheng  wrote:
>> > Hi,
>> >
>> > I am using NRTManager and NRTManagerReopenThread. Though I don't close
>> > either writer or the reopen thread, I receive AlreadyClosedException as
>> > follow.
>> >
>> > My initiating NRTManager and NRTManagerReopenThread are:
>> >
>> > FSDirectory indexDir = new NIOFSDirectory(new File(
>> > indexFolder));
>> >
>> > IndexWriterConfig iwConfig = new IndexWriterConfig(
>> > version, new LimitTokenCountAnalyzer(
>> > StandardAnalyzer, maxTokenNum));
>> >
>> > iw = new IndexWriter(indexDir, iwConfig);
>> >
>> > nrtm = new NRTManager(iw, null);
>> >
>> > ropt = new NRTManagerReopenThread(nrtm,
>> > targetMaxStaleSec,
>> > targetMinStaleSec);
>> >
>> > ropt.setName("Reopen Thread");
>> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2,
>> > Thread.MAX_PRIORITY));
>> > ropt.setDaemon(true);
>> > ropt.start();
>> >
>> >
>> > Where may the searchermanager fall out?
>> >
>> >
>> >
>> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager is
>> > closed77
>> > at
>> >
>> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235)
>> > at
>> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138)
>> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50)
>> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRTManager and AlreadyClosedException

2012-02-08 Thread Cheng
I use it exactly the same way. So there must be other reason causing the
problem.

On Wed, Feb 8, 2012 at 8:21 PM, Ian Lea  wrote:

> Releasing a searcher is not the same as closing the searcher manager,
> if that is what you mean.
>
> The searcher should indeed be released, but once only for each
> acquire().  Your searching threads should have code like that shown in
> the SearcherManager javadocs.
>
> IndexSearcher s = manager.acquire();
>  try {
>   // Do searching, doc retrieval, etc. with s
>  } finally {
>   manager.release(s);
>  }
>  // Do not use s after this!
>  s = null;
>
> --
> Ian.
>
>
> On Wed, Feb 8, 2012 at 12:09 PM, Cheng  wrote:
> > You are right. There is a method by which I do searching. At the end of
> the
> > method, I release the index searcher (not the searchermanager).
> >
> > Since this method is called by multiple threads. So I think the index
> > searcher will be released multiple times.
> >
> > First, I wonder if releasing searcher is same as releasing the searcher
> > manager.
> >
> > Second, as said in Mike's blog, the searcher should be released, which
> has
> > seemingly caused the problem. What are my alternatives here to avoid it?
> >
> > Thanks
> >
> >
> >
> > On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea  wrote:
> >
> >> Are you closing the SearcherManager?  Calling release() multiple times?
> >>
> >> From the exception message the first sounds most likely.
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Wed, Feb 8, 2012 at 5:20 AM, Cheng  wrote:
> >> > Hi,
> >> >
> >> > I am using NRTManager and NRTManagerReopenThread. Though I don't close
> >> > either writer or the reopen thread, I receive AlreadyClosedException
> as
> >> > follow.
> >> >
> >> > My initiating NRTManager and NRTManagerReopenThread are:
> >> >
> >> > FSDirectory indexDir = new NIOFSDirectory(new File(
> >> > indexFolder));
> >> >
> >> > IndexWriterConfig iwConfig = new IndexWriterConfig(
> >> > version, new LimitTokenCountAnalyzer(
> >> > StandardAnalyzer, maxTokenNum));
> >> >
> >> > iw = new IndexWriter(indexDir, iwConfig);
> >> >
> >> > nrtm = new NRTManager(iw, null);
> >> >
> >> > ropt = new NRTManagerReopenThread(nrtm,
> >> > targetMaxStaleSec,
> >> > targetMinStaleSec);
> >> >
> >> > ropt.setName("Reopen Thread");
> >> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2,
> >> > Thread.MAX_PRIORITY));
> >> > ropt.setDaemon(true);
> >> > ropt.start();
> >> >
> >> >
> >> > Where may the searchermanager fall out?
> >> >
> >> >
> >> >
> >> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager
> is
> >> > closed77
> >> > at
> >> >
> >>
> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235)
> >> > at
> >> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138)
> >> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50)
> >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: NRTManager and AlreadyClosedException

2012-02-08 Thread Simon Willnauer
are you closing the NRTManager while other threads still accessing the
SearcherManager?

simon

On Wed, Feb 8, 2012 at 1:48 PM, Cheng  wrote:
> I use it exactly the same way. So there must be other reason causing the
> problem.
>
> On Wed, Feb 8, 2012 at 8:21 PM, Ian Lea  wrote:
>
>> Releasing a searcher is not the same as closing the searcher manager,
>> if that is what you mean.
>>
>> The searcher should indeed be released, but once only for each
>> acquire().  Your searching threads should have code like that shown in
>> the SearcherManager javadocs.
>>
>> IndexSearcher s = manager.acquire();
>>  try {
>>   // Do searching, doc retrieval, etc. with s
>>  } finally {
>>   manager.release(s);
>>  }
>>  // Do not use s after this!
>>  s = null;
>>
>> --
>> Ian.
>>
>>
>> On Wed, Feb 8, 2012 at 12:09 PM, Cheng  wrote:
>> > You are right. There is a method by which I do searching. At the end of
>> the
>> > method, I release the index searcher (not the searchermanager).
>> >
>> > Since this method is called by multiple threads. So I think the index
>> > searcher will be released multiple times.
>> >
>> > First, I wonder if releasing searcher is same as releasing the searcher
>> > manager.
>> >
>> > Second, as said in Mike's blog, the searcher should be released, which
>> has
>> > seemingly caused the problem. What are my alternatives here to avoid it?
>> >
>> > Thanks
>> >
>> >
>> >
>> > On Wed, Feb 8, 2012 at 7:51 PM, Ian Lea  wrote:
>> >
>> >> Are you closing the SearcherManager?  Calling release() multiple times?
>> >>
>> >> From the exception message the first sounds most likely.
>> >>
>> >>
>> >> --
>> >> Ian.
>> >>
>> >>
>> >> On Wed, Feb 8, 2012 at 5:20 AM, Cheng  wrote:
>> >> > Hi,
>> >> >
>> >> > I am using NRTManager and NRTManagerReopenThread. Though I don't close
>> >> > either writer or the reopen thread, I receive AlreadyClosedException
>> as
>> >> > follow.
>> >> >
>> >> > My initiating NRTManager and NRTManagerReopenThread are:
>> >> >
>> >> > FSDirectory indexDir = new NIOFSDirectory(new File(
>> >> > indexFolder));
>> >> >
>> >> > IndexWriterConfig iwConfig = new IndexWriterConfig(
>> >> > version, new LimitTokenCountAnalyzer(
>> >> > StandardAnalyzer, maxTokenNum));
>> >> >
>> >> > iw = new IndexWriter(indexDir, iwConfig);
>> >> >
>> >> > nrtm = new NRTManager(iw, null);
>> >> >
>> >> > ropt = new NRTManagerReopenThread(nrtm,
>> >> > targetMaxStaleSec,
>> >> > targetMinStaleSec);
>> >> >
>> >> > ropt.setName("Reopen Thread");
>> >> > ropt.setPriority(Math.min(Thread.currentThread().getPriority() + 2,
>> >> > Thread.MAX_PRIORITY));
>> >> > ropt.setDaemon(true);
>> >> > ropt.start();
>> >> >
>> >> >
>> >> > Where may the searchermanager fall out?
>> >> >
>> >> >
>> >> >
>> >> > org.apache.lucene.store.AlreadyClosedException: this SearcherManager
>> is
>> >> > closed77
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.search.SearcherManager.acquire(SearcherManager.java:235)
>> >> > at
>> >> com.yyt.core.er.lucene.YYTLuceneImpl.codeIndexed(YYTLuceneImpl.java:138)
>> >> > at com.yyt.core.er.main.copy.SingleCodeER.run(SingleCodeER.java:50)
>> >> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> >> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to create directory on a remote server protected by password

2012-02-08 Thread Ian Lea
Don't.  Likely to cause more problems than it's worth.  See recent
thread on "Why read past EOF".

But if you really feel you must, either write your own implementation
of FSDirectory or mount the remote folder locally at the OS level
using SMB or NFS or whatever.  I know which one I'd go for, except
that I wouldn't do it at all.


--
Ian.


On Wed, Feb 8, 2012 at 12:12 PM, Cheng  wrote:
> Hi,
>
> I want to create a writer on a folder ("fsdir") in a remote server
> ("10.161.1.23"), which has user id "xyz" and password "pwd". How can I do
> so?
>
> Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: slow speed of searching

2012-02-08 Thread Ian Lea
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

(the 3rd item is Use a local filesystem!)

--
Ian.


On Wed, Feb 8, 2012 at 12:44 PM, Cheng  wrote:
> Hi,
>
> I have about 6.5 million documents which lead to 1.5G index. The speed of
> search a couple terms, like "dvd" and "price", causes about 0.1 second.
>
> I am afraid that our data will grow rapidly. Except for dividing documents
> into multiple indexes, what are the solutions I can try to improve
> searching spead?
>
> Thanks

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: slow speed of searching

2012-02-08 Thread Cheng
thanks a lot

On Wed, Feb 8, 2012 at 9:48 PM, Ian Lea  wrote:

> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>
> (the 3rd item is Use a local filesystem!)
>
> --
> Ian.
>
>
> On Wed, Feb 8, 2012 at 12:44 PM, Cheng  wrote:
> > Hi,
> >
> > I have about 6.5 million documents which lead to 1.5G index. The speed of
> > search a couple terms, like "dvd" and "price", causes about 0.1 second.
> >
> > I am afraid that our data will grow rapidly. Except for dividing
> documents
> > into multiple indexes, what are the solutions I can try to improve
> > searching spead?
> >
> > Thanks
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Working with MemoryIndex results

2012-02-08 Thread Dave Seltzer
Hello,

I'm using a MemoryIndex in order to search a block of in-memory text using
a lucene query.
I'm able to search the text, produce a result, and excerpt a highlight
using the highlighter.

Right now I'm doing this:

MemoryIndex index = new MemoryIndex();
index.addField("content", fullText, LuceneAnalyzer);
If(index.search(query) > 0.0f)
{
Highlighter highlighter = new Highlighter(new QueryScorer(query));
highlighter.setTextFragmenter(new SimpleFragmenter(150));
List excerpts =
Arrays.asList(highlighter.getBestFragments(LuceneAnalyzer, "content",
fullText, 5));
for(String excerpt : excerpts) {
System.out.println(query.toString() + ": " + excerpt);
}
}

I'd really like to be able to get the raw TextFragments from the
Highlighter, but I need a TokenStream in order to be able to call
highlighter.getBestTextFragments.
What's the best way to get a tokenstream from a block of text?

Thanks Much!

-Dave

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Please explain DisjunctionMaxQuery JavaDoc.

2012-02-08 Thread Paul Allan Hill
What the heck does is the JavaDoc for DisjunctionMaxQuery saying:

"A query that generates the union of documents produced by its subqueries, and 
that scores each document with the maximum score for that document as produced 
by any subquery, plus a tie breaking increment for any additional matching 
subqueries. This is useful when searching for a word in multiple fields with 
different boost factors (so that the fields cannot be combined equivalently 
into a single search field). We want the primary score to be the one associated 
with the highest boost, not the sum of the field scores (as BooleanQuery would 
give). If the query is "albino elephant" this ensures that "albino" matching 
one field and "elephant" matching another gets a higher score than "albino" 
matching both fields. To get this result, use both BooleanQuery and 
DisjunctionMaxQuery: for each term a DisjunctionMaxQuery searches for it in 
each field, while the set of these DisjunctionMaxQuery's is combined into a 
BooleanQuery. The tie breaker capability allows results that include the same 
term in multiple fields to be judged better than results that include this term 
in only the best of those multiple fields, without confusing this with the 
better case of two different terms in the multiple fields."

"Maximum ...  as produced by any subquery", OK that makes sense.  We pick the 
score that is the highest
If you have
DMQ ( Q1, Q2, Q3 )
And the subquery scores are ( 0.1, 0.2, 0.1) then Q2 wins and the overall score 
is 0.2 right?
But then what is the meaning of "any additional matching subqueries"?
Is the description then

(1)Running with the idea that something has to tie to involve a 
tie-breaker, I might say "If two subqueries are both the maximum of all the 
subqueries, the score will be the maximum score increased by the tie breaker 
increment"
Example: DMAQ with an increment of 0.15 and three subqueries ( Q1, Q2, Q3 ) 
which score (0.1, 0.2, 0.2) then
because there are two 0.2 score then the score for this query will be 0.2 + 
0.15 or 0.35.  If the scores are (0.1,0.1, 0.2) the overall score is 0.2, 
because we had only one maximum.

OR alternately forgetting the idea that anything is tied within the set of 
subqueries


(2)"if in addition to the maximum subquery score there are any other 
subqueries with nonzero scores, the overall score is increased by the 
tiebreaker increment."

Example: Using the same increment of 0.15, if the score are (0.0, 0.0, 0.2) the 
result is score 0.2, but (0.0, 0.1, 0.2 ) scores 0.35.

I'm leaning toward interpretation #2, but "tie breaking for ... additional 
matching..." does not say that to me, because I don't see any tie.
Once I understand that I'll ask about the how to "use both BooleanQuery and 
DisjunctionMaxQuery".

-Paul


RE: Please explain DisjunctionMaxQuery JavaDoc.

2012-02-08 Thread Paul Allan Hill


> -Original Message-
> From: Paul Allan Hill [mailto:p...@metajure.com]
> Sent: Wednesday, February 08, 2012 2:42 PM
> To: java-user@lucene.apache.org
> Subject: Please explain DisjunctionMaxQuery JavaDoc.
> 
> What the heck does is the JavaDoc for DisjunctionMaxQuery saying:
> 
>[...] plus a tie
> breaking increment 

Oh my, the 1st problem is the class description discusses "tie breaking 
increment", but the API says tie breaking multiplier.
Then wondering around in the code I find
DisjuncitonMaxScorer.score()
...
return scoreMax + (scoreSum - scoreMax) * tieBreakerMultiplier;
...
Which is upon examination IS " the score of each non-maximum disjunct for a 
document is multiplied by this weight and added into the final score." As 
described in the c'tor of DisjunctionMaxQuery.
But what this has anything to do with any idea of a "tie" anywhere in this 
query I don't know.  

-Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index writing performance of 3.5

2012-02-08 Thread Vitaly Funstein
Hello,

I am currently evaluating Lucene 3.5.0 for upgrading from 3.0.3, and
in the context of my usage, the most important parameter is index
writing throughput. To that end, I have been running various tests,
but seeing some contradictory results from different setups, which
hopefully someone with a better knowledge of Lucene's internals could
explain...

First, let me describe my usage of Lucene, which is common across all
of these cases.

1. Terms: non-analyzed strings or integral types, mostly. No free form
text values on fields.
2. All indexed fields are stored.
3. Multiple threads per index writer, in the overall application
currently capped at 4.
4. Document deletes are performed with each index update, using a
simple string term to identify the document.
5. Default IndexWriter config settings are used, i.e. directory type,
merge policy, RAM buffer size, etc.
6. Typical data size for an index is anywhere from a few hundred K
docs up to a few hundred M.
7. Hardware config:
- kernel 2.6.16-60 SMP (SuSE Enterprise Server 10)
- 16x CPU
- 16G RAM
- ReiserFS partition for index data (more on this below)

Here is where things diverge though. The first use case is a
standalone performance test, which writes 1M documents containing 4
fields (2 string, 2 numeric) to a single index using 10 worker
threads. In this case, I do not see any writing performance
degradation when going from 3.0.3 to 3.5.

The second setup is a distributed multi-threaded client server
application, where Lucene is used on the server to implement the
search functionality. Clients have the ability to submit searchable
data for indexing, as well as to run queries against the data. I
realize this is a very generic description, and if needed could
provide more specifics later. For now, let's say the second test runs
on one such client, and submits 3 million records for the server to
process (and also index via Lucene). Total time taken is then
reported.

But when running the test above, I can definitely observe a consistent
increase in test times when the only thing changing is Lucene going
from 3.0.3 to 3.5.0, on the order of 15-35%.

How could I reconcile this discrepancy? My theory at this point is
that the combination of the kernel above and ReiserFS (default FS for
the distro) somehow making index writing in 3.5.0 slower, possibly due
to the BKL issue, but only when used in a heavily multi-threaded
system. Unfortunately, I currently have no ext3 partitions, or ability
to upgrade the kernel on the system to prove or disprove this.

Has anyone experienced issues like this in a similar setup, or maybe
benchmarked Lucene across different file system types and release
versions?

Thanks,
-V

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org