IndexSearcher on multi-core CPU machine

2007-02-18 Thread dmitri

We have search (no update) web app on 2 dual core CPU machine (2x Dual Core
AMD Opteron(tm) Processor 280) with 8G of RAM. Lucene 2.0 is used.
My index is optimized and non compound, 9G holding 6.5 M documents. 
Search includes term queries, range filters and sorts.

When I use single IndexSearcher and search with multiple threads CPU are
partially idle.

To have 100% CPU utilization I have to create several IndexSearchers.

With org.apache.lucene.store.MMapDirectory throughput is better but I still
have to create multiple IndexSearcher instances to have 100% CPU
utilization.

With multiple IndexSearchers search times are better under multithreaded
load.
Following is average search times (in ms) for different number of parallel
threads and IndexSearchers:
concurrent  1 searcher  5 searchers 10 searchers
threads
1  180   177  167
2   201  184  174
4   241  197  188
5  339236  220
10663454  420
20  1172917  880
50   2599 21431912
100 4887 4056  3775
Maximum search times for smaller number of searchers differ in 2-3 times.

Search is CPU bound (no IO wait is observed). 
Is there any way to better utilize the server other than create several
IndexSearchers?
I need to squeeze as much performance as possible out of the machine as we
have strict performance requirements.
-- 
View this message in context: 
http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9034207
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher on multi-core CPU machine

2007-02-19 Thread dmitri

I use
-
searcher = new IndexSearcher(indexLocation);
-

So readers are created under the hood.


dmitri 


karl wettin-3 wrote:
> 
> 
> 18 feb 2007 kl. 22.52 skrev dmitri:
> 
>> With org.apache.lucene.store.MMapDirectory throughput is better but  
>> I still
>> have to create multiple IndexSearcher instances to have 100% CPU
>> utilization.
>>
>> With multiple IndexSearchers search times are better under  
>> multithreaded
>> load.
>>
>> concurrent  1 searcher   5 searchers   10 searchers
>> threads
>> 1   180  177   167
>> 100 4887 4056  3775
> 
> Are you using a single or multiple IndexReaders?
> 
> -- 
> karl
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9043584
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher on multi-core CPU machine

2007-02-19 Thread dmitri

I don't think so as sorting in on integer fields
-
dmitri



Paul Smith-2 wrote:
> 
> are you using Locale-sensitive sorting at all?
> 
> https://issues.apache.org/jira/browse/LUCENE-806
> 
> Just wondering if you're seeing the same problem we are having.
> 
> cheers,
> 
> Paul Smith
> 
> On 19/02/2007, at 8:52 AM, dmitri wrote:
> 
>>
>> We have search (no update) web app on 2 dual core CPU machine (2x  
>> Dual Core
>> AMD Opteron(tm) Processor 280) with 8G of RAM. Lucene 2.0 is used.
>> My index is optimized and non compound, 9G holding 6.5 M documents.
>> Search includes term queries, range filters and sorts.
>>
>> When I use single IndexSearcher and search with multiple threads  
>> CPU are
>> partially idle.
>>
>> To have 100% CPU utilization I have to create several IndexSearchers.
>>
>> With org.apache.lucene.store.MMapDirectory throughput is better but  
>> I still
>> have to create multiple IndexSearcher instances to have 100% CPU
>> utilization.
>>
>> With multiple IndexSearchers search times are better under  
>> multithreaded
>> load.
>> Following is average search times (in ms) for different number of  
>> parallel
>> threads and IndexSearchers:
>> concurrent  1 searcher  5 searchers 10 searchers
>> threads
>> 1  180   177  167
>> 2   201  184  174
>> 4   241  197  188
>> 5  339236  220
>> 10663454  420
>> 20  1172917  880
>> 50   2599 21431912
>> 100 4887 4056  3775
>> Maximum search times for smaller number of searchers differ in 2-3  
>> times.
>>
>> Search is CPU bound (no IO wait is observed).
>> Is there any way to better utilize the server other than create  
>> several
>> IndexSearchers?
>> I need to squeeze as much performance as possible out of the  
>> machine as we
>> have strict performance requirements.
>> -- 
>> View this message in context: http://www.nabble.com/IndexSearcher- 
>> on-multi-core-CPU-machine-tf3249889.html#a9034207
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> 
> 
> 
>  
> 

-- 
View this message in context: 
http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9044054
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher on multi-core CPU machine

2007-02-19 Thread dmitri

I haven't tried using several IndexSearchers over a single IndexReader.
Do you think it can help?
---
dmitri


karl wettin-3 wrote:
> 
> 
> What are the effects if you supply the same reader to IndexSearcher:s?
> 
> 19 feb 2007 kl. 16.03 skrev dmitri:
> 
>>
>> I use
>> -
>> searcher = new IndexSearcher(indexLocation);
>> -
>>
>> So readers are created under the hood.
>>
>> 
>> dmitri
>>
>>
>> karl wettin-3 wrote:
>>>
>>>
>>> 18 feb 2007 kl. 22.52 skrev dmitri:
>>>
>>>> With org.apache.lucene.store.MMapDirectory throughput is better but
>>>> I still
>>>> have to create multiple IndexSearcher instances to have 100% CPU
>>>> utilization.
>>>>
>>>> With multiple IndexSearchers search times are better under
>>>> multithreaded
>>>> load.
>>>>
>>>> concurrent  1 searcher   5 searchers   10 searchers
>>>> threads
>>>> 1   180  177   167
>>>> 100 4887 4056  3775
>>>
>>> Are you using a single or multiple IndexReaders?
>>>
>>> --  
>>> karl
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> --  
>> View this message in context: http://www.nabble.com/IndexSearcher- 
>> on-multi-core-CPU-machine-tf3249889.html#a9043584
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9044066
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Scoring while sorting

2007-02-21 Thread dmitri

What is the point to calculate score if the result set is going to be sorted
by some field?

Is it ok to replace several terms query (a OR b OR c) with MatchAllQuery and
RangeFilters (from a to a, from b to b, from c to c) if sorting is needed?
Won't it be faster?
-
dmitri
-- 
View this message in context: 
http://www.nabble.com/Scoring-while-sorting-tf3270213.html#a9092111
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ParallelSearcher in multi-node environment

2007-02-27 Thread dmitri

Hi,

I want to execute parallel search over several machines. But
ParallelSearcher doesn't look perfect. It creates threads for and spawns
many requests to the underlying Searchables (over a network) for a single
search.
Is there a decent implementation of the parallel search over remote indexes
somewhere?

--
Thank you
  Dmitri

-- 
View this message in context: 
http://www.nabble.com/ParallelSearcher-in-multi-node-environment-tf3301080.html#a9182802
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ANN] ParallelSearcher in multi-node environment

2007-03-01 Thread dmitri

e.g. I've changed original ParallelSearcher to use thread pool
(java.util.concurrent.ThreadPoolExecutor from jdk 1.5).
But implementing multi-host installation still requires a lot of changes
since ParallelSearcher calles underlying Searchables too many times (e.g.
for separate network call for every document)

Dmitri 
-- 
View this message in context: 
http://www.nabble.com/ParallelSearcher-in-multi-node-environment-tf3301080.html#a9245525
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Which stemmer?

2012-11-26 Thread Dmitri Mamrukov


Sent from my iPhone

On Nov 16, 2012, at 7:18 PM, "Igal @ getRailo.org"  wrote:
R
> This message cannot be displayed because of the way it is formatted. Ask the 
> sender to send it again using a different format or email program. 
> text/plainydckcu

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



IndexSearcher and multi-threaded performance

2008-11-11 Thread Dmitri Bichko
Hi,

I'm pretty new to Lucene, so please bear with me if this has been
covered before.

The wiki suggests sharing a single IndexSearcher between threads for
best performance
(http://wiki.apache.org/lucene-java/ImproveSearchingSpeed).  I've
tested running the same set of queries with: multiple threads sharing
the same searcher, with a separate searcher for each thread, both
shared/private with a RAMDirectory in-memory index, and (just for fun)
in multiple JVMs running concurrently (the results are in milliseconds
to complete the whole job):

threads  multi-jvm  shared  per-thread  ram-shared  ram-thread
  1  72997   70883   72573   60308   60012
  2  33147   48762   35973   25498   25734
  4  16229   46828   21267   13127   27164
  6  13088   47240   140289858   29917
  8   9775   47020   109838948   10440
 10   8721   50132   113349587   11355
 12   7290   49002   117989832
 16   9365   47099   12338   11296

The shared searcher indeed behaves better with a ram-based index, but
what's going on with the disk-based one?  It's basically not scaling
beyond two threads. Am I just doing something completely wrong here?

The test consists of about 1,500 Boolean OR queries with 1-10
PhraseQueries each, with 1-20 Terms per PhraseQuery.  I'm using a
HitCollector to count the hits, so I'm not retrieving any results.
The index is about 5GB and 20 million documents.

This is running on a 8 x quad-core Opteron machine with plenty of RAM to spare.

Any idea why I would see this behaviour?

Thanks,
Dmitri

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Dmitri Bichko
I re-ran the no-readonly ram tests:

thread  shared
1   64043   53610
2   26999   25260
3   27173   17265
4   22205   13222
5   20795   11098
6   17593   9852
7   17163   8987
8   17275   9052
9   19392   10266
10  27809   10397
11  25987   10724
12  26550   10832

The pattern is the same, but the difference at 4 and 6 is less
pronounced - it was probably just a hiccup (I'm not using terribly
sophisticated test methodology here), it's also possible I didn't give
the JVM enough RAM (this run was with 16GB, just to be on the safe
side).

Still, looks like the extra resource management overhead for
ram-thread beats whatever lock-contention ram-shared introduces.

I'm rerunning everything with readonly set and nio, I'll post the
results once it's done.

Cheers,
Dmitri

On Tue, Nov 11, 2008 at 5:40 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> Nice results, thanks!
>
> The poor disk-based scaling may be fixed by NIOFSDirectory, if you are on
> Unix.  If you are on Windows it won't help (and will likely be worse than
> FSDirectory), because of an apparently bug in Sun's JVM on Windows whereby
> NIO positional file reads seem to share a lock under the hood.
>
> The poor ram-thread result  for 4 & 6 threads is odd.  Those numbers ought
> to be at least as good as ram-shared.  Is it possible those columns are
> swapped?  Because the ram-shared case should have been hurt by using a
> non-read-only IndexReader.
>
> Mike
>
> Dmitri Bichko wrote:
>
>> Hi,
>>
>> I'm pretty new to Lucene, so please bear with me if this has been
>> covered before.
>>
>> The wiki suggests sharing a single IndexSearcher between threads for
>> best performance
>> (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed).  I've
>> tested running the same set of queries with: multiple threads sharing
>> the same searcher, with a separate searcher for each thread, both
>> shared/private with a RAMDirectory in-memory index, and (just for fun)
>> in multiple JVMs running concurrently (the results are in milliseconds
>> to complete the whole job):
>>
>> threads  multi-jvm  shared  per-thread  ram-shared  ram-thread
>> 1  72997   70883   72573   60308   60012
>> 2  33147   48762   35973   25498   25734
>> 4  16229   46828   21267   13127   27164
>> 6  13088   47240   140289858   29917
>> 8   9775   47020   109838948   10440
>>10   8721   50132   113349587   11355
>>12   7290   49002   117989832
>>16   9365   47099   12338   11296
>>
>> The shared searcher indeed behaves better with a ram-based index, but
>> what's going on with the disk-based one?  It's basically not scaling
>> beyond two threads. Am I just doing something completely wrong here?
>>
>> The test consists of about 1,500 Boolean OR queries with 1-10
>> PhraseQueries each, with 1-20 Terms per PhraseQuery.  I'm using a
>> HitCollector to count the hits, so I'm not retrieving any results.
>> The index is about 5GB and 20 million documents.
>>
>> This is running on a 8 x quad-core Opteron machine with plenty of RAM to
>> spare.
>>
>> Any idea why I would see this behaviour?
>>
>> Thanks,
>> Dmitri
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Dmitri Bichko
32 cores, actually :)

I reran the test with readonly turned on (I changed how the time is
measured a little, it should be more consistent):

fs-thread   ram-thread  fs-shared   ram-shared
1   71877   54739   73986   61595
2   34949   26735   43719   28935
3   25581   26885   38412   19624
4   20511   31742   38712   15059
5   19235   24345   39685   12509
6   16775   26896   39592   10841
7   17147   18296   46678   10183
8   18327   19043   39886   10048
9   16885   18721   40342   9483
10  17832   30757   44706   10975
11  17251   21199   39947   9704
12  17267   36284   40208   10996

I can't seem to get NIOFSDirectory working, though.  Calling
NIOFSDirectory.getDirectory("foo") just returns an FSDirectory.

Any ideas?

Cheers,
Dmitri

On Tue, Nov 11, 2008 at 5:09 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
> Nice! An 8 core machine with a test ready to go!
>
> How about trying the read only mode that was added to 2.4 on your
> IndexReader?
>
> And if you you are on unix and could try trunk and use the new
> NIOFSDirectory implementation...that would be awesome.
>
> Those two additions are our current hope for what your seeing...would be
> nice to know if we need to try for more (or if we need to petition the smart
> people that work on that stuff to try for more ;) ).
>
> - Mark
>
> Dmitri Bichko wrote:
>>
>> Hi,
>>
>> I'm pretty new to Lucene, so please bear with me if this has been
>> covered before.
>>
>> The wiki suggests sharing a single IndexSearcher between threads for
>> best performance
>> (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed).  I've
>> tested running the same set of queries with: multiple threads sharing
>> the same searcher, with a separate searcher for each thread, both
>> shared/private with a RAMDirectory in-memory index, and (just for fun)
>> in multiple JVMs running concurrently (the results are in milliseconds
>> to complete the whole job):
>>
>> threads  multi-jvm  shared  per-thread  ram-shared  ram-thread
>>  1  72997   70883   72573   60308   60012
>>  2  33147   48762   35973   25498   25734
>>  4  16229   46828   21267   13127   27164
>>  6  13088   47240   140289858   29917
>>  8   9775   47020   109838948   10440
>> 10   8721   50132   113349587   11355
>> 12   7290   49002   117989832
>> 16   9365   47099   12338   11296
>>
>> The shared searcher indeed behaves better with a ram-based index, but
>> what's going on with the disk-based one?  It's basically not scaling
>> beyond two threads. Am I just doing something completely wrong here?
>>
>> The test consists of about 1,500 Boolean OR queries with 1-10
>> PhraseQueries each, with 1-20 Terms per PhraseQuery.  I'm using a
>> HitCollector to count the hits, so I'm not retrieving any results.
>> The index is about 5GB and 20 million documents.
>>
>> This is running on a 8 x quad-core Opteron machine with plenty of RAM to
>> spare.
>>
>> Any idea why I would see this behaviour?
>>
>> Thanks,
>> Dmitri
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Dmitri Bichko
>From the user perspective: a public constructor would be the most
obvious, and would be consistent with RAMDirectory.

Dmitri

On Wed, Nov 12, 2008 at 4:50 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> I think we really should open up a non-static way to choose a different
> FSDirectory impl?  EG maybe add optional Class to FSDirectory.getDirectory?
>  Or maybe give NIOFSDirectory a public ctor?  Or something?
>
> Mike
>
> Mark Miller wrote:
>
>> Mark Miller wrote:
>>>
>>> Thats a good point, and points out a bug in solr trunk for me. Frankly I
>>> don't see how its done. There is no code I can see/find to use it rather
>>> than FSDirectory. Still assuming there must be a way, but I don't see it...
>>>
>> Ah - brain freeze. What else is new :) You have to set the system property
>> to change implementations: org.apache.lucene.FSDirectory.class is the
>> property, set it to the class. Been a long time...
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: IndexSearcher and multi-threaded performance

2008-11-12 Thread Dmitri Bichko
Nice!

At 8 threads nio-shared catches up with ram-shared.  Here's the complete table:

fs-thread   nio-thread  ram-thread  fs-shared   
nio-shared  ram-shared
1   71877   70461   54739   73986   72155   61595
2   34949   34945   26735   43719   33019   28935
3   25581   28732   26885   38412   23383   19624
4   20511   21235   31742   38712   18000   15059
5   19235   21060   24345   39685   14636   12509
6   16775   17685   26896   39592   12649   10841
7   17147   18766   18296   46678   11201   10183
8   18327   17588   19043   39886   10439   10048
9   16885   16483   18721   40342   94559483
10  17832   17428   30757   44706   894710975
11  17251   16405   21199   39947   85979704
12  17267   17967   36284   40208   846210996

And it behaves very well with more threads:

nio-shared
1   71066
2   33206
3   22824
4   18168
5   15198
6   13086
7   11616
8   10698
9   9919
10  9657
11  9409
12  8977
13  9210
14  8757
15  9282
16  9260
17  9010
18  8230
19  8439
20  8486
21  8631
22  8417
23  8154
24  8685
25  7878
26  8398
27  8265
28  8266
29  7951
30  8606
31  8385
32  8630

That solves it for me, but I do see a fair amount of free time on this
machine - if there are other things you want to benchmark, I'd be
happy to do it.

Cheers,
Dmitri

On Tue, Nov 11, 2008 at 9:43 PM, Mark Miller <[EMAIL PROTECTED]> wrote:
> Mark Miller wrote:
>>
>> Thats a good point, and points out a bug in solr trunk for me. Frankly I
>> don't see how its done. There is no code I can see/find to use it rather
>> than FSDirectory. Still assuming there must be a way, but I don't see it...
>>
> Ah - brain freeze. What else is new :) You have to set the system property
> to change implementations: org.apache.lucene.FSDirectory.class is the
> property, set it to the class. Been a long time...
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Retrieving payloads for terms matched by a query

2009-05-21 Thread Dmitri Bichko
Hi,

I may be missing something obvious, but how do I get the payloads for
the specific token positions that were matched by a query?

For example, if I have a phrase query like "A keyword B" that matches
the field "A keyword B A", I can get the payloads for A and B with
IndexReader.termPositions(), but that doesn't tell me which of the two
positions of A matched the query.  I can kind of work back to it, but
it quickly becomes difficult for more complex queries.

Here's what I'm trying to accomplish: I have several entity classes,
I'd want to create an index using the class names as tokens, and
storing the specific entity ids in the payloads.  That way, I should
be able to run queries in terms of the classes (ie '"CLASS_A
CLASS_B"~10 NOT CLASS_C', etc) and then report the actual entities for
the matching documents. (I realize I'll need custom tokenizers/filters
to identify and tag the entities and handle class references in
queries, but that part seems pretty straightforward).

Does this sound workable?

Thanks,
Dmitri

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Hardware recommendation

2005-09-09 Thread Dmitri Bichko
Hi,

I'm putting together a cheap indexing server for an "explorative" lucene
project and had a few questions about which route to go.

I am going with a Socket 939 platform - does it make sense to get the
dual core Athlon 64 X2, or is it better to stick with a faster clocked
"plain" Athlon 64?

Also, would Lucene benefit from running in 64 bit mode, or does it
prefer "compatibility" 32 bit?

I figure most indexing apps will be heavily IO bound, so I am stressing
that, while staying with commodity components, so:

WD SATA disks (250GB, 16MB cache, SATAII 3Gb/s)
starting out with 4 of these (plus system disks), on the onboard
controller (RAID0)

If need be I can add two disk cages, 5 disks each with two decent SATA
RAID controllers (64/128MB cache, NCQ, that sort of thing); the nForce4
PCI-Express should stand up to this, I'm hoping.

And of course I am limited to 4GB RAM.

I have three main applications in mind:

Indexing PubMed/Medline article abstracts, this would we an index of
about 15 million records with a couple of identifier fields, a title and
a 1-3 paragraph abstract.  Mostly the searches will be keyword searches
on the text fields.  Potentially I could add full-length papers to this
as well (a lot fewer records though).

Second one is indexing a couple hundred thousand MS Office documents and
PDF files (Google Appliance sort of thing).

And finally a genetic database repository a la LuceGene, or SRS.  This
would have more complex records (ie many fields, but little data with
each), which are mostly retrieved on unique identifiers (very little
text searching).  This would probably run to a few tens of millions of
records, maybe around 100 million eventually.

Given these applications, what else should I be thinking about,
hardware-wise?

Thanks,
Dmitri
The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sentence classification with Lucene

2025-02-17 Thread Dmitri Geller
Hi all, I would like to classify a sentence into one or two categories. 
I see this classification roughly this way:


```
unknown:
   example1
   example2
   ...
   exampleN

class1:
   example1
   example2
   ...
   exampleN

class2:
   example1
   example2
   ...
   exampleN

...

classN:
   example1
   example2
   ...
   exampleN

...
```

There are about 25-30 classes.
About 10-30 examples per class.
One sentence can get one or two classes assigned

As far as I understand: this can be done with Lucene Core, should be 
quite a standard functionality.

Can you point me to a Java example for this?

Thanks in advance!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Re: Sentence classification with Lucene

2025-02-19 Thread Dmitri Geller

Yes, something like lucene-classification [1].
But, there are multiple classifiers in this package.
Which one is better suited ? (Imagine I collect more samples per 
class... about... 30-40 samples per class)

Any good Java examples using these classifiers?


Another question:
in case I want my classification to work "semantically":

For example:
For the class "crypto" I can have these samples:
- crypto
- bitcoin
- stablecoin
- blockchain

and in case the input text contains "Eterium" - what happens in this 
case, will it match "crypto" ?


I mean, the models in lucene-classification: as far as I understand - 
they do not have knowledge about semantic similarity between words, right?





On 2025/02/19 13:45:52 Tommaso Teofili wrote:
> Hi,
>
> if you have 30 classes with 10 samples per class, I'd say that's not an
> optimal distribution.
> Apart from that, you may use one of the text classifiers from
> lucene-classification [1], is anything like this what you had in mind?
> Alternatively you can also do things outside of Lucene and use Lucene 
only,

> for example, to store vectors and find nearest neighbors.
>
> Regards,
> Tommaso
>
> [1] :
> 
https://lucene.apache.org/core/10_1_0/classification/org/apache/lucene/classification/package-summary.html

>
> On Mon, 17 Feb 2025 at 16:15, Dmitri Geller  wrote:
>
> > Hi all, I would like to classify a sentence into one or two categories.
> > I see this classification roughly this way:
> >
> > ```
> > unknown:
> > example1
> > example2
> > ...
> > exampleN
> >
> > class1:
> > example1
> > example2
> > ...
> > exampleN
> >
> > class2:
> > example1
> > example2
> > ...
> > exampleN
> >
> > ...
> >
> > classN:
> > example1
> > example2
> > ...
> > exampleN
> >
> > ...
> > ```
> >
> > There are about 25-30 classes.
> > About 10-30 examples per class.
> > One sentence can get one or two classes assigned
> >
> > As far as I understand: this can be done with Lucene Core, should be
> > quite a standard functionality.
> > Can you point me to a Java example for this?
> >
> > Thanks in advance!
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org