IndexSearcher on multi-core CPU machine
We have search (no update) web app on 2 dual core CPU machine (2x Dual Core AMD Opteron(tm) Processor 280) with 8G of RAM. Lucene 2.0 is used. My index is optimized and non compound, 9G holding 6.5 M documents. Search includes term queries, range filters and sorts. When I use single IndexSearcher and search with multiple threads CPU are partially idle. To have 100% CPU utilization I have to create several IndexSearchers. With org.apache.lucene.store.MMapDirectory throughput is better but I still have to create multiple IndexSearcher instances to have 100% CPU utilization. With multiple IndexSearchers search times are better under multithreaded load. Following is average search times (in ms) for different number of parallel threads and IndexSearchers: concurrent 1 searcher 5 searchers 10 searchers threads 1 180 177 167 2 201 184 174 4 241 197 188 5 339236 220 10663454 420 20 1172917 880 50 2599 21431912 100 4887 4056 3775 Maximum search times for smaller number of searchers differ in 2-3 times. Search is CPU bound (no IO wait is observed). Is there any way to better utilize the server other than create several IndexSearchers? I need to squeeze as much performance as possible out of the machine as we have strict performance requirements. -- View this message in context: http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9034207 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher on multi-core CPU machine
I use - searcher = new IndexSearcher(indexLocation); - So readers are created under the hood. dmitri karl wettin-3 wrote: > > > 18 feb 2007 kl. 22.52 skrev dmitri: > >> With org.apache.lucene.store.MMapDirectory throughput is better but >> I still >> have to create multiple IndexSearcher instances to have 100% CPU >> utilization. >> >> With multiple IndexSearchers search times are better under >> multithreaded >> load. >> >> concurrent 1 searcher 5 searchers 10 searchers >> threads >> 1 180 177 167 >> 100 4887 4056 3775 > > Are you using a single or multiple IndexReaders? > > -- > karl > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9043584 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher on multi-core CPU machine
I don't think so as sorting in on integer fields - dmitri Paul Smith-2 wrote: > > are you using Locale-sensitive sorting at all? > > https://issues.apache.org/jira/browse/LUCENE-806 > > Just wondering if you're seeing the same problem we are having. > > cheers, > > Paul Smith > > On 19/02/2007, at 8:52 AM, dmitri wrote: > >> >> We have search (no update) web app on 2 dual core CPU machine (2x >> Dual Core >> AMD Opteron(tm) Processor 280) with 8G of RAM. Lucene 2.0 is used. >> My index is optimized and non compound, 9G holding 6.5 M documents. >> Search includes term queries, range filters and sorts. >> >> When I use single IndexSearcher and search with multiple threads >> CPU are >> partially idle. >> >> To have 100% CPU utilization I have to create several IndexSearchers. >> >> With org.apache.lucene.store.MMapDirectory throughput is better but >> I still >> have to create multiple IndexSearcher instances to have 100% CPU >> utilization. >> >> With multiple IndexSearchers search times are better under >> multithreaded >> load. >> Following is average search times (in ms) for different number of >> parallel >> threads and IndexSearchers: >> concurrent 1 searcher 5 searchers 10 searchers >> threads >> 1 180 177 167 >> 2 201 184 174 >> 4 241 197 188 >> 5 339236 220 >> 10663454 420 >> 20 1172917 880 >> 50 2599 21431912 >> 100 4887 4056 3775 >> Maximum search times for smaller number of searchers differ in 2-3 >> times. >> >> Search is CPU bound (no IO wait is observed). >> Is there any way to better utilize the server other than create >> several >> IndexSearchers? >> I need to squeeze as much performance as possible out of the >> machine as we >> have strict performance requirements. >> -- >> View this message in context: http://www.nabble.com/IndexSearcher- >> on-multi-core-CPU-machine-tf3249889.html#a9034207 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > > > > -- View this message in context: http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9044054 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher on multi-core CPU machine
I haven't tried using several IndexSearchers over a single IndexReader. Do you think it can help? --- dmitri karl wettin-3 wrote: > > > What are the effects if you supply the same reader to IndexSearcher:s? > > 19 feb 2007 kl. 16.03 skrev dmitri: > >> >> I use >> - >> searcher = new IndexSearcher(indexLocation); >> - >> >> So readers are created under the hood. >> >> >> dmitri >> >> >> karl wettin-3 wrote: >>> >>> >>> 18 feb 2007 kl. 22.52 skrev dmitri: >>> >>>> With org.apache.lucene.store.MMapDirectory throughput is better but >>>> I still >>>> have to create multiple IndexSearcher instances to have 100% CPU >>>> utilization. >>>> >>>> With multiple IndexSearchers search times are better under >>>> multithreaded >>>> load. >>>> >>>> concurrent 1 searcher 5 searchers 10 searchers >>>> threads >>>> 1 180 177 167 >>>> 100 4887 4056 3775 >>> >>> Are you using a single or multiple IndexReaders? >>> >>> -- >>> karl >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> -- >> View this message in context: http://www.nabble.com/IndexSearcher- >> on-multi-core-CPU-machine-tf3249889.html#a9043584 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/IndexSearcher-on-multi-core-CPU-machine-tf3249889.html#a9044066 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Scoring while sorting
What is the point to calculate score if the result set is going to be sorted by some field? Is it ok to replace several terms query (a OR b OR c) with MatchAllQuery and RangeFilters (from a to a, from b to b, from c to c) if sorting is needed? Won't it be faster? - dmitri -- View this message in context: http://www.nabble.com/Scoring-while-sorting-tf3270213.html#a9092111 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ParallelSearcher in multi-node environment
Hi, I want to execute parallel search over several machines. But ParallelSearcher doesn't look perfect. It creates threads for and spawns many requests to the underlying Searchables (over a network) for a single search. Is there a decent implementation of the parallel search over remote indexes somewhere? -- Thank you Dmitri -- View this message in context: http://www.nabble.com/ParallelSearcher-in-multi-node-environment-tf3301080.html#a9182802 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANN] ParallelSearcher in multi-node environment
e.g. I've changed original ParallelSearcher to use thread pool (java.util.concurrent.ThreadPoolExecutor from jdk 1.5). But implementing multi-host installation still requires a lot of changes since ParallelSearcher calles underlying Searchables too many times (e.g. for separate network call for every document) Dmitri -- View this message in context: http://www.nabble.com/ParallelSearcher-in-multi-node-environment-tf3301080.html#a9245525 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Which stemmer?
Sent from my iPhone On Nov 16, 2012, at 7:18 PM, "Igal @ getRailo.org" wrote: R > This message cannot be displayed because of the way it is formatted. Ask the > sender to send it again using a different format or email program. > text/plainydckcu - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
IndexSearcher and multi-threaded performance
Hi, I'm pretty new to Lucene, so please bear with me if this has been covered before. The wiki suggests sharing a single IndexSearcher between threads for best performance (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed). I've tested running the same set of queries with: multiple threads sharing the same searcher, with a separate searcher for each thread, both shared/private with a RAMDirectory in-memory index, and (just for fun) in multiple JVMs running concurrently (the results are in milliseconds to complete the whole job): threads multi-jvm shared per-thread ram-shared ram-thread 1 72997 70883 72573 60308 60012 2 33147 48762 35973 25498 25734 4 16229 46828 21267 13127 27164 6 13088 47240 140289858 29917 8 9775 47020 109838948 10440 10 8721 50132 113349587 11355 12 7290 49002 117989832 16 9365 47099 12338 11296 The shared searcher indeed behaves better with a ram-based index, but what's going on with the disk-based one? It's basically not scaling beyond two threads. Am I just doing something completely wrong here? The test consists of about 1,500 Boolean OR queries with 1-10 PhraseQueries each, with 1-20 Terms per PhraseQuery. I'm using a HitCollector to count the hits, so I'm not retrieving any results. The index is about 5GB and 20 million documents. This is running on a 8 x quad-core Opteron machine with plenty of RAM to spare. Any idea why I would see this behaviour? Thanks, Dmitri - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
I re-ran the no-readonly ram tests: thread shared 1 64043 53610 2 26999 25260 3 27173 17265 4 22205 13222 5 20795 11098 6 17593 9852 7 17163 8987 8 17275 9052 9 19392 10266 10 27809 10397 11 25987 10724 12 26550 10832 The pattern is the same, but the difference at 4 and 6 is less pronounced - it was probably just a hiccup (I'm not using terribly sophisticated test methodology here), it's also possible I didn't give the JVM enough RAM (this run was with 16GB, just to be on the safe side). Still, looks like the extra resource management overhead for ram-thread beats whatever lock-contention ram-shared introduces. I'm rerunning everything with readonly set and nio, I'll post the results once it's done. Cheers, Dmitri On Tue, Nov 11, 2008 at 5:40 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Nice results, thanks! > > The poor disk-based scaling may be fixed by NIOFSDirectory, if you are on > Unix. If you are on Windows it won't help (and will likely be worse than > FSDirectory), because of an apparently bug in Sun's JVM on Windows whereby > NIO positional file reads seem to share a lock under the hood. > > The poor ram-thread result for 4 & 6 threads is odd. Those numbers ought > to be at least as good as ram-shared. Is it possible those columns are > swapped? Because the ram-shared case should have been hurt by using a > non-read-only IndexReader. > > Mike > > Dmitri Bichko wrote: > >> Hi, >> >> I'm pretty new to Lucene, so please bear with me if this has been >> covered before. >> >> The wiki suggests sharing a single IndexSearcher between threads for >> best performance >> (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed). I've >> tested running the same set of queries with: multiple threads sharing >> the same searcher, with a separate searcher for each thread, both >> shared/private with a RAMDirectory in-memory index, and (just for fun) >> in multiple JVMs running concurrently (the results are in milliseconds >> to complete the whole job): >> >> threads multi-jvm shared per-thread ram-shared ram-thread >> 1 72997 70883 72573 60308 60012 >> 2 33147 48762 35973 25498 25734 >> 4 16229 46828 21267 13127 27164 >> 6 13088 47240 140289858 29917 >> 8 9775 47020 109838948 10440 >>10 8721 50132 113349587 11355 >>12 7290 49002 117989832 >>16 9365 47099 12338 11296 >> >> The shared searcher indeed behaves better with a ram-based index, but >> what's going on with the disk-based one? It's basically not scaling >> beyond two threads. Am I just doing something completely wrong here? >> >> The test consists of about 1,500 Boolean OR queries with 1-10 >> PhraseQueries each, with 1-20 Terms per PhraseQuery. I'm using a >> HitCollector to count the hits, so I'm not retrieving any results. >> The index is about 5GB and 20 million documents. >> >> This is running on a 8 x quad-core Opteron machine with plenty of RAM to >> spare. >> >> Any idea why I would see this behaviour? >> >> Thanks, >> Dmitri >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
32 cores, actually :) I reran the test with readonly turned on (I changed how the time is measured a little, it should be more consistent): fs-thread ram-thread fs-shared ram-shared 1 71877 54739 73986 61595 2 34949 26735 43719 28935 3 25581 26885 38412 19624 4 20511 31742 38712 15059 5 19235 24345 39685 12509 6 16775 26896 39592 10841 7 17147 18296 46678 10183 8 18327 19043 39886 10048 9 16885 18721 40342 9483 10 17832 30757 44706 10975 11 17251 21199 39947 9704 12 17267 36284 40208 10996 I can't seem to get NIOFSDirectory working, though. Calling NIOFSDirectory.getDirectory("foo") just returns an FSDirectory. Any ideas? Cheers, Dmitri On Tue, Nov 11, 2008 at 5:09 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > Nice! An 8 core machine with a test ready to go! > > How about trying the read only mode that was added to 2.4 on your > IndexReader? > > And if you you are on unix and could try trunk and use the new > NIOFSDirectory implementation...that would be awesome. > > Those two additions are our current hope for what your seeing...would be > nice to know if we need to try for more (or if we need to petition the smart > people that work on that stuff to try for more ;) ). > > - Mark > > Dmitri Bichko wrote: >> >> Hi, >> >> I'm pretty new to Lucene, so please bear with me if this has been >> covered before. >> >> The wiki suggests sharing a single IndexSearcher between threads for >> best performance >> (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed). I've >> tested running the same set of queries with: multiple threads sharing >> the same searcher, with a separate searcher for each thread, both >> shared/private with a RAMDirectory in-memory index, and (just for fun) >> in multiple JVMs running concurrently (the results are in milliseconds >> to complete the whole job): >> >> threads multi-jvm shared per-thread ram-shared ram-thread >> 1 72997 70883 72573 60308 60012 >> 2 33147 48762 35973 25498 25734 >> 4 16229 46828 21267 13127 27164 >> 6 13088 47240 140289858 29917 >> 8 9775 47020 109838948 10440 >> 10 8721 50132 113349587 11355 >> 12 7290 49002 117989832 >> 16 9365 47099 12338 11296 >> >> The shared searcher indeed behaves better with a ram-based index, but >> what's going on with the disk-based one? It's basically not scaling >> beyond two threads. Am I just doing something completely wrong here? >> >> The test consists of about 1,500 Boolean OR queries with 1-10 >> PhraseQueries each, with 1-20 Terms per PhraseQuery. I'm using a >> HitCollector to count the hits, so I'm not retrieving any results. >> The index is about 5GB and 20 million documents. >> >> This is running on a 8 x quad-core Opteron machine with plenty of RAM to >> spare. >> >> Any idea why I would see this behaviour? >> >> Thanks, >> Dmitri >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
>From the user perspective: a public constructor would be the most obvious, and would be consistent with RAMDirectory. Dmitri On Wed, Nov 12, 2008 at 4:50 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > I think we really should open up a non-static way to choose a different > FSDirectory impl? EG maybe add optional Class to FSDirectory.getDirectory? > Or maybe give NIOFSDirectory a public ctor? Or something? > > Mike > > Mark Miller wrote: > >> Mark Miller wrote: >>> >>> Thats a good point, and points out a bug in solr trunk for me. Frankly I >>> don't see how its done. There is no code I can see/find to use it rather >>> than FSDirectory. Still assuming there must be a way, but I don't see it... >>> >> Ah - brain freeze. What else is new :) You have to set the system property >> to change implementations: org.apache.lucene.FSDirectory.class is the >> property, set it to the class. Been a long time... >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
Nice! At 8 threads nio-shared catches up with ram-shared. Here's the complete table: fs-thread nio-thread ram-thread fs-shared nio-shared ram-shared 1 71877 70461 54739 73986 72155 61595 2 34949 34945 26735 43719 33019 28935 3 25581 28732 26885 38412 23383 19624 4 20511 21235 31742 38712 18000 15059 5 19235 21060 24345 39685 14636 12509 6 16775 17685 26896 39592 12649 10841 7 17147 18766 18296 46678 11201 10183 8 18327 17588 19043 39886 10439 10048 9 16885 16483 18721 40342 94559483 10 17832 17428 30757 44706 894710975 11 17251 16405 21199 39947 85979704 12 17267 17967 36284 40208 846210996 And it behaves very well with more threads: nio-shared 1 71066 2 33206 3 22824 4 18168 5 15198 6 13086 7 11616 8 10698 9 9919 10 9657 11 9409 12 8977 13 9210 14 8757 15 9282 16 9260 17 9010 18 8230 19 8439 20 8486 21 8631 22 8417 23 8154 24 8685 25 7878 26 8398 27 8265 28 8266 29 7951 30 8606 31 8385 32 8630 That solves it for me, but I do see a fair amount of free time on this machine - if there are other things you want to benchmark, I'd be happy to do it. Cheers, Dmitri On Tue, Nov 11, 2008 at 9:43 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > Mark Miller wrote: >> >> Thats a good point, and points out a bug in solr trunk for me. Frankly I >> don't see how its done. There is no code I can see/find to use it rather >> than FSDirectory. Still assuming there must be a way, but I don't see it... >> > Ah - brain freeze. What else is new :) You have to set the system property > to change implementations: org.apache.lucene.FSDirectory.class is the > property, set it to the class. Been a long time... > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Retrieving payloads for terms matched by a query
Hi, I may be missing something obvious, but how do I get the payloads for the specific token positions that were matched by a query? For example, if I have a phrase query like "A keyword B" that matches the field "A keyword B A", I can get the payloads for A and B with IndexReader.termPositions(), but that doesn't tell me which of the two positions of A matched the query. I can kind of work back to it, but it quickly becomes difficult for more complex queries. Here's what I'm trying to accomplish: I have several entity classes, I'd want to create an index using the class names as tokens, and storing the specific entity ids in the payloads. That way, I should be able to run queries in terms of the classes (ie '"CLASS_A CLASS_B"~10 NOT CLASS_C', etc) and then report the actual entities for the matching documents. (I realize I'll need custom tokenizers/filters to identify and tag the entities and handle class references in queries, but that part seems pretty straightforward). Does this sound workable? Thanks, Dmitri - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Hardware recommendation
Hi, I'm putting together a cheap indexing server for an "explorative" lucene project and had a few questions about which route to go. I am going with a Socket 939 platform - does it make sense to get the dual core Athlon 64 X2, or is it better to stick with a faster clocked "plain" Athlon 64? Also, would Lucene benefit from running in 64 bit mode, or does it prefer "compatibility" 32 bit? I figure most indexing apps will be heavily IO bound, so I am stressing that, while staying with commodity components, so: WD SATA disks (250GB, 16MB cache, SATAII 3Gb/s) starting out with 4 of these (plus system disks), on the onboard controller (RAID0) If need be I can add two disk cages, 5 disks each with two decent SATA RAID controllers (64/128MB cache, NCQ, that sort of thing); the nForce4 PCI-Express should stand up to this, I'm hoping. And of course I am limited to 4GB RAM. I have three main applications in mind: Indexing PubMed/Medline article abstracts, this would we an index of about 15 million records with a couple of identifier fields, a title and a 1-3 paragraph abstract. Mostly the searches will be keyword searches on the text fields. Potentially I could add full-length papers to this as well (a lot fewer records though). Second one is indexing a couple hundred thousand MS Office documents and PDF files (Google Appliance sort of thing). And finally a genetic database repository a la LuceGene, or SRS. This would have more complex records (ie many fields, but little data with each), which are mostly retrieved on unique identifiers (very little text searching). This would probably run to a few tens of millions of records, maybe around 100 million eventually. Given these applications, what else should I be thinking about, hardware-wise? Thanks, Dmitri The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sentence classification with Lucene
Hi all, I would like to classify a sentence into one or two categories. I see this classification roughly this way: ``` unknown: example1 example2 ... exampleN class1: example1 example2 ... exampleN class2: example1 example2 ... exampleN ... classN: example1 example2 ... exampleN ... ``` There are about 25-30 classes. About 10-30 examples per class. One sentence can get one or two classes assigned As far as I understand: this can be done with Lucene Core, should be quite a standard functionality. Can you point me to a Java example for this? Thanks in advance! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Re: Sentence classification with Lucene
Yes, something like lucene-classification [1]. But, there are multiple classifiers in this package. Which one is better suited ? (Imagine I collect more samples per class... about... 30-40 samples per class) Any good Java examples using these classifiers? Another question: in case I want my classification to work "semantically": For example: For the class "crypto" I can have these samples: - crypto - bitcoin - stablecoin - blockchain and in case the input text contains "Eterium" - what happens in this case, will it match "crypto" ? I mean, the models in lucene-classification: as far as I understand - they do not have knowledge about semantic similarity between words, right? On 2025/02/19 13:45:52 Tommaso Teofili wrote: > Hi, > > if you have 30 classes with 10 samples per class, I'd say that's not an > optimal distribution. > Apart from that, you may use one of the text classifiers from > lucene-classification [1], is anything like this what you had in mind? > Alternatively you can also do things outside of Lucene and use Lucene only, > for example, to store vectors and find nearest neighbors. > > Regards, > Tommaso > > [1] : > https://lucene.apache.org/core/10_1_0/classification/org/apache/lucene/classification/package-summary.html > > On Mon, 17 Feb 2025 at 16:15, Dmitri Geller wrote: > > > Hi all, I would like to classify a sentence into one or two categories. > > I see this classification roughly this way: > > > > ``` > > unknown: > > example1 > > example2 > > ... > > exampleN > > > > class1: > > example1 > > example2 > > ... > > exampleN > > > > class2: > > example1 > > example2 > > ... > > exampleN > > > > ... > > > > classN: > > example1 > > example2 > > ... > > exampleN > > > > ... > > ``` > > > > There are about 25-30 classes. > > About 10-30 examples per class. > > One sentence can get one or two classes assigned > > > > As far as I understand: this can be done with Lucene Core, should be > > quite a standard functionality. > > Can you point me to a Java example for this? > > > > Thanks in advance! > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org