Re: How to create a tokenizer in a way solr will recognize

Yaşar Arabacı Fri, 31 Jan 2025 09:17:14 -0800

Thanks a lot. That solved that problem.

I am now facing another problem;


----
Caused by: java.lang.IllegalArgumentException: resource
tokenization/sentence-boundary-model.bin not found.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:220)
~[?:?]
at com.google.common.io.Resources.getResource(Resources.java:196) ~[?:?]
at 
zemberek.tokenization.TurkishSentenceExtractor.fromDefaultModel(TurkishSentenceExtractor.java:51)
~[?:?]
at 
zemberek.tokenization.TurkishSentenceExtractor.access$100(TurkishSentenceExtractor.java:28)
~[?:?]
at 
zemberek.tokenization.TurkishSentenceExtractor$Singleton.<init>(TurkishSentenceExtractor.java:261)
~[?:?]
at 
zemberek.tokenization.TurkishSentenceExtractor$Singleton.<clinit>(TurkishSentenceExtractor.java:256)
~[?:?]
at 
zemberek.tokenization.TurkishSentenceExtractor.<clinit>(TurkishSentenceExtractor.java:33)
~[?:?]
at 
com.github.yasar11732.lucene_zemberek.ZemberekTokenizerFactory.<clinit>(ZemberekTokenizerFactory.java:16)
~[?:?]
----

However, I have tokenization/sentence-boundary-model.bin inside
zemberek-tokenization-0.17.1.jar file, which I also copied into lib
dir.

Interestingly, I can instantiate a simple console application using
the same zemberek.tokenization.TurkishSentenceExtractor class (that in
turn tries to load tokenization/sentence-boundary-model.bin file) so I
don't know why it doesn't find
tokenization/sentence-boundary-model.bin when loaded into solr.

Best Regards,

Chris Hostetter <hossman_luc...@fucit.org>, 31 Oca 2025 Cum, 03:40
tarihinde şunu yazdı:
>
>
> : However, I am getting "A SPI class of type
> : org.apache.lucene.analysis.TokenizerFactory with name
> : 'zemberekTokenizer' does not exist."
> :
> : I have defined NAME on my tokenizer factory as can be seen here:
> : 
> https://github.com/yasar11732/lucene-zemberek/blob/master/src/main/java/com/github/yasar11732/lucene_zemberek/ZemberekTokenizerFactory.java#L14
> :
> : Is there any other step I should take before solr will recognize my 
> tokenizer?
>
> The key piece you are are missing is java level "Service Provider"
> Interfact registration of your class as an implemenation of the
> TokenizerFactory "Service Interface" ...
>
> https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html
>
> ..this is done using files under META-INF/services/ path inside your jar.
> You can see an example of how Lucene registers some of it's
> TokenizerFactories here...
>
> https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/resources/META-INF/services/org.apache.lucene.analysis.TokenizerFactory
>
> ...if you wanted to implement your own TokenFilter or CharFilter those
> would need to go in their own corisponding "Interface" based file name in
> your jar.
>
>
> Note: the "resources/" path is just where lucene keeps the source of that
> file in git, not a path that exists in the jar...
>
> $ jar tf lucene-analysis-common-9.11.1.jar | grep 
> META-INF/services/org.apache.lucene
> META-INF/services/org.apache.lucene.analysis.CharFilterFactory
> META-INF/services/org.apache.lucene.analysis.TokenFilterFactory
> META-INF/services/org.apache.lucene.analysis.TokenizerFactory
>
>
> (I have very little experience with maven, but i believe if you create a
> 'src/main/resources/META-INF/services/org.apache.lucene.analysis.TokenizerFactory'
> file in your repo, the maven 'jar' plugin will do the right thing for you)
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/

Re: How to create a tokenizer in a way solr will recognize

Reply via email to