Hi Koorosh,
Lucene analyzers and tokenfilters are discovered via Java SPI (see
https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html). In order to
make your TokenFilter discoverable, you need to add the fully qualified
classname of your factory to the file
resources/META-INF/org.apache.lucene.analyzer.util.TokenFilterFactory
Alan Woodward
www.flax.co.uk
On 16 Nov 2015, at 23:55, Koorosh Vakhshoori wrote:
> Hi all,
> I am in process of creating a patch for Lucene. However, I can’t get
> the JUnit test TestAllAnalyzersHaveFactories pass. Hope this is the
> right forum for help. If not kindly direct me to the correct forum.
> Any help is greatly appreciated!
>
> First, some background. The patch is building on Ted Sullivan work,
> SOLR-7136. It is an enhanced version of AutoPhrase which I like to
> submit to community. The code includes a new TokenFilter,
> AutoPhrasingTokenFilter with Junit tests. I have created following
> package:
>
> org.apache.lucene.analysis.autophrase
>
> This package contains the following class files:
>
> AutoPhraseDetector.java
> AutoPhrasingTokenFilter.java
> AutoPhrasingTokenFilterFactory.java
> package-info.java
>
> When running the test under ant, the test
> TestAllAnalyzersHaveFactories fails with following output, I have
> added some print statements for debugging:
> ============================================================
> -test:
> [junit4] <JUnit4> says ????! Master seed: 86F1C35C6CE11696
> [junit4] Your default console's encoding may not display certain
> unicode glyphs: US-ASCII
> [junit4] Executing 1 suite with 1 JVM.
> [junit4]
> [junit4] Started J0 PID(15156@localhost).
> [junit4] Suite:
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories
> [junit4] 1> clazzName: IndicNormalizationFilter
> [junit4] 1> simpleName: IndicNormalization
> [junit4] 1> clazzName: HyphenationCompoundWordTokenFilter
> [junit4] 1> simpleName: HyphenationCompoundWord
> [junit4] 1> clazzName: DictionaryCompoundWordTokenFilter
> [junit4] 1> simpleName: DictionaryCompoundWord
> [junit4] 1> clazzName: BulgarianStemFilter
> [junit4] 1> simpleName: BulgarianStem
> [junit4] 1> clazzName: ShingleFilter
> [junit4] 1> simpleName: Shingle
> [junit4] 1> clazzName: ReverseStringFilter
> [junit4] 1> simpleName: ReverseString
> [junit4] 1> clazzName: GreekLowerCaseFilter
> [junit4] 1> simpleName: GreekLowerCase
> [junit4] 1> clazzName: GreekStemFilter
> [junit4] 1> simpleName: GreekStem
> [junit4] 1> clazzName: HungarianLightStemFilter
> [junit4] 1> simpleName: HungarianLightStem
> [junit4] 1> clazzName: GermanNormalizationFilter
> [junit4] 1> simpleName: GermanNormalization
> [junit4] 1> clazzName: GermanLightStemFilter
> [junit4] 1> simpleName: GermanLightStem
> [junit4] 1> clazzName: GermanMinimalStemFilter
> [junit4] 1> simpleName: GermanMinimalStem
> [junit4] 1> clazzName: GermanStemFilter
> [junit4] 1> simpleName: GermanStem
> [junit4] 1> clazzName: EnglishPossessiveFilter
> [junit4] 1> simpleName: EnglishPossessive
> [junit4] 1> clazzName: EnglishMinimalStemFilter
> [junit4] 1> simpleName: EnglishMinimalStem
> [junit4] 1> clazzName: PorterStemFilter
> [junit4] 1> simpleName: PorterStem
> [junit4] 1> clazzName: KStemFilter
> [junit4] 1> simpleName: KStem
> [junit4] 1> clazzName: ItalianLightStemFilter
> [junit4] 1> simpleName: ItalianLightStem
> [junit4] 1> clazzName: HindiStemFilter
> [junit4] 1> simpleName: HindiStem
> [junit4] 1> clazzName: HindiNormalizationFilter
> [junit4] 1> simpleName: HindiNormalization
> [junit4] 1> clazzName: RussianLightStemFilter
> [junit4] 1> simpleName: RussianLightStem
> [junit4] 1> clazzName: ClassicFilter
> [junit4] 1> simpleName: Classic
> [junit4] 1> clazzName: StandardFilter
> [junit4] 1> simpleName: Standard
> [junit4] 1> clazzName: CzechStemFilter
> [junit4] 1> simpleName: CzechStem
> [junit4] 1> clazzName: ElisionFilter
> [junit4] 1> simpleName: Elision
> [junit4] 1> clazzName: DelimitedPayloadTokenFilter
> [junit4] 1> simpleName: DelimitedPayload
> [junit4] 1> clazzName: TokenOffsetPayloadTokenFilter
> [junit4] 1> simpleName: TokenOffsetPayload
> [junit4] 1> clazzName: NumericPayloadTokenFilter
> [junit4] 1> simpleName: NumericPayload
> [junit4] 1> clazzName: TypeAsPayloadTokenFilter
> [junit4] 1> simpleName: TypeAsPayload
> [junit4] 1> clazzName: AutoPhrasingTokenFilter
> [junit4] 1> simpleName: AutoPhrasing
> [junit4] 2> NOTE: reproduce with: ant test
> -Dtestcase=TestAllAnalyzersHaveFactories -Dtests.method=test
> -Dtests.seed=86F1C35C6CE11696 -Dtests.slow=true -Dtests.locale=zh_CN
> -Dtests.timezone=US/Samoa -Dtests.asserts=true
> -Dtests.file.encoding=UTF-8
> [junit4] ERROR 2.94s | TestAllAnalyzersHaveFactories.test <<<
> [junit4] > Throwable #1: java.lang.IllegalArgumentException: A
> SPI class of type org.apache.lucene.analysis.util.TokenFilterFactory
> with name 'AutoPhrasing' does not exist. You need to add the
> corresponding JAR file supporting this SPI to your classpath. The
> current classpath supports the following names: [apostrophe,
> arabicnormalization, arabicstem, bulgarianstem, brazilianstem,
> cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams,
> commongramsquery, dictionarycompoundword, hyphenationcompoundword,
> decimaldigit, lowercase, stop, type, uppercase, czechstem,
> germanlightstem, germanminimalstem, germannormalization, germanstem,
> greeklowercase, greekstem, englishminimalstem, englishpossessive,
> kstem, porterstem, spanishlightstem, persiannormalization,
> finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase,
> galicianminimalstem, galicianstem, hindinormalization, hindistem,
> hungarianlightstem, hunspellstem, indonesianstem, indicnormalization,
> italianlightstem, latvianstem, asciifolding, capitalization,
> codepointcount, fingerprint, hyphenatedwords, keepword, keywordmarker,
> keywordrepeat, length, limittokencount, limittokenoffset,
> limittokenposition, removeduplicates, stemmeroverride, trim, truncate,
> worddelimiter, scandinavianfolding, scandinaviannormalization,
> edgengram, ngram, norwegianlightstem, norwegianminimalstem,
> patternreplace, patterncapturegroup, delimitedpayload, numericpayload,
> tokenoffsetpayload, typeaspayload, portugueselightstem,
> portugueseminimalstem, portuguesestem, reversestring,
> russianlightstem, shingle, snowballporter, serbiannormalization,
> classic, standard, swedishlightstem, synonym, turkishlowercase,
> elision]
> [junit4] > at
> __randomizedtesting.SeedInfo.seed([86F1C35C6CE11696:EA5FC86C21D7B6E]:0)
> [junit4] > at
> org.apache.lucene.analysis.util.AnalysisSPILoader.lookupClass(AnalysisSPILoader.java:135)
> [junit4] > at
> org.apache.lucene.analysis.util.TokenFilterFactory.lookupClass(TokenFilterFactory.java:42)
> [junit4] > at
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test(TestAllAnalyzersHaveFactories.java:168)
> [junit4] > at java.lang.Thread.run(Thread.java:745)
> [junit4] 2> NOTE: test params are: codec=CheapBastard,
> sim=ClassicSimilarity, locale=zh_CN, timezone=US/Samoa
> [junit4] 2> NOTE: Linux 2.6.32-358.el6.x86_64 amd64/Oracle
> Corporation 1.8.0_05
> (64-bit)/cpus=4,threads=1,free=136794808,total=160432128
> [junit4] 2> NOTE: All tests run in this JVM:
> [TestAllAnalyzersHaveFactories]
> [junit4] Completed [1/1] in 4.33s, 1 test, 1 error <<< FAILURES!
> [junit4]
> [junit4]
> [junit4] Tests with failures [seed: 86F1C35C6CE11696]:
> [junit4] -
> org.apache.lucene.analysis.core.TestAllAnalyzersHaveFactories.test
> [junit4]
> [junit4]
> [junit4] JVM J0: 0.66 .. 6.09 = 5.44s
> [junit4] Execution time total: 6.11 sec.
> [junit4] Tests summary: 1 suite, 1 test, 1 error
> ================================================
>
> Running the test under debugger in Eclipse, it gives the same error
> message for a different Factory class 'DaitchMokitoffSoundex'. This
> may or may not be related to my issue, not sure.
>
> My guess is there is some sort of class loader issue. My understanding
> of the test is that it is making sure there is a corresponding
> TokenFilter Factory for a TokenFilter. In this case that would be
> AutoPhrasingTokenFilterFactory. Now, I checked to make sure the class
> is created. The 'find' command shows the class at:
>
> build/analysis/common/classes/java/org/apache/lucene/analysis/autophrase/AutoPhrasingTokenFilterFactory.class
>
> The location is similar to other Filter factories.
>
> I have put in print statement as well as running the test in Eclipse
> debugger. As far as I can see, the test code sees the
> AutoPhrasingTokenFilter. Looking at
> TestAllAnalyzersHaveFactories.java, at line marked with '1>', the test
> code picks up the class AutoPhrasingTokenFilter. However, when it gets
> to line '2>', it fails:
>
> ===========================================
> public void test() throws Exception {
> 1> List<Class<?>> analysisClasses =
> TestRandomChains.getClassesForPackage("org.apache.lucene.analysis");
>
> ClassLoader cl = ClassLoader.getSystemClassLoader();
>
> URL[] urls = ((URLClassLoader)cl).getURLs();
> // System.out.println("ClassPath Start:");
> for(URL url: urls){
> // System.out.println(url.getFile());
> }
> // System.out.println("ClassPath Ends!");
>
> for (final Class<?> c : analysisClasses) {
> final int modifiers = c.getModifiers();
> if (
> // don't waste time with abstract classes
> Modifier.isAbstract(modifiers) || !Modifier.isPublic(modifiers)
> || c.isSynthetic() || c.isAnonymousClass() ||
> c.isMemberClass() || c.isInterface()
> || testComponents.contains(c)
> || crazyComponents.contains(c)
> || oddlyNamedComponents.contains(c)
> || c.isAnnotationPresent(Deprecated.class) // deprecated ones
> are typically back compat hacks
> || !(Tokenizer.class.isAssignableFrom(c) ||
> TokenFilter.class.isAssignableFrom(c) ||
> CharFilter.class.isAssignableFrom(c))
> ) {
> continue;
> }
>
> Map<String,String> args = new HashMap<>();
> args.put("luceneMatchVersion", Version.LATEST.toString());
>
> if (Tokenizer.class.isAssignableFrom(c)) {
> String clazzName = c.getSimpleName();
> assertTrue(clazzName.endsWith("Tokenizer"));
> String simpleName = clazzName.substring(0, clazzName.length() - 9);
> assertNotNull(TokenizerFactory.lookupClass(simpleName));
> TokenizerFactory instance = null;
> try {
> instance = TokenizerFactory.forName(simpleName, args);
> assertNotNull(instance);
> if (instance instanceof ResourceLoaderAware) {
> ((ResourceLoaderAware) instance).inform(loader);
> }
> assertSame(c, instance.create().getClass());
> } catch (IllegalArgumentException e) {
> if (e.getCause() instanceof NoSuchMethodException) {
> // there is no corresponding ctor available
> throw e;
> }
> // TODO: For now pass because some factories have not yet a
> default config that always works
> }
> } else if (TokenFilter.class.isAssignableFrom(c)) {
> String clazzName = c.getSimpleName();
> System.out.println("clazzName: " + clazzName);
> assertTrue(clazzName.endsWith("Filter"));
> String simpleName = clazzName.substring(0, clazzName.length()
> - (clazzName.endsWith("TokenFilter") ? 11 : 6));
> System.out.println("simpleName: " + simpleName);
> 2> assertNotNull(TokenFilterFactory.lookupClass(simpleName));
> =====================================================
>
> Here is the code for the factory class:
>
> package org.apache.lucene.analysis.autophrase;
>
> /*
> * Copyright 2015 Synopsys, Inc.
> *
> * Licensed under the Apache License, Version 2.0 (the "License"); you
> * may not use this file except in compliance with the License. You may
> * obtain a copy of the License at
> *
> * http://www.apache.org/licenses/LICENSE-2.0
> *
> * Unless required by applicable law or agreed to in writing, software
> * distributed under the License is distributed on an "AS IS" BASIS,
> * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> * See the License for the specific language governing permissions and
> * limitations under the License.
> */
>
> import java.io.IOException;
> import java.util.Map;
>
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.util.CharArraySet;
> import org.apache.lucene.analysis.util.ResourceLoader;
> import org.apache.lucene.analysis.util.ResourceLoaderAware;
> import org.apache.lucene.analysis.util.TokenFilterFactory;
>
> public class AutoPhrasingTokenFilterFactory extends TokenFilterFactory
> implements ResourceLoaderAware {
>
> private CharArraySet phraseSets;
> private final String phraseSetFiles;
> private final boolean ignoreCase;
> private final boolean emitSingleTokens;
> private final boolean quotePhrase;
> private final boolean emitAmbiguousPhrases;
>
> private String replaceWhitespaceWith = null;
>
> public AutoPhrasingTokenFilterFactory(Map<String, String> initArgs) {
> super( initArgs );
> phraseSetFiles = get(initArgs, "phrases");
> ignoreCase = getBoolean( initArgs, "ignoreCase", false);
> emitSingleTokens = getBoolean( initArgs, "includeTokens", false );
> quotePhrase = getBoolean( initArgs, "quotePhrase", false );
> emitAmbiguousPhrases = getBoolean( initArgs,
> "emitAmbiguousPhrases", false );
>
> String replaceWhitespaceArg = initArgs.get( "replaceWhitespaceWith" );
> if (replaceWhitespaceArg != null) {
> replaceWhitespaceWith = replaceWhitespaceArg;
> }
> }
>
> @Override
> public void inform(ResourceLoader loader) throws IOException {
> if (phraseSetFiles != null) {
> phraseSets = getWordSet(loader, phraseSetFiles, ignoreCase);
> }
> }
>
> @Override
> public TokenStream create( TokenStream input ) {
> AutoPhrasingTokenFilter autoPhraseFilter = new
> AutoPhrasingTokenFilter( input, phraseSets, emitSingleTokens );
> if (replaceWhitespaceWith != null) {
> autoPhraseFilter.setReplaceWhitespaceWith( new Character(
> replaceWhitespaceWith.charAt( 0 )) );
> }
> //Doesn't make send to emit phrases in double quotes if
> replaceWhitespaceWith character is set.
> if ((replaceWhitespaceWith == null) && quotePhrase) {
> autoPhraseFilter.setQuotePhrase(quotePhrase);
> }
> if (emitAmbiguousPhrases) {
> autoPhraseFilter.setEmitAmbiguousPhrases(emitAmbiguousPhrases);
> }
> return autoPhraseFilter;
> }
> }
>
> Thanks,
>
> Koorosh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>