The definition of CharArraySet is dangerously confusing and leads to bugs when 
used.
------------------------------------------------------------------------------------

                 Key: LUCENENET-414
                 URL: https://issues.apache.org/jira/browse/LUCENENET-414
             Project: Lucene.Net
          Issue Type: Bug
          Components: Lucene.Net Core
    Affects Versions: Lucene.Net 2.9.2
         Environment: Irrelevant
            Reporter: Vincent Van Den Berghe
            Priority: Minor
             Fix For: Lucene.Net 2.9.2


Right now, CharArraySet derives from System.Collections.Hashtable, but doesn't 
actually use this base type for storing elements.
However, the StandardAnalyzer.STOP_WORDS_SET is exposed as a 
System.Collections.Hashtable. The trivial code to build your own stopword set 
using the StandardAnalyzer.STOP_WORDS_SET and adding your own set of stopwords 
like this:

CharArraySet myStopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET, 
ignoreCase: false);
foreach (string domainSpecificStopWord in DomainSpecificStopWords)
    stopWords.Add(domainSpecificStopWord);

... will fail because the CharArraySet accepts an ICollection, which will be 
passed the Hashtable instance of STOP_WORDS_SET: the resulting myStopWords will 
only contain the DomainSpecificStopWords, and not those from STOP_WORDS_SET.

One workaround would be to replace the first line with this:

CharArraySet stopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET.Count 
+ DomainSpecificStopWords.Length, ignoreCase: false);
foreach (string domainSpecificStopWord in 
(CharArraySet)StandardAnalyzer.STOP_WORDS_SET)
    stopWords.Add(domainSpecificStopWord);

... but this makes use of the implementation detail (the STOP_WORDS_SET is 
really an UnmodifiableCharArraySet which is itself a CharArraySet). It works 
because it forces the foreach() to use the correct 
CharArraySet.GetEnumerator(), which is defined as a "new" method (this has a 
bad code smell to it)

At least 2 possibilities exist to solve this problem:
- Make CharArraySet use the Hashtable instance and a custom comparator, instead 
of its own implementation.
- Make CharArraySet use HashSet<char[]>, defined in .NET 4.0.





--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to