Hi Suneel,
Just for context, I've implemented the following.
@Override
protected void map(Text key, BehemothDocument value, Context context)
throws IOException, InterruptedException {
String sContent = value.getText();
if (sContent == null) {
// no text available? skip
context.getCounter("LuceneTokenizer", "BehemothDocWithoutText")
.increment(1);
return;
}
analyzer = new StandardAnalyzer(matchVersion); // or any other
analyzer
TokenStream ts = analyzer.tokenStream(key.toString(), new
StringReader(sContent.toString()));
// The Analyzer class will construct the Tokenizer, TokenFilter(s),
and CharFilter(s),
// and pass the resulting Reader to the Tokenizer.
@SuppressWarnings("unused")
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
CharTermAttribute termAtt = ts
.addAttribute(CharTermAttribute.class);
StringTuple document = new StringTuple();
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
if (termAtt.length() > 0) {
document.add(new String(termAtt.buffer(), 0,
termAtt.length()));
}
}
ts.end(); // Perform end-of-stream operations, e.g. set the
final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
context.write(key, document);
}
I'll be testing and will update is anything else comes up.
Thanks
Lewis
On Mon, May 11, 2015 at 2:12 PM, Lewis John Mcgibbney <
[email protected]> wrote:
> I found Mike's blog post regarding Lucene 4.X from a while ago [0].
> In the* '*Other Changes*'* section Mike states "Analyzers must always
> provide a reusable token stream, by implementing the
> Analyzer.createComponents method (reusableTokenStream has been removed
> and tokenStream is now final, in Analzyer)."
> This provides a good bit ore context therefore I'm going to continue on
> createComponents route with the aim of implementing the newer 4.X Lucene
> API.
> In the meantime, if you get any updated or have a code sample it would be
> very much appreciated.
> Thanks
> Lewis
>
> [0]
> http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
>
> On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi Suneel,
>>
>> On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi <[email protected]>
>> wrote:
>>
>>> Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in
>>> the
>>> TokenStream workflow in Lucene post-Lucene 4.5.
>>>
>>
>> Yes I know that after looking into the codebase. Thanks for clarifying!
>>
>>
>>>
>>> What exactly are u trying to do and where is it u r stuck now? It would
>>> help if u posted a code snippet or something.
>>>
>>>
>> In particular I am working on the following implementation [0] which uses
>> the following code
>>
>> TokenStream stream = analyzer.reusableTokenStream(key.toString(), new
>> StringReader(sContent.toString()));
>>
>> Of note here is that the analyzer object is instantiated as of type
>> DefaultAnalyzer [1]. It is further noted that the
>> analyzer.reusableTokenStream
>> API is deprecated as you've noted so I am just wondering what the suggested
>> API semantics are in order to achieve the desired upgrade.
>> Thanks in advance again for any input.
>> Lewis
>>
>> [0]
>> https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53
>> [1]
>> http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java
>>
>>
>>
>
>
>
> --
> *Lewis*
>
--
*Lewis*