RE: localToken contains a termBuffer with 10 empty chars ('')

Uwe Schindler Sun, 18 Oct 2009 02:47:08 -0700

The termBuffer is just a buffer of a arbitrary length (the length is
over-allocated with some additional chars, that a new buffer does not need
to be allocated whenever a new char is added (it woks the same like
stringbuffer). The termLength() contains the number of "valid" chars in the
buffer. If you have an empty Token, the termLength()==0 and it is irrelevant
what the buffer contains and how long it is. If there is 5 chars in, the
size of the buffer is 10, but termLength() returns 5, so you know, that 5
chars are valid. The rest of 5 chars at the end of the buffer is just
garbage.


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: David Ginzburg [mailto:davidginzb...@gmail.com]
> Sent: Sunday, October 18, 2009 11:41 AM
> To: java-user@lucene.apache.org
> Subject: Re: localToken contains a termBuffer with 10 empty chars ('')
> 
> Sorry,What does it mean, to respect the termLength() ?
> 
> On Sun, Oct 18, 2009 at 11:37 AM, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> > You must also respect termLength() which returns the number of "valid"
> > chars
> > in the term buffer.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -----Original Message-----
> > > From: David Ginzburg [mailto:davidginzb...@gmail.com]
> > > Sent: Sunday, October 18, 2009 2:28 AM
> > > To: java-user@lucene.apache.org
> > > Subject: localToken contains a termBuffer with 10 empty chars ('')
> > >
> > > Hi,
> > > I have written a my own weighted synonym filter and tried to integrate
> it
> > > inside an analyzer.
> > > The analyzer as defined in the schema.xml is:
> > >
> > >
> > >
> > >
> > > the field type is
> > > *<fieldType name="Company_Name" class="solr.TextField"
> > > positionIncrementGap="100" >
> > >       <analyzer type="index">
> > >         <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
> > >         **
> > >         <filter class="DTSynonymFactory"
> > > FreskoFunction="**SimilarityProbManual.txt"
> > > ignoreCase="true" expand="false"/>
> > >
> > >         <!--<filter class="solr.**EnglishPorterFilterFactory"
> > > protected="protwords.txt"/>-->
> > >         <!--<filter
> > class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>-
> > > ->
> > >       </analyzer>
> > >       <analyzer type="query">
> > >         <tokenizer class="solr.**StandardTokenizerFactory"/>
> > >         <filter class="solr.**LowerCaseFilterFactory"/>
> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> > > words="stopwords.txt"/>
> > >         <!--<filter class="solr.**EnglishPorterFilterFactory"
> > > protected="protwords.txt"/>-->
> > >         <!--<filter
> class="solr.**RemoveDuplicatesTokenFilterFac**tory"/
> > > >-->
> > >       </analyzer>
> > >     </fieldType>*
> > >
> > >
> > > The problem is that I always get in the  Token next(Token
> reusableToken)
> > > method in  DTSynonymFilter  a token with a termBuffer containing  10
> emty
> > > chars.
> > > *
> > > *
> > > *I have debugged and stepped into Solr code and found that *
> > > *in class DocInverterPerField
> > >  Token token = stream.next(localToken); line 134*
> > > *
> > > localToken contains a termBuffer with 10 empty chars ('')*
> > >
> > > *What am I doing wrong ???
> > > *
> > > The java code:
> > > *
> > > import com.google.common.collect.**ArrayListMultimap;
> > > import java.io.IOException;
> > > import java.util.LinkedList;
> > > import java.util.List;
> > > import org.apache.lucene.analysis.**Token;
> > > import org.apache.lucene.analysis.**TokenFilter;
> > > import org.apache.lucene.analysis.**TokenStream;
> > > import org.apache.lucene.analysis.**payloads.PayloadHelper;
> > > import org.apache.lucene.index.**Payload;
> > >
> > > /**
> > >  *
> > >  * @author david
> > >  */
> > > public class DTSynonymFilter extends TokenFilter {
> > >
> > >     public DTSynonymFilter(TokenStream input,
> ArrayListMultimap<String,
> > > Synonym> syns) {
> > >         super(input);
> > >         this.synsMap = syns;
> > >         System.out.println("in DTSynonymFilter synsMap ");
> > >
> > >
> > >
> > >     }
> > >     public static final String SYNONYM = "<SYNONYM>";
> > >     TokenFilter tf;
> > >     private LinkedList<Token> synonymTokenQueue = new
> > LinkedList<Token>();
> > >
> > >     private ArrayListMultimap<String, Synonym> synsMap = null;
> > >     private LinkedList<Token> buffer;
> > >
> > >     private Token nextTok(Token target) throws IOException {
> > >
> > >         if (buffer != null && !buffer.isEmpty()) {
> > >             return buffer.removeFirst();
> > >         } else {
> > >             return input.next(target);
> > >         }
> > >     }
> > >
> > >     private void pushTok(Token t) {
> > >         if (buffer == null) {
> > >             buffer = new LinkedList<Token>();
> > >
> > >         }
> > >         buffer.addFirst(t);
> > >     }
> > >
> > >     @Override
> > >     public Token next(Token reusableToken) throws IOException {
> > >
> > >         if (synonymTokenQueue.size() > 0) {
> > >
> > >             return synonymTokenQueue.removeFirst(* *);
> > >
> > >         }
> > >         if (reusableToken == null) {
> > >             return null;
> > >         }
> > >
> > >         reusableToken.setPayload(new Payload(new byte[]{(byte) 1}));
> > >
> > >       //   System.out.println("trying to get synonyms for
> > > "+reusableToken);
> > >       //    System.out.println(synsMap.* *get(reusableToken.term()));
> > >         List<Synonym> syns = synsMap.get(reusableToken.**term());
> > >        for (Synonym synonym : synsMap.get(reusableToken.**term())) {
> > >                 System.out.println(synonym);
> > >             }
> > >         Payload boostPayload;
> > >
> > >         for (Synonym synonym : syns) {
> > >             //Token(char[] startTermBuffer, int termBufferOffset, int
> > > termBufferLength, int start, int end)
> > >            // Token synToken = new
> > > Token(synonym.getToken().**toCharArray(),
> > > reusableToken.startOffset(), reusableToken.endOffset(),
> > > synonym.getToken().length(), 0);//, t.startOffset(), t.endOffset(),
> > > SYNONYM);
> > >             Token newTok = new Token(reusableToken.**startOffset(),
> > > reusableToken.endOffset(), SYNONYM);
> > >             newTok.setTermBuffer(synonym.**getToken().toCharArray(),
> 0,
> > > synonym.getToken().length());
> > >             // set the position increment to zero
> > >             // this tells lucene the synonym is
> > >             // in the exact same location as the originating word
> > >             newTok.setPositionIncrement(0)**;
> > >             boostPayload = new Payload(PayloadHelper.**
> > > encodeFloat(synonym.getWieght(**)));
> > >             newTok.setPayload(**boostPayload);
> > >             synonymTokenQueue.add(newTok);
> > >
> > >         }
> > >         return reusableToken;
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >     }
> > > }
> > >
> > >
> > > import DTSynonymFilter;
> > > import com.google.common.collect.**ArrayListMultimap;
> > > import java.io.File;
> > > import java.io.IOException;
> > > import java.util.List;
> > > import java.util.Map;
> > > import java.util.logging.Level;
> > > import java.util.logging.Logger;
> > > import org.apache.lucene.analysis.**Token;
> > > import org.apache.lucene.analysis.**TokenStream;
> > > import org.apache.solr.analysis.**BaseTokenFilterFactory;
> > > import org.apache.solr.analysis.**TokenizerFactory;
> > > import org.apache.solr.common.**ResourceLoader;
> > > import org.apache.solr.common.util.**StrUtils;
> > > import org.apache.solr.util.plugin.**ResourceLoaderAware;
> > >
> > > /**
> > >  *
> > >  * @author david
> > >  */
> > > public class DTSynonymFactory extends BaseTokenFilterFactory
> implements
> > > ResourceLoaderAware {
> > >
> > >     boolean informed=false;
> > >     String synonyms=null;
> > >
> > >     public DTSynonymFactory(){
> > >
> > >        // this.syns= ArrayListMultimap.create();
> > >     }
> > >
> > >     final static Logger log =
> > Logger.getLogger(**DTSynonymFactory.class.**
> > > getName());
> > >
> > >     private static TokenizerFactory loadTokenizerFactory(*
> > *ResourceLoader
> > > loader, String cname, Map<String, String> args) {
> > >         TokenizerFactory tokFactory = (TokenizerFactory)
> > > loader.newInstance(cname);
> > >         tokFactory.init(args);
> > >         return tokFactory;
> > >     }
> > >     private ArrayListMultimap<String, Synonym> syns = null;
> > >
> > >     public DTSynonymFilter create(TokenStream input) {
> > >
> > >         Thread.dumpStack();
> > >         try {
> > >             Thread.sleep(5000);
> > >         } catch (InterruptedException ex) {
> > >
> > >
> Logger.getLogger(**DTSynonymFactory.class.**getName()).log(Level.SEVERE,
> > > null, ex);
> > >         }
> > >         if(syns!=null){
> > >             System.out.println("in create() syns is "+syns+" syns size
> is
> > > "+" " );
> > >             return new DTSynonymFilter(input,syns);
> > >         }
> > >         else{
> > >             System.out.println("in create() syns is "+syns+" and
> informed
> > > is
> > > "+informed);
> > >             return new DTSynonymFilter(input,null);
> > >
> > >
> > >         }
> > >   }
> > >     @Override
> > >     public void inform(ResourceLoader loader) {
> > >
> > >          synonyms = args.get("FreskoFunction");
> > >         System.out.println("in DTSynonymFilter.inform() synonyms file
> is
> > > "+synonyms);
> > >         boolean ignoreCase = getBoolean("ignoreCase", false);
> > >          System.out.println("in DTSynonymFilter.inform() ignoreCase is
> > > "+ignoreCase);
> > >         boolean expand = getBoolean("expand", true);
> > >         System.out.println("in DTSynonymFilter.inform() expand is
> > > "+expand);
> > >         //String seperator =
> > >         String tf = args.get("tokenizerFactory");
> > >
> > >         TokenizerFactory tokFactory = null;
> > >         if (tf != null) {
> > >             tokFactory = loadTokenizerFactory(loader, tf, args);
> > >         }
> > >         if (tf != null) {
> > >             System.out.println("**TokenizerFactory loaded ");
> > >         }
> > >         if (synonyms != null) {
> > >             List<String> wlist = null;
> > >             try {
> > >                 File synonymFile = new File(synonyms);
> > >                 if (synonymFile.exists()) {
> > >                     wlist = loader.getLines(synonyms);
> > >                 } else {
> > >                     List<String> files = StrUtils.splitFileNames(**
> > > synonyms);
> > >                     for (String file : files) {
> > >                         wlist = loader.getLines(file.trim());
> > >                     }
> > >                 }
> > >             } catch (Exception e) {
> > >                 e.printStackTrace();
> > >
> > >                 throw new RuntimeException(e);
> > >
> > >             }
> > >             syns = ArrayListMultimap.create();
> > >             populateSynMap("\\|", wlist);
> > >             if(syns==null){
> > >                 System.out.println("sysns after create and populate is
> > > null!!!!!!");
> > >                 Thread.sleep(5000);
> > >
> > >
> > >             }
> > >             else{
> > >                 System.out.println("after crete the size of syns is
> > > "+syns.size());
> > >                 informed=true;
> > >             }
> > >
> > >         // synMap = new SynonymMap(ignoreCase);
> > >         // parseRules(wlist, synMap, "=>", ",", expand,tokFactory);
> > >         }
> > >         else{
> > >             throw new RuntimeException("Could not find synonyms");
> > >         }
> > >         }catch(Exception e){
> > >            e.printStackTrace();
> > >            throw  new RuntimeException(e);
> > >         }
> > >     }
> > >
> > >
> > >         }
> > >
> > >     }
> > > }
> > >
> > > * Thanks in advance
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> 
> --
> All that is necessary for evil to triumph is for good men to do nothing


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: localToken contains a termBuffer with 10 empty chars ('')

Reply via email to