PatternReplaceCharFilter would probably work, or maybe a custom CharFilter? *CharFilter has the advantage of preserving original text offsets, for highlighting.
Steve > -----Original Message----- > From: Glen Newton [mailto:glen.new...@gmail.com] > Sent: Monday, February 27, 2012 12:57 PM > To: java-user@lucene.apache.org > Subject: Re: Customizing indexing of large files > > Hi, > > Understood. > Write a custom FileReader that filters out the text you do not want. > This will do it streaming. > > Glen > > On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande > <praka...@altair.com> wrote: > > Hi, > > > > Description is multiline, in addition there is other text also. So, > essentially what I need id to jump the DATA_END as soon as I hit > DATA_BEGIN. > > > > I am creating the field using the constructor Field(String name, Reader > reader) and using StandardAnalyser. Right now I am using FileReader which > is causing all the text to be indexed/tokenized. > > > > Amount of text I am interested in is also pretty large, description is > just one such example. So, I really want some stream based implementation > to avoid keeping large amount of text in memory. May be a custom > TokenStream, but I don't know what to implement in tokenstream. The only > abstract method is incrementToken, I have no idea what to do in it. > > > > Regards, > > > > Prakash Bande > > Director - Hyperworks Enterprise Software > > Altair Eng. Inc. > > Troy MI > > Ph: 248-614-2400 ext 489 > > Cell: 248-404-0292 > > > > -----Original Message----- > > From: Glen Newton [mailto:glen.new...@gmail.com] > > Sent: Monday, February 27, 2012 12:05 PM > > To: java-user@lucene.apache.org > > Subject: Re: Customizing indexing of large files > > > > I'd suggest writing a perl script or > > insert-favourite-scripting-language-here script to pre-filter this > > content out of the files before it gets to Lucene/Solr > > Or you could just grep for "Data' and"Description" (or is > > 'Description' multi-line)? > > > > -Glen Newton > > > > On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande > > <praka...@altair.com> wrote: > >> Hi, > >> > >> I want to customize the indexing of some specific kind of files I have. > I am using 2.9.3 but upgrading is possible. > >> This is how my file's data looks > >> > >> ***************************** > >> Data for 2010 > >> Description: This section has a general description of the data. > >> DATA_BEGIN > >> Month P1 P2 P3 > >> 01 3243.433 43534.324 45345.2443 > >> 02 3242.324 234234.24 323.2343 > >> ... > >> ... > >> ... > >> ... > >> DATA_END > >> Data for 2011 > >> Description: This section has a general description of the data. > >> DATA_BEGIN > >> Month P1 P2 P3 > >> 01 3243.433 43534.324 45345.2443 > >> 02 3242.324 234234.24 323.2343 > >> ... > >> ... > >> ... > >> ... > >> DATA_END > >> ***************************** > >> > >> I would like to use a StandardAnalyser, but do not want to index the > data of the columns, i.e. skip all those numbers. Basically, as soon as I > hit the keyword DATA_BEGIN, I want to jump to DATA_END. > >> So, what is the best approach? Using a custom Reader, custom tokenizer > or some other mechanism. > >> Regards, > >> > >> Prakash Bande > >> Altair Eng. Inc. > >> Troy MI > >> Ph: 248-614-2400 ext 489 > >> Cell: 248-404-0292 > >> > > > > > > > > -- > > - > > http://zzzoot.blogspot.com/ > > - > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > -- > - > http://zzzoot.blogspot.com/ > - > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org