Re: Design questions

Adrian Smith Fri, 15 Feb 2008 04:02:39 -0800

Hi,

I have a similar sitaution. I also considered using $. But for the sake of
not running into (potential) problems with Tokenisers, I just defined a
string in a config file which for sure is never going to occur in a document
and will never be searched for, e.g.


dfgjkjrkruigduhfkdgjrugr

Cheers, Adrian
--
Java Software Developer
http://www.databasesandlife.com/



On 15/02/2008, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> I haven't really been following this thread that closely, but...
>
> : Why not just use $$$$$$$$? Check to insure that it makes
>
> : it through whatever analyzer you choose though. For instance,
> : LetterTokenizer will remove it...
>
>
> 1) i'm 99% sure you can do something like this...
>
>   Document doc = new Document()
>   for (int i = 0; i < pages.length; i++) {
>     doc.add(new Field("text", pages[i], Field.Store.NO,
> Field.Index.TOKENIZED));
>     doc.add(new Field("text", "$$", Field.Store.NO,
> Field.Index.UN_TOKENIZED));
>   }
>
> ...and you'll get your magic token regardless of whether it would normally
> make it through your analyzer. In fact: you want it to be something your
> analyzer could never produce, even if it appears in the orriginal text, so
> you don't get false boundaries (ie: if you use an Analzeer that lowercases
> everything, then "A" makes a perfectly fine boundary token.
>
> 2) if your goal is just to be able to make sure you can query for phrases
> without crossing page boundaries, it's a lot simpler just to use are
> really big positionIncimentGap with your analyzer (and add each page as a
> seperate Field instance).  boundary tokens like these are relaly only
> neccessary if you want more complex queries (like "find X and Y on
> the same page but not in the same sentence")
>
>
>
>
> -Hoss
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Design questions

Reply via email to