Re: NO_NORM and TOKENIZED

2008-03-05 Thread Michael McCandless


Correct, they are logically orthogonal, and I agree the API is  
somewhat confusing since "NO_NORMS" is mixing up two things.


To get a tokenized field without norms you can create the field with  
Index.TOKENIZED, and then call setOmitNorms(true).


Note that norms "spread" during merges, so, if you really want  
NO_NORMS for a given field X then every doc in the index must have  
its field X indexed with NO_NORMS.  Ie, build a clean index if you  
decide to turn off norms for field X.


Mike

Tobias Hill wrote:


Hi,

I am quite new to the Lucene API. I find the Field-constructor
unintuitive. Maybe I have misunderstood it. Let's find out...

It can be used either as:
new Field("field", "data", Store.NO, TOKENIZED)

or:
new Field("field", "data", Store.NO, NO_NORM)


As I understand it NO_NORM and TOKENIZED are not settings for
a one-dimensional behaviour - on the contrary they are rather
orthogonal.

I.e. it is quite likely that I would want _both_ TOKENIZED and  
NO_NORM.
This is especially true for fields that are of approx. equal and  
short length

over the doc-space.

- Am I right in my reasoning (which means that the API is a bit  
unclear)?

Or
- Have I misunderstood something fundamental about TOKENIZED and  
NO_NORM?


Thankful for any feedback on this,
Tobias

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incorrect Token Offset when using multiple fieldable instance

2008-03-05 Thread Michael McCandless


This is how Lucene has worked for quite some time (since 1.9).

When there are multiple fields with the same name in one Document,  
each field's offset starts from the last offset (offset of the last  
token) seen in the previous field.  If tokens are skipped at the end  
there's no way IndexWriter can know (because tokenStream doesn't  
return them).  It's as if we need the ability to query a tokenStream  
for its "final" offset or something.


One workaround might be to insert an "end marker" token, with the  
true end offset, which is a term you would never search on?


Mike

Renaud Delbru wrote:


Hi,

I currently use multiple fieldable instances for indexing sentences  
of a document.
When there is only one single fieldable instance, the token offset  
generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable instances.  
In DocumentWriter$FieldData#invertField method, if the field is  
tokenized, instead of updating offset attribute with  
stringValue.length() (which is performed if the field is not  
tokenized, line 1458), you update the offset attribute with the end  
offset of the last token (line 1503: offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a  
stopword, a dot, a space, etc.), the offset attribute is updated  
with the end offset of the last token not filtered. In this case,  
you store inside the offset attribute an incorrect offset (the  
offset is shift back) and all the next fieldable instances will  
have their offset shifted back.


Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

Regards.

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: More IndexDeletionPolicy questions

2008-03-05 Thread Michael McCandless


OK got it.  Yes, sharing an index over NFS requires your own  
DeletionPolicy.


So presumably you have readers on different machines accessing the  
index over NFS, and then one machine that does the writing to that  
index?  How so you plan to get which commit point each of these  
readers is currently using back to the single writer?


Mike

Tim Brennan wrote:

The bigger picture here is NFS-safety.  When I run a search, I hand  
off the search results to another thread so that they can be  
processed as necessary -- in particular so that they can be JOINed  
with a SQL DB -- but I don't want to completely lock the index from  
writes while doing a bunch of SQL calls.  Using the commit point  
tracking I can make sure my appropriate snapshot stays around until  
I'm completely done with it, even if I'm using an NFS mounted  
filesystem that doesn't have delete-last semantics.


--tim





It seems like it should be pretty simple -- keep a list of open
IndexReaders, track what Segment files they're pointing to, and in



onCommit don't delete those segments.


This implies you have multiple readers in a single JVM?  If so, you
should not need to make a custom deletion policy to handle this case

-- the OS should be properly protecting open files from deletion.   
Can you  shed more light on the bigger picture here?



Unfortunately it ends up being very difficult to directly determine
what Segment an IndexReader is pointing to.  Is there some
straightforward way that I'm missing -- all I've managed to do so
far is to remember the most recent one from onCommit/onInit and use



that onethat works OK, but makes bootstrapping a pain if you
try to open a Reader before you've opened the writer once.

Also, when I use IndexReader.reopen(), can I assume that the newly



returned reader is pointing at the "most recent" segment?  I think



so...


Yes, except you have a synchronization challenge: if the writer is in

the process of committing just as your reader opens you can't be
certain whether the reader got the new commit or the previous one.
If you have external synchronization to ensure reader only re-opens
after writer has fully committed then this isn't an issue.



Here's a sequence of steps that happen in my app:

0) open writer, onInit tells me that seg_1 is the most recent

segment


1) Open reader, assume it is pointing to seg_1 from (0)

2) New write commits into seg_2, don't delete seg_1 b/c of (1)

3) Call reader.reopen() on the reader from (1)new reader is
pointing to seg_2 now?

4) seg_1 stays around until the next time I open or commit a
writer, then it is removed.


Does that seem reasonable?


--tim









-

To unsubscribe, e-mail: java-user-unsubscri



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incorrect Token Offset when using multiple fieldable instance

2008-03-05 Thread Renaud Delbru
Do you know if there will be side-effects if we replace in 
DocumentWriter$FieldData#invertField

offset = offsetEnd+1;
by
offset = stringValue.length();

I still not understand the reason of such choice for the incrementation 
of the start offset.


Regards.

Michael McCandless wrote:


This is how Lucene has worked for quite some time (since 1.9).

When there are multiple fields with the same name in one Document, 
each field's offset starts from the last offset (offset of the last 
token) seen in the previous field.  If tokens are skipped at the end 
there's no way IndexWriter can know (because tokenStream doesn't 
return them).  It's as if we need the ability to query a tokenStream 
for its "final" offset or something.


One workaround might be to insert an "end marker" token, with the true 
end offset, which is a term you would never search on?


Mike

Renaud Delbru wrote:


Hi,

I currently use multiple fieldable instances for indexing sentences 
of a document.
When there is only one single fieldable instance, the token offset 
generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable instances. In 
DocumentWriter$FieldData#invertField method, if the field is 
tokenized, instead of updating offset attribute with 
stringValue.length() (which is performed if the field is not 
tokenized, line 1458), you update the offset attribute with the end 
offset of the last token (line 1503: offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a 
stopword, a dot, a space, etc.), the offset attribute is updated with 
the end offset of the last token not filtered. In this case, you 
store inside the offset attribute an incorrect offset (the offset is 
shift back) and all the next fieldable instances will have their 
offset shifted back.


Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Security filtering from external DB

2008-03-05 Thread Gabriel Landais

Jake Mannix a écrit :

Gabriel,
  You can make this search much more efficient as follows: say that you have
a method

public BooleanQuery createQuery(Collection allowedUUIDs);

  that works as you describe.  Then you can easily create a useful reusable
filter as follows:

  Filter filter = new CachingWrapperFilter(new
QueryFilter(createQuery(uuidCollection)));

  this filter will cache BitSets keyed on IndexReader instances, so hang
onto it and use it for
the entire time your uuidCollection doesn't change.

   Hope this helps.

  -jake


Hi,
 yes it helps!
 As a result, I will not use my own BitSet cache mechanism. I have been 
dumb when I believed that I couldn't create a SHOULD query on >1024 
terms, as I can create inner BooleanQuery...

 Just as note, QueryFilter is deprecated, QueryWrapperFilter replaces it.

The new code :

private Query createQuery(Collection visibleItems) {
int maxTerms = BooleanQuery.getMaxClauseCount() - 1;
int i = 0;
boolean subQueryIsQuery = true;

BooleanQuery query = new BooleanQuery();

BooleanQuery subQuery = new BooleanQuery();
for (String string : visibleItems) {
subQuery.add(new TermQuery(new Term(KEY_UUID, string)), 
BooleanClause.Occur.SHOULD);

if (i % maxTerms == 0) {
subQueryIsQuery = false;
query.add(new BooleanClause(subQuery, 
BooleanClause.Occur.SHOULD));

subQuery = new BooleanQuery();
}
i++;
}

if (subQueryIsQuery) {
return subQuery;
}
return query;
}

private Filter createFilter(Collection visibleItems) throws 
SimExplorerTechnicalException {
Filter filter = new CachingWrapperFilter(new 
QueryWrapperFilter(createQuery(visibleItems)));

return filter;
}

 Not tested yet, but looks good.
 Thanks a lot!
Gabriel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: NO_NORM and TOKENIZED

2008-03-05 Thread spring
Hm, what exactly does NO_NORM mean?

Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NO_NORM and TOKENIZED

2008-03-05 Thread Michael McCandless


NO_NORMS means "do index the field as a single token (ie, do not  
tokenize the field), and, do not store norms for it".


Mike

On Mar 5, 2008, at 5:20 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:


Hm, what exactly does NO_NORM mean?

Thank you


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: More IndexDeletionPolicy questions

2008-03-05 Thread Tim Brennan
No, I have multiple readers in the same VM.  I track open readers within my VM 
and save those reader's commit points until the readers are gone.

--tim



- "Michael McCandless" <[EMAIL PROTECTED]> wrote:

> OK got it.  Yes, sharing an index over NFS requires your own  
> DeletionPolicy.
> 
> So presumably you have readers on different machines accessing the  
> index over NFS, and then one machine that does the writing to that  
> index?  How so you plan to get which commit point each of these  
> readers is currently using back to the single writer?
> 
> Mike
> 
> Tim Brennan wrote:
> 
> > The bigger picture here is NFS-safety.  When I run a search, I hand 
> 
> > off the search results to another thread so that they can be  
> > processed as necessary -- in particular so that they can be JOINed 
> 
> > with a SQL DB -- but I don't want to completely lock the index from 
> 
> > writes while doing a bunch of SQL calls.  Using the commit point  
> > tracking I can make sure my appropriate snapshot stays around until 
> 
> > I'm completely done with it, even if I'm using an NFS mounted  
> > filesystem that doesn't have delete-last semantics.
> >
> > --tim
> >
> >
> >
> >
> >>> It seems like it should be pretty simple -- keep a list of open
> >>> IndexReaders, track what Segment files they're pointing to, and
> in
> >>
> >>> onCommit don't delete those segments.
> >>
> >> This implies you have multiple readers in a single JVM?  If so,
> you
> >> should not need to make a custom deletion policy to handle this
> case
> >>
> >> -- the OS should be properly protecting open files from deletion.  
> 
> >> Can you  shed more light on the bigger picture here?
> >>
> >>> Unfortunately it ends up being very difficult to directly
> determine
> >>> what Segment an IndexReader is pointing to.  Is there some
> >>> straightforward way that I'm missing -- all I've managed to do so
> >>> far is to remember the most recent one from onCommit/onInit and
> use
> >>
> >>> that onethat works OK, but makes bootstrapping a pain if you
> >>> try to open a Reader before you've opened the writer once.
> >>>
> >>> Also, when I use IndexReader.reopen(), can I assume that the
> newly
> >>
> >>> returned reader is pointing at the "most recent" segment?  I
> think
> >>
> >>> so...
> >>
> >> Yes, except you have a synchronization challenge: if the writer is
> in
> >>
> >> the process of committing just as your reader opens you can't be
> >> certain whether the reader got the new commit or the previous one.
> >> If you have external synchronization to ensure reader only
> re-opens
> >> after writer has fully committed then this isn't an issue.
> >>
> >>>
> >>> Here's a sequence of steps that happen in my app:
> >>>
> >>> 0) open writer, onInit tells me that seg_1 is the most recent
> >> segment
> >>>
> >>> 1) Open reader, assume it is pointing to seg_1 from (0)
> >>>
> >>> 2) New write commits into seg_2, don't delete seg_1 b/c of (1)
> >>>
> >>> 3) Call reader.reopen() on the reader from (1)new reader is
> >>> pointing to seg_2 now?
> >>>
> >>> 4) seg_1 stays around until the next time I open or commit a
> >>> writer, then it is removed.
> >>>
> >>>
> >>> Does that seem reasonable?
> >>>
> >>>
> >>> --tim
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> -
> >>> To unsubscribe, e-mail: java-user-unsubscri
> >>
> >>
> >>
> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incorrect Token Offset when using multiple fieldable instance

2008-03-05 Thread Michael McCandless


Well, first off, sometimes the thing being indexed isn't a string, so  
you have no stringValue to get its length.  It could be a Reader or a  
TokenStream.


Second off, it's conceivable that an analyzer computes its own  
"interesting" offsets that are not in fact simple indices into the  
stringValue, though I would expect that to be the exception not the  
rule.


I can't think of any other harm ... so if neither of these apply in  
your situation then it should be OK?


I do agree this seems like a bug.  EG, if you use Highlighter on a  
multi-valued field indexed with stored field & term vectors and say  
the first field ended with a stop word that was filtered out, then  
your offsets will be off and the wrong parts will be highlighted in  
all but the first field (I think?).  I think we really need some way  
for the tokenStream to "declare" its final offset at the end.


Mike

Renaud Delbru wrote:

Do you know if there will be side-effects if we replace in  
DocumentWriter$FieldData#invertField

offset = offsetEnd+1;
by
offset = stringValue.length();

I still not understand the reason of such choice for the  
incrementation of the start offset.


Regards.

Michael McCandless wrote:


This is how Lucene has worked for quite some time (since 1.9).

When there are multiple fields with the same name in one Document,  
each field's offset starts from the last offset (offset of the  
last token) seen in the previous field.  If tokens are skipped at  
the end there's no way IndexWriter can know (because tokenStream  
doesn't return them).  It's as if we need the ability to query a  
tokenStream for its "final" offset or something.


One workaround might be to insert an "end marker" token, with the  
true end offset, which is a term you would never search on?


Mike

Renaud Delbru wrote:


Hi,

I currently use multiple fieldable instances for indexing  
sentences of a document.
When there is only one single fieldable instance, the token  
offset generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable  
instances. In DocumentWriter$FieldData#invertField method, if the  
field is tokenized, instead of updating offset attribute with  
stringValue.length() (which is performed if the field is not  
tokenized, line 1458), you update the offset attribute with the  
end offset of the last token (line 1503: offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a  
stopword, a dot, a space, etc.), the offset attribute is updated  
with the end offset of the last token not filtered. In this case,  
you store inside the offset attribute an incorrect offset (the  
offset is shift back) and all the next fieldable instances will  
have their offset shifted back.


Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: More IndexDeletionPolicy questions

2008-03-05 Thread Michael McCandless


Oh then I don't think you need a custom deletion policy.

A single NFS client emulates delete-on-last-close semantics.  Ie, if  
the deletion of the file, and the held-open file handles, happen  
through a single NFS client, then it emulates delete-on-last-close  
semantics for you, by creating those .nfsXXX files on the server.


Test to be sure though!

Mike

Tim Brennan wrote:

No, I have multiple readers in the same VM.  I track open readers  
within my VM and save those reader's commit points until the  
readers are gone.


--tim



- "Michael McCandless" <[EMAIL PROTECTED]> wrote:


OK got it.  Yes, sharing an index over NFS requires your own
DeletionPolicy.

So presumably you have readers on different machines accessing the
index over NFS, and then one machine that does the writing to that
index?  How so you plan to get which commit point each of these
readers is currently using back to the single writer?

Mike

Tim Brennan wrote:


The bigger picture here is NFS-safety.  When I run a search, I hand



off the search results to another thread so that they can be
processed as necessary -- in particular so that they can be JOINed



with a SQL DB -- but I don't want to completely lock the index from



writes while doing a bunch of SQL calls.  Using the commit point
tracking I can make sure my appropriate snapshot stays around until



I'm completely done with it, even if I'm using an NFS mounted
filesystem that doesn't have delete-last semantics.

--tim





It seems like it should be pretty simple -- keep a list of open
IndexReaders, track what Segment files they're pointing to, and

in



onCommit don't delete those segments.


This implies you have multiple readers in a single JVM?  If so,

you

should not need to make a custom deletion policy to handle this

case


-- the OS should be properly protecting open files from deletion.  



Can you  shed more light on the bigger picture here?


Unfortunately it ends up being very difficult to directly

determine

what Segment an IndexReader is pointing to.  Is there some
straightforward way that I'm missing -- all I've managed to do so
far is to remember the most recent one from onCommit/onInit and

use



that onethat works OK, but makes bootstrapping a pain if you
try to open a Reader before you've opened the writer once.

Also, when I use IndexReader.reopen(), can I assume that the

newly



returned reader is pointing at the "most recent" segment?  I

think



so...


Yes, except you have a synchronization challenge: if the writer is

in


the process of committing just as your reader opens you can't be
certain whether the reader got the new commit or the previous one.
If you have external synchronization to ensure reader only

re-opens

after writer has fully committed then this isn't an issue.



Here's a sequence of steps that happen in my app:

0) open writer, onInit tells me that seg_1 is the most recent

segment


1) Open reader, assume it is pointing to seg_1 from (0)

2) New write commits into seg_2, don't delete seg_1 b/c of (1)

3) Call reader.reopen() on the reader from (1)new reader is
pointing to seg_2 now?

4) seg_1 stays around until the next time I open or commit a
writer, then it is removed.


Does that seem reasonable?


--tim











-

To unsubscribe, e-mail: java-user-unsubscri





-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why indexing database is necessary? (RE: indexing database)

2008-03-05 Thread Shalin Shekhar Mangar
Hi,

We have built a data import tool which can read from Databases and add
them to Solr. We found that making content available for full text
search and faceted search was a common use case and usually everyone
ends up writing a custom ETL based tool for this task. Therefore we're
contributing this back to the Solr project.

Please look at https://issues.apache.org/jira/browse/SOLR-469 for
details. A user guide is provided at
http://wiki.apache.org/solr/DataImportHandler

I realize that having such a tool for lucene would also be helpful for
a large audience. However, currently we're more focused on Solr since
we don't use lucene directly in our own production environments.

On Wed, Mar 5, 2008 at 12:22 AM, Darren Hartford <[EMAIL PROTECTED]> wrote:
> Indexing with lucene/nutch on top of/instead of DB indexing for:
>
>  1) relativity scoring
>  2) alias searching (i.e. a large amount of aliases, like first names)
>  3) highlighting
>  4) cross-datasource searching (multi DB, DB + XML files, etc).
>
>  As for best approach to externally index, I do not have any direct
>  pointers.  I would recommend looking at an ETL tool that can be extended
>  for this purpose (I've started writing a plugin for Pentaho, but got
>  pulled off and haven't finished it -- and that was for Solr, not
>  lucene/nutch).
>
>  -D
>
>
>
>  > -Original Message-
>  > From: Duan, Nick [mailto:[EMAIL PROTECTED]
>  > Sent: Tuesday, March 04, 2008 1:33 PM
>  > To: java-user@lucene.apache.org
>  > Subject: Why indexing database is necessary? (RE: indexing database)
>  >
>  > Could anyone provide any insight on why someone would use nutch/lucene
>  > or any other search engines to index relational databases? With use
>  > cases if possible?  Shouldn't the database's own indexing mechanism be
>  > used since it is more efficient?
>  >
>  > If there is such a need of indexing the database content using search
>  > engines, what would be the best approach other than de-normalizing the
>  > database?
>  >
>  > Thanks a lot in advance!
>  >
>  > ND
>  > -Original Message-
>  > From: payo [mailto:[EMAIL PROTECTED]
>  > Sent: Tuesday, March 04, 2008 12:36 PM
>  > To: [EMAIL PROTECTED]
>  > Subject: indexing database
>  >
>  >
>  > hi to all
>  >
>  > i can index a database with nutch?
>  >
>  > i am use nutch 0.8.1
>  >
>  > thanks
>  > --
>  > View this message in context:
>  > http://www.nabble.com/indexing-database-tp15832696p15832696.html
>  > Sent from the Nutch - User mailing list archive at Nabble.com.
>  >
>  >
>  > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Regards,
Shalin Shekhar Mangar.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: C++ as token in StandardAnalyzer?

2008-03-05 Thread Tom Conlon
Hi Donna - See previous post below that may help. Tom

Hi,

In case this is of help to others:

Crux of problem: 
I wanted numbers and characters such as # and + to be considered.

Solution:
implement a LowercaseWhitespaceAnalyzer and a
LowercaseWhitespaceTokenizer.

i.e.
IndexWriter writer = new IndexWriter(INDEX_DIR, new
LowercaseWhitespaceAnalyzer(), true);

Tom
===
Diagnostics:

StandardAnalyzer

Enter Querystring: (C++ AND C#)  Searching for: +c +c
Enter Querystring: (C\+\+ AND C\#)   Searching for: +c +c
Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: ("moss 2007" "sharepoint 2007") asp.net

SimpleAnalyser
--
Enter Querystring: C++ Searching for: c
Enter Querystring: C#  Searching for: c
Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: (moss or sharepoint) and "asp net"

WhitespaceAnalyzer
--
Enter Querystring: (C++ AND C#)  Searching for: +C++ +C# Enter
Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: ("moss 2007" or "sharepoint 2007") and asp.net

KeywordAnalyzer
---
Enter Querystring: (C++ AND C#) Searching for: +C++ +C# Enter
Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: (moss 2007 or sharepoint 2007) and asp.net

StopAnalyzer

Enter Querystring: (C\++ AND C\#)  Searching for: +c +c Enter
Querystring: ("MOSS 2007" or "SHAREPOINT 2007") and "ASP.NET"
Searching for: (moss sharepoint) "asp net"
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-Original Message-
From: Donna L Gresh [mailto:[EMAIL PROTECTED] 
Sent: 04 March 2008 19:22
To: java-user@lucene.apache.org
Subject: C++ as token in StandardAnalyzer?

I saw some discussion in the archives some time ago about the fact that 
C++ is tokenized as C in the StandardAnalyzer; this seems to still be 
C++ the
case; I was wondering if there is a simple way for me to get the
behavior I want for C++ (that it is tokenized as C++) in particular, and
perhaps for other more ideosyncratic terms I may have in my own
application-- Thanks Donna



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using a thesaurus/onthology

2008-03-05 Thread Borgman, Lennart
Is there any possibility to use a thesaurus or an onthology when 
indexing/searching with Lucene?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using a thesaurus/onthology

2008-03-05 Thread Mathieu Lecarme

Borgman, Lennart a écrit :

Is there any possibility to use a thesaurus or an onthology when 
indexing/searching with Lucene?
  
Yes. the WordNet contrib do that. And with a token filter, it's easy to 
use your own.

What do you wont to do?

M.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using a thesaurus/onthology

2008-03-05 Thread Duan, Nick
Nutch has a ontology plugin based on Jena.
http://wiki.apache.org/nutch/OntologyPlugin

I haven't used it.  Just by looking at the source code, it seems it just
a Owl parser.  So apparently it only works with sources defined in OWL
format, not others such as RDF.  I think you need to extend the source
code to create your own Query processor to include related terms defined
in the ontology (with thesaurus being considered as a subset of
ontology).

If anyone has experience using the plug-in, I'd really appreciate any
suggestions and comments.

ND 
-Original Message-
From: Borgman, Lennart [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 05, 2008 9:09 AM
To: java-user@lucene.apache.org
Subject: Using a thesaurus/onthology

Is there any possibility to use a thesaurus or an onthology when
indexing/searching with Lucene?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



applying patch in Eclipse to get SpanHighlighter functionality

2008-03-05 Thread Donna L Gresh
I have downloaded the Lucene (core, 2.3.1) code and created a project 
using Eclipse (pointing to src/java) to use it. That works fine, along 
with the contrib highlighter jar file from the standard distribution.

I have also successfully added an additional Eclipse project for the 
(standard) Highlighter, pointing to ..contrib/highlighter/src/java . I can 
use this project instead of the standard highlighter jar file and that 
works fine too.

I am having trouble figuring out how to apply the patch (assuming that's 
possible) to get the SpanHighlighter functionality. When I try to apply 
the patch using team->apply patch and pointing to the patch xml file, I am 
getting errors that seem to imply that the patch is looking for things 
under contrib/highlighter which it can't find. 

I admit to knowing very little about patches; I am just interested in 
getting the SpanHighlighter functionality. 

Can anyone point me in the right direction? Thank you.

Donna  Gresh


Re: applying patch in Eclipse to get SpanHighlighter functionality

2008-03-05 Thread Mark Miller
Are you using subclipse to apply the patch? Its not very good at it. I 
use TortoiseSVN for patching, as its much smarter about these things. 
With TortoiseSVN, you just patch from the root dir and it knows you are 
referring to the contrib folder thats under the root directory (the 
directory you apply the patch to). I think subclipse also makes patches 
with absolute URLs, which is not good either, though you can still use 
them with the ignore stuff it has.


Donna L Gresh wrote:
I have downloaded the Lucene (core, 2.3.1) code and created a project 
using Eclipse (pointing to src/java) to use it. That works fine, along 
with the contrib highlighter jar file from the standard distribution.


I have also successfully added an additional Eclipse project for the 
(standard) Highlighter, pointing to ..contrib/highlighter/src/java . I can 
use this project instead of the standard highlighter jar file and that 
works fine too.


I am having trouble figuring out how to apply the patch (assuming that's 
possible) to get the SpanHighlighter functionality. When I try to apply 
the patch using team->apply patch and pointing to the patch xml file, I am 
getting errors that seem to imply that the patch is looking for things 
under contrib/highlighter which it can't find. 

I admit to knowing very little about patches; I am just interested in 
getting the SpanHighlighter functionality. 


Can anyone point me in the right direction? Thank you.

Donna  Gresh

  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reusing same IndexSearcher

2008-03-05 Thread Mindaugas Žakšauskas
Hi,

Another newbie here...using Lucene 2.3.1 on Linux. Hopefully anyone
could advice me on /subj/.

Both IndexSearcher Javadoc and Lucene FAQ says the IndexSearcher
should be reused as it's thread safe. That's OK.
Now if I have index changed, I need to reopen the IndexReader that is
associated with it. How do I do this as IndexSearcher has no setter
method for IndexReader?

Let's speak in Java. Say, we've got a static singletone accessor method:

...
private static Searcher instance;
...
public static Searcher getLuceneSearcher () {
   if( instance == null ) {
  IndexReader reader = IndexReader.open( "/tmp/index_folder" );
  instance = new IndexSearcher( reader );// simple yet boring
   } else {// here goes the fun part
  IndexReader r_old = instance.getIndexReader();
  IndexReader r_new = r_old.reopen();
  if( r_old != r_new ) {
 r_old.close();  // thanks for nice Javadoc, guys!
 // what to do now? there's no instance.setIndexReader( r_new )!
  }
   }
   return instance;
}

Of course, I could create a new IndexSearcher on the else branch and
return it. However, this approach resulted the infamous "too many open
files" exception. Lifting the `ulimit -n` to hundreds of thousands of
files didn't really help as the same exception was still being thrown
(actual resource usage fluctuating around 2000 of open files). Then
from `lsof` output I noticed that the same segment file was being open
more than once, apparently from different instances of
IndexSearchers/IndexReaders and went the path shown above. Maybe I'm
just plain wrong.

Really appreciate your advice.

Regards,
Mindaugas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reusing same IndexSearcher

2008-03-05 Thread Michael McCandless


Actually you do need to make a new IndexSearcher every time you  
reopen a new IndexReader.


However, that should not lead to leaking file descriptors.  All open  
files are held by IndexReader (not IndexSearcher), so as long as you  
are properly closing your IndexReader's you shouldn't use up file  
descriptors.


Mike

Mindaugas ?ak?auskas wrote:


Hi,

Another newbie here...using Lucene 2.3.1 on Linux. Hopefully anyone
could advice me on /subj/.

Both IndexSearcher Javadoc and Lucene FAQ says the IndexSearcher
should be reused as it's thread safe. That's OK.
Now if I have index changed, I need to reopen the IndexReader that is
associated with it. How do I do this as IndexSearcher has no setter
method for IndexReader?

Let's speak in Java. Say, we've got a static singletone accessor  
method:


...
private static Searcher instance;
...
public static Searcher getLuceneSearcher () {
   if( instance == null ) {
  IndexReader reader = IndexReader.open( "/tmp/index_folder" );
  instance = new IndexSearcher( reader );// simple yet boring
   } else {// here goes the fun  
part

  IndexReader r_old = instance.getIndexReader();
  IndexReader r_new = r_old.reopen();
  if( r_old != r_new ) {
 r_old.close();  // thanks for nice  
Javadoc, guys!
 // what to do now? there's no instance.setIndexReader 
( r_new )!

  }
   }
   return instance;
}

Of course, I could create a new IndexSearcher on the else branch and
return it. However, this approach resulted the infamous "too many open
files" exception. Lifting the `ulimit -n` to hundreds of thousands of
files didn't really help as the same exception was still being thrown
(actual resource usage fluctuating around 2000 of open files). Then
from `lsof` output I noticed that the same segment file was being open
more than once, apparently from different instances of
IndexSearchers/IndexReaders and went the path shown above. Maybe I'm
just plain wrong.

Really appreciate your advice.

Regards,
Mindaugas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reusing same IndexSearcher

2008-03-05 Thread Mindaugas Žakšauskas
Hi,

Thanks for your reply.

I can't think of any way to ensure fair file descriptor usage when
there are many active instances of IndexSearcher (all containing
IndexReader) running. Our project installations tend to run on heavily
loaded sites, where a lot of information is read and written at the
same time.

My original idea was not to operate at IndexReader level, but only
provide a single IndexSearcher (which contains IndexReader).
IndexSearcher being (thread) safe, none of the other code should be
aware of Lucene internals (think encapsulation).

Anyway, if what you saying is correct, I think the Javadoc and FAQ
must be little bit more specific on that. Also, I've looked at the
code of IndexSearcher and could not find a single reason why would
setting a new IndexReader hurt. But that's just probably me having a
hard day.

Can anyone comment if this approach is relevant and would help?
http://www.xman.org/jlinux/server.html

Regards,
Mindaugas

On Wed, Mar 5, 2008 at 5:41 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
>  Actually you do need to make a new IndexSearcher every time you
>  reopen a new IndexReader.
>
>  However, that should not lead to leaking file descriptors.  All open
>  files are held by IndexReader (not IndexSearcher), so as long as you
>  are properly closing your IndexReader's you shouldn't use up file
>  descriptors.
>
>  Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: applying patch in Eclipse to get SpanHighlighter functionality

2008-03-05 Thread Donna L Gresh
Thanks Mark-
I'm very much a newbie in all this patching stuff, but I don't think I'm 
using anything other than built-in Eclipse functionality using team->apply 
patch. And it's clearly not working well.

I took a look at TortoiseSVN but I think it's way overkill for me-- oh 
well; maybe I'll just wait until it makes it into a release :)

Donna Gresh



Mark Miller <[EMAIL PROTECTED]> wrote on 03/05/2008 12:07:21 PM:

> Are you using subclipse to apply the patch? Its not very good at it. I 
> use TortoiseSVN for patching, as its much smarter about these things. 
> With TortoiseSVN, you just patch from the root dir and it knows you are 
> referring to the contrib folder thats under the root directory (the 
> directory you apply the patch to). I think subclipse also makes patches 
> with absolute URLs, which is not good either, though you can still use 
> them with the ignore stuff it has.
> 
> Donna L Gresh wrote:
> > I have downloaded the Lucene (core, 2.3.1) code and created a project 
> > using Eclipse (pointing to src/java) to use it. That works fine, along 

> > with the contrib highlighter jar file from the standard distribution.
> >
> > I have also successfully added an additional Eclipse project for the 
> > (standard) Highlighter, pointing to ..contrib/highlighter/src/java . I 
can 
> > use this project instead of the standard highlighter jar file and that 

> > works fine too.
> >
> > I am having trouble figuring out how to apply the patch (assuming 
that's 
> > possible) to get the SpanHighlighter functionality. When I try to 
apply 
> > the patch using team->apply patch and pointing to the patch xml file, 
I am 
> > getting errors that seem to imply that the patch is looking for things 

> > under contrib/highlighter which it can't find. 
> >
> > I admit to knowing very little about patches; I am just interested in 
> > getting the SpanHighlighter functionality. 
> >
> > Can anyone point me in the right direction? Thank you.
> >
> > Donna  Gresh
> >
> > 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


storing position - keyword

2008-03-05 Thread 1world1love

Greetings all. I am indexing a set of documents where I am extracting terms
and mapping them to a controlled vocabulary and then placing the matched
vocabulary in a keyword field. What I want to know is if there is a way to
store the original term location with the keyword field?

Example Text: "The quick brown fox jumped over the lazy dog" -->

Controlled Vocabulary Terms: "physical activity", "exercise", "sedentary
lifestyle", "canine"

I am storing these controlled terms in a keyword field so they are stored
and searchable exactly. 

What I would like to be able to do is to highlight the context of the
original term or phrase that is associated with a mapped term. So in the
example above, if the controlled term is "sedentary lifestyle", I would like
to highlight "lazy".

A couple of notes:

There can be multiple mapped terms for an original term or phrase.

The algorithm that handles the mapping provides the start and end position
of the original text

Thanks in advance for any ideas,

j
-- 
View this message in context: 
http://www.nabble.com/storing-position---keyword-tp15855654p15855654.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boolean Query search performance

2008-03-05 Thread Beard, Brian

I'm using lucene 2.2.0.

I'm in the process of re-writing some queries to build BooleanQueries
instead of using query parser.

Bypassing query parser provides almost an order of magnitude improvement
for very large queries, but then the search performance takes 20-30%
longer. I'm adding boost values as a way to sort (long story, no
workaround I can see for now).

If I do a query.toString(), both queries give different results, which
is probably a clue (additional paren's with the BooleanQuery)

Query.toString the old way using queryParser:
+(id:1^2.0 id:2 ... ) +type:CORE

Query.toString the new way using BooleanQuery:
+((id:1^2.0) (id:2) ... ) +type:CORE

Does anyone have any ideas as to why the performance difference in
searching. I would think I could get the search time to be the same,
since after the query is formed everything remains the same. The same
number of hits are going through the hitCollector that gets called after
this, so all variables seem constant. QueryParser must be doing some
optimization I'm not taking advantage of. Code snippet follows.

The previous way using queryParser was:

  String queryStr = (id:1^2 OR id:2  id:n) AND type:CORE
  Query query = parser.parse(queryStr);


The new way using BooleanQuery is...

  BooleanQuery totalBooleanQuery = new BooleanQuery();
  BooleanClause docTypeClause = 
new BooleanClause( new TermQuery( 
new Term("type", "CORE")), BooleanClause.Occur.MUST);

  BooleanQuery idBooleanQuery = new BooleanQuery();
  for (String uid : uniqueIdMap.keySet()) {
TermQuery termQuery = 
 new TermQuery(new Term("id", uid));
termQuery.setBoost(loopBoostValue);
idBooleanQuery.add(
 new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));
  } 

  totalBooleanQuery.add(docTypeClause);
  totalBooleanQuery.add(idBooleanQuery, Occur.MUST); 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



changing scoring formula

2008-03-05 Thread sumittyagi

is there any way to change the score of the documents.
Actually i want to modify the scores of the documents dynamically, everytime
for a given query the results will be sorted according to "lucene scoring
formula + an equation".
how can i do that...i saw that lucene scoring page but i am not getting
exactly how to do that...
please advice me
-- 
View this message in context: 
http://www.nabble.com/changing-scoring-formula-tp15860538p15860538.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: More IndexDeletionPolicy questions

2008-03-05 Thread Tim Brennan
Ha, you know it never occurred to me that the driver might do this for 
me...I'll test it out.  Thanks,

--tim


- "Michael McCandless" <[EMAIL PROTECTED]> wrote:

> Oh then I don't think you need a custom deletion policy.
> 
> A single NFS client emulates delete-on-last-close semantics.  Ie, if 
> 
> the deletion of the file, and the held-open file handles, happen  
> through a single NFS client, then it emulates delete-on-last-close  
> semantics for you, by creating those .nfsXXX files on the server.
> 
> Test to be sure though!
> 
> Mike
> 
> Tim Brennan wrote:
> 
> > No, I have multiple readers in the same VM.  I track open readers  
> > within my VM and save those reader's commit points until the  
> > readers are gone.
> >
> > --tim
> >
> >
> >
> > - "Michael McCandless" <[EMAIL PROTECTED]> wrote:
> >
> >> OK got it.  Yes, sharing an index over NFS requires your own
> >> DeletionPolicy.
> >>
> >> So presumably you have readers on different machines accessing the
> >> index over NFS, and then one machine that does the writing to that
> >> index?  How so you plan to get which commit point each of these
> >> readers is currently using back to the single writer?
> >>
> >> Mike
> >>
> >> Tim Brennan wrote:
> >>
> >>> The bigger picture here is NFS-safety.  When I run a search, I
> hand
> >>
> >>> off the search results to another thread so that they can be
> >>> processed as necessary -- in particular so that they can be
> JOINed
> >>
> >>> with a SQL DB -- but I don't want to completely lock the index
> from
> >>
> >>> writes while doing a bunch of SQL calls.  Using the commit point
> >>> tracking I can make sure my appropriate snapshot stays around
> until
> >>
> >>> I'm completely done with it, even if I'm using an NFS mounted
> >>> filesystem that doesn't have delete-last semantics.
> >>>
> >>> --tim
> >>>
> >>>
> >>>
> >>>
> > It seems like it should be pretty simple -- keep a list of open
> > IndexReaders, track what Segment files they're pointing to, and
> >> in
> 
> > onCommit don't delete those segments.
> 
>  This implies you have multiple readers in a single JVM?  If so,
> >> you
>  should not need to make a custom deletion policy to handle this
> >> case
> 
>  -- the OS should be properly protecting open files from deletion.
>  
> >>
>  Can you  shed more light on the bigger picture here?
> 
> > Unfortunately it ends up being very difficult to directly
> >> determine
> > what Segment an IndexReader is pointing to.  Is there some
> > straightforward way that I'm missing -- all I've managed to do
> so
> > far is to remember the most recent one from onCommit/onInit and
> >> use
> 
> > that onethat works OK, but makes bootstrapping a pain if
> you
> > try to open a Reader before you've opened the writer once.
> >
> > Also, when I use IndexReader.reopen(), can I assume that the
> >> newly
> 
> > returned reader is pointing at the "most recent" segment?  I
> >> think
> 
> > so...
> 
>  Yes, except you have a synchronization challenge: if the writer
> is
> >> in
> 
>  the process of committing just as your reader opens you can't be
>  certain whether the reader got the new commit or the previous
> one.
>  If you have external synchronization to ensure reader only
> >> re-opens
>  after writer has fully committed then this isn't an issue.
> 
> >
> > Here's a sequence of steps that happen in my app:
> >
> > 0) open writer, onInit tells me that seg_1 is the most recent
>  segment
> >
> > 1) Open reader, assume it is pointing to seg_1 from (0)
> >
> > 2) New write commits into seg_2, don't delete seg_1 b/c of (1)
> >
> > 3) Call reader.reopen() on the reader from (1)new reader is
> > pointing to seg_2 now?
> >
> > 4) seg_1 stays around until the next time I open or commit a
> > writer, then it is removed.
> >
> >
> > Does that seem reasonable?
> >
> >
> > --tim
> >
> >
> >
> >
> >
> >
> >
> >
> 
> >>
> -
> > To unsubscribe, e-mail: java-user-unsubscri
> 
> 
> 
> >>
> -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>>
> >>>
> >>
> -
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>
> >>
> >>
> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For addit

Re: changing scoring formula

2008-03-05 Thread Michael Stoppelman
Sumit,

The class you'll end up subclassing from would be:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.htmlor
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/DefaultSimilarity.html

On an IndexSearcher you can setSimilarity() class that you want to use for
scoring, see:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/IndexSearcher.html
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)

Does that help?

-M

On Wed, Mar 5, 2008 at 1:16 PM, sumittyagi <[EMAIL PROTECTED]> wrote:

>
> is there any way to change the score of the documents.
> Actually i want to modify the scores of the documents dynamically,
> everytime
> for a given query the results will be sorted according to "lucene scoring
> formula + an equation".
> how can i do that...i saw that lucene scoring page but i am not getting
> exactly how to do that...
> please advice me
> --
> View this message in context:
> http://www.nabble.com/changing-scoring-formula-tp15860538p15860538.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: storing position - keyword

2008-03-05 Thread Karl Wettin

1world1love skrev:

Greetings all. I am indexing a set of documents where I am extracting terms
and mapping them to a controlled vocabulary and then placing the matched
vocabulary in a keyword field.


One could also say you are classifying your data based on keywords in
the text?


What I want to know is if there is a way to store the original term location
with the keyword field?


You can always store values in a field, but the term and the stored
value is not coupled. Thus you would need to store the positions per
document in each field in machine readable format you then parse:

doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..

But that is a way expensive solution.


Example Text: "The quick brown fox jumped over the lazy dog" -->

Controlled Vocabulary Terms: "physical activity", "exercise", "sedentary
lifestyle", "canine"

I am storing these controlled terms in a keyword field so they are stored
and searchable exactly. 


This is known as faceted classification.





What I would like to be able to do is to highlight the context of the
original term or phrase that is associated with a mapped term. So in the
example above, if the controlled term is "sedentary lifestyle", I would like
to highlight "lazy".

There can be multiple mapped terms for an original term or phrase.

The algorithm that handles the mapping provides the start and end
position of the original text



Are you aware of the hightlighter contrib module?

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

The simplest solution is to a new facet Term per classification in text
and use the text start and end positions of the text field, and have the
hightligher to load the text and highlight this text field.

Matching a document with the same terms occuring multiple times will
cause a greater score than it only occuring once. This is probably
problematic for you.

Instead you could add a single Term, ignore the built in positions and
store them for all positions in the payload of that single Term.


for (String facet : facets) {
  doc.addField(
  "f", new SingleTokenTokenStream(
  facet, new Payload(offsets.toByteArray())
  )
  );
}

(This is dry coded, you will need to implement some of them things.)

You also need to modify the highligher so it can read this data.



   karl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boolean Query search performance

2008-03-05 Thread Karl Wettin

Beard, Brian skrev:

I'm using lucene 2.2.0.

I'm in the process of re-writing some queries to build BooleanQueries
instead of using query parser.

Bypassing query parser provides almost an order of magnitude improvement
for very large queries, but then the search performance takes 20-30%
longer.


Have you benchmarked the time spent creating the query and the time 
spent searching using the query seperatly? QueryParser can be sort of 
expensive. It also seems to me that you are creating a very large query 
given you add one criteria per entity id.


karl




I'm adding boost values as a way to sort (long story, no
workaround I can see for now).

If I do a query.toString(), both queries give different results, which
is probably a clue (additional paren's with the BooleanQuery)

Query.toString the old way using queryParser:
+(id:1^2.0 id:2 ... ) +type:CORE

Query.toString the new way using BooleanQuery:
+((id:1^2.0) (id:2) ... ) +type:CORE

Does anyone have any ideas as to why the performance difference in
searching. I would think I could get the search time to be the same,
since after the query is formed everything remains the same. The same
number of hits are going through the hitCollector that gets called after
this, so all variables seem constant. QueryParser must be doing some
optimization I'm not taking advantage of. Code snippet follows.

The previous way using queryParser was:

  String queryStr = (id:1^2 OR id:2  id:n) AND type:CORE
  Query query = parser.parse(queryStr);


The new way using BooleanQuery is...

  BooleanQuery totalBooleanQuery = new BooleanQuery();
  BooleanClause docTypeClause = 
new BooleanClause( new TermQuery( 
new Term("type", "CORE")), BooleanClause.Occur.MUST);


  BooleanQuery idBooleanQuery = new BooleanQuery();
  for (String uid : uniqueIdMap.keySet()) {
	TermQuery termQuery = 
 new TermQuery(new Term("id", uid));

termQuery.setBoost(loopBoostValue);
idBooleanQuery.add(
 new BooleanClause(termQuery, BooleanClause.Occur.SHOULD));
  } 


  totalBooleanQuery.add(docTypeClause);
  totalBooleanQuery.add(idBooleanQuery, Occur.MUST); 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: storing position - keyword

2008-03-05 Thread 1world1love

First off Karl, thanks for your reply and your time.



karl wettin-3 wrote:
> 
> One could also say you are classifying your data based on keywords in
> the text?
> 

I probably didn't explain myself very well or more specifically provide a
good example. In my case, there really isn't any relationship between the
mapped terms per document. That is to say that an individual term or phrase
in the document is mapped to a concrete concept in a controlled vocabulary.
The concept doesn't represent a class of anything and no relationship exists
between the concepts. They would never be grouped by any means. It is more a
matter of replacing some arbitrary word or phrase with an adjudicated
version.

The example I gave did in fact use classifications for the terms, but that
is not exactly the point that I was trying to convey. I suppose a better
example would be where each term or phrase in the sentence mapped to any
equivilent in another language:

dog -> canis
dog -> cane
dog -> chien

So that if you searched for "canis", then any document with "dog" would be
returned (unless the context inferred that dog meant something else). By the
same token, if the text was "here we go" or "let's go", then it may map to
"vamos" or "vamonos".

To confuse matters more, it is not really a matter of synonyms, as the
orginal term is discarded from the index and there is only one mapped term
per original term or phrase and the algorithm determines the controlled
meaning from the context.


karl wettin-3 wrote:
> 
> 
> You can always store values in a field, but the term and the stored
> value is not coupled. Thus you would need to store the positions per
> document in each field in machine readable format you then parse:
> 
> doc.addField("f", "keyword:12,32;54,32", Field.Store.YES, ..
> 
> But that is a way expensive solution.
> 

Indeed, though doesn't a analyzed field have some other information attached
to it?

Forgive me if this is a naive question. I am fairly new to Lucene.


karl wettin-3 wrote:
>  
> 
> This is known as faceted classification.
> 
> 
> 
> 

Again, I am not overly familiar with these disciplines, but I always thought
of facets as a organizational strategy. As I said, my example betrayed me a
bit, as I am not that interested in organizing these documents, rather
providing a controlled vocabulary from which to search as opposed to any
random text.



karl wettin-3 wrote:
> 
> Are you aware of the hightlighter contrib module?
> 
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/
> 
> The simplest solution is to a new facet Term per classification in text
> and use the text start and end positions of the text field, and have the
> hightligher to load the text and highlight this text field.
> 

This is actually not a web based application and the highlighting would
really only be used for analyzing performance of the mapping algorithms. The
main issue is that we do need to be able to provide the location of the
original term for each mapped keyword.



karl wettin-3 wrote:
> 
> Matching a document with the same terms occuring multiple times will
> cause a greater score than it only occuring once. This is probably
> problematic for you.
> 

It may not be that big of an issue.


karl wettin-3 wrote:
> 
> Instead you could add a single Term, ignore the built in positions and
> store them for all positions in the payload of that single Term.
> 
> 
> for (String facet : facets) {
>doc.addField(
>"f", new SingleTokenTokenStream(
>facet, new Payload(offsets.toByteArray())
>)
>);
> }
> 
> (This is dry coded, you will need to implement some of them things.)
> 
> You also need to modify the highligher so it can read this data.
> 
> 

Something like this seems like it might work well for my purposes. I will
look at this further.

Thanks again,

J


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-- 
View this message in context: 
http://www.nabble.com/storing-position---keyword-tp15855654p15865186.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boolean Query search performance

2008-03-05 Thread Chris Hostetter

: If I do a query.toString(), both queries give different results, which
: is probably a clue (additional paren's with the BooleanQuery)
: 
: Query.toString the old way using queryParser:
: +(id:1^2.0 id:2 ... ) +type:CORE
: 
: Query.toString the new way using BooleanQuery:
: +((id:1^2.0) (id:2) ... ) +type:CORE

i didn't look too closely at the psuedo code you posted, but the 
additional parens normally indicates that you are actually creating an 
extra layer of BooleanQueries (ie: a BooleanQuery with only one clause for 
each term) ... but the rewrite method should optimize those away (even 
back in lucene 2.2) ... if you look at query.rewrite(reader).toString() 
then the queries *really* should be the same, if they aren't, then that 
may be your culprit.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]