Postal Code Radius Search

2007-08-29 Thread Mike
I've searched the mailing list archives, the web, read the FAQ, etc and I
don't see anything relevant so here it goes…

I'm trying to implement a radius based searching based on zip/postal codes.
 (The user enters their zip code and I show nearby matches under x miles
away sorted by linear distance.)  I already have the data required to pull
this off (zip codes, long/lat coordinates, etc.)   Extreme accuracy is not a
requirement.  It just needs to be an approximation (plus or minus a few
miles.)

What I'm looking for is a little direction.  How have others implemented
this type of search?  What are the pros/cons of various methods?  I have a
few ideas but obviously none of them are very good or I guess I wouldn't be
here asking.  ;)

By the way, my index is updated about every 10 minutes and holds about
25,000 records.  However, this may increase in the next year or so to
hundreds of thousands.  So whatever I do needs to be fairly scalable.  The
items being searched as well as the people searching will be located all
over the world.   Some areas may be busier than others so there is an
opportunity for caching more common locals.

Thank you for your time.  I'd appreciate any suggestions that you can give.

- Mike


Re: Lucene 4.0 Index Format Finalization Timetable

2011-12-07 Thread Mike Sokolov
My personal view, as a bystander with no more information than you, is 
that one has to assume there will be further index format changes before 
a 4.0 release.  This is based on the number of changes in the last 9 
months, and the amount of activity on the dev list.


For us the implication is we need to stick w/3.x for now.  You might be 
in a different situation if you really need the 4.0 changes.  Maybe you 
can just stick w/the current trunk and take responsibility for patching 
critical bugfixes, hoping you won't have to recreate your index too many 
times...


-Mike

On 12/06/2011 09:48 PM, Jamie Johnson wrote:

I suppose that's fair enough.  Some quick googling seems that this has
been asked many times with pretty much the same response.  Sorry to
add to the noise.

On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni  wrote:
   

I asked here[1] and it said "Ask again later."

[1] http://8ball.tridelphia.net/


On 12/06/2011 08:46 PM, Jamie Johnson wrote:
 

Thanks Robert.  Is there a timetable for that?  I'm trying to gauge
whether it is appropriate to push for my organization to move to the
current lucene 4.0 implementation (we're using solr cloud which is
built against trunk) or if it's expected there will be changes to what
is currently on trunk.  I'm not looking for anything hard, just trying
to plan as much as possible understanding that this is one of the
implications of using trunk.

On Tue, Dec 6, 2011 at 6:48 PM, Robert Muirwrote:
   

On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnsonwrote:
 

Is there a timetable for when it is expected to be finalized?
   

it will be finalized when Lucene 4.0 is released.

--
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Obtaining IDF values for the terms in a document set

2011-12-15 Thread Mike O'Leary
We have a large set of documents that we would like to index with a customized 
stopword list. We have run tests by indexing a random set of about 10% of the 
documents, and we'd like to generate a list of the terms in that smaller set 
and their IDF values as a way to create a starter set of stopwords for the 
larger document set by selecting the terms that have the lowest IDF values. 
First of all, is this the best way to create a stopword list? Second, is there 
a straightforward way to generate a list of terms and their IDF values from a 
Lucene index?
Thanks,
Mike


RE: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Mike O'Leary
Hi Simon,
I guess in a sense we are interested in obtaining a list of the top N terms, 
but they would be the top terms in the sense that they have the lowest IDF 
values. These would be the terms that appear in all or almost all documents in 
the document set. This is not a count of the number of term occurrences in 
documents, it is a count of documents that contain at least one occurrence of a 
given term. Lucene must be storing IDF values for the terms of a document set 
somewhere in order to compute TF/IDF values when searching. I am wondering if 
there is an easy way to iterate through all of the terms that occur in the 
document set and obtain their IDF values.
Thanks,
Mike

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@googlemail.com] 
Sent: Thursday, December 15, 2011 11:44 AM
To: java-user@lucene.apache.org
Subject: Re: Obtaining IDF values for the terms in a document set

On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary  wrote:
> We have a large set of documents that we would like to index with a 
> customized stopword list. We have run tests by indexing a random set of about 
> 10% of the documents, and we'd like to generate a list of the terms in that 
> smaller set and their IDF values as a way to create a starter set of 
> stopwords for the larger document set by selecting the terms that have the 
> lowest IDF values. First of all, is this the best way to create a stopword 
> list? Second, is there a straightforward way to generate a list of terms and 
> their IDF values from a Lucene index?
> Thanks,
> Mike

hey mike,

I can certainly help you with generating the list of your top N terms, if that 
is the best or right way to generate the stopwords list I am not sure but maybe 
somebody else will step up.

to get the top N terms out of your index you can simply iterate the terms in a 
field and put the top N terms based on the docFreq() on a heap. something like 
this:

 static class TermAndDF {
   String term;
   int df;
 }
 int queueSize = N;
 PriorityQueue queue = ...

 final TermEnum termEnum = reader.terms(new Term(field));
  try {
do {
  final Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int docFreq = termEnum.docFreq();
  if (queue.size() < queueSize) {
 queue.add(new TermAndDF(term.text(), docFreq);
  } else if (queue.top().df < docFreq) {
 TermAndFreq tnFrq = queue.top();
 tnFrq.term = term.text();
 tnFrq.df = docFreq;
  }
} while (termEnum.next());
  } finally {
termEnum.close();
  }

another way of doing it is to use index pruning and drop terms with docFreq 
above a threshold after you have indexed your doc set.

simon

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Tamper resistant index

2012-01-09 Thread Mike C
Hi,

I'm investigating storing syslog data using Lucene (via Solr or
Elasticsearch, undecided at present). The syslogs belong to systems
under the scope of the PCI DSS (Data Security Standard), and one of
the requirements is to ensure logs aren't tampered with. I'm looking
for advice on how to accomplish this.

Looking through the Lucene documentation, I believe there doesn't
exist any built in functionality to secure index data through digital
signatures or HMACs. Is this the case, or have I overlooked something?
I see there is a lucenetransform project
(http://code.google.com/p/lucenetransform/) that offers encryption,
but not digital signatures. I'm not concerned about hiding the
contents of the data, just need to ensure it hasn't been tampered
with. At present I use Splunk, which signs and verifies blocks of
indexed data. Unfortunately its pricing model doesn't scale well,
hence looking for a lucene-based solution.

I suppose I could add a digital signature programmatically to each
lucene Document/Syslog, though it seems like a lot of overhead.
Lucenetransforms approach does seem to suggest that I could provide a
digital signature version of Directory (and IndexInput/IndexOutput),
however before I go down that rabbit hole, decided to check in here.
Any advice or suggestions appreciated.

Kind Regards,

Mike C.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Retrieving offsets

2012-01-19 Thread Mike Sokolov

I think you have hit on all the best solutions.

The Jira issues you mentioned do indeed hold out some promising 
solutions here, but they are a ways away, requiring some significant 
re-plumbing and I'm not sure there is a lot of attention being paid to 
that at the moment.  You should vote for those issues, I think.  But in 
the meantime, I think your payload solution is probably the best in 
terms of efficiency; you can find code that does that kind of thing in 
LUCENE-3318 if you poke around a bit.  However it might be simplest to 
just use the existing highlighters to do this sort of thing, and not 
worry about spans?


-Mike

On 1/19/2012 9:46 PM, Nishad Prakash wrote:


I'm going to cry.  There is no way to retrieve offsets for position, 
rather than for term?



On 1/13/2012 6:33 PM, Nishad Prakash wrote:

I'm having a set of issues in trying to use Lucene that are all
connected to the difficulty of retrieving offsets.  I need some advice
on how best to proceed, or a pointer if this has been answered
somewhere.

My app requires that I display all portions of the documents where the
search term or terms are found.  Because of this, I always use
IndexReader.getSpans(), since knowing only which documents matched
isn't enough.  However, this still leaves me with a lot of unresolved
problems.

- I cannot find any standard way to map the returned span positions to
offsets.  For single term queries, I can get at offsets by writing a
custom TermVectorMapper.  For more complex queries, I have to (I
think) use rewrite(), extract the target terms, then load their term
vectors and go through them to find the positions that match what's in
the span, and pull up the corresponding offsets.  This
is...surprising.  We took considerable pains during indexing to
maintain the offset information through several layers of analysis
filters, but now we can't get to it while searching without
considerably more pain.  Am I missing something obvious?

- More generally, I would like to be able to iterate over positions in
a document, collecting offset information for those positions as I go.
Is there any way to do this?  I didn't find such an iterator, but I
may not know where to look.  Everything I did find was tied to
iterating over positions for specific terms, which is not relevant
here.

Right now, I can think of these options:
1) get at offsets via term vectors; try to make that as fast as



possible by "short-circuiting" how much of the term vector we load.
2) Maintain external per-document position->offset maps outside
Lucene.
3) Maybe store offsets as payload?

But is there already a (non-term-vector based) way of getting at
offsets that I don't know about?  My ideal solution would be an
iterable position->offset map for each document; failing that, an
enhancement to getSpans() that returns offset information along with
position.

It seems like LUCENE-2878 and LUCENE-3318 are concerned with at least
some of these issues, but the comments are a bit inside-baseball for
me at this stage.  So I would greatly appreciate any advice on this
issue.

nishad

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Searching by similarity using term vectors

2012-02-14 Thread Mike O'Leary
If I have indexed a set of documents using term vectors, is there support in 
Lucene to treat a list of query terms as a small document, create a term vector 
for it, and find documents by computing similarity between the query's term 
vector and the term vectors in the index? If so, what API functions are 
provided to do this kind of search? It looks like the standard method of search 
treats a list of query terms as a Boolean query. Is there an alternative search 
function that doesn't do this?
Thanks,
Mike


Re: Concurrency and multiple merge threads

2012-02-19 Thread Mike McCandless
Sounds like a nice machine!

It's frustrating that RAMFile even has any sync'd methods... Lucene is write 
once, so once a RAMFile is written we don't need any sync to read it.  Maybe on 
creating a RAMInputStream we could make a new ReadOnlyRAMFile, holding the same 
buffers without sync.

That said the ops inside the sync are tiny so it's strange if this really is 
the cause of the contention... It could just be a profiling ghost and something 
else is the real bottleneck...

Mike

On Feb 18, 2012, at 9:21 PM, Benson Margulies  wrote:

> Using Lucene 3.5.0, on a 32-core machine, I have coded something shaped like:
> 
> make a writer on a RAMDirectory.
> 
> start:
> 
>  Create a near-real-time searcher from it.
> 
>  farm work out to multiple threads, each of which performs a search
> and retrieves some docs.
> 
>  When all are done, write some new docs.
> 
> back to start.
> 
> The returns of adding threads diminish faster than I would like.
> According to YourKit, a major contribution when I try 16 is conflict
> on the RAMFile monitor.
> 
> The conflict shows five Lucene Merge Threads holding the monitor, plus
> my own threads. I'm not sure that I'm interpreting this correctly;
> perhaps there were five different occasions when a merge thread
> blocked my threads.
> 
> In any case, I'm fairly stumped as to how my threads manage to
> materially block each other, since the synchronized methods used on
> the search side in RAMFile are pretty tiny.
> 
> YourKit claims that the problem is in RAMFile.numBuffers, but I have
> not been able to catch this being called in a search.
> 
> I did spot the following backtrace.
> 
> In any case, I'd be grateful if anyone could tell me if this is a
> familiar story or one for which there's a solution.
> 
> 
>RAMFile.getBuffer(int) line: 75
>RAMInputStream.switchCurrentBuffer(boolean) line: 107
>RAMInputStream.seek(long) line: 144
>SegmentNorms.bytes() line: 163
>SegmentNorms.bytes() line: 143
>ReadOnlySegmentReader(SegmentReader).norms(String) line: 599
>TermQuery$TermWeight.scorer(IndexReader, boolean, boolean) line: 107
>BooleanQuery$BooleanWeight.scorer(IndexReader, boolean, boolean) line: 298 
>
>IndexSearcher.search(Weight, Filter, Collector) line: 577
>IndexSearcher.search(Weight, Filter, int, Sort, boolean) line: 517
>IndexSearcher.search(Weight, Filter, int, Sort) line: 487
>IndexSearcher.search(Query, Filter, int, Sort) line: 400
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene's use of vectors

2012-03-01 Thread Mike O'Leary
In the Javadoc page for the Similarity class, it says,

"Lucene combines Boolean model (BM) of Information Retrieval with Vector Space 
Model (VSM) of Information Retrieval - documents "approved" by BM are scored by 
VSM."

Is the Vector Space Model that is referred to here different than the term 
vectors that can optionally be stored in index fields? It sounds like the 
vector space model is used by Lucene in all cases in order to determine ranking 
of returned results, not only when indexing with term vectors is enabled. If 
you have indexed without term vectors, what does Lucene use to score "approved" 
documents? And if you have indexed with term vectors, what does that enable you 
to do that you couldn't do with an index without term vectors?

Is there a kind of search in Lucene in which documents are "approved" by VSM as 
well as scored by them, or does that even make sense? I understand how 
similarity works when comparing two documents, but I can't imagine that it 
would work to search by comparing a term vector from a set of search terms 
against each of the term vectors in an index one at a time. Is there a more 
efficient way of searching using a term vector of search terms - other than 
using its terms in a Boolean search that is?

I am asking because my boss asked me what all of the ways that Lucene uses 
vectors in indexing and search were, and my answer revealed a lot of gaps in my 
understanding of it.
Thanks,
Mike


Highlighting in Luke?

2012-03-13 Thread Mike O'Leary
I sent this message to the Luke discussion forum, but there isn't a lot of 
activity there these days, so I thought I would ask my question here too.

I was asked if Luke supports highlighting of matched terms in its search 
results display. I looked through the code, and it doesn't look to me like 
there is a way to change strings that are displayed as search results so that 
some words are displayed with bold, italic or some other highlighting feature 
and others are not. Is this true, or did I overlook something?
Thanks,
Mike


surround parser match-all query

2012-05-06 Thread Mike Sokolov
does anybody know how to express a MatchAllDocsQuery in surround query 
parser language?  I've tried


*

and()

but those don't parse.  I looked at the grammar and I don't think there 
is a way.  Please let us all know if you know otherwise!


Thanks

-Mike Sokolov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov
No, that doesn't work either - it works for the lucene query parser, but 
not for the *surround* query parser, which I'm using because it has a 
syntax for span queries.


On 5/6/2012 6:10 PM, Vladimir Gubarkov wrote:

Do you mean

*:*

?

On Mon, May 7, 2012 at 1:26 AM, Mike Sokolov  wrote:

does anybody know how to express a MatchAllDocsQuery in surround query
parser language?  I've tried

*

and()

but those don't parse.  I looked at the grammar and I don't think there is a
way.  Please let us all know if you know otherwise!

Thanks

-Mike Sokolov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov
I think what I have in mind would be purely an artifact of the parser; a 
term that would always be optimized away, either vanishing or gobbling 
up the whole query.  So if you had  n(A,*), you would just get "A".  If 
you had and(A, not(*)) (is that the surround syntax for not?) you would 
get nothing, if you had * you would get all the documents.  Maybe this 
could be done without having to actually generate a query internally, 
but could happen during parsing.  It's kind of a weird case, but I am 
trying to translate from one query language to another, and it would be 
convenient to have this as an option.


-Mike

On 5/6/2012 7:28 PM, Robert Muir wrote:

Hi Mike: wheres for the normal queryparser this Query doesn't consult
the positions file and is trivial, how would such a query be
implemented for the surround parser? As a single span that matches all
positions for the whole document? Maybe it could be a "fake span" for
each document of 0 ... Integer.MAX_VALUE?

I think it would be nice to have as long as its not going to be too
inefficient...

On Sun, May 6, 2012 at 5:26 PM, Mike Sokolov  wrote:

does anybody know how to express a MatchAllDocsQuery in surround query
parser language?  I've tried

*

and()

but those don't parse.  I looked at the grammar and I don't think there is a
way.  Please let us all know if you know otherwise!

Thanks

-Mike Sokolov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: surround parser match-all query

2012-05-06 Thread Mike Sokolov
Hmm - I looked at Spans more carefully, and it appears as if your idea 
about a "fake" Query (some kind of SpanAllQuery would be called for) 
would work well, and would probably be much simpler to implement.  It 
wouldn't preclude the kind of optimization I was talking about either, 
but I don't know if it would be worth the trouble.


It turns out in my very specific case I have a term that appears in 
every document in a particular field, so I am just using a search for 
that at the moment.


-Mike

On 5/6/2012 8:04 PM, Mike Sokolov wrote:
I think what I have in mind would be purely an artifact of the parser; 
a term that would always be optimized away, either vanishing or 
gobbling up the whole query.  So if you had  n(A,*), you would just 
get "A".  If you had and(A, not(*)) (is that the surround syntax for 
not?) you would get nothing, if you had * you would get all the 
documents.  Maybe this could be done without having to actually 
generate a query internally, but could happen during parsing.  It's 
kind of a weird case, but I am trying to translate from one query 
language to another, and it would be convenient to have this as an 
option.


-Mike

On 5/6/2012 7:28 PM, Robert Muir wrote:

Hi Mike: wheres for the normal queryparser this Query doesn't consult
the positions file and is trivial, how would such a query be
implemented for the surround parser? As a single span that matches all
positions for the whole document? Maybe it could be a "fake span" for
each document of 0 ... Integer.MAX_VALUE?

I think it would be nice to have as long as its not going to be too
inefficient...

On Sun, May 6, 2012 at 5:26 PM, Mike Sokolov  
wrote:

does anybody know how to express a MatchAllDocsQuery in surround query
parser language?  I've tried

*

and()

but those don't parse.  I looked at the grammar and I don't think 
there is a

way.  Please let us all know if you know otherwise!

Thanks

-Mike Sokolov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to extract highest TF-IDF terms from Lucene index?

2012-05-09 Thread Mike McCandless
There is a tool named HighFregTerms, in contrib/misc that does this...

Mike

Sent from my iPad

On May 9, 2012, at 4:18 PM, Michael Berkovsky  
wrote:

> Hi,
> 
> Assuming that there is a large lucene collection, and I want to extract top
> N terms with highest TF/IDF scores from some field.
> The collection does not have term vectors stored. Does Lucene have some
> utility to do this?
> 
> Thanks!
> Michael

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Mike Sokolov
It sounds me as if there could be a market for a new kind of query that 
would implement:


A w/5 (B and C)

in the way that people understand it to mean - the same A near both B 
and C, not just any A.


Maybe it's too hard to implement using rewrites into existing SpanQueries?

In term of the PositionIterator work  - instead of A being within 5 in a 
"minimum" distance sense, what we want is that its "maximum" distance to 
all the terms in the other query (B and C) is 5.  I'm not sure if any 
query in that branch covers this case though, either, but if I recall, 
there was a way to implement extensions to it that were fairly natural.


-Mike

On 5/16/2012 7:15 PM, Trejkaz wrote:

On Thu, May 17, 2012 at 7:11 AM, Chris Harris  wrote:

but also crazier ones, perhaps like

agreement w/5 (medical and companion)
(dog or dragon) w/5 (cat and cow)
(daisy and (dog or dragon)) w/25 (cat not cow)

[skip]

Everything in your post matches our experience. We ended up writing
something which transforms the query as well but had to give up on
certain crazy things people tried, such as this form:

(A and B) w/5 (C and D)

For this one:

   A w/5 (B and C)

We found the user expected the same A to be within 5 terms of both a B
and a C, and rewrote it to match that but also match more than they
asked for. So far, there have been no complaints about the overmatches
(it's documented.)

There is probably an extremely accurate way to rewrite it, but it
couldn't be figured out at the time. Maybe start with spans for A and
then remove spans not-near a B and spans not-near a C, which would
leave you with only spans near an A. The problem is that if you expand
the query to something like this, it gets quite a bit more complex, so
a user query which is already complex could turn into a really hard to
understand mess...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



filter by term frequency

2012-06-16 Thread Mike Sokolov
I imagine this is a question that comes up from time to time, but I 
haven't been able to find a definitive answer anywhere, so...


I'm wondering whether there is some type of Lucene query that filters by 
term frequency.   For example, suppose I want to find all documents that 
have exactly 2 occurrences of some word.  I know that the frequency is 
stored and used in scoring , but I don't think it is exposed in a simple 
way at the query level.  It looks to me as if CustomScoreQuery might be 
a convenient way to monkey with scores?  But it doesn't seem to use that 
for filtering, just sorting.  Perhaps a Collector could then impose a 
score threshold later? Any suggestions here?


-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: filter by term frequency

2012-06-17 Thread Mike Sokolov

Thanks, Jack!

On 6/16/2012 5:26 PM, Jack Krupansky wrote:
If you were a *Solr* user, I could say "try the 'termfreq' function 
query":


   termfreq(field,term) returns the number of times the term appears 
in the field for that document.

   Example Syntax: termfreq(text,'memory')

See:
http://wiki.apache.org/solr/FunctionQuery#tf

Lucene does have "FunctionQuery", "ValueSource", and 
"TermFreqValueSource".


See:
http://lucene.apache.org/solr/api/org/apache/solr/search/function/FunctionQuery.html 



-- Jack Krupansky

-Original Message- From: Mike Sokolov
Sent: Saturday, June 16, 2012 2:33 PM
To: java-user@lucene.apache.org
Subject: filter by term frequency

I imagine this is a question that comes up from time to time, but I
haven't been able to find a definitive answer anywhere, so...

I'm wondering whether there is some type of Lucene query that filters by
term frequency.   For example, suppose I want to find all documents that
have exactly 2 occurrences of some word.  I know that the frequency is
stored and used in scoring , but I don't think it is exposed in a simple
way at the query level.  It looks to me as if CustomScoreQuery might be
a convenient way to monkey with scores?  But it doesn't seem to use that
for filtering, just sorting.  Perhaps a Collector could then impose a
score threshold later? Any suggestions here?

-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fast way to get the start of document

2012-06-23 Thread Mike Sokolov
I got the sense from Paul's post that he wanted a solution that didn't 
require changing his index, although I'm not sure there is one.  Paul if 
you're willing to re-index, you could also store the length of the text 
as a numeric field, retrieve that and use it to drive the decision about 
whether to highlight.


-Mike Sokolov

On 6/23/2012 6:17 PM, Jack Krupansky wrote:
Simply have two fields, "full_body" and "limited_body". The former 
would index but not store the full document text from Tika (the 
"content" metadata.) The latter would store but not necessarily index 
the first 10K or so characters of the full text. Do searches on the 
full body field and highlighting on the limited body field.


-- Jack Krupansky

-Original Message- From: Paul Hill
Sent: Friday, June 22, 2012 2:23 PM
To: java-user@lucene.apache.org
Subject: Fast way to get the start of document

Our Hit highlighting (Using the older Highlighter) is wired with a 
"too huge" limit, so we could skip the multi-million character files, 
not just for highlighter.setMaxDocCharsToAnalyze, but if a document is 
really above the too huge limit, we don't
even try, and just produce a fragment from the front of the document.  
This results in almost reasonable response to time, even for a result 
sets of crazy huge documents (or ones with just 1 huge doc). I think 
this is all pretty normal.  Tell me if I'm wrong.


Given the above, while timing what was going on, I realized that I was 
reading in the entire body of the text in the skip highlighting case 
just to grab the 1st 100 or so characters.

I was doing

String text = fieldable.stringValue(); // Oh my!

Is there a way to _not_ read the whole multi-million characters in and 
only _start_ reading the contents of a large field?  See code below 
which got me no better results.

Some details

1.  Using Lucene 3.4

2.  Storing the (Tika) parse text of documents

a.  These are human produced documents; PDF, word etc. often 10K 
of characters, sometimes 100Ks, but very occasionally a few million)


3.  At this time, we store positions, but not offsets.

4.  We are using the old Highlighter, not the 
FastVectorHighlighter (because of #3 above).


5.  A basic search result is a page of 10 documents with short 
"blurb" (one fragment that shows a good hit).


I would be willing to live with a token stream to gen the intro blurb, 
but using the following code when under the too large code path 
(forget the highlighting) can add .5 seconds (compared to not reading 
anything which is not a solution just a comparison).

So here is my code.
   Fieldable textFld = doc.getFieldable(TEXT);
   if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
   blurb = highlightBlurb(scoreDoc, document, textFld, 
workingBlurbLen);

   } else {
   logger.debug("--- didn't call highlighter 
textLength = " + fullTextLength);
   TokenStream tokenStream = 
TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT, 
document, analyzer);
   OffsetAttribute offset = 
tokenStream.addAttribute(OffsetAttribute.class);
   CharTermAttribute charTerm = 
tokenStream.addAttribute(CharTermAttribute.class);

   StringBuilder blurbB = new StringBuilder("");
   while (tokenStream.incrementToken() && blurbB.length() < 
workingBlurbLen) {

   blurbB.append(charTerm.toString());
   blurbB.append(" ");
   }
   blurb = blurbB.toString();
   }
What could I do in the else that is faster?  Is not having offsets 
effecting this code path?
While your answering the above, I will be running some stats to 
suggest to management why we SHOULD store offsets, so we can use 
FastVectorHighlighter,

but I'm afraid I might still want the too-huge-to-highlight path.

-Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fast way to get the start of document

2012-06-25 Thread Mike Sokolov
I should also mention FastVectorHighlighter - are you using that?  I 
believe it would find a highlight at the end of a huge document much 
faster.  It would still read the whole doc into memory, but wouldn't 
have to analyze it.  There are also some limiting parameters there which 
prevent blowups for very large docs (hl.phraseLimit; see LUCENE-3234)


-Mike

On 06/25/2012 01:03 PM, Paul Hill wrote:

Mike and Jack,

Thanks for the suggestions.

As Mike suggested,  I already have the pre-stored length field.
I DO NOT read in the whole doc just to make the decision on "too huge", but I 
DO read it to _obtain_ the trivial
Intro. fragment instead of an excellent highlighted fragment.  I wanted a 
(memory saving) stream, so I could read just a little (1st buffer).

I am willing to change the index, so one solution is to not store an additional 
"reasonable_body_for_highlight_frament_generation", but a
smaller "just the 1st page" field only for too-huge documents that I use only when I want 
to only get the "Intro fragment" (with possibly no highlights).
But Jack's suggestion makes my think I should consider adding as many initial 
pages as I can get away with for too-huge documents and then I might just luck 
out and find a decent high-lightable section.
(Adding 10 pages only to the 1 in 5000 document that is too-huge, doesn't seem 
like much overhead for an index).

Our choice is that we'd like to hit highlight docs which are as huge as 
possible, because we are working with
customers who tend to be verbose, very verbose on occasion, and would love to 
find the perfect quote in Appendix Q of a 95 page report (but maybe I need
to have a talk with product management about this).  It is a tradeoff where we 
can try to educate the users and tell them that we are sorry that their query 
is slow, but if they want a faster response try using less common words
and a few more of them and you won't run into your too-huge documents unless 
you really want to see them.

So is there NO way to read the "all_text" field and only read _the_start_ of it?
Otherwise, I'm thinking I'll go with an extra 1st page field for the too-huge 
documents.

-Paul

   

-Original Message-
From: Mike Sokolov [mailto:soko...@ifactory.com]
Sent: Saturday, June 23, 2012 7:16 PM
To: java-user@lucene.apache.org
Cc: Jack Krupansky
Subject: Re: Fast way to get the start of document

I got the sense from Paul's post that he wanted a solution that didn't require 
changing his index, although
I'm not sure there is one.  Paul if you're willing to re-index, you could also 
store the length of the text as a
numeric field, retrieve that and use it to drive the decision about whether to 
highlight.

-Mike Sokolov

On 6/23/2012 6:17 PM, Jack Krupansky wrote:
 

Simply have two fields, "full_body" and "limited_body". The former
would index but not store the full document text from Tika (the
"content" metadata.) The latter would store but not necessarily index
the first 10K or so characters of the full text. Do searches on the
full body field and highlighting on the limited body field.

-- Jack Krupansky

-Original Message- From: Paul Hill
Sent: Friday, June 22, 2012 2:23 PM
To: java-user@lucene.apache.org
Subject: Fast way to get the start of document

Our Hit highlighting (Using the older Highlighter) is wired with a
"too huge" limit, so we could skip the multi-million character files,
not just for highlighter.setMaxDocCharsToAnalyze, but if a document is
really above the too huge limit, we don't even try, and just produce a
fragment from the front of the document.
This results in almost reasonable response to time, even for a result
sets of crazy huge documents (or ones with just 1 huge doc). I think
this is all pretty normal.  Tell me if I'm wrong.

Given the above, while timing what was going on, I realized that I was
reading in the entire body of the text in the skip highlighting case
just to grab the 1st 100 or so characters.
I was doing

String text = fieldable.stringValue(); // Oh my!

Is there a way to _not_ read the whole multi-million characters in and
only _start_ reading the contents of a large field?  See code below
which got me no better results.
Some details

1.  Using Lucene 3.4

2.  Storing the (Tika) parse text of documents

a.  These are human produced documents; PDF, word etc. often 10K
of characters, sometimes 100Ks, but very occasionally a few million)

3.  At this time, we store positions, but not offsets.

4.  We are using the old Highlighter, not the
FastVectorHighlighter (because of #3 above).

5.  A basic search result is a page of 10 documents with short
"blurb" (one fragment that shows a good hit).

I would be willing to live with a token stream to gen the intro blurb,
but using the following code when under 

Re: find meaningful words through Lucene

2012-06-27 Thread Mike Sokolov
Maybe high frequency terms that are not evenly distributed throughout 
the corpus would be a better definition.  Discriminative terms.  I'm 
sure there is something in the machine learning literature about 
unsupervised clustering that would help here.  But I don't know what it 
is :)


-Mike

On 06/27/2012 05:09 AM, Ian Lea wrote:

All words are important if they help people find what they want.

Maybe you want high frequency terms.  See contrib class
org.apache.lucene.misc.HighFreqTerms.


--
Ian.


On Wed, Jun 27, 2012 at 3:04 AM, 齐保元  wrote:
   

meaningful just means the word is important than others,like keywords/keyphrase.





 

Please define meaningful.

--
Ian.


On Tue, Jun 26, 2012 at 10:39 AM,  wrote:
   

hi,  does anyone knows how to extract meaningful words from Lucene index?
 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem with TermVector offsets and positions not being preserved

2012-07-19 Thread Mike O'Leary
I created an index using Lucene 3.6.0 in which I specified that a certain text 
field in each document should be indexed, stored, analyzed with no norms, with 
term vectors, offsets and positions. Later I looked at that index in Luke, and 
it said that term vectors were created for this field, but offsets and 
positions were not. The code I used for indexing couldn't be simpler. It looks 
like this for the relevant field:

doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, 
Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS);

The indexer adds these documents to the index and commits them. I ran the 
indexer in a debugger and watched the Lucene code set the Field instance 
variables called storeTermVector, storeOffsetWithTermVector and 
storePositionWithTermVector to true for this field.

When the indexing was done, I ran a simple program in a debugger that opens an 
index, reads each document and writes out its information as XML. The values of 
storeOffsetWithTermVector and storePositionWithTermVector in the ReportText 
Field objects were false. Is there something other than specifying 
Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs to 
be done in order for offsets and positions to be saved in the index? Or are 
there circumstances under which the Field.TermVector setting for a Field object 
is ignored? This doesn't make sense to me, and I could swear that offsets and 
positions were being saved in some older indexes I created that I unfortunately 
no longer have around for comparison. I'm sure that I am just overlooking 
something or have made some kind of mistake, but I can't see what it is at the 
moment. Thanks for any help or advice you can give me.
Mike


RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
Hi Robert,
I put together the following two small applications to try to separate the 
problem I am having from my own software and any bugs it contains. One of the 
applications is called CreateTestIndex, and it comes with the Lucene In Action 
book's source code that you can download from Manning Publications. I changed 
it a tiny bit to get rid of a special analyzer that is irrelevant to what I am 
looking at, to get rid of a few warnings about deprecated functions, and to add 
a loop that writes names of fields and their TermVector, offset and position 
settings to the console.

The other application is called DumpIndex, and got it from a web site somewhere 
about 6 months ago. I changed a few lines to get rid of deprecated function 
warnings and added the same line of code to it that writes field information to 
the console.

What I am seeing is that when I run CreateTestIndex, when the fields are first 
created, added to a document, and are about to be added to the index, the 
fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly 
print out that the values of field.isTermVectorStored(), 
field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() 
are true. When I run DumpIndex on the index that was created, those fields 
print out true for field.isTermVectorStored() and false for the other two 
functions.
Thanks,
Mike

This is the source code for CreateTextIndex:


package myLucene;

/**
 * Copyright Manning Publications Co.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific lan  
*/

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Fieldable;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.text.ParseException;

public class CreateTestIndex {
  
  public static Document getDocument(String rootDir, File file) throws 
IOException {
Properties props = new Properties();
props.load(new FileInputStream(file));

Document doc = new Document();

// category comes from relative path below the base directory
String category = file.getParent().substring(rootDir.length());//1
category = category.replace(File.separatorChar, '/');  //1

String isbn = props.getProperty("isbn"); //2
String title = props.getProperty("title");   //2
String author = props.getProperty("author"); //2
String url = props.getProperty("url");   //2
String subject = props.getProperty("subject");   //2

String pubmonth = props.getProperty("pubmonth"); //2

System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth 
+ "\n" + category + "\n-");

doc.add(new Field("isbn", // 3
  isbn,   // 3
  Field.Store.YES,// 3
  Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("category", // 3
  category,   // 3
  Field.Store.YES,// 3
  Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("title",// 3
  title,  // 3
  Field.Store.YES,// 3
  Field.Index.ANALYZED,   // 3
  Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
doc.add(new Field("title2",   // 3
  title.toLowerCase(),// 3
  Field.Store.YES,// 3
  Field.Index.NOT_ANALYZED_NO_NORMS,   // 3
  Field.TermVector.WITH_POSITIONS_OFFSETS));  // 3

// split multiple authors into unique fiel

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
I neglected to mention that CreateTestIndex uses a collection of data files 
with .properties extensions that are included in the Lucene In Action source 
code download.
Mike

-Original Message-
From: Mike O'Leary [mailto:tmole...@uw.edu] 
Sent: Friday, July 20, 2012 2:10 PM
To: java-user@lucene.apache.org
Subject: RE: Problem with TermVector offsets and positions not being preserved

Hi Robert,
I put together the following two small applications to try to separate the 
problem I am having from my own software and any bugs it contains. One of the 
applications is called CreateTestIndex, and it comes with the Lucene In Action 
book's source code that you can download from Manning Publications. I changed 
it a tiny bit to get rid of a special analyzer that is irrelevant to what I am 
looking at, to get rid of a few warnings about deprecated functions, and to add 
a loop that writes names of fields and their TermVector, offset and position 
settings to the console.

The other application is called DumpIndex, and got it from a web site somewhere 
about 6 months ago. I changed a few lines to get rid of deprecated function 
warnings and added the same line of code to it that writes field information to 
the console.

What I am seeing is that when I run CreateTestIndex, when the fields are first 
created, added to a document, and are about to be added to the index, the 
fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly 
print out that the values of field.isTermVectorStored(), 
field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() 
are true. When I run DumpIndex on the index that was created, those fields 
print out true for field.isTermVectorStored() and false for the other two 
functions.
Thanks,
Mike

This is the source code for CreateTextIndex:


package myLucene;

/**
 * Copyright Manning Publications Co.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific lan  
*/

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field; import 
org.apache.lucene.document.Fieldable;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.text.ParseException;

public class CreateTestIndex {
  
  public static Document getDocument(String rootDir, File file) throws 
IOException {
Properties props = new Properties();
props.load(new FileInputStream(file));

Document doc = new Document();

// category comes from relative path below the base directory
String category = file.getParent().substring(rootDir.length());//1
category = category.replace(File.separatorChar, '/');  //1

String isbn = props.getProperty("isbn"); //2
String title = props.getProperty("title");   //2
String author = props.getProperty("author"); //2
String url = props.getProperty("url");   //2
String subject = props.getProperty("subject");   //2

String pubmonth = props.getProperty("pubmonth"); //2

System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth 
+ "\n" + category + "\n-");

doc.add(new Field("isbn", // 3
  isbn,   // 3
  Field.Store.YES,// 3
  Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("category", // 3
  category,   // 3
  Field.Store.YES,// 3
  Field.Index.NOT_ANALYZED)); // 3
doc.add(new Field("title",// 3
  title,  // 3
  Field.Store.YES,// 3
  Field.Index.ANALYZED,   // 3
 

RE: Problem with TermVector offsets and positions not being preserved

2012-07-20 Thread Mike O'Leary
Hi Robert,
I'm not trying to determine whether a document has term vectors, I'm trying to 
determine whether the term vectors that are in the index have offsets and 
positions stored. Shouldn't the Field instance variables called 
storeOffsetWithTermVector and storePositionWithTermVector be set to true for a 
field that is defined to store offsets and positions in term vectors? They are 
set to true in 3.5, but not in 3.6. When I open an index that I created with 
3.6 in Luke, it says the fields in question have term vectors enabled, but 
offsets and positions are not stored. Maybe once term vectors with offsets and 
positions are created, it doesn't matter anymore what the values of 
storeOffsetWithTermVector and storePositionWithTermVector happen to be, but I'd 
like to find out for sure if offsets and positions are being handled right in 
3.6 or not because I need to produce indexes that a co-worker can use with a UI 
that uses fast vector term highlighting, and I'd like to be sure I have created 
indexes that work for him.
Thanks,
Mike

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, July 20, 2012 4:05 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

I think its wrong for DumpIndex to look at term vector information from the 
Document that was retrieved from IndexReader.document, thats basically just a 
way of getting access to your stored fields.

This tool should be using something like IndexReader.getTermFreqVector for the 
document to determine if it has term vectors.

On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary  wrote:
> Hi Robert,
> I put together the following two small applications to try to separate the 
> problem I am having from my own software and any bugs it contains. One of the 
> applications is called CreateTestIndex, and it comes with the Lucene In 
> Action book's source code that you can download from Manning Publications. I 
> changed it a tiny bit to get rid of a special analyzer that is irrelevant to 
> what I am looking at, to get rid of a few warnings about deprecated 
> functions, and to add a loop that writes names of fields and their 
> TermVector, offset and position settings to the console.
>
> The other application is called DumpIndex, and got it from a web site 
> somewhere about 6 months ago. I changed a few lines to get rid of deprecated 
> function warnings and added the same line of code to it that writes field 
> information to the console.
>
> What I am seeing is that when I run CreateTestIndex, when the fields are 
> first created, added to a document, and are about to be added to the index, 
> the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified 
> correctly print out that the values of field.isTermVectorStored(), 
> field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() 
> are true. When I run DumpIndex on the index that was created, those fields 
> print out true for field.isTermVectorStored() and false for the other two 
> functions.
> Thanks,
> Mike
>
> This is the source code for CreateTextIndex:
>
> //
> //
> package myLucene;
>
> /**
>  * Copyright Manning Publications Co.
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  * http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific lan */
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field; import 
> org.apache.lucene.document.Fieldable;
> import org.apache.lucene.document.NumericField;
> import org.apache.lucene.document.DateTools;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.util.Version;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.util.Properties;
> import java.util.Date;
> import java.util.List;
> import java.util.ArrayList;
> import java.text.ParseException;
>
> public class CreateTestIndex {
>
>   public static Document getDocument(String rootDir, File file) throws 
> IOException {

RE: Problem with TermVector offsets and positions not being preserved

2012-07-26 Thread Mike O'Leary
Hi Robert,
Thanks for your help. This cleared up all of the things I was having trouble 
understanding about offsets and positions in term vectors.
Mike

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, July 20, 2012 5:59 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary  wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying 
> to determine whether the term vectors that are in the index have offsets and 
> positions > stored.

Right: what i'm trying to tell you is that offsets and positions is not an 
index-wide setting for a field: its per-document.

I think all the tools you are using to check these values are not doing it 
correctly:
1. DumpIndex is wrongly using values from the Document returned by 
IndexReader.document(), but that doesn't and never did retrieve these values 
(it would be 2 extra disk seeks per document to figure out the term vector 
flags) 2. I havent looked at Luke, but its probably printing the "global"
bits from FieldInfos. It used to be that we wrote some bits for these options, 
I don't ever know what the purpose was since these options can be controlled 
on/off at a per-document level: they make no sense.
Because of this we stopped writing these bits in 3.6 (we only write into 
FieldInfos if the field has any term vectors at all), and thats probably whats 
confusing you there.

Again, if you really want to validate that a specific document has 
offsets/positions in its term vectors, you need to check that specific document 
with IndexReader.getTermFreqVector, there is no other way, since this can be 
controlled on a per-document basis for a field.


--
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Small Vocabulary

2012-08-06 Thread Mike Sokolov
There was some interesting work done on optimizing queries including 
very common words (stop words) that I think overlaps with your problem. 
See this blog post 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 
from the Hathi Trust.


The upshot in a nutshell was that queries including terms with very 
large postings lists (ie high occurrences) were slow, and the approach 
they took to dealing with this was to index n-grams (ie pairs and 
triplets of adjacent tokens).  However I'm not sure this would help much 
if your queries will typically include only a single token.


-Mike

On 07/30/2012 09:07 AM, Carsten Schnober wrote:

Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance "ART ADV ADJA NN".
The aim is to be able to search (efficiently) for occurrences of "ADJA".

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober


   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Supporting advanced search methods in a user interface

2012-08-16 Thread Mike O'Leary
I would like to know if anyone has ideas (or pointers to discussions) about 
good ways to support advanced search options, such as the various kinds of 
SpanQuery, in a search application user interface that is understandable to 
non-expert users.
Thanks,
Mike


RE: Problem with TermVector offsets and positions not being preserved

2012-08-22 Thread Mike O'Leary
I have one more question about term vector positions and offsets being 
preserved. My co-worker is working on updating the documents in an index with a 
field that contains a numerical value derived from the term frequencies and 
inverse document frequencies of terms in the document. His first pass at doing 
this calculates these values, writes them along with document ids to a text 
file and then updates the documents by reading lines from the file, searching 
for the document that contains the id, adding the field to the document, and 
replacing the document in the index. Some of the fields in these documents have 
term vectors with offsets and positions. After the revised document is updated 
in the index, those fields' term vector offsets and positions are still found. 
After closing the searcher, reader and writer that are used in this process, 
the fields that have term vectors no longer have positions and offsets in them. 
His code looks like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, _analyzer);
IndexWriter writer = new IndexWriter(indexDir, config);
IndexReader reader = IndexReader.open(writer, true);
IndexSearcher searcher = new IndexSearcher(reader);

while ((s = in.readLine()) != null) {
String[] tokens = s.split(",");
float fieldValue = Float.parseFloat(tokens[1].trim());
NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
nField.setFloatValue(fieldValue);
String docId = tokens[0].trim();
Term docIdTerm = new Term("DocId", docId);
TermQuery query = new TermQuery(docIdTerm);
TopDocs hits = searcher.search(query, 2);
  
if (hits.scoreDocs.length != 1) {
throw new Exception("Unexpected number of documents in index with docId 
= " + docId);
}
int docNum = hits.scoreDocs[0].doc;
Document doc = searcher.doc(docNum);
doc.add(nField);
writer.updateDocument(docIdTerm, doc);
}
displayTermVectorInfo(dir);   // for debugging
writer.close();
displayTermVectorInfo(dir);   // for debugging
reader.close();
searcher.close();

private static void displayTermVectorInfo(Directory dir) throws IOException, 
CorruptIndexException {
IndexReader reader = null;

try {
reader = IndexReader.open(dir);

for (int i = 0; i < reader.numDocs; i++) {
Document doc = reader.document(j);
List docFields = doc.getFields();

for (Fieldable field : docFields) {
TermFreqVector termFreqVector = reader.getTermFreqVector(i, 
field.name());
  
if (termFreqVector != null && termFreqVector instanceof 
TermPositionVector) {
TermPositionVector termPositionVector = 
(TermPositionVector)termFreqVector;
System.out.println("Field " + field.name());

for (int j = 0; j < termFreqVector.size(); j++) {
TermVectorOffsetInfo[] offsets = 
termPositionVector.getOffsets(j);

for (TermVectorOffsetInfo offsetInfo : offsets) {
System.out.println("offset: " + 
offsetInfo.getStartOffset() + " " + offsetInfo.getEndOffset());
}
}
for (int k = 0; k < termFreqVector.size(); k++) {
int[] positions = 
termPositionVector.getTermPositions(k);

for (int position : positions) {
System.out.println("position: " + position);
}
}
}
}
}
} finally {
if (reader != null) {
reader.close();
}
}
}

The first time displayTermVectorInfo is called, it displays offsets and 
positions for the fields that have term vectors with offsets and positions. The 
second time it is called, it doesn't display anything because none of the term 
vectors satisfy termFreqVector instanceof TermPositionVector. Is it supposed to 
work this way? What is it about closing the writer that alters the term vectors 
in the affected fields? Is there a way to add a field to the documents in an 
index in which this doesn't occur?
Thanks,
Mike


-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, July 20, 2012 5:59 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary  wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying 
> to determine whether the term vectors that are in the index have offsets and 
> positions > stored.

Right: what i'm trying to tell you is that offsets and positions is not an 
index-wide setting for a field: its per-document.

I thin

RE: Problem with TermVector offsets and positions not being preserved

2012-08-24 Thread Mike O'Leary
So for Lucene 3.6, is the right way to do this to create a new Document and add 
new Fields based on the old Fields (with the settings you want them to have for 
term vector offsets and positions, etc.) and then call updateDocument on that 
new Document?
Thanks,
Mike

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, August 24, 2012 9:52 AM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

Calling IR.document does not restore your 'original Document'
completely. This is really an age-old trap.
So don't update documents this way: its fine to fetch their contents but 
nothing goes thru the effort to ensure that things like term vectors parameters 
are the same as what you originally provided. This would require extra disk 
seeks.

See https://issues.apache.org/jira/browse/LUCENE-3312 for an effort to fix this 
trap for google summer of code.

On Wed, Aug 22, 2012 at 5:23 PM, Mike O'Leary  wrote:
> I have one more question about term vector positions and offsets being 
> preserved. My co-worker is working on updating the documents in an index with 
> a field that contains a numerical value derived from the term frequencies and 
> inverse document frequencies of terms in the document. His first pass at 
> doing this calculates these values, writes them along with document ids to a 
> text file and then updates the documents by reading lines from the file, 
> searching for the document that contains the id, adding the field to the 
> document, and replacing the document in the index. Some of the fields in 
> these documents have term vectors with offsets and positions. After the 
> revised document is updated in the index, those fields' term vector offsets 
> and positions are still found. After closing the searcher, reader and writer 
> that are used in this process, the fields that have term vectors no longer 
> have positions and offsets in them. His code looks like this:
>
> IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, 
> _analyzer); IndexWriter writer = new IndexWriter(indexDir, config); 
> IndexReader reader = IndexReader.open(writer, true); IndexSearcher 
> searcher = new IndexSearcher(reader);
>
> while ((s = in.readLine()) != null) {
> String[] tokens = s.split(",");
> float fieldValue = Float.parseFloat(tokens[1].trim());
> NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
> nField.setFloatValue(fieldValue);
> String docId = tokens[0].trim();
> Term docIdTerm = new Term("DocId", docId);
> TermQuery query = new TermQuery(docIdTerm);
> TopDocs hits = searcher.search(query, 2);
>
> if (hits.scoreDocs.length != 1) {
> throw new Exception("Unexpected number of documents in index with 
> docId = " + docId);
> }
> int docNum = hits.scoreDocs[0].doc;
> Document doc = searcher.doc(docNum);
> doc.add(nField);
> writer.updateDocument(docIdTerm, doc); }
> displayTermVectorInfo(dir);   // for debugging
> writer.close();
> displayTermVectorInfo(dir);   // for debugging
> reader.close();
> searcher.close();
>
> private static void displayTermVectorInfo(Directory dir) throws IOException, 
> CorruptIndexException {
> IndexReader reader = null;
>
> try {
> reader = IndexReader.open(dir);
>
> for (int i = 0; i < reader.numDocs; i++) {
> Document doc = reader.document(j);
> List docFields = doc.getFields();
>
> for (Fieldable field : docFields) {
> TermFreqVector termFreqVector = 
> reader.getTermFreqVector(i, field.name());
>
> if (termFreqVector != null && termFreqVector instanceof 
> TermPositionVector) {
> TermPositionVector termPositionVector = 
> (TermPositionVector)termFreqVector;
> System.out.println("Field " + field.name());
>
> for (int j = 0; j < termFreqVector.size(); j++) {
> TermVectorOffsetInfo[] offsets = 
> termPositionVector.getOffsets(j);
>
> for (TermVectorOffsetInfo offsetInfo : offsets) {
> System.out.println("offset: " + 
> offsetInfo.getStartOffset() + " " + offsetInfo.getEndOffset());
> }
> }
> for (int k = 0; k < termFreqVector.size(); k++) {
> int[] positions = 
> termPositionVector.getTermPositions(k);
>
> for (int position : positions) {
> System.out.println("position: " +

Uses for IndexWriter.commit(commitUserData)/IndexCommit.getUserData()

2012-09-21 Thread Mike O'Leary
I was looking at IndexWriter.commit(commitUserData) and 
IndexCommit.getUserData() as possible ways to save metadata about documents in 
an index, but I realized that the metadata we are looking at could easily get 
to have way too many map entries to work well. This pair of functions looks 
useful though, and I would like to ask if anyone could describe use cases where 
it works well to save data in a commitUserData map while indexing for later use 
in the application.
Thanks,
Mike


Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-25 Thread Mike O'Leary
I am updating an analyzer that uses a particular configuration of the 
PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the fields use a 
custom analyzer and StandardTokenizer and the other fields use the 
KeywordAnalyzer and KeywordTokenizer. The older version of the analyzer looks 
like this:

public class MyPerFieldAnalyzer extends Analyzer {
  PerFieldAnalyzerWrapper _analyzer;

  public MyPerFieldAnalyzer() {
Map analyzerMap = new HashMap();

analyzerMap.put("IDNumber", new KeywordAnalyzer());
...
...

_analyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(), analyzerMap);
  }

  @Override
  public TokenStream tokenStream(String fieldname, Reader reader) {
TokenStream stream = _analyzer.tokenStream(fieldname, reader);
return stream;
  }
}

In older versions of Lucene it is necessary to define a tokenStream function, 
but in 4.0 it is not (in fact, TokenStream is declared final, so you can't). 
Instead, it is necessary to define a createComponents function that takes the 
same arguments as the tokenStream function and returns a TokenStreamComponents 
object. The TokenStreamComponents constructor has a Tokenizer argument and a 
TokenStream argument. I assume I can just use the same code to provide the 
TokenStream object as was used in the older analyzer's tokenStream function, 
but I don't see how to provide a Tokenizer object, unless it is by creating a 
separate map of field names to Tokenizers that works the same way the analyzer 
map does. Is that the best way to do this, or is there a better way? For 
example, would it be better to inherit from AnalyzerWrapper instead of from 
Analyzer? In that case I would need to define getWrappedAnalyzer and 
wrappedComponents functions. I think in that case I would still need to put the 
same kind of logic in the wrapComponents function that specifies which 
tokenizer to use with which field, though. It looks like the 
PerFieldAnalyzerWrapper itself assumes that the same tokenizer will be used 
with all fields, as its wrapComponents function ignores the fieldname 
parameter. I would appreciate any help in finding out the best way to update 
this analyzer and to write the required function(s).
Thanks,
Mike


RE: Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-25 Thread Mike O'Leary
Hi Chris,
In a nutshell, my question is, what should I put in place of ??? to make this 
into a Lucene 4.0 analyzer?

public class MyPerFieldAnalyzer extends Analyzer {
  PerFieldAnalyzerWrapper _analyzer;

  public MyPerFieldAnalyzer() {
Map analyzerMap = new HashMap();

analyzerMap.put("IDNumber", new KeywordAnalyzer());
...
...

_analyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(),  analyzerMap);
  }

  @Override
  public TokenStreamComponents createComponents(String fieldname, Reader 
reader) {
Tokenizer source = ???;
TokenStream stream = _analyzer.tokenStream(fieldname, reader);
return new TokenStreamComponents(source, stream);
  }
}

I must be missing something obvious. Can you tell me what it is?
Thanks,
Mike

-Original Message-
From: Chris Male [mailto:gento...@gmail.com] 
Sent: Tuesday, September 25, 2012 5:18 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question

Hi Mike,

I don't really understand what problem you're having.

PerFieldAnalyzerWrapper, like all AnalyzerWrappers, uses 
Analyzer.PerFieldReuseStrategy which means it caches the TokenStreamComponents 
per field.  The TokenStreamComponents cached are created by by retrieving the 
wrapped Analyzer through
AnalyzerWrapper.getWrappedAnalyzer(Field) and calling createComponents.  In 
PerFieldAnalyzerWrapper, getWrappedAnalyzer pulls the Analyzer from the Map you 
provide.

Consequently to use your custom Analyzers and KeywordAnalyzer, all you need to 
do is define your custom Analyzer using the new Analyzer API (that is using 
TokenStreamComponents), create your Map from that Analyzer and KeywordAnalyzer 
and pass it into PerFieldAnalyzerWrapper.  This seems to be what you're doing 
in your code sample.

Are you able to expand on the problem you're encountering?

On Wed, Sep 26, 2012 at 11:57 AM, Mike O'Leary  wrote:

> I am updating an analyzer that uses a particular configuration of the 
> PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the fields 
> use a custom analyzer and StandardTokenizer and the other fields use 
> the KeywordAnalyzer and KeywordTokenizer. The older version of the 
> analyzer looks like this:
>
> public class MyPerFieldAnalyzer extends Analyzer {
>   PerFieldAnalyzerWrapper _analyzer;
>
>   public MyPerFieldAnalyzer() {
> Map analyzerMap = new HashMap Analyzer>();
>
> analyzerMap.put("IDNumber", new KeywordAnalyzer());
> ...
> ...
>
> _analyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(), 
> analyzerMap);
>   }
>
>   @Override
>   public TokenStream tokenStream(String fieldname, Reader reader) {
> TokenStream stream = _analyzer.tokenStream(fieldname, reader);
> return stream;
>   }
> }
>
> In older versions of Lucene it is necessary to define a tokenStream 
> function, but in 4.0 it is not (in fact, TokenStream is declared 
> final, so you can't). Instead, it is necessary to define a 
> createComponents function that takes the same arguments as the 
> tokenStream function and returns a TokenStreamComponents object. The 
> TokenStreamComponents constructor has a Tokenizer argument and a 
> TokenStream argument. I assume I can just use the same code to provide 
> the TokenStream object as was used in the older analyzer's tokenStream 
> function, but I don't see how to provide a Tokenizer object, unless it 
> is by creating a separate map of field names to Tokenizers that works 
> the same way the analyzer map does. Is that the best way to do this, 
> or is there a better way? For example, would it be better to inherit 
> from AnalyzerWrapper instead of from Analyzer? In that case I would 
> need to define getWrappedAnalyzer and wrappedComponents functions. I 
> think in that case I would still need to put the same kind of logic in 
> the wrapComponents function that specifies which tokenizer to use with 
> which field, though. It looks like the PerFieldAnalyzerWrapper itself 
> assumes that the same tokenizer will be used with all fields, as its 
> wrapComponents function ignores the fieldname parameter. I would 
> appreciate any help in finding out the best way to update this analyzer and 
> to write the required function(s).

Thanks,
> Mike
>



--
Chris Male | Open Source Search Developer | elasticsearch | 
www.e<http://www.dutchworks.nl> lasticsearch.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-25 Thread Mike O'Leary
Hi Chris,
So if I change my analyzer to inherit from AnalyzerWrapper, I need to define a 
getWrappedAnalyzer function and a wrapComponents function. I think 
getWrappedAnalyzer is straightforward, but I don't understand who is calling 
wrapComponents and for what purpose, so I don't know how to define it. This is 
my modified analyzer code with ??? in the places I don't know how to define.
Thanks,
Mike

public class MyPerFieldAnalyzer extends AnalyzerWrapper {
  Map _analyzerMap = new HashMap();
  Analyzer _defaultAnalyzer;

  public MyPerFieldAnalyzer() {
_analyzerMap.put("IDNumber", new KeywordAnalyzer());
...
...

_defaultAnalyzer = new CustomAnalyzer();
  }

  @Override
  protected Analyzer getWrappedAnalyzer(String fieldName) {
Analyzer analyzer;

if (analyzerMap.containsKey(fieldName) {
  analyzer = analyzerMap.get(fieldName);
} else {
  analyzer = defaultAnalyzer;
}
  }

  @Override
  public TokenStreamComponents wrapComponents(String fieldname,  
TokenStreamComponents components) {
Tokenizer tokenizer = ???;
TokenStream tokenStream = ???;
return new TokenStreamComponents(tokenizer, tokenStream);
  }
}

-Original Message-
From: Chris Male [mailto:gento...@gmail.com] 
Sent: Tuesday, September 25, 2012 5:34 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question

Ah I see.

The problem is that we don't really encourage wrapping of Analyzers.  Your 
Analyzer wraps a PerFieldAnalyzerWrapper consequently it needs to extend 
AnalyzerWrapper, not Analyzer.  AnalyzerWrapper handles the createComponents 
call and just requires you to give it the Analyzer(s) you've wrapped through 
getWrappedAnalyzer.

You can avoid all this entirely of course by not extending Analyzer but instead 
just instantiating a PerFieldAnalyerWrapper instance directly instead of your 
MyPerFieldAnalyzer.

On Wed, Sep 26, 2012 at 12:25 PM, Mike O'Leary  wrote:

> Hi Chris,
> In a nutshell, my question is, what should I put in place of ??? to 
> make this into a Lucene 4.0 analyzer?
>
> public class MyPerFieldAnalyzer extends Analyzer {
>   PerFieldAnalyzerWrapper _analyzer;
>
>   public MyPerFieldAnalyzer() {
> Map analyzerMap = new HashMap Analyzer>();
>
> analyzerMap.put("IDNumber", new KeywordAnalyzer());
> ...
> ...
>
> _analyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(),  
> analyzerMap);
>   }
>
>   @Override
>   public TokenStreamComponents createComponents(String fieldname, 
> Reader
> reader) {
> Tokenizer source = ???;
> TokenStream stream = _analyzer.tokenStream(fieldname, reader);
> return new TokenStreamComponents(source, stream);
>   }
> }
>
> I must be missing something obvious. Can you tell me what it is?
> Thanks,
> Mike
>
> -Original Message-
> From: Chris Male [mailto:gento...@gmail.com]
> Sent: Tuesday, September 25, 2012 5:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question
>
> Hi Mike,
>
> I don't really understand what problem you're having.
>
> PerFieldAnalyzerWrapper, like all AnalyzerWrappers, uses 
> Analyzer.PerFieldReuseStrategy which means it caches the 
> TokenStreamComponents per field.  The TokenStreamComponents cached are 
> created by by retrieving the wrapped Analyzer through
> AnalyzerWrapper.getWrappedAnalyzer(Field) and calling createComponents.
>  In PerFieldAnalyzerWrapper, getWrappedAnalyzer pulls the Analyzer 
> from the Map you provide.
>
> Consequently to use your custom Analyzers and KeywordAnalyzer, all you 
> need to do is define your custom Analyzer using the new Analyzer API 
> (that is using TokenStreamComponents), create your Map from that 
> Analyzer and KeywordAnalyzer and pass it into PerFieldAnalyzerWrapper.  
> This seems to be what you're doing in your code sample.
>
> Are you able to expand on the problem you're encountering?
>
> On Wed, Sep 26, 2012 at 11:57 AM, Mike O'Leary  wrote:
>
> > I am updating an analyzer that uses a particular configuration of 
> > the PerFieldAnalyzerWrapper to work with Lucene 4.0. A few of the 
> > fields use a custom analyzer and StandardTokenizer and the other 
> > fields use the KeywordAnalyzer and KeywordTokenizer. The older 
> > version of the analyzer looks like this:
> >
> > public class MyPerFieldAnalyzer extends Analyzer {
> >   PerFieldAnalyzerWrapper _analyzer;
> >
> >   public MyPerFieldAnalyzer() {
> > Map analyzerMap = new HashMap > Analyzer>();
> >
> > analyzerMap.put("IDNumber", new KeywordAnalyzer());
> > ...
> > ...
> >
> > _analyzer = new PerFieldAnal

RE: Lucene 4.0 PerFieldAnalyzerWrapper question

2012-09-26 Thread Mike O'Leary
Hi Chris,
So it sounds like instead of defining a new class that gets instantiated to 
create an analyzer, I could just do this:

public class MyPerFieldAnalyzer {
  public static Analyzer getMyPerFieldAnalyzer() {
Map analyzerMap = new HashMap();

analyzerMap.put("IDNumber", new KeywordAnalyzer());
...
...

return new PerFieldAnalyzerWrapper(new CustomAnalyzer(), analyzerMap) ;
  }
}

Which is much simpler than all of the things I was thinking I would need to do.
Thanks very much,
Mike

-Original Message-
From: Chris Male [mailto:gento...@gmail.com] 
Sent: Tuesday, September 25, 2012 6:32 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question

Mike,

On Wed, Sep 26, 2012 at 1:05 PM, Mike O'Leary  wrote:

> Hi Chris,
> So if I change my analyzer to inherit from AnalyzerWrapper, I need to 
> define a getWrappedAnalyzer function and a wrapComponents function. I 
> think getWrappedAnalyzer is straightforward, but I don't understand 
> who is calling wrapComponents and for what purpose, so I don't know 
> how to define it. This is my modified analyzer code with ??? in the 
> places I don't know how to define.
> Thanks,
> Mike
>
> public class MyPerFieldAnalyzer extends AnalyzerWrapper {
>   Map _analyzerMap = new HashMap();
>   Analyzer _defaultAnalyzer;
>
>   public MyPerFieldAnalyzer() {
> _analyzerMap.put("IDNumber", new KeywordAnalyzer());
> ...
> ...
>
> _defaultAnalyzer = new CustomAnalyzer();
>   }
>
>   @Override
>   protected Analyzer getWrappedAnalyzer(String fieldName) {
> Analyzer analyzer;
>
> if (analyzerMap.containsKey(fieldName) {
>   analyzer = analyzerMap.get(fieldName);
> } else {
>   analyzer = defaultAnalyzer;
> }
>   }
>

I'm not sure if you have missed it but PerFieldAnalyzerWrapper supports having 
a default Analyzer.


>
>   @Override
>   public TokenStreamComponents wrapComponents(String fieldname,  
> TokenStreamComponents components) {
> Tokenizer tokenizer = ???;
> TokenStream tokenStream = ???;
> return new TokenStreamComponents(tokenizer, tokenStream);
>   }
> }
>

wrapComponents is useful for when you need to change the components retrieved 
from the wrapped Analyzer.  Adding a new Tokenizer or TokenFilter for example.  
But you don't need to do this, and can just return the components parameter 
unchanged.


>
> -Original Message-
> From: Chris Male [mailto:gento...@gmail.com]
> Sent: Tuesday, September 25, 2012 5:34 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question
>
> Ah I see.
>
> The problem is that we don't really encourage wrapping of Analyzers.  
> Your Analyzer wraps a PerFieldAnalyzerWrapper consequently it needs to 
> extend AnalyzerWrapper, not Analyzer.  AnalyzerWrapper handles the 
> createComponents call and just requires you to give it the Analyzer(s) 
> you've wrapped through getWrappedAnalyzer.
>
> You can avoid all this entirely of course by not extending Analyzer 
> but instead just instantiating a PerFieldAnalyerWrapper instance 
> directly instead of your MyPerFieldAnalyzer.
>
> On Wed, Sep 26, 2012 at 12:25 PM, Mike O'Leary  wrote:
>
> > Hi Chris,
> > In a nutshell, my question is, what should I put in place of ??? to 
> > make this into a Lucene 4.0 analyzer?
> >
> > public class MyPerFieldAnalyzer extends Analyzer {
> >   PerFieldAnalyzerWrapper _analyzer;
> >
> >   public MyPerFieldAnalyzer() {
> > Map analyzerMap = new HashMap > Analyzer>();
> >
> > analyzerMap.put("IDNumber", new KeywordAnalyzer());
> > ...
> > ...
> >
> > _analyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(), 
> > analyzerMap);
> >   }
> >
> >   @Override
> >   public TokenStreamComponents createComponents(String fieldname, 
> > Reader
> > reader) {
> > Tokenizer source = ???;
> > TokenStream stream = _analyzer.tokenStream(fieldname, reader);
> > return new TokenStreamComponents(source, stream);
> >   }
> > }
> >
> > I must be missing something obvious. Can you tell me what it is?
> > Thanks,
> > Mike
> >
> > -Original Message-
> > From: Chris Male [mailto:gento...@gmail.com]
> > Sent: Tuesday, September 25, 2012 5:18 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene 4.0 PerFieldAnalyzerWrapper question
> >
> > Hi Mike,
> >
> > I don't really understand what problem you're having.
> >
> > PerFieldAnalyzerWrapper, like 

proposed change to CharTokenizer

2010-10-14 Thread Mike Sokolov
Background: I've been trying to enable hit highlighting of XML documents 
in such a way that the highlighting preserves the well-formedness of the 
XML.


I thought I could get this to work by implementing a CharFilter that 
extracts text from XML (somewhat like HTMLStripCharFilter, except I am 
using an XML parser - however I think the concept is also applicable to 
HTMLStripCharFilter) while preserving the offsets of the text in the 
original XML document so as to enable highlighting.


I ran into a problem in CharTokenizer.incrementToken(), which calls 
correctOffset() as follows:


offsetAtt.setOffset(correctOffset(start), correctOffset(start+length));

The issue is that the end offset is computed as the offset of the 
beginning of the *next* block of text rather than the offset of the end 
of *this* block of text.


In my test case:

bold text regular text

I get tokens like this ([] showing token boundaries):

   [bold] [text][regular][text]

instead of:

   [bold][text][regular][text]

I don't think this problem can be fixed by jiggling offsets, or indeed 
by wrapping or extending CharTokenizer in any straightforward way.  The 
fix I found is to change the line in CharTokenizer.incrementToken() to:


offsetAtt.setOffset(correctOffset(start), 
correctOffset(start+length-1)+1);


Again, conceptually, this computes the corrected offset of the last 
character in the token, and then marks the end of the token as the 
immediately following position, rather than including all the garbage 
characters in between the end of this token and the beginning of the next.


My impression is that this change should be completely 
backwards-compatible since its behavior will be identical for 
CharFilters that don't actually perform character deletion, and AFAICT 
the only existing CharFilter performs replacements and expansions (of 
ligatures and the like).  But my knowledge of Lucene is far from 
comprehensive.

Does this seem like a reasonable patch?

-Mike

Michael Sokolov
Engineering Director
www.ifactory.com
@iFactoryBoston

PubFactory: the revolutionary e-publishing platform from iFactory


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Where does Lucene recognise it has encountered a new term for the first time?

2010-12-15 Thread Mike Cawson
I’m using Lucene to index database records and text documents.

I want to provide efficient fuzzy queries over the data so I’m using a 
secondary 
Lucene index for all of the distinct terms encountered in the primary index.

Each ‘document’ in the secondary index is a term from the primary index with 
fields for its q-grams, phonetic key(s) and synonyms.

It’s easy to populate the secondary index after indexing all of the records and 
text documents using an IndexReader. However, to keep the secondary index up to 
date I need to recognise when new terms are encountered for the first time, but 
even looking deep into Lucene code and stepping through the indexing process 
hasn’t revealed where this occurs – I presume because it doesn’t happen in a 
single place but rather once in the in-memory term cache, once when the cache 
is 
flushed into a segment, and again when segments are optimised.

Is this correct? Can anyone suggest how to maintain a secondary index of terms? 
Perhaps only when the main index is optimised?

Thanks, Mike




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Scoring problem with MultiPhraseQuery?

2010-12-15 Thread Mike Cawson
I'm using MultiPhraseQuery to implement a fuzzy phrase query.

E.g. user enters "blue lorry" and I expand 'blue' to 'turquoise', and 'glue' 
and 


'lorry' to 'truck', 'van', 'lory' and 'lorrie'. I can then construct a 
MultiPhraseQuery with those lists of terms.

The search works correctly but the score is always the total number of terms 
(N) 


that I put into the MultiPhraseQuery (N=8 in this example)!

I've tried using a boost of 1/N but the boost appears to be ignored.

I can't think of a reason why this should be intentional beheviour so I assume 
there's a bug.

I'm using Lucene 3.0.

Thanks,
Mike Cawson


   

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-21 Thread mike anderson
[x] ASF Mirrors (linked in our release announcements or via the Lucene
website)

[] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)

[x] I/we build them from source via an SVN/Git checkout.

[] Other (someone in your company mirrors them internally or via a
downstream project)


On Tue, Jan 18, 2011 at 4:04 PM, Grant Ingersoll wrote:

> As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really
> don't have a good sense of how people get Lucene and Solr for use in their
> application.  Because of this, there has been some talk of dropping Maven
> support for Lucene artifacts (or at least make them external).  Before we do
> that, I'd like to conduct an informal poll of actual users out there and see
> how you get Lucene or Solr.
>
> Where do you get your Lucene/Solr downloads from?
>
> [] ASF Mirrors (linked in our release announcements or via the Lucene
> website)
>
> [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
>
> [] I/we build them from source via an SVN/Git checkout.
>
> [] Other (someone in your company mirrors them internally or via a
> downstream project)
>
> Please put an X in the box that applies to you.  Multiple selections are OK
> (for instance, if one project uses a mirror and another uses Maven)
>
> Please do not turn this thread into a discussion on Maven and it's
> (de)merits, I simply want to know, informally, where people get their JARs
> from.  In other words, no discussion is necessary (we already have that
> going on d...@lucene.apache.org which you are welcome to join.)
>
> Thanks,
> Grant


Re: QueryValidator

2011-05-05 Thread Mike Sokolov
It's an idea - sorry I don't have an implementation I can share easily; 
it's embedded in our application code and not easy to refactor.  I'm not 
sure where this would fit in the solr architecture; maybe some subclass 
of SearchHandler?  I guess the query rewriter would need to be aware of 
which parser it's trying to avoid errors in.  In our case, we have a 
limited case where we always use a single parser, but I think solr 
exposes a pluggable extensible architecture with a lot of different 
parsers, so a more general solution will be more complex, and I don't 
have it :)


-Mike


On 05/05/2011 10:00 AM, Bernd Fehling wrote:

Hi Michael

sounds excellent to me.

Is it a QParserPlugin or what is it?

Regards
Bernd



Am 05.05.2011 14:05, schrieb Michael Sokolov:
In our applications, we catch ParseException and then take one of the 
following actions:


1) report an error to the user
2) rewrite the query, stripping all punctuation, and try again
3) rewrite the query, quoting all punctuation, and try again

would that work for you?

On 5/5/2011 3:26 AM, Bernd Fehling wrote:

Dear list,

I need a QueryValidator and don't mind writing one but don't want
to reinvent the wheel in case there is already something.

Is this the right list for talking about a QueryValidator or
should it belong to SOLR?

What do I mean with a QueryValidator?
I think about something like validating the query before or after 
parsing it.

Currently invalid queries [e.g. text:(:foo AND bar) ] throw exceptions
which pop up to the top. Not only that they show up in the logs
(which is good) they also give unuseful result page to jetty (which 
is bad).

And they also waste time for searching what can't be searched.

What should the QueryValidator do?
- check the query against the searchable fields of the schema 
(validate it)

- give options of fallback strategies
-- let it through as raw
-- remove specific chars (e.g. all ":" which have not a valid search 
field before)

-- ...
- in case of an invalid query don't try to start a search but give a 
clean no-hit-page

with a status and the cause

Actually it must be located somewhere around the parser.

What do think of this?

Regards,
Bernd

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov
Are the tokens unique within a document? If so, why not store a document 
for every doc/token pair with fields:


id (doc#/token#)
doc-id (doc#)
token
weight1
weight2
frequency

Then search for token, sort by weight1, weight2 or frequency.

If the token matches are unique within a document you will only get each 
document listed once.  If they aren't unique, it's not clear what you 
want to sort by anyway....


-Mike

On 05/05/2011 04:12 PM, Chris Schilling wrote:

Hi,

I am trying to figure out how to solve this problem:

I have about 500,000 files that I would like to index, but the files are 
structured.  So, each file has the following layout:

doc1
token1, weight11, frequency1, weight21
token2, weight12, frequency2, weight22
.
.
.

etc for 500,000 docs.

Basically, I would like to index the tokens for each doc.  When I search for a 
token, I would like to be able to return the top docs sorted by weight1, 
frequency, or weight2.

So, in my naive setup, I loop through the files in the directory, then I loop 
through the lines of the file.   In side of the loop through each file, I call 
this function:

public Document processKeywords(Document doc, String keyword, Float 
weight1, Float weight2, Integer frequency) throws Exception {
Document doc = new Document();
doc.add(new Field("keywords", keyword, Field.Store.NO, 
Field.Index.ANALYZED));
doc.add(new NumericField(keyword+"weight1", 
Field.Store.YES, true).setFloatValue(weight1));   
doc.add(new NumericField(keyword+"weight2", 
Field.Store.YES, true).setFloatValue(weight2));   
doc.add(new NumericField(keyword+"frequency", 
Field.Store.YES, true).setFloatValue(frequency));   
return doc;
}

So, for each token, I create 3 new fields each time. Notice how I am trying to index the 
keyword in the "keywords" field.  For the weights and frequency, I create a new 
field with a name based on the keyword.  On average, I have 100 tokens per document, so 
each document will have about 300 distinct fields.

When running my program, the lucene portion eats up tons of memory and when it 
gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the 
program slows to a crawl.  I assume it is spending all of its time in garbage 
collection due to all these fields.

My code above seems like a very hacky way of accomplishing what I want (sorting 
documents based on keyword search using different numeric fields associated 
with that keyword).

FYI, here is the main search code, where q is the token I am searching for and sortby is 
the field I want to use to sort.  I setup a QP to search for the keyword in the 
"keywords" field.  Then, I can extract the stats that I indexed for the given 
query keyword.

private static final QueryParser parser = new QueryParser(Version.LUCENE_30, 
"keywords", new StandardAnalyzer(Version.LUCENE_30));

public void search(String q, String sortby) throws IOException, 
ParseException {
Query query = parser.parse(q);
long start = System.currentTimeMillis();
TopDocs hits = this.is.search(query, null, 10, new Sort(new 
SortField(q+"sortby", SortField.FLOAT, true)));
long end = System.currentTimeMillis();
System.out.println("Found " + hits.totalHits +
" document(s) (in " + (end - start) +
" milliseconds) that matched query '" +
q + "':");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = this.is.doc(scoreDoc.doc);
String hash = doc.get("hash");
System.out.println(hash + " " + doc.get(q+"sortby") + " 
" + hash);
}
}

I am pretty new to Lucene, so I hope this makes sense.  I tried to pare my 
problem down as much as possible.  Like I said, the main problem I am running 
into is that after processing about 3 documents, the indexing slows to a 
crawl and seems to spend all of its time in the garbage collector.  I am 
looking for a more efficient/effective way of solving this problem.  Code 
tidbits would help, but are not necessary :)

Thanks for your help,
Chris S.
   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: new to lucene, non standard index

2011-05-05 Thread Mike Sokolov
I think the solution I gave you will work.  The only problem is if a 
token appears twice in the same doc:


doc1 has foo with two different sets of weights and frequencies...

but I think you're saying that doesn't happen

On 05/05/2011 06:09 PM, Chris Schilling wrote:

Hey Mike,

Let me clarify:

The tokens are not unique.  Let's say doc1 contains the token
foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10

Now, let's say doc2 also contains the token
foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5

Now, I want to search for all the documents that contain foo, but I want them 
sorted by frequency.

Then, I would have doc1, doc2.

Now, I want to search for all the documents that contain foon, but I want them 
sorted by weight1.
Then, I would have doc2, doc1

Does that clarify?


On May 5, 2011, at 3:01 PM, Mike Sokolov wrote:

   

Are the tokens unique within a document? If so, why not store a document for 
every doc/token pair with fields:

id (doc#/token#)
doc-id (doc#)
token
weight1
weight2
frequency

Then search for token, sort by weight1, weight2 or frequency.

If the token matches are unique within a document you will only get each 
document listed once.  If they aren't unique, it's not clear what you want to 
sort by anyway

-Mike

On 05/05/2011 04:12 PM, Chris Schilling wrote:
 

Hi,

I am trying to figure out how to solve this problem:

I have about 500,000 files that I would like to index, but the files are 
structured.  So, each file has the following layout:

doc1
token1, weight11, frequency1, weight21
token2, weight12, frequency2, weight22
.
.
.

etc for 500,000 docs.

Basically, I would like to index the tokens for each doc.  When I search for a 
token, I would like to be able to return the top docs sorted by weight1, 
frequency, or weight2.

So, in my naive setup, I loop through the files in the directory, then I loop 
through the lines of the file.   In side of the loop through each file, I call 
this function:

public Document processKeywords(Document doc, String keyword, Float 
weight1, Float weight2, Integer frequency) throws Exception {
Document doc = new Document();
doc.add(new Field("keywords", keyword, Field.Store.NO, 
Field.Index.ANALYZED));
doc.add(new NumericField(keyword+"weight1", 
Field.Store.YES, true).setFloatValue(weight1));   
doc.add(new NumericField(keyword+"weight2", 
Field.Store.YES, true).setFloatValue(weight2));   
doc.add(new NumericField(keyword+"frequency", 
Field.Store.YES, true).setFloatValue(frequency));   
return doc;
}

So, for each token, I create 3 new fields each time. Notice how I am trying to index the 
keyword in the "keywords" field.  For the weights and frequency, I create a new 
field with a name based on the keyword.  On average, I have 100 tokens per document, so 
each document will have about 300 distinct fields.

When running my program, the lucene portion eats up tons of memory and when it 
gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the 
program slows to a crawl.  I assume it is spending all of its time in garbage 
collection due to all these fields.

My code above seems like a very hacky way of accomplishing what I want (sorting 
documents based on keyword search using different numeric fields associated 
with that keyword).

FYI, here is the main search code, where q is the token I am searching for and sortby is 
the field I want to use to sort.  I setup a QP to search for the keyword in the 
"keywords" field.  Then, I can extract the stats that I indexed for the given 
query keyword.

private static final QueryParser parser = new QueryParser(Version.LUCENE_30, 
"keywords", new StandardAnalyzer(Version.LUCENE_30));

public void search(String q, String sortby) throws IOException, 
ParseException {
Query query = parser.parse(q);
long start = System.currentTimeMillis();
TopDocs hits = this.is.search(query, null, 10, new Sort(new 
SortField(q+"sortby", SortField.FLOAT, true)));
long end = System.currentTimeMillis();
System.out.println("Found " + hits.totalHits +
" document(s) (in " + (end - start) +
" milliseconds) that matched query '" +
q + "':");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = this.is.doc(scoreDoc.doc);
String hash = doc.get("hash");
System.out.println(hash + " &q

Re: Sharding Techniques

2011-05-10 Thread Mike Sokolov



Down to basics, Lucene searches work by locating terms and resolving
documents from them. For standard term queries, a term is located by a
process akin to binary search. That means that it uses log(n) seeks to
get the term. Let's say you have 10M terms in your corpus. If you stored
that in a single field in a single index with a single segment, it would
take log(10M) ~= 24 seeks to locate a term. This is of course very
simplified.

When you have 63 indexes, log(n) works against you. Even with the
unrealistic assumption that the 10M terms are evenly distributed and
without duplicates, the number of seeks for a search that hits all parts
will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
begun to estimate the merging part.
This is true, but if the indexes are kept on 63 separate servers, those 
seeks will be carried out in parallel.  The OP did indicate his indexes 
would be on different servers, I think?  I still agree with your overall 
point - at this scale a single server is probably best.  And if there 
are performance issues, I think the usual approach is to create multiple 
mirrored copies (slaves) rather than sharding.  Sharding is useful for 
very large indexes: indexes to big to store on disk and cache in memory 
on one commodity box


-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



highlighting performance

2011-06-20 Thread Mike Sokolov
Our apps use highlighting, and I expect that highlighting is an 
expensive operation since it requires processing the text of the 
documents, but I ran a test and was surprised just how expensive it is.  
I made a test index with three fields: path, modified, and contents.  I 
made the index using org.apache.lucene.demo.IndexFiles modified so that 
the contents field is stored and analyzed:


  doc.add(new Field("contents", false, buf.toString(),
  Store.YES, Index.ANALYZED, 
TermVector.WITH_POSITIONS_OFFSETS));


There are about 8000 documents in the index, and the contents field 
averages around 7500 bytes.  The total index directory size is about 242M.


I ran a modified version of the demo.SearchFiles class that doesn't 
print anything out (printing results takes most of the time for faster 
queries), and runs random queries drawn from the text of the documents: 
these are a mix of (mostly) term queries, and about 20% phrase queries 
(that are phrases from the text).


I compared a few cases: no field access, un-highlighted retrieval, 
highlighting, Highlighter and FastVectorHighlighter, always asking for 
10 top scoring docs per query, and running at least 1000 queries for 
each case.


No field access at all gets about 7000 qps; basically we just call 
searcher.search(query, 10)


Then there is a big cost for retrieving the stored documents from the index:

Retrieving each document (calling search.doc(docID)) and the path field 
only (a small field) gets about 250 qps


As a comparison, if I don't store the contents field in the index (and 
don't retrieve it at all), I get similar performance to the no retrieval 
case (around 7000 qps).  OK - so there is a fair amount of I/O required 
to retrieve the stored doc; this may be unavoidable, although do 
consider that for highlighting only a small portion of the doc may 
ultimately be required.


Then another big penalty is paid for highlighting:

Highlighter gets about 60 qps

And finally I am really mystified about this one:

FastVectorHighlighter gets about 20 qps. There is a lot of variance here 
(say 9-44 qps), although always worse than Highlighter.


If these results hold up I'll be astonished, since they imply:

(1) FVH is not fast
(2) Highlighting consumes most processing time  (around 80%) in the best 
case, as compared to just retrieving un-highlighted documents.


and the follow on is that at least for users that need highlighting, 
there is hardly any point in optimizing anything else!


I thought maybe FVH required a lot of memory, so I changed the -Xmx512m 
(from the default: 64m I think), but this had no effect.


I also tried optimizing the index, and although this improved query 
performance somewhat across the board, it actually accentuated the cost 
of highlighting since the most marked improvement was in the basic 
unhighlighted query.


Here is what the highlighting looks like:

For FVH we allocate a single SimpleFragsListBuilder, 
SimpleFragmentBuilder, preTags[1], postTags[1] and DefaultEncoder so 
these don't have to be created for each query. We also cache the 
FastVectorHighlighter itself, and we call:


highlighter.getBestFragment(highlighter.getFieldQuery(query), 
searcher.getIndexReader(), hits[i].doc, "contents", 40, flb, fb, 
preTags, postTags, encoder);


once for each result.

In the Highlighter case, we also cache the Highlighter and call:

highlighter.getBestFragment(analyzer, "contents", doc.get("contents"));

does this performance profile match up with your expectations?  Did I do 
something stupid? Please let me know if I can provide more info.  I'm 
considering what can be done to speed up highlighting, but don't want to 
go off half-cocked..


--
Michael Sokolov
Engineering Director
www.ifactory.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Advanced NearSpanQuery

2011-07-13 Thread Mike Sokolov
Can you wrap a SpanNearQuery around an DisjunctionSumQuery with 
minNrShouldMatch=8?


-Mike

On 07/13/2011 08:53 AM, Jeroen Lauwers wrote:

Hi,

I was wondering if anyone could help me on this:

I want to search for:

1.   a set of words (eg. 10)

2.   only a couple of words may come in between (eg. 3) in the result 
document

3.   of the supplied set of (10) words, at least 8 must be present (or in 
other words: 2 of the supplied words can be missing)

I use the SpanNearQuery for (1.) and (2.), but it is the third part that's 
lacking.

Any ideas?

Jeroen

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Advanced NearSpanQuery

2011-07-13 Thread Mike Sokolov

Sorry for the misdirection ...

On 07/13/2011 11:37 AM, Simon Willnauer wrote:

I don't think this is possible with spans today. Once
https://issues.apache.org/jira/browse/LUCENE-2878 is due this should
be possible with a boolean query I think.

to work around this you need to write a SpanOR query with a
minShouldMatch functionality though.

simon

On Wed, Jul 13, 2011 at 5:09 PM, Jeroen Lauwers  wrote:
   

Hi Mike,

Thanks for your quick reply, but do not seem to find any documentation on 
"DisjunctionSumQuery" and I'm not familiar with that concept.

Could you point me in the right direction?

Jeroen

-Original Message-
From: Mike Sokolov [mailto:soko...@ifactory.com]
Sent: woensdag 13 juli 2011 15:23
To: java-user@lucene.apache.org
Cc: Jeroen Lauwers
Subject: Re: Advanced NearSpanQuery

Can you wrap a SpanNearQuery around an DisjunctionSumQuery with 
minNrShouldMatch=8?

-Mike

On 07/13/2011 08:53 AM, Jeroen Lauwers wrote:
 

Hi,

I was wondering if anyone could help me on this:

I want to search for:

1.   a set of words (eg. 10)

2.   only a couple of words may come in between (eg. 3) in the result 
document

3.   of the supplied set of (10) words, at least 8 must be present (or in 
other words: 2 of the supplied words can be missing)

I use the SpanNearQuery for (1.) and (2.), but it is the third part that's 
lacking.

Any ideas?

Jeroen


   

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: An incorrect sentence in Javadoc at o.a.l.queryparser.surround.parser?

2014-12-04 Thread Mike Drob
I believe this is already filed as
https://issues.apache.org/jira/browse/SOLR-4572

Getting the wiki page fixed would be great as well, though!

On Wed, Dec 3, 2014 at 7:44 PM, Shinichiro Abe 
wrote:

> Hi,
>
> That Javadoc says "N is ordered, and W is unordered."
>
> https://github.com/apache/lucene-solr/blob/trunk/lucene/queryparser/src/java/org/apache/lucene/queryparser/surround/parser/QueryParser.java#L39
>
> "W is ordered, and N is unordered."
> I think this is correct because WQuery() returns ordered SrndQuery and
> NQuery() returns not-ordered SrndQuery.
>
> May I file a Jira?
>
> I know a solr wiki page (and some slides on the Web) which are copied from
> this Javadoc.
> I'd like to fix the solr wiki page.
>
> Regards,
> Shinichiro Abe
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Spatial Search with Nested Polygons

2015-03-26 Thread Mike Hansen
I was wondering about the feasibility / difficultly of implementing a
solution to the following problem with Lucene.

For each document, I have a series of nested polygons each associated
with a numerical value.  My search query gives a point, and I want to
return all of the documents whose largest polygon contains the point
(that part is easy).  Additionally, I'd like to have access to the
numerical value of the smallest polygon which contains the point
(something like makeDistanceValueSource).

Thanks,
--Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Spatial Search with Nested Polygons

2015-03-26 Thread Mike Hansen
On Thu, Mar 26, 2015 at 9:06 PM, david.w.smi...@gmail.com
 wrote:
> The second, (non-easy) part seems like it could be pretty slow:
>
> To determine “the smallest polygon which contains the point” for the
> current matching document, you’d have to iterate over them in
> smallest-to-largest-1 order and check containment, so that you know which
> corresponding value to return.  There will be a performance hit for sure.

There are a few things which could probably help with performance.
Each document has only around say 30 polygons. You could do a binary
search which would help reduce the cost. Additionally, I have a
distinguished point contained inside of all the nested polygons so I
can pre-compute the minimum and maximum distances from that point to
the edge of the polygon and use that to also reduce the number of
containment checks to do.  I expect that there will be on the order to
500-1000 documents considered for each search.

> This sounds like a custom ValueSource/FunctionValues that does that logic…
> perhaps by grabbing the shapes from SerializedDVStrategy’s shape providing
> ValueSource.  If you provide the shapes using a Spatial4j ShapeCollection
> with the order from biggest to smallest, you can know the index of which
> shape matches, and then pull the i-th numeric value you need from a list of
> numbers in BinaryDocValues.  The largest shape could be kept out of here
> since you don’t need it.

Thanks -- this is very helpful.

--Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive search for greek characters

2017-09-27 Thread Mike Sokolov
These are only used in classical Greek I think, explaining probably why they 
are not covered by the simpler filter.

On September 27, 2017 9:48:37 AM EDT, Ahmet Arslan  
wrote:
>I may be wrong about ASCIIFoldingFilter. Please go with the
>ICUFoldingFilter.
>Ahmet
>On Wednesday, September 27, 2017, 3:47:01 PM GMT+3, Chitra
> wrote:  
> 
> Hi Ahmet,                      Thank you so much for the reply.
>
>I have tried but it seems, ASCIIFoldingFilter is not supporting greek
>accent characters and it supports only Latin like accent characters. Am
>I missing anything?
>
>
>
>Chitra
>
>
>
>On Wed, Sep 27, 2017 at 5:47 PM, Ahmet Arslan 
>wrote:
>
>
>
>Hi,
>Yes ICUFoldingFilter or ASCIIFoldingFilter could be used.
>ahmet 
>
> 
> 
>On Wednesday, September 27, 2017, 1:54:43 PM GMT+3, Chitra
> wrote: 
>
>
>
>
>
>Hi,
>                In Lucene, I want to search greek characters(with
>accent
>insensitive) by removing or replacing accent marks with similar
>characters.
>
>Example: we are trying to convert  Greek Extended characters
> to basic Greek Unicode
> for providing accent
>insensitive search...
>
>
>Kindly suggest the better solution to achieve this...? Does
>ICUFoldingFilter solve my use-case?
>
>-- 
>Regards,
>Chitra
>
>
>
>
>
>-- 
>Regards,Chitra

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: FunctionValues vs DoubleValuesSource

2017-10-13 Thread Mike Sokolov
Oh thanks Alan that's a good suggestion, but I already wrote max and sum double 
values sources since it was easy enough. If you think that's a good approach I 
could post a patch.

On October 13, 2017 3:57:30 AM EDT, Alan Woodward  wrote:
>Hi,
>
>Yes, moving stuff over to DoubleValuesSource is only half done at the
>moment, unfortunately!
>
>Can you use the expressions module to do what you want?  The
>SimpleBindings class allows you to map arbitrary DoubleValuesSource
>objects to specific names, and then you can combine them using
>javascript functions.
>
>Alan Woodward
>www.flax.co.uk
>
>
>> On 12 Oct 2017, at 23:25, Michael McCandless
> wrote:
>> 
>> Hi Mike,
>> 
>> It looks like FunctionValues is a very old API used by many function
>> queries, while DoubleValuesSource is relatively new (introduced in
>> https://issues.apache.org/jira/browse/LUCENE-5325).
>> 
>> This comment (
>>
>https://issues.apache.org/jira/browse/LUCENE-5325?focusedCommentId=15235324&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15235324)
>> on the issue seems to refer to wrapper classes to convert between the
>old
>> and new APIs?
>> 
>> I admit the situation is rather confusing; but we've been gradually
>working
>> on cutting over modules to the new API.  Patches welcome!
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> On Tue, Oct 10, 2017 at 3:39 PM, Sokolov, Michael
>
>> wrote:
>> 
>>> Hi, I'm trying to implement a complex set of values computed
>according to
>>> some externally-driven specification, so I am looking at these APIs.
>My
>>> question is whether there is any way to mix them. I have implemented
>some
>>> DoubleValuesSources and now I want combine them using sum, max, etc.
>I
>>> noticed these handy classes over in
>o.a.l.queries.function.valuesource,
>>> but they seem to be of a different flavor than the DVS API is
>designed for.
>>> EG DVS is kind of an iterative API while the functions appear to be
>random
>>> access (you pass them a docid). I could code up my own DVS for
>functions
>>> like max, sum and so on, but I wonder if there is some kind of
>adapter, or
>>> at least a reasonable strategy that would let one mix these apis?
>>> 
>>> -Mike
>>> 
>>> 
>>>
>-
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

RE: run in eclipse error

2017-10-17 Thread Mike Sokolov
Checkstyle has a onetoplevelclass rule that would enforce this

On October 17, 2017 3:45:01 AM EDT, Uwe Schindler  wrote:
>Hi,
>
>this has nothing to do with the Java version. I generally ignore this
>Eclipse-failure as I only develop in Eclipse, but run from command
>line. The reason for this behaviour is a problem with Eclipse's
>resource management/compiler with the way how some classes in Solr
>(especially facet component) are setup.
>
>In general, it is nowadays a no-go to have so called "non-inner"
>pkg-private classes. These are classes which share the same source code
>file, but are not nested in the main class. Instead they appear next to
>each other in the source file. This is a relic from Java 1.0 and should
>really no longer used!
>
>Unfortunately some Solr developers still create such non-nested
>classes. Whenever I see them I change them to be static inner classes.
>The problem with the bug caused by this is that Eclipse randomly fails
>(it depends on the order how it compiles). The problem is that Eclipse
>(but also other tools) cannot relate the non-inner class file to a
>source file and therefore cannot figure out when it needs to be
>recompiled.
>
>BTW. The same problem applies to other build system like javac and Ant
>when it needs to compile. When you change such an inner non-nested
>inner class, it fails to compile in most cases unless you do "ant
>clean". The problem is again, that the compiler cannot relate the class
>files to source code files!
>
>We should really fix those classes to be static and inner - or place
>them in separate source files. I am looking to find a solution to
>detect this with forbiddenapis or our Source Code Regexes, if anybody
>has an idea: tell me!
>
>Uwe
>
>-
>Uwe Schindler
>Achterdiek 19, D-28357 Bremen
>http://www.thetaphi.de
>eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: 380382...@qq.com [mailto:380382...@qq.com]
>> Sent: Tuesday, October 17, 2017 4:43 AM
>> To: java-user 
>> Subject: run in eclipse error
>> 
>> i am trying to run solr in eclipse. but got the error "The type
>> FacetDoubleMerger is already defined". i don't know why. Whether it
>is jdk
>> version wrong?
>> Does git master need to use java9 for development?
>> 
>> 
>> 380382...@qq.com
>
>
>-
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Lucene config issue cannot run demo

2017-11-10 Thread Mike Lynott
The Lucene demo calls for file that’s not provided.

On this page:
http://lucene.apache.org/core/7_1_0/demo/overview-summary.html#overview_description

we are told to run this command:


java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}

Error message received:

Error: Could not find or load main class org.apache.lucene.demo.IndexFiles

Suggestions?

Mike Lynott
Sent from Mail for Windows 10



Sample code?

2018-05-02 Thread Mike Lynott
The sample code for SynonymGraphFilterFactory is written (I assume) in
Solr. Could someone provide a Java translation? Thanks.

This is what I see:


   
 
 
   
 

Mike L


Thanks to Vincenzo D'Amore

2018-05-12 Thread Mike Lynott
I don't know how to connect to the request I mailed on 5/2 about sample
SynonymGraphFilterFactory code. I had naively expected that Vincenzo's
response would come back to my Gmail. I'd appreciate guidance on this
because I can't find
any on the Apache/Lucene site. My spam folder doesn't have a message either.

Thanks very much Vincenzo. Grazie! This will help a great deal.

Mike


Lucene API to retrieve matched words

2018-09-05 Thread Mike Grishaber
Hello All,

 

I am trying to find a way to retrieve a list of the words that matched a
query.  I'm not looking for highlighting, just a list of the words.  So if I
search for 'ski' and I match on 'skier' and 'skiis', I would like to get
back a list that includes 'skier' and 'skiis'.

Is there an API call that provides this?

 

Thanks

Mike



[ANNOUNCE] Apache Lucene 8.5.2 released

2020-05-26 Thread Mike Drob
26 May 2020, Apache Lucene™ 8.5.2 available

The Lucene PMC is pleased to announce the release of Apache Lucene 8.5.2.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains one bug fix. The release is available for immediate
download at:

https://lucene.apache.org/core/downloads.html

Lucene 8.5.2 Bug Fixes:

   - LUCENE-9350 : Don't
   cache automata on FuzzyQuery

Please report any feedback to the mailing lists (
https://lucene.apache.org/core/discussion.html
)

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using may not
have
replicated the release yet. If that is the case, please try another mirror.
This also applies to Maven access.


Re: [VOTE] Lucene logo contest

2020-06-16 Thread Mike Drob
C. The current Lucene logo

Committer, not PMC

On Tue, Jun 16, 2020 at 9:31 AM Gus Heck  wrote:

> From the comments, I sense some confusion, (or perhaps I was confused)...
> at least as I read the vote mail, there are 3 options and 4 links, the
> first link doesn't appear to be presented as an option, but rather as
> background info. (and if you scroll down has several variations, the last
> of which is A.
>
> On Tue, Jun 16, 2020 at 9:36 AM Ilan Ginzburg  wrote:
>
>> A is cleaner and more modern but C is a lot friendlier and "warmer" (and
>> less pretentious).
>> Depending on what the logo is expected to convey, A or C.
>>
>> On Tue, Jun 16, 2020 at 3:12 PM Dawid Weiss 
>> wrote:
>>
>>> A is nice and modern... but I still like the current logo better, so
>>> for me it's "C".
>>>
>>> Dawid
>>>
>>> On Tue, Jun 16, 2020 at 12:08 AM Ryan Ernst  wrote:
>>> >
>>> > Dear Lucene and Solr developers!
>>> >
>>> > In February a contest was started to design a new logo for Lucene [1].
>>> That contest concluded, and I am now (admittedly a little late!) calling a
>>> vote.
>>> >
>>> > The entries are labeled as follows:
>>> >
>>> > A. Submitted by Dustin Haver [2]
>>> >
>>> > B. Submitted by Stamatis Zampetakis [3] Note that this has several
>>> variants. Within the linked entry there are 7 patterns and 7 color
>>> palettes. Any vote for B should contain the pattern number, like B1 or B3.
>>> If a B variant wins, we will have a followup vote on the color palette.
>>> >
>>> > C. The current Lucene logo [4]
>>> >
>>> > Please vote for one of the three (or nine depending on your
>>> perspective!) above choices. Note that anyone in the Lucene+Solr community
>>> is invited to express their opinion, though only Lucene+Solr PMC cast
>>> binding votes (indicate non-binding votes in your reply, please). This vote
>>> will close one week from today, Mon, June 22, 2020.
>>> >
>>> > Thanks!
>>> >
>>> > [1] https://issues.apache.org/jira/browse/LUCENE-9221
>>> > [2]
>>> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
>>> > [3]
>>> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>>> > [4]
>>> https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>


[ANNOUNCE] Apache Lucene 8.8.2 released

2021-04-12 Thread Mike Drob
The Lucene PMC is pleased to announce the release of Apache Lucene 8.8.2.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains three bug fixes. The release is available for
immediate download at:

  

### Lucene 8.8.2 Release Highlights:

 * LUCENE-9870: Fix Circle2D intersectsLine t-value (distance) range clamp
 * LUCENE-9744: NPE on a degenerate query in
MinimumShouldMatchIntervalsSource$MinimumMatchesIterator.getSubMatches().
 * LUCENE-9762: DoubleValuesSource.fromQuery (also used by
FunctionScoreQuery.boostByQuery) could throw an exception when the query
implements TwoPhaseIterator and when the score is requested repeatedly

Please read CHANGES.txt for a full list of changes:

  


Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using may not
have
replicated the release yet. If that is the case, please try another mirror.
This also applies to Maven access.


FacetsCollector ScoreMode

2022-03-21 Thread Mike Drob
Hey all,

I was looking into some performance issues and was a little confused about
one aspect of FacetsCollector - why does it always specify
ScoreMode.COMPLETE?

Especially for the case where we are counting facets, without collecting
the documents, it seems like we should be able to get away without scoring.
I've tested it locally and it seems to work, but I'm wondering what nuance
I am missing.

The default behaviour is keepScores == false, so I feel like we should be
able to adjust the score mode used based on that.

Thanks,
Mike


[ANNOUNCE] Apache Lucene 8.11.2 released

2022-06-21 Thread Mike Drob
The Lucene PMC is pleased to announce the release of Apache Lucene 8.11.2.

Apache Lucene is a high-performance, full-featured text search engine library 
written entirely in Java. It is a technology suitable for nearly any 
application that requires full-text search, especially cross-platform.

This release contains numerous bug fixes, optimizations, and improvements, some 
of which are highlighted below. The release is available for immediate download 
at:

  

### Lucene 8.11.2 Release Highlights:

Bug fixes

* LUCENE-10564: Make sure SparseFixedBitSet#or updates ramBytesUsed.
* LUCENE-10477: Highlighter: WeightedSpanTermExtractor.extractWeightedSpanTerms 
to Query#rewrite
  multiple times if necessary.

Optimizations

* LUCENE-10481: FacetsCollector will not request scores if it does not use them.

Please read CHANGES.txt for a full list of changes:

  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Fuzzy Query Similarity

2022-07-08 Thread Mike Drob
Hi folks,

I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.

The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.

This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:

addDoc("spark spark", writer); // exact match

addDoc("spar spars", writer); // multiple fuzzy terms

The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.

A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546

What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?

Thanks,
Mike



More detail:


The first document with the field "spark spark" has a score explanation:

1.4054651 = sum of:
  1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
1.4054651 = score(freq=2.0), product of:
  1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
  1.4142135 = tf(freq=2.0), with freq of:
2.0 = freq, occurrences of term within document
  0.70710677 = fieldNorm

And a document with the field "spar spars" comes in ever so slightly higher
at

1.5404116 = sum of:
  0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
0.74536043 = score(freq=1.0), product of:
  0.75 = boost
  1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
  1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
  0.70710677 = fieldNorm
  0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
0.79505116 = score(freq=1.0), product of:
  0.8 = boost
  1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
1 = docFreq, number of documents containing term
2 = docCount, total number of documents with field
  1.0 = tf(freq=1.0), with freq of:
1.0 = freq, occurrences of term within document
  0.70710677 = fieldNorm


Re: Fuzzy Query Similarity

2022-07-11 Thread Mike Drob
Hi Uwe, thanks for all the pointers!

I tried using BooleanSimilarity and the resulting scores were even more 
divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple 
terms that were close. Which makes sense with ignoring TF but still doesn't 
help me down-boost the other terms.

On 2022/07/09 16:23:37 Uwe Schindler wrote:
> Hi
> > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > matches, or even to incorporate the edit distance more generally into
> > the per-term score, although it does seem like that would be something
> > people would generally expect.
> 
> Actually it does this:
> 
>   * By default FuzzyQuery uses a rewrite method that expands all terms
> as should clauses into a boolean query:
> MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
>   * TopTermsReqrite basically keeps track of a "boost" factor for each
> term and sorts the "best" terms in a PQ:
> 
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
>   * For each collected term the term enumeration sets a boost (1.0 for
> exact match):
> 
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
> 

Thanks for the link to this calculation. I spent a long time trying to find it 
but kept missing.

There's some interesting things happening here by making longer terms more 
similar. Starting from "spark" we say that "spar" is 75% similar because it's a 
4 character term that needs a single edit (1/4) and "spare" is 80% similar 
because it's a 5 character term with a single edit (1/5). I don't have enough 
information yet to say if this is expected in the application or not, but it 
explains how we get the scores so there's something satisfying about at least 
that bit.

As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that 
computed similarity to it squared, which worked for this exact case but didn't 
keep up with adding a third fuzzy term to that competing document.

After thinking about this more, I suspect that what I really want is for 
FuzzyQuery to score as the max of any of the matching terms, rather than the 
sum? This would be a big change though. I don't know that it's fair for 
multiple approximate matches to outweigh a single exact match here. We get so 
close to what I need with TestFuzzyQuery.testSingleQueryExactMatchScoresHighest 
but it doesn't quite make it all the way.

What do you think?

> So in short the exact term gets a boost factor of 1 in the resulting 
> term query, all other terms a lower one.
> 
> Uwe
> 
> -- 
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:u...@thetaphi.de
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fuzzy Query Similarity

2022-07-12 Thread Mike Drob
On Mon, Jul 11, 2022 at 3:36 PM Mike Drob  wrote:

> Hi Uwe, thanks for all the pointers!
>
> I tried using BooleanSimilarity and the resulting scores were even more
> divergent! 1.0 for the exact match vs 1.55 (= 0.8 + 0.75) for the multiple
> terms that were close. Which makes sense with ignoring TF but still doesn't
> help me down-boost the other terms.
>
> On 2022/07/09 16:23:37 Uwe Schindler wrote:
> > Hi
> > > FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
> > > matches, or even to incorporate the edit distance more generally into
> > > the per-term score, although it does seem like that would be something
> > > people would generally expect.
> >
> > Actually it does this:
> >
> >   * By default FuzzyQuery uses a rewrite method that expands all terms
> > as should clauses into a boolean query:
> > MultiTermQuery.TopTermsBlendedFreqScoringRewrite(maxExpansions)
> >   * TopTermsReqrite basically keeps track of a "boost" factor for each
> > term and sorts the "best" terms in a PQ:
> >
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TopTermsRewrite.java#L109-L160
> >   * For each collected term the term enumeration sets a boost (1.0 for
> > exact match):
> >
> https://github.com/apache/lucene/blob/dd4e8b82d711b8f665e91f0d74f159ef1e63939f/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L248-L256
> >
>
> Thanks for the link to this calculation. I spent a long time trying to
> find it but kept missing.
>
> There's some interesting things happening here by making longer terms more
> similar. Starting from "spark" we say that "spar" is 75% similar because
> it's a 4 character term that needs a single edit (1/4) and "spare" is 80%
> similar because it's a 5 character term with a single edit (1/5). I don't
> have enough information yet to say if this is expected in the application
> or not, but it explains how we get the scores so there's something
> satisfying about at least that bit.
>
> As a hacky idea, I tried changing the boost in FuzzyTermsEnum from that
> computed similarity to it squared, which worked for this exact case but
> didn't keep up with adding a third fuzzy term to that competing document.
>
> After thinking about this more, I suspect that what I really want is for
> FuzzyQuery to score as the max of any of the matching terms, rather than
> the sum? This would be a big change though. I don't know that it's fair for
> multiple approximate matches to outweigh a single exact match here. We get
> so close to what I need with
> TestFuzzyQuery.testSingleQueryExactMatchScoresHighest but it doesn't quite
> make it all the way.
>
> It looks like if I remove the hard coded use of Boolean RewriteMethod and
let it fall back to the default Disjunction Max I get the behaviour that I
want.
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/MultiTermQuery.java#L184

What are the use cases where we need a summation of the scores instead of
taking the max?


> What do you think?
>
> > So in short the exact term gets a boost factor of 1 in the resulting
> > term query, all other terms a lower one.
> >
> > Uwe
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail:u...@thetaphi.de
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Using Lucene 8.5.1 vs 8.5.2

2022-07-26 Thread Mike Drob
I would use 8.5.2 if possible when considering fuzzy queries. The
automation can be very large, but if you’re not caching the query then the
extra footprint is not significant since it needs to be computed  at some
point anyway to evaluate the query.

Really though, I would use 8.11 over either of those.

Mike Drob

On Tue, Jul 26, 2022 at 1:03 PM Baris Kazar  wrote:

> Dear Folks,-
>  May I please ask if using 8.5.1 is ok wrt 8.5.2?
> The only change was the following where fuzzy query was fixed for a major
> bug (?).
> How much does this affect the fuzzy query performance? Has Dev Team done a
> study to compare Lucene-9350 Bug vs Lucene-9068 Bug?
> https://lucene.apache.org/core/8_5_2/changes/Changes.html
> https://issues.apache.org/jira/browse/LUCENE-9350
> Best regards
>
>
>


Re: Lucene V8 Support

2022-09-15 Thread Mike Drob
Hi Fergal,

You should not expect much support on version 8 going forward. It will
probably get critical security releases and not much else.

Mike

On Thu, Sep 15, 2022 at 8:31 AM Fergal Gavin 
wrote:

> Hi there,
>
> We are a user of the Lucene core library in our product.
>
> With the release of Lucene 9.0.0 in Dec 07 2021 (
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-core/9.0.0),
> we were wondering what the future support model for version 8 of the
> library will be? Will it be discontinued now that 9.0x is available or
> maintained into the future in parallel with V9? Does the Lucene project
> have a version 8 support cut-off date in mind?
>
> I do see that a recent release of version 8 of the library (8.11.2,
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-core/8.11.2)
> on Jun 18, 2022 so perhaps this indicates that version 8 will be supported
> for the foreseeable future?
>
> Regards,
>
> Fergal.
>


Read past EOF

2009-04-28 Thread Mike Streeton
I have an index that works fine on Lucene 2.3.2 but fails to open in 2.4.1, it 
always fails with an Read past EOF. The index does contain some field names 
with german umlaut characters in

Any ideas?

Many Thanks

Mike

CheckIndex v2.3.2


NOTE: testing will be more thorough if you run java with 
'-ea:org.apache.lucene', so assertions are enabled

Opening index @ C:/index/german

Segments file=segments_9 numSegments=1 version=FORMAT_SHARED_DOC_STORE [Lucene 
2.3]
  1 of 1: name=_3 docCount=235535
compound=true
numFiles=1
size (MB)=301.684
no deletions
test: open reader.OK
test: fields, norms...OK [70 fields]
test: terms, freq, prox...OK [1475862 terms; 25448796 terms/docs pairs; 
28642994 tokens]
test: stored fields...OK [13560464 total field count; avg 57.573 fields 
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

No problems were detected with this index.

CheckIndex v2.4.1


NOTE: testing will be more thorough if you run java with 
'-ea:org.apache.lucene...', so assertions are enabled

Opening index @ C:/index/german

Segments file=segments_9 numSegments=1 version=FORMAT_SHARED_DOC_STORE [Lucene 
2.3]
  1 of 1: name=_3 docCount=235535
compound=true
hasProx=true
numFiles=1
size (MB)=301.684
no deletions
test: open reader.FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
java.io.IOException: read past EOF
  at org.apache.lucene.store.BufferedIndexInput.refill(Unknown Source)
  at org.apache.lucene.store.BufferedIndexInput.readBytes(Unknown Source)
  at org.apache.lucene.store.BufferedIndexInput.readBytes(Unknown Source)
  at org.apache.lucene.store.IndexInput.readString(Unknown Source)
  at org.apache.lucene.index.FieldInfos.read(Unknown Source)
  at org.apache.lucene.index.FieldInfos.(Unknown Source)
  at org.apache.lucene.index.SegmentReader.initialize(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.CheckIndex.checkIndex(Unknown Source)
  at org.apache.lucene.index.CheckIndex.main(Unknown Source)

WARNING: 1 broken segments (containing 235535 documents) detected
WARNING: would write new segments file, and 235535 documents would be lost, if 
-fix were specified



RE: Read past EOF

2009-04-28 Thread Mike Streeton
An update, I have managed to get it to not fail by debugging and changing the 
value of org.apache.lucene.store.InputIndex.preUTF8Strings = true. The value is 
always false when it fails.

Mike 

-Original Message-
From: Mike Streeton [mailto:mike.stree...@connexica.com] 
Sent: 28 April 2009 12:53
To: java-user@lucene.apache.org
Subject: Read past EOF

I have an index that works fine on Lucene 2.3.2 but fails to open in 2.4.1, it 
always fails with an Read past EOF. The index does contain some field names 
with german umlaut characters in

Any ideas?

Many Thanks

Mike

CheckIndex v2.3.2


NOTE: testing will be more thorough if you run java with 
'-ea:org.apache.lucene', so assertions are enabled

Opening index @ C:/index/german

Segments file=segments_9 numSegments=1 version=FORMAT_SHARED_DOC_STORE [Lucene 
2.3]
  1 of 1: name=_3 docCount=235535
compound=true
numFiles=1
size (MB)=301.684
no deletions
test: open reader.OK
test: fields, norms...OK [70 fields]
test: terms, freq, prox...OK [1475862 terms; 25448796 terms/docs pairs; 
28642994 tokens]
test: stored fields...OK [13560464 total field count; avg 57.573 fields 
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq vector 
fields per doc]

No problems were detected with this index.

CheckIndex v2.4.1


NOTE: testing will be more thorough if you run java with 
'-ea:org.apache.lucene...', so assertions are enabled

Opening index @ C:/index/german

Segments file=segments_9 numSegments=1 version=FORMAT_SHARED_DOC_STORE [Lucene 
2.3]
  1 of 1: name=_3 docCount=235535
compound=true
hasProx=true
numFiles=1
size (MB)=301.684
no deletions
test: open reader.FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
java.io.IOException: read past EOF
  at org.apache.lucene.store.BufferedIndexInput.refill(Unknown Source)
  at org.apache.lucene.store.BufferedIndexInput.readBytes(Unknown Source)
  at org.apache.lucene.store.BufferedIndexInput.readBytes(Unknown Source)
  at org.apache.lucene.store.IndexInput.readString(Unknown Source)
  at org.apache.lucene.index.FieldInfos.read(Unknown Source)
  at org.apache.lucene.index.FieldInfos.(Unknown Source)
  at org.apache.lucene.index.SegmentReader.initialize(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.CheckIndex.checkIndex(Unknown Source)
  at org.apache.lucene.index.CheckIndex.main(Unknown Source)

WARNING: 1 broken segments (containing 235535 documents) detected
WARNING: would write new segments file, and 235535 documents would be lost, if 
-fix were specified


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Limiting search result for web search engine

2010-02-02 Thread Mike Polzin
I am working on building a web search engine and I would like to build a reults 
page similar to what Google does. The functionality I am looking to include is 
what I refer to a "rolling up" sites, meaning that even if a particular site 
(defined by its base URL) has many relevent hits on various pages for the 
searches keywords, that site is only shown once in the results listing with a 
link to the most relevent hit on that site. What I do not want is to have one 
site dominate a search results page. 

Does it make sense to just do the search, get the hits list and then 
programatically remove the results which, although they meet the search 
criteria, are not as relevent? Is there a way to do this through queries?

Thanks in advance!

Mike


  

How to calculate payloads in queries too

2010-04-11 Thread Mike Schultz

I am interested in using payloads in the following way.  I store
Func(index-term) as a payload at index-term when indexing.  When querying I
want to compute Func(query-term) as well.  Then my similarity returns some
other function, Gunc(Func(index-term1),Func(query-term)).

As an example, maybe I'm stripping plural-s and I want to index the singular
along with the payload true/false as to whether the original term was plural
or not.  Then when I query with a term I have a true/false value for it as
well.  My similarity score can then rank plurals higher when plurals are
queried and visa versa.

As another example, say, I have a part of speech tagger and I'm lucky enough
to get long queries that I can also pos tag.

I would think the normal symmetry of query/index would make this usage just
fall out of the code but I only see the use case of index payloads.  I could
hack something so that, for example, I don't use a filter to strip plural s,
and instead do it later in the query object.  Then I could calculate my
Func(query-term) and pass back a similarity that knows about the query
payload.  Is there a better way?
-- 
View this message in context: 
http://n3.nabble.com/How-to-calculate-payloads-in-queries-too-tp712743p712743.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to calculate payloads in queries too

2010-04-12 Thread Mike Schultz

I see the payload in the token now.
-- 
View this message in context: 
http://n3.nabble.com/How-to-calculate-payloads-in-queries-too-tp712743p713413.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing

2007-08-24 Thread Mike Klaas
Note that Solr is expressedly designed for this kind of thing:  every  
time you commit, a new searcher is opened in the background, warmed,  
and the swapped with the current one.  It also support autocommit  
after X updates, or after the oldest update passes X milliseconds  
without being committed.


-Mike

On 22-Aug-07, at 7:39 AM, Jonathan Ariel wrote:

I'm not reindexing the entire index. I'm just commiting the  
updated. But I'm
not sure how it would affect performance to commit in real time. I  
think

right now I have like 10 updated per minute.

On 8/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:


There are several approaches. First, is your index small
enough to fit in RAM? You might consider just putting it all in
RAM and searching that.

A more complex solution would be to keep the increments
in a separate RAMDir AND your FSDir, search both and
keep things coordinated. Something like

open FSDIr
create RAMDir
while (whatever) {
   get request
   if (modification) {
   write to FSDir and RAMDir
  }
   if (search) {
 search FSDir
 open RAMDir reader
 search RAMDir
 close RAMDir reader (but not writer!)
  }
}

close FSDIr
close RAMDir
start again from the top.



Warning: I haven't done this, but it *should* work. The sticky
part seems to me to be coordinating deletes since the
open FSDir may contain documents also in the RAMDir,
but that's "an exercise for the reader",

You could also define the problem away and just live
with a 5 minute latency.

Best
Erick

On 8/22/07, Jonathan Ariel <[EMAIL PROTECTED]> wrote:


Hi,
I'm new to this list. So first of all Hello to everyone!

So right now I have a little issue I would like to discuss with you.
Suppose that your are in a really big application where the data  
in your
database is updated really fast. I reindex lucene every 5 min but  
since

my
application lists everything from lucene there are like 5 minutes  
(in

the

worse case) where I don't see new staff.
What do you think would be the best aproach to this problem?

Thanks!

Jonathan






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-08-27 Thread Mike Klaas

Mark,

I'm still interested in integrating this into Solr--this is a feature  
that has been requested a few times.  It would be easier to do so if  
it were a contrib/...


thanks for the great work,
-Mike

On 27-Aug-07, at 4:21 AM, Mark Miller wrote:

I am a bit unclear about your question. The patch you mention  
extends the original Highlighter to support phrase and span  
queries. It does not include any major performance increases over  
the original Highlighter (in fact, it takes a bit longer to  
Highlight a Span or Phrase query than it does to just highlight  
Terms).


Will it be released with the next version of Lucene? Doesn't look  
like it, but anything is possible. A few people are using it, but  
there has not been widespread interest that I have seen. My guess  
is that there are just not enough people trying to highlight Span  
queries -- which I'd blame on a lack of Span support in the default  
Lucene Query syntax.


Whether it is included soon or not, the code works well and I will  
continue to support it.


- Mark

Michael Stoppelman wrote:
Is this jar going to be in the next release of lucene? Also, are  
these the

same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/ 
spanhighlighter10.patch


-M

On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:




I have not looked at any highlighting code yet. Is there already an


extension


of PhraseQuery that has getSpans() ?



Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery)  
query).getTerms();

int i;
SpanQuery[] clauses = new SpanQuery 
[phraseQueryTerms.length];


for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
}

SpanNearQuery sp = new SpanNearQuery(clauses,
((PhraseQuery) query).getSlop(), false);
sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit  
distance, but

it approximates extremely well in most cases.

I wonder if this approach to Highlighting would be worth it in  
the end.
Certainly, it would seem to require that you store offsets or you  
would

have to re-tokenize anyway.

Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks.  
Ronnie's
Highlighter appears to be faster than the original due to two  
things: he
doesn't have to re-tokenize text and he rebuilds the original  
document
in large pieces. Depending on how you want to look at it, he  
loses most
of the speed gained from just looking at the Query tokens instead  
of all
tokens to pulling the Term offset information (which appears  
pretty slow).


If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents in bigger
pieces i.e. instead of going through each token and adding the  
source

text that it covers, build up the offset information until you get
another hit and then pull from the source text into the  
highlighted text
in one big piece rather than a tokens worth at a time. Of course  
this is
not compatible with the way the Fragmenter currently works. If  
you use

the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
wins because it takes so darn long to re-analyze.

It is also interesting to note that it is very difficult to see in a
gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to be  
as fast

as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, and TokenSources is
certainly not worth it. It just takes too long to pull TermVector  
info.


- Mark



 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing time linear?

2007-08-28 Thread Mike Klaas

On 23-Aug-07, at 2:48 AM, Barry Forrest wrote:


Hi list,

I'm trying to estimate how long it will take to index 10 million  
documents.

If I measure how long it takes to index say 10,000 documents, can I
extrapolate?  Will it take roughly 1000 times longer to do the  
whole set?


Segment merging is logarithmic, so you will get some rare, large  
delays at high doc counts that would not manifest at 10k.


Analysis is linear.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucille, a (new) Python port of Lucene

2007-08-28 Thread Mike Klaas

Not to mention Lupy.

Hasn't it been relatively well-established that trying to create a  
performant search engine in a dynamic interpreted language is a show- 
stopper?  After several failed ports of lucene (I can add to this my  
own, unreleased, attempt) I just don't see the point, except as an  
academic exercise.   This is true even with selective optimization in  
c.  I think that the core engine needs to be in c/java to achieve  
feasibility--there's nothing stopping a cool dynamic language  
wrapping the core (see Lucy).


good luck,
-Mike

On 28-Aug-07, at 5:33 PM, Erik Hatcher wrote:


Why Lucille in light of PyLucene?

Erik


On Aug 28, 2007, at 10:55 AM, Dan Callaghan wrote:


Dear list,

I have recently begun a Python port of Lucene, named Lucille. It is
still very much a work in progress, but I hope to have a
feature-complete release compatible with Lucene 2.1 done in the  
near future.


The project homepage is at: http://www.djc.id.au/lucille/

Contributions, feedback, and questions are most welcome!

P.S. A big thanks to the Lucene contributors for their hard work in
building a great piece of software.

--
Dan Callaghan <[EMAIL PROTECTED]>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highlighter that works with phrase and span queries

2007-08-29 Thread Mike Klaas
I just meant whether it would live in a lucene release (somewhere  
under contrib/) or just in JIRA.  Would including the functionality  
in Solr help get it into lucene?


-Mike

On 29-Aug-07, at 4:58 AM, Mark Miller wrote:

It kind of is a contrib -- its really just a new Scorer class (with  
some axillary helper classes) for the old contrib Highlighter.  
Since the contrib Highlighter is pretty hardened at this point, I  
figured that was the best way to go. Or do you mean something  
different?


- Mark

Mike Klaas wrote:

Mark,

I'm still interested in integrating this into Solr--this is a  
feature that has been requested a few times.  It would be easier  
to do so if it were a contrib/...


thanks for the great work,
-Mike

On 27-Aug-07, at 4:21 AM, Mark Miller wrote:

I am a bit unclear about your question. The patch you mention  
extends the original Highlighter to support phrase and span  
queries. It does not include any major performance increases over  
the original Highlighter (in fact, it takes a bit longer to  
Highlight a Span or Phrase query than it does to just highlight  
Terms).


Will it be released with the next version of Lucene? Doesn't look  
like it, but anything is possible. A few people are using it, but  
there has not been widespread interest that I have seen. My guess  
is that there are just not enough people trying to highlight Span  
queries -- which I'd blame on a lack of Span support in the  
default Lucene Query syntax.


Whether it is included soon or not, the code works well and I  
will continue to support it.


- Mark

Michael Stoppelman wrote:
Is this jar going to be in the next release of lucene? Also, are  
these the

same as the changes in the following patch:
https://issues.apache.org/jira/secure/attachment/12362653/ 
spanhighlighter10.patch


-M

On 6/27/07, Mark Miller <[EMAIL PROTECTED]> wrote:



I have not looked at any highlighting code yet. Is there  
already an



extension


of PhraseQuery that has getSpans() ?



Currently I am using this code originally by M. Harwood:
Term[] phraseQueryTerms = ((PhraseQuery)  
query).getTerms();

int i;
SpanQuery[] clauses = new SpanQuery 
[phraseQueryTerms.length];


for (i = 0; i < phraseQueryTerms.length; i++) {
clauses[i] = new SpanTermQuery(phraseQueryTerms 
[i]);

}

SpanNearQuery sp = new SpanNearQuery(clauses,
((PhraseQuery) query).getSlop(), false);
sp.setBoost(query.getBoost());

I don't think it is perfect logic for PhraseQuery's edit  
distance, but

it approximates extremely well in most cases.

I wonder if this approach to Highlighting would be worth it in  
the end.
Certainly, it would seem to require that you store offsets or  
you would

have to re-tokenize anyway.

Some more interesting "stuff" on the current Highlighter methods:

We can gain a lot of speed on the implementation of the current
Highlighter if we grab from the source text in bigger chunks.  
Ronnie's
Highlighter appears to be faster than the original due to two  
things: he
doesn't have to re-tokenize text and he rebuilds the original  
document
in large pieces. Depending on how you want to look at it, he  
loses most
of the speed gained from just looking at the Query tokens  
instead of all
tokens to pulling the Term offset information (which appears  
pretty slow).


If you use a SimpleAnalyzer on docs around 1800 tokens long,  
you can

actually match the speed of Ronnies highlighter with the current
highlighter if you just rebuild the highlighted documents in  
bigger
pieces i.e. instead of going through each token and adding the  
source

text that it covers, build up the offset information until you get
another hit and then pull from the source text into the  
highlighted text
in one big piece rather than a tokens worth at a time. Of  
course this is
not compatible with the way the Fragmenter currently works. If  
you use
the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's  
highlighter

wins because it takes so darn long to re-analyze.

It is also interesting to note that it is very difficult to see  
in a

gain in using TokenSources to build a TokenStream. Using the
StandardAnalyzer, it takes docs that are 1800 tokens just to be  
as fast

as re-analyzing. Notice I didn't say fast, but "as fast". Anything
smaller, or if you're using a simpler analyzer, and  
TokenSources is
certainly not worth it. It just takes too long to pull  
TermVector info.


- Mark



-- 
---

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-

Re: Search performance question

2007-09-06 Thread Mike Klaas


On 6-Sep-07, at 4:41 AM, makkhar wrote:



Hi,

   I have an index which contains more than 20K documents. Each  
document has

the following structure :

field : ID (Index and store)  typical value  
- "1000"

field : parameterName(index and store)  typical value -
"/mcp/data/parent1/parent2/child1/child2/status"
field : parameterValue(index and not store)typical value - "draft"

When I search for a term which results in "all" the documents getting
returned, the search time is more than 1 sec. I have still not done
hits.doc(), which I understand, would be even worse.

My problem is, I am expecting the search itself to happen in the  
order of a
few milliseconds irrespective of the number of documents it  
matched. Am I

expecting too much ?


20K docs is not very many.  I would expect a simply TermQuery to be  
on the order of milliseconds, _after_ the OS has cached the index in  
memory.  Does the time improve after some warmup?


-MIke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Extract terms not by reader, but by documents

2007-09-06 Thread Mike Klaas

On 6-Sep-07, at 11:48 AM, Grant Ingersoll wrote:



On Sep 6, 2007, at 1:32 PM, Rafael Rossini wrote:

Karl, I´m aware of IndexReader.getTermFreqVector, with this I can  
get all
terms of a document, but I want all terms of a document that  
matched a

query.

Grant,


Yes, I think I understand.  You want to know what terms from your
query matched in a given document.


Yep, that´s what I want. In the contrib/highlighter package, the
query.rewrite.extractTerms is used to match the terms in the  
documents. So




Can you point to where this is taking place in the contrib/ 
highlighter?  I am not a highlighter expert, but I would like to  
see it.  The only place I see a call to extractTerms is in  
QueryTermExtractor.java


The document is re-analyzed, or the token stream is retrieved from  
term vector reconstruction.  Das ist allist.


-Mike
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing Speed using Java Lucene 2.0 and Lucene.NET 2.0

2007-09-10 Thread Mike Klaas

On 10-Sep-07, at 5:59 AM, Laxmilal Menaria wrote:


Hello Everyone,

I have created a Index Application using Java lucene 2.0 in java and
Lucene.Net 2.0 in VB.net. Both application have same logic. But  
when I have
indexed a database with 14000 rows from both application and same  
machine, I
surprised that Java lucene took (198 Seconds) more than double time  
from
.net.(87 Seconds). Is there are any specific reason for that.. or  
any other

thing..

Also have tried with same of Java Lucene 2.2. Its also took same  
time..(190

Seconds)


Are you using the same index settings? (mergefactor, maxBufferedDocs,  
etc)?  Are you using StandardAnalyzer (the trunk version should be  
many times faster)?


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing Host and IP Information in Lucene

2007-09-11 Thread Mike Klaas

On 10-Sep-07, at 8:37 PM, AnkitSinghal wrote:



But i think the query like host:example* will not work in this case
Actually it was typo in my question. I want to search for above  
type of

query only.


Hosts are best stored in reverse domain format:

xyz.example.com -> com.example.xyz

Then you can query docs from example.com via:
(com.example com.example.*)

If you want 'example' to be searchable as a term, then additionally  
store the host in a different, tokenized field.


-Mike


Ankit


Daniel Noll-3 wrote:


On Monday 10 September 2007 23:53:06 AnkitSinghal wrote:
And if i make the field as UNTOKENIZED  i cannot search for  
queries like

host:xyz.* .


I'm not sure why that wouldn't work.  If the stored token is
xyz.example.com,
then xyz.* will certainly match it.

Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context: http://www.nabble.com/Storing-Host- 
and-IP-Information-in-Lucene-tf4414865.html#a12607238

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenization question

2007-09-13 Thread Mike Klaas

On 13-Sep-07, at 12:37 PM, Dan Luria wrote:


What I do is

Doc1 = source_doc
Doc2 = new Document()
foreach (field f in doc1.getfields) {
Doc2.Add(new Field(doc1.getField(key), doc1.getField(value));
}

but when i pull the fields from Doc1, i never get the tokenized  
field..

it just doesnt appear.

so my question is -- i can see that field in the index, and search
against it, but how do i transfer it to a different index?

(PS: The above is pseudo-code... not syntax)


indexed document fields are not stored anywhere.  There are bits and  
pieces of the document all over the place (this is the nature of an  
inverted index).


You can (quite time-consumedly) reconstruct by iterating over the  
whole index.  I think luke can do this.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BoostingTermQuery performance

2007-10-02 Thread Mike Klaas

On 2-Oct-07, at 3:44 PM, Peter Keegan wrote:

I have been experimenting with payloads and BoostingTermQuery,  
which I think
are excellent additions to Lucene core. Currently,  
BoostingTermQuery extends

SpanQuery. I would suggest changing this class to extend TermQuery and
refactor the current version to something like 'BoostingSpanQuery'.

The reason is rooted in performance. In my testing, I compared query
throughput using TermQuery against 2 versions of BoostingTermQuery  
- the
current one that extends SpanQuery and one that extends TermQuery  
(which
I've included, below). Here are the results (qps = queries per  
second):


TermQuery:200 qps
BoostingTermQuery (extends SpanQuery): 97 qps
BoostingTermQuery (extends TermQuery): 130 qps

Here is a version of BoostingTermQuery that extends TermQuery. I  
had to
modify TermQuery and TermScorer to make them public. A code review  
would be

in order, and I would appreciate your comments on this suggestion.


Awesome!  I wasn't aware that there was such a difference.  With a  
performance gap that large, it is definitely worth having the option.


Payload have the potential to be a heavily-used feature in Lucene,  
and performacen will be key for that.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Generalized proximity query performance

2007-10-05 Thread Mike Klaas

On 5-Oct-07, at 10:54 AM, Chris Hostetter wrote:

: I am using a hand rolled query of the following form (implemented  
with

: SpanNearQuery, not a sloppy PhraseQuery):
: a b c => +(a AND b AND c) OR "a b"~5 OR "b c"~5
:
: The obvious solution, "a b c"~5, is not applicable for my issues,  
because I
: would like to allow for the possibility that a and b are near  
each other in

: one field, while c is in another field.

Hmmm.. can you give some more concrete examples of what you mean by  
this?
both in terms of the use case you are trying to satisfy, and in  
terms of
how your current code works ... you don't have to post code or give  
away
trade secrets, just describe it as a black box (ie: what is the  
input?,

how do you know when to use fieldA vs fieldC,how do you decide when to
make a span query vs an OR query?

based one what youv'e described so far, it's hard to udnerstand  
what it is
you are doing -- which is important to udnerstand how to help you  
make it

better/faster.


I understand the OP to want a PhraseQuery that has an intention  
(rather than side-effect) of doing proximity-based scoring.


"phrase query here"~1000 is the current hack that performs fine for N  
< 3 query terms, but fails currently for N >= 3 since it requires  
that all the terms be present.  For larger queries, this effectively  
nullifies the usefulness of the phrase query approach.


It doesn't seem to me that writing a variant of PhraseQuery that has  
the desired functionality would be _too_ hard, but I haven't looked  
into it in depth.


-Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Generalized proximity query performance

2007-10-05 Thread Mike Klaas

On 5-Oct-07, at 11:27 AM, Chris Hostetter wrote:



that's what i thought first too, and it is a problem i'd eventaully  
like
to tackle ... it was the part about "c" being in a differnet field  
from
"a" and "b" that confused me ... i don't know what that exactly is  
being

suggested here.


I'm thinking of the dismax model: you still want each keyword to  
match (though possibly in different fields).  I don't really think  
that that is appropriate to through into a single query class.   
Having separate match/boost clauses is better.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TermDocs.skipTo

2007-10-29 Thread Mike Streeton
Are there any issues surrounding TermDocs.skipTo(). I have a index that works 
okay if I use TermDocs.next() to find next doc id, but using skipTo to go to 
the one after a point can miss sometimes.

e.g. Iterating using TermDocs.next() and TermDocs.doc() 1,50,1,2 but 
suing TermDocs.skipTo(51) returns false indicating that no doc id > 50 exists.

I will try and create a sample index to show this.

Many Thanks

Mike


Reuse TermDocs

2007-11-05 Thread Mike Streeton
Can TermDocs be reused i.e. can you do.

TermDocs docs = reader.termDocs();
docs.seek(term1);
int i = 0;
while (docs.next()) {
i++;
}
docs.seek(term2);
int j = 0;
while (docs.next()) {
j++;
}

Reuse does seem to work but I get ArrayIndexOutOfBoundsExceptions from 
BitVector it I reuse the same one over a period of time.

Many Thanks

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Mike Klaas

On 29-Oct-07, at 9:43 AM, Paul Elschot wrote:


On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:

+prop1:a +prop2:b +prop3:c +prop4:d +prop5:e

is much faster than

(+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)

where the second one is a result from BooleanQuery in  
BooleanQuery, and

all have Occur.MUST.



SImplifying boolean queries like this is not available in Lucene,  
but it

would have a positive effect on search performance, especially when
prop1:a and prop2:b have a high document frequency.


Wait--shouldn't the outer-most BooleanQuery provide most of this  
speedup already (since it should be skipTo'ing between the nested  
BooleanQueries and the outermost).  Is it the indirection and sub- 
query management that is causing the performance difference, or  
differences in skiptTo behaviour?


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Mike Klaas

On 6-Nov-07, at 3:02 PM, Paul Elschot wrote:


On Tuesday 06 November 2007 23:14:01 Mike Klaas wrote:



Wait--shouldn't the outer-most BooleanQuery provide most of this
speedup already (since it should be skipTo'ing between the nested
BooleanQueries and the outermost).  Is it the indirection and sub-
query management that is causing the performance difference, or
differences in skiptTo behaviour?


The usual Lucene answer to performance questions: it depends.

After every hit, next() needs to be called on a subquery before
skipTo() can be used to find the next hit. It is currently not  
defined which

subquery will be used for this first next().

The structure of the scorers normally follows the structure of
the BooleanQueries, so the indirection over the deep subquery
scores could well  be relevant to performance, too.

Which of these factors actually dominates performance is hard
to predict in advance. The point of skipTo() is that is tries to avoid
disk I/O as much as possible for the first time that the query is
executed. Later executions are much more likely to hit the OS cache,
and then the indirections will be more relevant to performance.

I'd like to have a good way to do a performance test on a first
query execution, in the sense that it does not hit the OS cache
for its skipTo() executions, but I have not found a good way yet.


Interesting--thanks for the thoughtful answer.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TermDocs.skipTo error

2007-11-09 Thread Mike Streeton
I have posted before about a problem with TermDocs.skipTo () but never managed 
to reproduce it. I have now got it to fail using the following program, please 
can someone try it and see if they get the stack trace:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array 
index out of range: 101306
  at org.apache.lucene.util.BitVector.get(BitVector.java:72)
  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:118)
  at 
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176)
  at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
  at Test4.test(Test4.java:88)
  at main(Test4.java:69)

The program creates a test index, if you run it a second time it will not 
create the index. Change the directory name on line 33.

Many Thanks

Mike

Ps I am using Lucene 2.2 and java 1.6 u1



import java.io.IOException;
import java.util.Random;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

public class Test4 {

  /**
   * @param args
   * @throws IOException
   * @throws LockObtainFailedException
   * @throws CorruptIndexException
   */
  public static void main(String[] args) throws Exception {
Random rand = new Random(0);
Directory[] dirs = new Directory[10];
for (int i = 0; i < dirs.length; i++) {
  dirs[i] = FSDirectory.getDirectory("c:\\temp\\lucenetest\\"
  + Integer.toString(i));
  if (!IndexReader.indexExists(dirs[i])) {
IndexWriter writer = new IndexWriter(dirs[i],
new StandardAnalyzer(), true);
for (int j = 0; j < 10; j++) {
  Document doc = new Document();
  doc.add(new Field("i", 
Integer.toString(rand.nextInt(100)),
  Store.YES, Index.UN_TOKENIZED));
  doc.add(new Field("j",
  Integer.toString(rand.nextInt(1000)), 
Store.YES,
  Index.UN_TOKENIZED));
  writer.addDocument(doc);
  if (j % 1 == 0) {
System.out.println(j);
  }
}
writer.optimize();
writer.close();
writer = null;
  }
  IndexReader reader = IndexReader.open(dirs[i]);
  for (int j = 0; j < 1000; j++) {
reader.deleteDocument(rand.nextInt(reader.maxDoc()));
  }
  reader.close();
}
IndexReader[] readers = new IndexReader[dirs.length];
for (int i = 0; i < dirs.length; i++) {
  readers[i] = IndexReader.open(dirs[i]);
}
IndexReader reader = new MultiReader(readers);
TermDocs docs = reader.termDocs();
for (int i = 0; i < 100; i++) {
  for (int j = 0; j < 1000; j++) {
try {
  test(docs, Integer.toString(i), 
Integer.toString(j));
} catch (Exception e) {
  System.err.println("Failed at i="+i+" j="+j);
  throw e;
}
  }
}
docs.close();
reader.close();
  }

  private static void test(TermDocs docs, String i, String j)
  throws IOException {
docs.seek(new Term("i", i));
while (docs.next());
docs.seek(new Term("j", j));
while (docs.next());
docs.seek(new Term("i", i));
if (docs.next()) {
  while (docs.skipTo(docs.doc()+1000));
}
docs.seek(new Term("j", j));
if (docs.next()) {
  while (docs.skipTo(docs.doc()+1000));
}
  }

}



RE: TermDocs.skipTo error

2007-11-09 Thread Mike Streeton
Erick,
   Sorry the numbers are just printed out for debugging when it is building the 
index. I will try it with lucene 2.1 and see what happens

Thanks

Mike

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: 09 November 2007 15:59
To: java-user@lucene.apache.org
Subject: Re: TermDocs.skipTo error

FWIW, running Lucene 2.1, Java 1.5 all I get is some numbers being printed
out
0
1
2
.
.
.
90,000


and ran through the above 4 times or so

Erick

On Nov 9, 2007 5:51 AM, Mike Streeton <[EMAIL PROTECTED]>
wrote:

> I have posted before about a problem with TermDocs.skipTo () but never
> managed to reproduce it. I have now got it to fail using the following
> program, please can someone try it and see if they get the stack trace:
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 101306
>  at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java
> :118)
>  at org.apache.lucene.index.SegmentTermDocs.skipTo(
> SegmentTermDocs.java:176)
>  at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
>  at Test4.test(Test4.java:88)
>  at main(Test4.java:69)
>
> The program creates a test index, if you run it a second time it will not
> create the index. Change the directory name on line 33.
>
> Many Thanks
>
> Mike
>
> Ps I am using Lucene 2.2 and java 1.6 u1
>
>
>
> import java.io.IOException;
> import java.util.Random;
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.MultiReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermDocs;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.LockObtainFailedException;
>
> public class Test4 {
>
>  /**
>   * @param args
>   * @throws IOException
>   * @throws LockObtainFailedException
>   * @throws CorruptIndexException
>   */
>  public static void main(String[] args) throws Exception {
>Random rand = new Random(0);
>Directory[] dirs = new Directory[10];
>for (int i = 0; i < dirs.length; i++) {
>  dirs[i] = FSDirectory.getDirectory
> ("c:\\temp\\lucenetest\\"
>  + Integer.toString(i));
>  if (!IndexReader.indexExists(dirs[i])) {
>IndexWriter writer = new IndexWriter(dirs[i],
>new StandardAnalyzer(), true);
>for (int j = 0; j < 10; j++) {
>  Document doc = new Document();
>  doc.add(new Field("i", Integer.toString(
> rand.nextInt(100)),
>  Store.YES, Index.UN_TOKENIZED));
>  doc.add(new Field("j",
>  Integer.toString(rand.nextInt(1000)),
> Store.YES,
>  Index.UN_TOKENIZED));
>  writer.addDocument(doc);
>  if (j % 1 == 0) {
>System.out.println(j);
>  }
>}
>writer.optimize();
>writer.close();
>writer = null;
>  }
>  IndexReader reader = IndexReader.open(dirs[i]);
>  for (int j = 0; j < 1000; j++) {
>reader.deleteDocument(rand.nextInt(reader.maxDoc
> ()));
>  }
>  reader.close();
>}
>IndexReader[] readers = new IndexReader[dirs.length];
>for (int i = 0; i < dirs.length; i++) {
>  readers[i] = IndexReader.open(dirs[i]);
>}
>IndexReader reader = new MultiReader(readers);
>TermDocs docs = reader.termDocs();
>for (int i = 0; i < 100; i++) {
>  for (int j = 0; j < 1000; j++) {
>try {
>  test(docs, Integer.toString(i),
> Integer.toString(j));
>

RE: TermDocs.skipTo error

2007-11-09 Thread Mike Streeton
I have tried this again using Lucene 2.1 and as Erick found it works okay, I 
have tried it on jdk 1.6 u1 and u3 both work, but both fail when using lucene 
2.2

Mike

-Original Message-
From: Mike Streeton [mailto:[EMAIL PROTECTED]
Sent: 09 November 2007 16:05
To: java-user@lucene.apache.org
Subject: RE: TermDocs.skipTo error

Erick,
   Sorry the numbers are just printed out for debugging when it is building the 
index. I will try it with lucene 2.1 and see what happens

Thanks

Mike

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: 09 November 2007 15:59
To: java-user@lucene.apache.org
Subject: Re: TermDocs.skipTo error

FWIW, running Lucene 2.1, Java 1.5 all I get is some numbers being printed
out
0
1
2
.
.
.
90,000


and ran through the above 4 times or so

Erick

On Nov 9, 2007 5:51 AM, Mike Streeton <[EMAIL PROTECTED]>
wrote:

> I have posted before about a problem with TermDocs.skipTo () but never
> managed to reproduce it. I have now got it to fail using the following
> program, please can someone try it and see if they get the stack trace:
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 101306
>  at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java
> :118)
>  at org.apache.lucene.index.SegmentTermDocs.skipTo(
> SegmentTermDocs.java:176)
>  at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
>  at Test4.test(Test4.java:88)
>  at main(Test4.java:69)
>
> The program creates a test index, if you run it a second time it will not
> create the index. Change the directory name on line 33.
>
> Many Thanks
>
> Mike
>
> Ps I am using Lucene 2.2 and java 1.6 u1
>
>
>
> import java.io.IOException;
> import java.util.Random;
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.MultiReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermDocs;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.LockObtainFailedException;
>
> public class Test4 {
>
>  /**
>   * @param args
>   * @throws IOException
>   * @throws LockObtainFailedException
>   * @throws CorruptIndexException
>   */
>  public static void main(String[] args) throws Exception {
>Random rand = new Random(0);
>Directory[] dirs = new Directory[10];
>for (int i = 0; i < dirs.length; i++) {
>  dirs[i] = FSDirectory.getDirectory
> ("c:\\temp\\lucenetest\\"
>  + Integer.toString(i));
>  if (!IndexReader.indexExists(dirs[i])) {
>IndexWriter writer = new IndexWriter(dirs[i],
>new StandardAnalyzer(), true);
>for (int j = 0; j < 10; j++) {
>  Document doc = new Document();
>  doc.add(new Field("i", Integer.toString(
> rand.nextInt(100)),
>  Store.YES, Index.UN_TOKENIZED));
>  doc.add(new Field("j",
>  Integer.toString(rand.nextInt(1000)),
> Store.YES,
>  Index.UN_TOKENIZED));
>  writer.addDocument(doc);
>  if (j % 1 == 0) {
>System.out.println(j);
>  }
>}
>writer.optimize();
>writer.close();
>writer = null;
>  }
>  IndexReader reader = IndexReader.open(dirs[i]);
>  for (int j = 0; j < 1000; j++) {
>reader.deleteDocument(rand.nextInt(reader.maxDoc
> ()));
>  }
>  reader.close();
>}
>IndexReader[] readers = new IndexReader[dirs.length];
>for (int i = 0; i < dirs.length; i++) {
>  readers[i] = IndexReader.open(dirs[i]);
>}
>IndexReader reader = 

RE: TermDocs.skipTo error

2007-11-09 Thread Mike Streeton
I have just tried this again using the index I built with lucene 2.1 but 
running the test using lucene 2.2 and it works okay, so it seems to be 
something related to an index built using lucene 2.2.

Mike

-Original Message-
From: Mike Streeton [mailto:[EMAIL PROTECTED]
Sent: 09 November 2007 16:34
To: java-user@lucene.apache.org
Subject: RE: TermDocs.skipTo error

I have tried this again using Lucene 2.1 and as Erick found it works okay, I 
have tried it on jdk 1.6 u1 and u3 both work, but both fail when using lucene 
2.2

Mike

-Original Message-
From: Mike Streeton [mailto:[EMAIL PROTECTED]
Sent: 09 November 2007 16:05
To: java-user@lucene.apache.org
Subject: RE: TermDocs.skipTo error

Erick,
   Sorry the numbers are just printed out for debugging when it is building the 
index. I will try it with lucene 2.1 and see what happens

Thanks

Mike

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: 09 November 2007 15:59
To: java-user@lucene.apache.org
Subject: Re: TermDocs.skipTo error

FWIW, running Lucene 2.1, Java 1.5 all I get is some numbers being printed
out
0
1
2
.
.
.
90,000


and ran through the above 4 times or so

Erick

On Nov 9, 2007 5:51 AM, Mike Streeton <[EMAIL PROTECTED]>
wrote:

> I have posted before about a problem with TermDocs.skipTo () but never
> managed to reproduce it. I have now got it to fail using the following
> program, please can someone try it and see if they get the stack trace:
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 101306
>  at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>  at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java
> :118)
>  at org.apache.lucene.index.SegmentTermDocs.skipTo(
> SegmentTermDocs.java:176)
>  at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
>  at Test4.test(Test4.java:88)
>  at main(Test4.java:69)
>
> The program creates a test index, if you run it a second time it will not
> create the index. Change the directory name on line 33.
>
> Many Thanks
>
> Mike
>
> Ps I am using Lucene 2.2 and java 1.6 u1
>
>
>
> import java.io.IOException;
> import java.util.Random;
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.MultiReader;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermDocs;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.LockObtainFailedException;
>
> public class Test4 {
>
>  /**
>   * @param args
>   * @throws IOException
>   * @throws LockObtainFailedException
>   * @throws CorruptIndexException
>   */
>  public static void main(String[] args) throws Exception {
>Random rand = new Random(0);
>Directory[] dirs = new Directory[10];
>for (int i = 0; i < dirs.length; i++) {
>  dirs[i] = FSDirectory.getDirectory
> ("c:\\temp\\lucenetest\\"
>  + Integer.toString(i));
>  if (!IndexReader.indexExists(dirs[i])) {
>IndexWriter writer = new IndexWriter(dirs[i],
>new StandardAnalyzer(), true);
>for (int j = 0; j < 10; j++) {
>  Document doc = new Document();
>  doc.add(new Field("i", Integer.toString(
> rand.nextInt(100)),
>  Store.YES, Index.UN_TOKENIZED));
>  doc.add(new Field("j",
>  Integer.toString(rand.nextInt(1000)),
> Store.YES,
>  Index.UN_TOKENIZED));
>  writer.addDocument(doc);
>  if (j % 1 == 0) {
>System.out.println(j);
>  }
>}
>writer.optimize();
>writer.close();
>writer = null;
>  }
>  IndexReader reader = IndexReader.open(dirs[i]);
>  for (int j = 0; j < 1000; j++) {
>   

RE: TermDocs.skipTo error

2007-11-12 Thread Mike Streeton
Yonik,
Thanks for this , I have checked and it is not at the end of the index, but 
something funny is happening. It is a multireader each reader within it has 
100,000 documents, but when it fails it is trying to access a document e.g. 
123,456, in the segment that only has 100,000

Many Thanks

Mike

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: 10 November 2007 22:49
To: java-user@lucene.apache.org
Subject: Re: TermDocs.skipTo error

On Nov 9, 2007 11:40 AM, Mike Streeton <[EMAIL PROTECTED]> wrote:
> I have just tried this again using the index I built with lucene 2.1 but 
> running the test using lucene 2.2 and it works okay, so it seems to be 
> something related to an index built using lucene 2.2.

I bet you are triggering an issue in multi-level skip lists by
attempting to skipTo a target past maxDoc (which per the javadoc,
seems like it should be legal).

So a short term workaround for you might be to first test if the
target you are trying to skip to is less than maxDoc().

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TermDocs.skipTo error

2007-11-14 Thread Mike Streeton
I have now managed to quantify the error, it only affects Lucene 2.2 build 
indexes and occurs after a period of time reusing a TermDocs object, I have 
modified my test app top be a little more verbose about the conditions it fails 
under. Hopefully someone can track the bug down in Lucene. I have run the test 
again after it fails changing the loop iterators so it repeats the failing 
iteratation first and it works okay.

Many Thanks

Mike



import java.io.File;
import java.io.IOException;
import java.util.Random;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.MultiReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;

public class Test4 {

/**
 * @param args
 * @throws IOException
 * @throws LockObtainFailedException
 * @throws CorruptIndexException
 */
public static void main(String[] args) throws Exception {
Random rand = new Random(0);
FSDirectory[] dirs = new FSDirectory[10];
boolean build = false;
for (int i = 0; i < dirs.length; i++) {
dirs[i] = FSDirectory.getDirectory("c:" + 
File.separator + "temp"
+ File.separator + "lucenetest" + 
File.separator
+ Integer.toString(i));
if (!IndexReader.indexExists(dirs[i])) {
if (!build) {
System.out.println("Building Test Index 
Start");
}
build = true;
System.out.println("Building Index: " + 
dirs[i].getFile()
+ " Start");
IndexWriter writer = new IndexWriter(dirs[i],
new StandardAnalyzer(), true);
for (int j = 0; j < 10; j++) {
Document doc = new Document();
doc.add(new Field("i", 
Integer.toString(rand.nextInt(100)),
Store.YES, 
Index.UN_TOKENIZED));
doc.add(new Field("j",

Integer.toString(rand.nextInt(1000)), Store.YES,
Index.UN_TOKENIZED));
writer.addDocument(doc);
}
writer.optimize();
writer.close();
writer = null;
System.out.println("Building Index: " + 
dirs[i].getFile()
+ " Complete");
}
IndexReader reader = IndexReader.open(dirs[i]);
for (int j = 0; j < 1000; j++) {

reader.deleteDocument(rand.nextInt(reader.maxDoc()));
}
reader.close();
}
if (build) {
System.out.println("Building Test Index Complete");
}
System.out.println("Test Start");
IndexReader[] readers = new IndexReader[dirs.length];
for (int i = 0; i < dirs.length; i++) {
readers[i] = IndexReader.open(dirs[i]);
}
IndexReader reader = new MultiReader(readers);
TermDocs docs = reader.termDocs();
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 1000; j++) {
try {
test(reader, docs, Integer.toString(i), 
Integer.toString(j));
} catch (Exception e) {
System.err.println("maxdoc=" + 
reader.maxDoc());
System.err.println("Test Failed at i=" 
+ i + " j=" + j);
throw e;
}
}

Re: Custom query parser

2007-11-22 Thread Mike Klaas

On 22-Nov-07, at 8:49 AM, Nicolas Lalevée wrote:


Le jeudi 22 novembre 2007, Matthijs Bierman a écrit :

Hi Nicolas,

Why can't you extend the QueryParser and override the methods you  
want

to modify?


Because the query parser I would like to have is a very basic user  
one, ala

google. The syntax I would like is nothing more than :
"type:text OR foo -bar"
I also don't want the user to be popuped with parsing error, and  
further more,

even with errors, I don't want to drop the failing terms. A search
for "bla:foo:bar" should fallback to "bla foo bar".


This is usually best handled by preprocessing the query and  
validating it yourself.


So I have the parser, but it seems, unless I have missed something,  
that I

have to reimplement (copy and paste) the QueryParser#getFieldQuery


You could override it, call the super method, then if the resulting  
query is a PhraseQuery, extract the components and replace the query  
with a span query of prefix queries, or somesuch.


-Mike
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index and access to lines of a CSV file

2007-12-13 Thread Mike Klaas

On 13-Dec-07, at 3:26 PM, Tobias Rothe wrote:

I got a quick question.  I am handling hughe CSV files. They start  
with a key in the first column and are followed by data.
I need to retrieve randomly this data based on the key.  So it is  
kind of a search where I give a unique key and ideally access to  
the right line.
The file contains about 200.000 lines or more.  I am not sure if  
Lucene can handle things like that and I did not really find a hint  
on this  topic.  So I hope to find help here.


Is that all you're doing?  You could accomplish the above with Lucene  
but it isn't really needed for that.  You need some kind of on-disk  
key->value mapper.  Something like a berkeley db hashtable or btree  
should work (store each line as a key/value pair).


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas

On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:

I have a few fields that use package names and class names and I've  
been

looking for some suggestions for analyzing these fields.

A few examples -

Text (class name)
- "org.apache.lucene.document.Document"
Queries that would match
- "org.apache" , "org.apache.lucene.document"

Text (class name + method signature)
-- "org.apache.lucene.document.Document#add(Fieldable)"
Queries that would match
-- "org.apache.lucene", "org.apache.lucene.document.Document#add"

Any thoughts on how to approach tokenizing these types of texts?


Perhaps it would help to include some examples of queries you _don't_  
want to match.  For all the examples above, simply tokenizing  
alphanumeric components would suffice.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas

Either index them as a series of tokens:

org
org.apache
org.apache.lucene
org.apache.lucene.document
org.apache.lucene.document.Document

or index them as a single token, and use prefix queries (this is what  
I do for reverse domain names):


classname:(org.apache org.apache.*)

Note that "classname:org.apache*" would probably be wrong--you might  
not want to match


org.apache-fake.lucene.document

regards,
-Mike

On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:


Good point.

I don't want the sub-package names on their own to match.

Text (class name)
 - "org.apache.lucene.document.Document"
Queries that would match
 - "org.apache", "org.apache.lucene.document"
Queries that DO NOT match
 - "apache", "lucene", "document"

-Nathan

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 11:29 AM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class names

On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:


I have a few fields that use package names and class names and I've
been
looking for some suggestions for analyzing these fields.

A few examples -

Text (class name)
- "org.apache.lucene.document.Document"
Queries that would match
- "org.apache" , "org.apache.lucene.document"

Text (class name + method signature)
-- "org.apache.lucene.document.Document#add(Fieldable)"
Queries that would match
-- "org.apache.lucene", "org.apache.lucene.document.Document#add"

Any thoughts on how to approach tokenizing these types of texts?


Perhaps it would help to include some examples of queries you _don't_
want to match.  For all the examples above, simply tokenizing
alphanumeric components would suffice.

-Mike

--
CONFIDENTIALITY NOTICE This message and any included attachments  
are from Cerner Corporation and are intended only for the  
addressee. The information contained in this message is  
confidential and may constitute inside or non-public information  
under international, federal, or state securities laws.  
Unauthorized forwarding, printing, copying, distribution, or use of  
such information is strictly prohibited and may be unlawful. If you  
are not the addressee, please promptly delete this message and  
notify the sender of the delivery error by e-mail or you may call  
Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
(816)221-1024.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thoughts/suggestions for analyzing/tokenizing class names

2007-12-17 Thread Mike Klaas


On 17-Dec-07, at 11:39 AM, Beyer,Nathan wrote:


Would using Field.Index.UN_TOKENIZED be the same as tokenizing a field
into one token?


Indeed.

-Mike



-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 12:53 PM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class names

Either index them as a series of tokens:

org
org.apache
org.apache.lucene
org.apache.lucene.document
org.apache.lucene.document.Document

or index them as a single token, and use prefix queries (this is what
I do for reverse domain names):

classname:(org.apache org.apache.*)

Note that "classname:org.apache*" would probably be wrong--you might
not want to match

org.apache-fake.lucene.document

regards,
-Mike

On 17-Dec-07, at 9:39 AM, Beyer,Nathan wrote:


Good point.

I don't want the sub-package names on their own to match.

Text (class name)
 - "org.apache.lucene.document.Document"
Queries that would match
 - "org.apache", "org.apache.lucene.document"
Queries that DO NOT match
 - "apache", "lucene", "document"

-Nathan

-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Monday, December 17, 2007 11:29 AM
To: java-user@lucene.apache.org
Subject: Re: thoughts/suggestions for analyzing/tokenizing class  
names


On 15-Dec-07, at 3:14 PM, Beyer,Nathan wrote:


I have a few fields that use package names and class names and I've
been
looking for some suggestions for analyzing these fields.

A few examples -

Text (class name)
- "org.apache.lucene.document.Document"
Queries that would match
- "org.apache" , "org.apache.lucene.document"

Text (class name + method signature)
-- "org.apache.lucene.document.Document#add(Fieldable)"
Queries that would match
-- "org.apache.lucene", "org.apache.lucene.document.Document#add"

Any thoughts on how to approach tokenizing these types of texts?


Perhaps it would help to include some examples of queries you _don't_
want to match.  For all the examples above, simply tokenizing
alphanumeric components would suffice.

-Mike

- 
-

CONFIDENTIALITY NOTICE This message and any included attachments
are from Cerner Corporation and are intended only for the
addressee. The information contained in this message is
confidential and may constitute inside or non-public information
under international, federal, or state securities laws.
Unauthorized forwarding, printing, copying, distribution, or use of
such information is strictly prohibited and may be unlawful. If you
are not the addressee, please promptly delete this message and
notify the sender of the delivery error by e-mail or you may call
Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)
(816)221-1024.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
CONFIDENTIALITY NOTICE This message and any included attachments  
are from Cerner Corporation and are intended only for the  
addressee. The information contained in this message is  
confidential and may constitute inside or non-public information  
under international, federal, or state securities laws.  
Unauthorized forwarding, printing, copying, distribution, or use of  
such information is strictly prohibited and may be unlawful. If you  
are not the addressee, please promptly delete this message and  
notify the sender of the delivery error by e-mail or you may call  
Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)  
(816)221-1024.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Pagination ...

2007-12-26 Thread Mike Richmond
You might want to take a look at Solr (http://lucene.apache.org/solr/).  You
could either use Solr directly, or see how they implement paging.


--Mike


On Dec 26, 2007 12:12 PM, Zhou Qi <[EMAIL PROTECTED]> wrote:

> Using the search function for pagination will carry out unnecessary index
> search when you are going previous or next. Generally, most of the
> information need (e.g 80%) can be satisfied by the first 100 documents
> (20%). In lucene, the returing documents is set to 100 for the sake of
> speed.
>
> I am not quite sure my way of pagination is best: but it works fine under
> test preasure: Just keep the first search result in cache and fetch the
> snippet when the document is presented in current page.
>
> 2007/12/26, Dragon Fly <[EMAIL PROTECTED]>:
> >
> >
> > Any advice on this? Thanks.
> >
> > > From: [EMAIL PROTECTED]
> > > To: java-user@lucene.apache.org
> > > Subject: Pagination ...
> > > Date: Sat, 22 Dec 2007 10:19:30 -0500
> > >
> > >
> > > Hi,
> > >
> > > What is the most efficient way to do pagination in Lucene? I have
> always
> > done the following because this "flavor" of the search call allows me to
> > specify the top N hits ( e.g. 1000) and a Sort object:
> > >
> > > TopFieldDocs topFieldDocs = searcher.search(query, null, 1000,
> > SORT_BY_DATE);
> > >
> > > Is it the best way? Thank you.
> > >
> > > _
> > > Don't get caught with egg on your face. Play Chicktionary!
> > > http://club.live.com/chicktionary.aspx?icid=chick_wlhmtextlink1_dec
> >
> > _
> > Get the power of Windows + Web with the new Windows Live.
> > http://www.windowslive.com?ocid=TXT_TAGHM_Wave2_powerofwindows_122007
>


  1   2   >