Sounds useful. I suppose this means one would have custom function for
within-bucket-reordering? e.g. for a web search you might reorder based on the
URL length if you think shorter URLs are an indicator of higher quality. It
also sounds like something that can easily sit outside Luceneor
Sorry about not responding to this before now, been a little busy :).
For those of you who don't know me, I am a committer on the Nutch
project. I have been working with Wikia since early July and more
actively since the beginning of November. Before Wikia I helped start
another search engin
Ryan McKinley wrote:
Andrzej Bialecki wrote:
Lukas Vlcek wrote:
So staring will be accommodated only during indexing phase. Does it
mean it
will be pretty static value not a dynamically changing variable...
correct?
In other words if I add my starts to some document it won't affect the
scorin
On Jan 8, 2008 11:48 PM, chris.b <[EMAIL PROTECTED]> wrote:
>
> Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams
> and
> the ngrams that i indicate, while maintining the whitespaces. :)
> The reason i'm doing this is because I only wish to index names with more
> than one t
This is done by Lucene's scorers. You should however start
in http://lucene.apache.org/java/docs/scoring.html, - scorers
are described in the "Algorithm" section. "Offsets" are used
by Phrase Scorers and by Span Scorer.
Doron
On Jan 8, 2008 11:24 PM, Marjan Celikik < [EMAIL PROTECTED]> wrote:
>
Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams and
the ngrams that i indicate, while maintining the whitespaces. :)
The reason i'm doing this is because I only wish to index names with more
than one token.
--
View this message in context:
http://www.nabble.com/Basic-Nam
yes, no worries.
i just check in advance what fields are available and build the Sort
object accordingly. Eventually BCC would be there...but not
necessary so at first.
Anyway, got it to work! Thanks for your help.
All the best,
Michael
On Jan 8, 2008, at 4:37 PM, Doron Cohen wrote:
H
Hi Michael, I think you mean the exception thrown when you
search and sort with a field that was not yet indexed:
RuntimeException: field "BBC" does not appear to be indexed
I think the current behavior is correct, otherwise an application
might (by a bug) attempt to sort by a wrong field, th
my mistake, I thought I was looking at the solr mailing list ;)
If you change your analyzer, it does not change the tokens that are
already in the index -- you will need to re-index for any changes to
take effect.
ryan
Michael Prichard wrote:
Meaning that it says "field is not indexed". Wh
Doron Cohen wrote:
Hi Marjan,
Lucene process the query in what can be called
one-doc-at-a-time.
For the example query - x y - (not the phrase query "x y") - all
documents containing either x or y are considered a match.
When processing the query - x y - the posting lists of these two
index ter
Meaning that it says "field is not indexed". Where is
sortMissingLastAttribute?
thanks.
On Jan 8, 2008, at 4:13 PM, Ryan McKinley wrote:
what do you mean by "fail"? -- there is the sortMissingLast attribute
Michael Prichard wrote:
ok... i should read the manual more often.
i went ahead a
what do you mean by "fail"? -- there is the sortMissingLast attribute
Michael Prichard wrote:
ok... i should read the manual more often.
i went ahead and just added untokenized, unstored sort fields
question, if I put a field in to sort of but say I have not indexed any
as of yet...will
Hi Chris,
A null pointer exception can be causes by not checking
newToken for null after this line:
Token newToken = input.next()
I think Hoss meant to call next() on the input as long as returned
tokens do not satisfy the check for being a named entity.
Also, this code assumes white space i
I should note that this technique is probably not easily applicable to
current Lucene scoring mechanism without additional development.
On 1/8/08, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>
> After checking the Lucene API of ParallelReader it seems that the star
> score could be stored in different
After checking the Lucene API of ParallelReader it seems that the star score
could be stored in different index which shares the same identifier for the
documents. Such index could be small (partitioned to many small indices?) so
the updates can be fast. Is that what you meant Andrzej? ;-)
Anyway,
Andrzej Bialecki wrote:
Lukas Vlcek wrote:
So staring will be accommodated only during indexing phase. Does it
mean it
will be pretty static value not a dynamically changing variable...
correct?
In other words if I add my starts to some document it won't affect the
scoring immediately but afte
ok... i should read the manual more often.
i went ahead and just added untokenized, unstored sort fields
question, if I put a field in to sort of but say I have not indexed
any as of yet...will the Sort fail? For example, say I have a BCC
field and nothing has been indexed with that yet
Lukas Vlcek wrote:
So staring will be accommodated only during indexing phase. Does it mean it
will be pretty static value not a dynamically changing variable... correct?
In other words if I add my starts to some document it won't affect the
scoring immediately but after indexing cycle. Correct?
So staring will be accommodated only during indexing phase. Does it mean it
will be pretty static value not a dynamically changing variable... correct?
In other words if I add my starts to some document it won't affect the
scoring immediately but after indexing cycle. Correct?
On 1/8/08, Dennis Ku
I'm surprised they aren't keeping *any* logs or so they claim. Seems foolish
to me from a data-mining prospective.
"A Wikia employee told me today that people were already asking what the
most popular search terms were. He said there was no way of finding out as
no logs are kept." [1]
[1]
http://r
Star ratings are being stored but not accounted for in the score as of
yet. The plan is to include them in future indexing scores. :)
Dennis
Mike Klaas wrote:
On 7-Jan-08, at 11:49 PM, Lukas Vlcek wrote:
This would be great!
I am particularly interested how they are going about customized
On 7-Jan-08, at 11:49 PM, Lukas Vlcek wrote:
This would be great!
I am particularly interested how they are going about customized
search (if
they have a plan to do it). I mean if they can reorder raw search
results
based on some kind of collective knowledge (which is probably kept
outsid
Hi Marjan,
Lucene process the query in what can be called
one-doc-at-a-time.
For the example query - x y - (not the phrase query "x y") - all
documents containing either x or y are considered a match.
When processing the query - x y - the posting lists of these two
index terms are traversed, and
Is it possible to sort on a tokenized field? For example, I break
email address into pieces, i.e.
[EMAIL PROTECTED]
becomes
[EMAIL PROTECTED]
michael.prichard
michael
prichard
email.com
email
so when sorting on this field I get some strange results. Do I need
to create another field jus
Hi, Sachin,
If you like self-join, you may need to retrieve the data from the
second query and merge them into each Document object. Then you can do
the query in one shot. (it's redundant. but do not try to normalize
data in the index.)
Lucene is an index. Just like index in SQL database, which c
I think the problem is that he is calling getBestFrags on every hit
result for 200 page documents. So he is probably getting the document
for every result and running the Highlighter on each. Thats some slow
stuff there. The first simple thought is to page your results and only
getBestFrags for
Are you just trying to search or are you trying to highlight?
Usually, you do your search, and then highlight 1 or more documents.
You can also speed up highlighting by using term vectors.
-Grant
On Jan 8, 2008, at 9:38 AM, Yannick Caillaux wrote:
Hello,
First, sorry for my bad english.
Cool. I just realized that compass also has an annotation value of
analyzer. Now I'll just have to find out if you can truly have more
than one per index.
Thanks!
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional comma
I *think* you want to consider writing your own analyzer that tokenizes
your special fields however you want to. Then use PerFieldAnalyzerWrapper
at both index and query time to break the stream up appropriately for this
(and only this) field..
This would give you two tokens for the attribute fiel
Following your suggestion (I think), I built a tokenfilter with the following
code for next():
public final Token next() throws IOException {
Token newToken = input.next();
termText = newToken.termText();
Character tempChar = termText.charAt
It's often a mistake to try to force Lucene to act like a database. Is it
possible to just use the database for the join portion and Lucene
for the text search?
Otherwise I agree with Developer Developer. You need to provide
a higher level idea of *what* it is you're trying to accomplish to get
go
I have an index that contains a couple special fields that I need to
tokenize differently than the rest. The case is that I basically
have a key/value pair stored as the value. The field name is
"attribute" and it's value is "SomeValue=1.9"
I need to tokenize the value so that I can search on t
Provide more details please.
Can you not use boolean query and filters if need be ?
On Jan 8, 2008 7:23 AM, sachin <[EMAIL PROTECTED]> wrote:
>
> I need to write lucene query something similar to SQL self joins.
>
> My current implementation is very primitive. I fire first query, get the
> res
Hello,
First, sorry for my bad english.
I have an index including 100 Dublin Core notices. I indexed
title,creator and I added a field "fulltext" containing the PDF
document referenced by the DC notice. (A PDF document is about 200 pages)
There's no problem to index them. But when I try
On Jan 8, 2008, at 2:55 AM, Lukas Vlcek wrote:
BTW:
1) If they have made any improvements/changes to Nutch (or Lucene/
Hadoop)
code and they keep it closed then how they can claim they are using
open
sourced algorithms?
They are "using" it, they just aren't sharing it. Many companies out
I need to write lucene query something similar to SQL self joins.
My current implementation is very primitive. I fire first query, get the
results, based on the result of first query I fire second query and then
merge the results from both the queries. The whole processing is very
expensive. Doi
Did you try SpanFirstQuery?
I had the same need in my application, for implementing type-ahead
functionality over the titles and I found that storing them as un_tokenized
gives the best performance (of course, I don't run any query, but iterate
over the terms in my solution).
Span queries are expe
Hello,
I am having a problem qith PrefixQuery:
I have a field name item title which is indexed as:
doc.add(new Field("item_title", item_title.trim().toLowerCase(),
Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));
and I am forming my query like:
PrefixQuery pq = new PrefixQuery((
On Mon, 2008-01-07 at 14:20 -0800, Otis Gospodnetic wrote:
> Please post your results, Lars!
Tried the patch, and it failed to compile (plain Lucene compiled fine).
In the process, I looked at TermQuery and found that it'd be easier to
copy that code and just hardcode 1.0f for all norms. Did tha
Otis Gospodnetic wrote:
Is your user field stored? If so, you cold find the target Document, get the
user field value, modify it, and re-add it to the Document (or something
close to this -- I am doing this with one of the indices on simpy.com and
it's working well).
No, it's not stored. I'm
40 matches
Mail list logo