Hi Otmar,
Shouldn't Occur.SHOULD alone do what you ask? Documents that match all
terms in the query would be scored higher than documents that match fewer
than all terms.
-sujit
On Fri, Mar 25, 2016 at 2:20 AM, Otmar Caduff wrote:
> Hi all
> In Lucene, I know of the possibility of Occur.SHOULD
Hi Ali,
I agree with the others that there is no good way to do what you are
looking for if you want to assign lucene-like scores to your external
results, but if you have some objective measure of goodness that doesn't
depend on your lucene scores, you can apply it to both result sets and
merge t
I did something like this sometime back. The objective was to find patterns
surrounding some keywords of interest so I could find keywords similar to
the ones I was looking for, sort of like a poor man's word2vec. It uses
SpanQuery as Jigar said, and you can find the code here (I believe it was
wri
Hi Shouvik, not sure if you have already considered this, but you could put
the database primary key for the record into the index - ie, reverse your
insert to do DB first, get the record_id and then add this to the Lucene
index as "record_id" field. During retrieval you can minimize the network
tr
Hi John,
Take a look at the PerFieldAnalyzerWrapper. As the name suggests, it allows
you to create different analyzers per field.
-sujit
On Fri, Sep 19, 2014 at 6:50 AM, John Cecere wrote:
> I've considered this, but there are two problems with it. First of all, it
> feels like I'm still taki
Hi Arjen,
This is kind of a spin on your last observation that your list of stop
words don't change frequently. If you have a custom filter that attempts to
stem the incoming token and if it stems to the same as a stopword, only
then sets the keyword attribute on the original token.
That way your
Hi Arjen,
You could also mark a token as "keyword" so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
Note: This filter is aware of the KeywordAttr
Hi Rafaela,
I built something along these lines as a proof of concept. All data in the
index was unstored and only fields which were searchable (tokenized and
indexed) were kept in the index. The full record was encrypted and stored in a
MongoDB database. A custom Solr component did the search
Hi Michael,
Instead of putting the annotation in Payloads, why not put them in as
"synonyms", ie at the same spot as the original string (see SynonymFilter in
the LIA book). So your string would look like (to the index):
W. A. Mozart was born in Salzburg
artist city
so you ca
Hi Uwe,
I see, makes sense, thanks very much for the info. Sorry about giving you wrong
info Carsten.
-sujit
On Apr 15, 2013, at 1:06 PM, Uwe Schindler wrote:
> Hi,
>
> Original Message-
>> From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJIT PAL
>>
; To: java-user@lucene.apache.org
>> Subject: Re: Statically store sub-collections for search (faceted search?)
>>
>> Am 12.04.2013 20:08, schrieb SUJIT PAL:
>>> Hi Carsten,
>>>
>>> Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
Hi Carsten,
Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
Since you are not doing any scoring (only filtering), the max boolean clauses
limit should not apply to a filter.
-sujit
On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote:
> Dear list,
> I would like to c
Hi Jerome,
How about this one?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
Regards,
Sujit
On Mar 22, 2013, at 9:22 AM, Jerome Blouin wrote:
> Hello,
>
> I'm looking for an analyzer that allows performing accent insensitive search
> in latin l
Hi Glen,
I don't believe you can attach a single payload to multiple tokens. What I did
for a similar requirement was to combine the tokens into a single "_" delimited
single token and attached the payload to it. For example:
The Big Bad Wolf huffed and puffed and blew the house of the Three Li
ompute Sim(query q, doc
> d).
> - Reorder the results based on the Sim(query q, doc d) results.
>
> Is this the best way? I can't see a way to compute the Sim() metric at
> any other time, because in scorePayload(), we don't have access to the
> full payload, nor to
Hi Stephen,
We are doing something similar, and we store as a multifield with each
document as (d,z) pairs where we store the z's (scores) as payloads for
each d (topic). We have had to build a custom similarity which
implements the scorePayload function. So to find docs for a given d
(topic), we
Hi Grant,
Not sure if this qualifies as a "bet you didn't know", but one could use
Lucene term vectors to construct document vectors for similarity,
clustering and classification tasks. I found this out recently (although
I am probably not the first one), and I think this could be quite
useful.
-
Hi Mead,
You may want to check out the permuterm index idea.
http://www-nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html
Basically you write a custom filter that takes a term and generates all
word permutations off it. On the query side, you convert your query so
its always a p
Hi Paul,
Since you have modified the StandardAnalyzer (I presume you mean
StandardFilter), why not do a check on the term.text() and if its all
punctuation, skip the analysis for that term? Something like this in
your StandardFilter:
public final boolean incrementToken() throws IOException {
Ch
tion is a bit academic at this point, we are
planning on multiplying the "docboost" into the SCORE values as they are
added into the index.
-sujit
On Wed, 2011-10-12 at 18:16 -0700, Sujit Pal wrote:
> Hi,
>
> Question about Payload Query and Document Boosts. We are using Lucene
>
Hi,
Question about Payload Query and Document Boosts. We are using Lucene
3.2 and Payload queries, with our own PayloadSimilarity class which
overrides the scorePayload method like so:
{code}
@Override
public float scorePayload(int docId, String fieldName,
int start, int end, byte[] pa
Depending on what you wanted to do with the Javabean (I assume you want
to make some or all its fields searchable since you are writing to
Lucene), you could use reflection to break it up into field name value
pairs and write them out to the IndexWriter using something like this:
Document d = new
08:21 +0200, Simon Willnauer wrote:
> On Wed, Jun 22, 2011 at 8:53 PM, Sujit Pal wrote:
> > Hello,
> >
> > I am currently in need of a LowerCaseFilter and StopFilter that will
> > recognize KeywordAttribute, similar to the way PorterStemFilter
> > currently does (on
Hello,
I am currently in need of a LowerCaseFilter and StopFilter that will
recognize KeywordAttribute, similar to the way PorterStemFilter
currently does (on trunk). Specifically, in case the term is a
KeywordAttribute.isKeyword(), it should not lowercase and remove
respectively.
This can be ach
Hi Leroy,
Would it make sense to index as Lucene documents the unit to be
searched? So if you want paragraphs to be shown in search results, you
could parse the source document during indexing into paragraphs and
index them as separate Lucene documents.
-sujit
On Wed, 2011-05-25 at 15:46 -0400,
Thank you Koji. I opened LUCENE-3141 for this.
https://issues.apache.org/jira/browse/LUCENE-3141
-sujit
On Tue, 2011-05-24 at 22:33 +0900, Koji Sekiguchi wrote:
> (11/05/24 3:28), Sujit Pal wrote:
> > Hello,
> >
> > My version: Lucene 3.1.0
> >
> > I
I meant to check out the Semantic vectors project, but never got around
to it, so there is nothing in the blog (sujitpal.blogspot.com) that
talks about semantic vectors at the moment. Its on my (rather long) todo
list though... Sorry about that...
-sujit
On Mon, 2011-05-23 at 21:22 -0300, Diego C
Hello,
My version: Lucene 3.1.0
I've had to customize the snippet for highlighting based on our
application requirements. Specifically, instead of the snippet being a
set of relevant fragments in the text, I need it to be the first
sentence where a match occurs, with a fixed size from the beginni
Hi Deepak,
Would something like this work in your case?
"Arcos Bioscience"^2.0 "Arcos" "Bioscience"
ie, a BooleanQuery with the full phrase boosted OR'd with a query on
each word?
-sujit
On Tue, 2011-04-26 at 14:46 -0400, Deepak Konidena wrote:
> Hi,
>
> Currently when I type in Arcos Bioscie
I don't know if there is already an analyzer available for this, but you
could use GATE or UIMA for Named Entity Extraction against names and
expand the query to include the extra names that are used synonymously.
You could do this outside Lucene or inline using a custom Lucene
tokenizer that embed
but not all the other methods that are calculating
> the similarity scores...
>
>
> those methods are called and they have the implementation you have in
> DefaultSimilarityClass.. right ?
>
>
>
>
> On 1 March 2011 21:12, Sujit Pal wrote:
> One
One way to do this currently is to build a per field similarity wrapper
(that triggers off the field name). I believe there is some work going
on with Lucene Similarity that would make it pluggable for this sort of
stuff, but in the meantime, this is what I did:
public class MyPerFieldSimilarityWr
32 matches
Mail list logo