hi all , has any body tried to use LSI(latent semantic indexing) for
indexing in lucene?
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
On Mar 17, 2009, at 5:44 AM, liat oren wrote:
Thanks for all the answers.
I am new to Lucene and in the emails its the first time I heard of the
bigrams and thus read about them a bit.
Question - if I query for "cat animal" - or use boosting - "cat^2
animal^0.5" - will the results return ONLY
Can you provide a little more info on what you want to do? For
instance, you could just have a buffering TokenFilter that stores up
the tokens and counts them and then spits them back out, but I somehow
suspect that is not what you are after.
-Grant
On Mar 17, 2009, at 6:03 AM, Ильдар Аши
: The final "production" computation is one-time, still, I have to recurrently
: come back and correct some errors, then retry...
this doesn't really seem like a problem ideally suited for Lucene ... this
seems like the type of problem sequential batch crunching could solve
better...
first pas
: > I suppose SpanTermQuery could override the weight/scorer methods so that
: > it behaved more like a TermQuery if it was executed directly ... but
: > that's really not what it's intended for.
:
: This is currently the only way to boost a term via payloads.
: BoostingTermQuery extends SpanTerm
I've recently upgraded to Solr 1.3 using Lucene 2.4. One of the reasons I
upgraded was because of the nicer SearchComponent architecture that let me
add a needed feature to the default request handler. Simply put, I needed to
filter a query based on some additional parameters. So I subclassed
Query
On Mar 17, 2009, at 2:32 PM, Aaron Schon wrote:
how would I go about recommending Jane Doe connecting to Frank
Jones?. Hope you can help a newbie by pointing where I should be
looking?
You might as well read something about it to get you started:
"Programming Collective Intelligence"
http
You may want to try Filters (starting from TermFilter) for this, especially
those based on the default OpenBitSet (see the intersection count method)
because of your interest in stop words.
10k OpenBitSets for 39 M docs will probably not fit in memory in one go,
but that can be worked around by kee
Michael McCandless wrote:
Is this a one-time computation? If so, couldn't you wait a long time
for the machine to simply finish it?
The final "production" computation is one-time, still, I have to
recurrently come back and correct some errors, then retry...
With the simple approach (doing 100
Hello
I was trying to use the DuplicateFilter api in contrib/queries for Lucene in
an
application but it doesn't seem to be accepted as a valid argument to the
searcher.search function. I'm using Apache Lucene 2.4.0.
Here's what I did.
DuplicateFilter df=new DuplicateFilter("NAME");
df.setKeepMo
Hi Paul,
If you do not store all the data inside lucene you have to get you
updated data from you spreadsheet again. Even if you would store all
the data you would have to update the document by creating a new one
and adding it to the index using updateDocument(). You can not update
just one single
I am using lucene to index rows in a spreadsheet , each row is a
Document, and the document indexes 10 fields from the row plus the row
number used to relate thethe Document to the row number
So when someone modifies one of the 10 fields I am interested in a row I
have to update the document wi
Hi Paul,
On 3/17/2009 at 9:18 AM, Paul Libbrecht wrote:
> what is the official pom.xml fragment to be used for the contribs
> package of lucene?
> It seems to be only of type pom inside the maven repository... does it
> mean that I have to fetch "sub-contribs" ?
Your POM should include dependenci
Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout
. In there is the Taste collaborative filtering project which is all
about recommendations.
On Mar 17, 2009, at 9:32 AM, Aaron Schon wrote:
Hi all, Apologies if this question is off-topic, but I was wonderi
I'm not sure this would fall primarily under recommenders... I would assume
Facebook is doing "look-ahead" on connections. i.e. A->B, B->C, so suggest
A->C. Then they weight the suggestions by the number of indirect links between
A and C and probably other factors (which is where the generic "
You might try looking in a list that talks about recommender systems.
Google hits:
- http://en.wikipedia.org/wiki/Recommendation_system
- ACM Recommender Systems 2009 http://recsys.acm.org/
- A Guide to Recommender Systems
http://www.readwriteweb.com/archives/recommender_systems.php
2009/3/17 Aaro
Hi all, Apologies if this question is off-topic, but I was wondering if there
is a way of leveraging Lucene (or other mechanism) to store the information
about connections and recommend People you might know as done in FB or LI.
The data is as follows:
john_sm...@somedomain.com, jane_...@other
Hello Luceners,
what is the official pom.xml fragment to be used for the contribs
package of lucene?
It seems to be only of type pom inside the maven repository... does it
mean that I have to fetch "sub-contribs" ?
paul
smime.p7s
Description: S/MIME cryptographic signature
Hello.
Can I get access to the terms of a field and its frequency during indexing the
document?
Thanks.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.ap
Is this a one-time computation? If so, couldn't you wait a long time
for the machine to simply finish it?
With the simple approach (doing 100 million 2-term AND queries), how
long do you estimate it'd take?
I think you could do this with your own analyzer (as you
suggested)... it would run norm
OK - thanks for the explanation. So this is not just a simple search ...
I'll go away and leave you and Michael and the other experts to talk
about clever solutions.
--
Ian.
On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu
wrote:
> Ian Lea wrote:
>>
>> Adrian - have you looked any further
OK thank's a lot, I must be very poor about searching ;-)... I kind of
missed these information.
Thx again.
-Ray-
On Tue, Mar 17, 2009 at 12:25 PM, Uwe Schindler wrote:
> It is possible in two ways:
>
> 1. Use the analyzer class and generate a TokenStream/Tokenizer from it.
> Then
> add the fie
Ian Lea wrote:
Adrian - have you looked any further into why your original two term
query was too slow? My experience is that simple queries are usually
extremely fast.
Let me first point out that it is not "too slow" in absolute terms, it
is only for my particular needs of attempting the num
It is possible in two ways:
1. Use the analyzer class and generate a TokenStream/Tokenizer from it. Then
add the field using the c'tor taking a TokenStream. If you want to
additionally store the field, you have to add another field with the same
field name, but no index and store enabled. After th
org.apache.lucene.analysis.PerFieldAnalyzerWrapper
There's plenty of info about it on the web, even some recent
discussion on this list which will be in the archives.
--
Ian.
On Tue, Mar 17, 2009 at 11:17 AM, Raymond Balmès
wrote:
> I was looking for calling a different analyzer for each field
I work on windows.
I copied my jar to the lib directory - so it is now together with the other
jars Luke uses (Lucene, etc)
And added the text below to the classpath file (exists in the luke-src-0.9.1
directory).
2009/3/17 Ian Lea
> Added that classpathentry to what? That means nothing to me
I was looking for calling a different analyzer for each field of a
document... looks like it is not possible.
Do I have it right ?
-Ray-
Michael McCandless wrote:
I don't understand how this would address the "docFreq does
not reflect deletions".
Bad mail-quoting, sorry. I am not interested by document deletion, I
just index Wikipedia once, and want to get a co-occurrence-based
similarity distance between words called NGD (norm
This is all getting very complicated!
Adrian - have you looked any further into why your original two term
query was too slow? My experience is that simple queries are usually
extremely fast. Standard questions: have you warmed up the searcher?
How large is the index? How many occurrences of yo
Added that classpathentry to what? That means nothing to me.
I'd run it from the command line as
$ java -cp whatever -jar whatever.jar
or
$ export CLASSPATH=whatever
$ java -jar whatever.jar
Those examples are unix based. If you're on Windows I imagine there
are equivalents. Or maybe your c
Adrian Dimulescu wrote:
Thank you.
I suppose the solution for this is to not create an index but to store
co-occurence frequencies at Analyzer level.
I don't understand how this would address the "docFreq does
not reflect deletions".
You can use the shingles analyzer (under contrib/analyzer
Hi Ian,
Thanks for the answer.
Yes, I meant running in from command line.
They are already in the classpath - I added this part:
2009/3/17 Ian Lea
> Well, assuming that when you say "invoke Luke's jar outside java" you
> mean that you are trying to run Luke from the command line e.g. $ java
Well, assuming that when you say "invoke Luke's jar outside java" you
mean that you are trying to run Luke from the command line e.g. $ java
-jar lukexxx.jar, it simply sounds like your classes are not on the
classpath. Add them.
--
Ian.
On Tue, Mar 17, 2009 at 10:20 AM, liat oren wrote:
> Hi
Hi,
I edited Luke's code so it also uses my classes (I added the jar to the
class-path and put it in the lib folder).
When I run from java it works good.
Now I try to build it and invoke Luke's jar outside java and get the
following error:
Exception in thread "main" java.lang.NoClassDefFoundError
Thanks for all the answers.
I am new to Lucene and in the emails its the first time I heard of the
bigrams and thus read about them a bit.
Question - if I query for "cat animal" - or use boosting - "cat^2
animal^0.5" - will the results return ONLY documents that contain both?
>From what I saw unt
Thank you.
I suppose the solution for this is to not create an index but to store
co-occurence frequencies at Analyzer level.
Adrian.
On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
>
> Be careful: docFreq does not take deletions into account.
>
Mark Miller wrote:
Hmmm - you can probably get qsol to do it:
http://myhardshadow.com/qsol. I think you can setup any token to
expand to anything with a regex matcher and use group capturing in the
replacement (I don't fully remember though, been a while since I've
used it).
So you could do
37 matches
Mail list logo