lsi as indexing algorithm with lucene

2009-03-17 Thread nitin gopi
hi all , has any body tried to use LSI(latent semantic indexing) for indexing in lucene? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Scores between words. Boosting?

2009-03-17 Thread Grant Ingersoll
On Mar 17, 2009, at 5:44 AM, liat oren wrote: Thanks for all the answers. I am new to Lucene and in the emails its the first time I heard of the bigrams and thus read about them a bit. Question - if I query for "cat animal" - or use boosting - "cat^2 animal^0.5" - will the results return ONLY

Re: get terms of a field and its frequences during indexing the document

2009-03-17 Thread Grant Ingersoll
Can you provide a little more info on what you want to do? For instance, you could just have a buffering TokenFilter that stores up the tokens and counts them and then spits them back out, but I somehow suspect that is not what you are after. -Grant On Mar 17, 2009, at 6:03 AM, Ильдар Аши

Re: number of hits of pages containing two terms

2009-03-17 Thread Chris Hostetter
: The final "production" computation is one-time, still, I have to recurrently : come back and correct some errors, then retry... this doesn't really seem like a problem ideally suited for Lucene ... this seems like the type of problem sequential batch crunching could solve better... first pas

Re: sloppyFreq question

2009-03-17 Thread Chris Hostetter
: > I suppose SpanTermQuery could override the weight/scorer methods so that : > it behaved more like a TermQuery if it was executed directly ... but : > that's really not what it's intended for. : : This is currently the only way to boost a term via payloads. : BoostingTermQuery extends SpanTerm

NPE in MultiSegmentReader$MultiTermDocs.doc

2009-03-17 Thread Comron Sattari
I've recently upgraded to Solr 1.3 using Lucene 2.4. One of the reasons I upgraded was because of the nicer SearchComponent architecture that let me add a needed feature to the default request handler. Simply put, I needed to filter a query based on some additional parameters. So I subclassed Query

Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Petite Abeille
On Mar 17, 2009, at 2:32 PM, Aaron Schon wrote: how would I go about recommending Jane Doe connecting to Frank Jones?. Hope you can help a newbie by pointing where I should be looking? You might as well read something about it to get you started: "Programming Collective Intelligence" http

Re: number of hits of pages containing two terms

2009-03-17 Thread Paul Elschot
You may want to try Filters (starting from TermFilter) for this, especially those based on the default OpenBitSet (see the intersection count method) because of your interest in stop words. 10k OpenBitSets for 39 M docs will probably not fit in memory in one go, but that can be worked around by kee

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? The final "production" computation is one-time, still, I have to recurrently come back and correct some errors, then retry... With the simple approach (doing 100

Error using DuplicateFilter in contrib/queries

2009-03-17 Thread Densel Santhmayor
Hello I was trying to use the DuplicateFilter api in contrib/queries for Lucene in an application but it doesn't seem to be accepted as a valid argument to the searcher.search function. I'm using Apache Lucene 2.4.0. Here's what I did. DuplicateFilter df=new DuplicateFilter("NAME"); df.setKeepMo

Re: What is the best way to modify a Document object

2009-03-17 Thread Simon Willnauer
Hi Paul, If you do not store all the data inside lucene you have to get you updated data from you spreadsheet again. Even if you would store all the data you would have to update the document by creating a new one and adding it to the index using updateDocument(). You can not update just one single

What is the best way to modify a Document object

2009-03-17 Thread Paul Taylor
I am using lucene to index rows in a spreadsheet , each row is a Document, and the document indexes 10 fields from the row plus the row number used to relate thethe Document to the row number So when someone modifies one of the 10 fields I am interested in a row I have to update the document wi

RE: Lucene-contrib maven artifact id?

2009-03-17 Thread Steven A Rowe
Hi Paul, On 3/17/2009 at 9:18 AM, Paul Libbrecht wrote: > what is the official pom.xml fragment to be used for the contribs > package of lucene? > It seems to be only of type pom inside the maven repository... does it > mean that I have to fetch "sub-contribs" ? Your POM should include dependenci

Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Grant Ingersoll
Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout . In there is the Taste collaborative filtering project which is all about recommendations. On Mar 17, 2009, at 9:32 AM, Aaron Schon wrote: Hi all, Apologies if this question is off-topic, but I was wonderi

RE: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Max Metral
I'm not sure this would fall primarily under recommenders... I would assume Facebook is doing "look-ahead" on connections. i.e. A->B, B->C, so suggest A->C. Then they weight the suggestions by the number of indirect links between A and C and probably other factors (which is where the generic "

Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Glen Newton
You might try looking in a list that talks about recommender systems. Google hits: - http://en.wikipedia.org/wiki/Recommendation_system - ACM Recommender Systems 2009 http://recsys.acm.org/ - A Guide to Recommender Systems http://www.readwriteweb.com/archives/recommender_systems.php 2009/3/17 Aaro

"People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Aaron Schon
Hi all, Apologies if this question is off-topic, but I was wondering if there is a way of leveraging Lucene (or other mechanism) to store the information about connections and recommend People you might know as done in FB or LI. The data is as follows: john_sm...@somedomain.com, jane_...@other

Lucene-contrib maven artifact id?

2009-03-17 Thread Paul Libbrecht
Hello Luceners, what is the official pom.xml fragment to be used for the contribs package of lucene? It seems to be only of type pom inside the maven repository... does it mean that I have to fetch "sub-contribs" ? paul smime.p7s Description: S/MIME cryptographic signature

get terms of a field and its frequences during indexing the document

2009-03-17 Thread Ильдар Аширбаев
Hello. Can I get access to the terms of a field and its frequency during indexing the document? Thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.ap

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? With the simple approach (doing 100 million 2-term AND queries), how long do you estimate it'd take? I think you could do this with your own analyzer (as you suggested)... it would run norm

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
OK - thanks for the explanation. So this is not just a simple search ... I'll go away and leave you and Michael and the other experts to talk about clever solutions. -- Ian. On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu wrote: > Ian Lea wrote: >> >> Adrian - have you looked any further

Re: Different analyzer per field ?

2009-03-17 Thread Raymond Balmès
OK thank's a lot, I must be very poor about searching ;-)... I kind of missed these information. Thx again. -Ray- On Tue, Mar 17, 2009 at 12:25 PM, Uwe Schindler wrote: > It is possible in two ways: > > 1. Use the analyzer class and generate a TokenStream/Tokenizer from it. > Then > add the fie

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Ian Lea wrote: Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Let me first point out that it is not "too slow" in absolute terms, it is only for my particular needs of attempting the num

RE: Different analyzer per field ?

2009-03-17 Thread Uwe Schindler
It is possible in two ways: 1. Use the analyzer class and generate a TokenStream/Tokenizer from it. Then add the field using the c'tor taking a TokenStream. If you want to additionally store the field, you have to add another field with the same field name, but no index and store enabled. After th

Re: Different analyzer per field ?

2009-03-17 Thread Ian Lea
org.apache.lucene.analysis.PerFieldAnalyzerWrapper There's plenty of info about it on the web, even some recent discussion on this list which will be in the archives. -- Ian. On Tue, Mar 17, 2009 at 11:17 AM, Raymond Balmès wrote: > I was looking for calling a different analyzer for each field

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread liat oren
I work on windows. I copied my jar to the lib directory - so it is now together with the other jars Luke uses (Lucene, etc) And added the text below to the classpath file (exists in the luke-src-0.9.1 directory). 2009/3/17 Ian Lea > Added that classpathentry to what? That means nothing to me

Different analyzer per field ?

2009-03-17 Thread Raymond Balmès
I was looking for calling a different analyzer for each field of a document... looks like it is not possible. Do I have it right ? -Ray-

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: I don't understand how this would address the "docFreq does not reflect deletions". Bad mail-quoting, sorry. I am not interested by document deletion, I just index Wikipedia once, and want to get a co-occurrence-based similarity distance between words called NGD (norm

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
This is all getting very complicated! Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Standard questions: have you warmed up the searcher? How large is the index? How many occurrences of yo

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread Ian Lea
Added that classpathentry to what? That means nothing to me. I'd run it from the command line as $ java -cp whatever -jar whatever.jar or $ export CLASSPATH=whatever $ java -jar whatever.jar Those examples are unix based. If you're on Windows I imagine there are equivalents. Or maybe your c

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Adrian Dimulescu wrote: Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. I don't understand how this would address the "docFreq does not reflect deletions". You can use the shingles analyzer (under contrib/analyzer

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread liat oren
Hi Ian, Thanks for the answer. Yes, I meant running in from command line. They are already in the classpath - I added this part: 2009/3/17 Ian Lea > Well, assuming that when you say "invoke Luke's jar outside java" you > mean that you are trying to run Luke from the command line e.g. $ java

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread Ian Lea
Well, assuming that when you say "invoke Luke's jar outside java" you mean that you are trying to run Luke from the command line e.g. $ java -jar lukexxx.jar, it simply sounds like your classes are not on the classpath. Add them. -- Ian. On Tue, Mar 17, 2009 at 10:20 AM, liat oren wrote: > Hi

Luke - Build a jar that uses my own jar

2009-03-17 Thread liat oren
Hi, I edited Luke's code so it also uses my classes (I added the jar to the class-path and put it in the lib folder). When I run from java it works good. Now I try to build it and invoke Luke's jar outside java and get the following error: Exception in thread "main" java.lang.NoClassDefFoundError

Re: Scores between words. Boosting?

2009-03-17 Thread liat oren
Thanks for all the answers. I am new to Lucene and in the emails its the first time I heard of the bigrams and thus read about them a bit. Question - if I query for "cat animal" - or use boosting - "cat^2 animal^0.5" - will the results return ONLY documents that contain both? >From what I saw unt

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. Adrian. On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > > Be careful: docFreq does not take deletions into account. >

Re: Lucene 2.9

2009-03-17 Thread Luis Alves
Mark Miller wrote: Hmmm - you can probably get qsol to do it: http://myhardshadow.com/qsol. I think you can setup any token to expand to anything with a regex matcher and use group capturing in the replacement (I don't fully remember though, been a while since I've used it). So you could do