Re: Calculate Term Co-occurrence Matrix

2010-08-24 Thread Ivan Provalov
nd > trigrams with their frequencies > > Thank you very much > > > On Mon, Aug 23, 2010 at 3:22 PM, Ivan Provalov > wrote: > > > Ahmed, if you want the raw score, you can do it the > way you describe below. > > > > > > > > --- On Sun, 8

Re: Calculate Term Co-occurrence Matrix

2010-08-23 Thread Ivan Provalov
s, bigrams and > trigrams with their frequencies > > Thank you very much > > > On Mon, Aug 23, 2010 at 3:22 PM, Ivan Provalov > wrote: > > > Ahmed, if you want the raw score, you can do it the > way you describe below. > > > > > > > > --

Re: Calculate Term Co-occurrence Matrix

2010-08-23 Thread Ivan Provalov
get the > > matching score ? > > > > for example, "damaged"  co-occurs with "shipment" > with a probability = 0.4 > > ?? > > > > > > On Sun, Aug 22, 2010 at 5:35 AM, Ivan Provalov > wrote: > > > >> Ahmed, > >> &

Re: Calculate Term Co-occurrence Matrix

2010-08-22 Thread Ivan Provalov
orrect ?? > > On Sun, Aug 22, 2010 at 2:47 PM, ahmed algohary wrote: > > > Thanks! It is exactly what I need. But, isn't there a > way to get the > > matching score ? > > > > for example, "damaged"  co-occurs with "shipment" > with a pr

Re: Calculate Term Co-occurrence Matrix

2010-08-21 Thread Ivan Provalov
Ahmed, FYI, I updated the term collocations package I mentioned earlier with a few fixes and changes which will make it work for Lucene 3.0.2. This may help your task. See: https://issues.apache.org/jira/browse/LUCENE-474 Thanks, Ivan Provalov --- On Sat, 8/21/10, Otis Gospodnetic wrote

Re: Calculate Term Co-occurrence Matrix

2010-08-19 Thread Ivan Provalov
I used this before almost as is with couple of fixes: http://issues.apache.org/jira/browse/LUCENE-474 Thanks, IP --- On Thu, 8/19/10, ahmed algohary wrote: > From: ahmed algohary > Subject: Calculate Term Co-occurrence Matrix > To: java-user@lucene.apache.org > Date: Thursday, August 19, 20

Re: Reverse Lucene queries

2010-07-23 Thread Ivan Provalov
00). http://project.carrot2.org Ivan Provalov On Jul 23, 2010, at 6:55 AM, Grant Ingersoll wrote: On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote: 23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu: Hi all, I have an interesting problem...instead of going from a query to a document collection, is it possible

Re: IR meetup in Michigan - lucene's scaling performance and relevance tuning

2010-07-21 Thread Ivan Provalov
cene's scaling performance and  > relevance tuning > To: java-user@lucene.apache.org > Date: Tuesday, July 20, 2010, 2:16 PM > are there such events in Russia? > > Best Regards > Alexander Aristov > > > On 20 July 2010 17:59, Ivan Provalov > wrote: > &

IR meetup in Michigan - lucene's scaling performance and relevance tuning

2010-07-20 Thread Ivan Provalov
We are organizing a meetup in michigan on IR. The first meeting is on august 19. We will be talking about lucene's scalability and relevance tuning followed by a discussion. Feel free to sign up: http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group Thanks, Ivan pro

Re: Stemming and Wildcard Queries

2010-05-21 Thread Ivan Provalov
lan in this > tool, we show all of the words in the index that start with > plan. Here are some of the related words: > plan > plane > planes > planet > planificaci > planned > plannedoutages.xls > planner > planners > > Just a thought. > Herb > >

Stemming and Wildcard Queries

2010-05-20 Thread Ivan Provalov
ssues with the TermVector? Any suggestions? Ivan Provalov - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Relevancy Practices

2010-05-03 Thread Ivan Provalov
ns. 5. Some of the tools we use constantly - Lucene’s query Explanation and Luke. Thanks, Ivan Provalov --- On Thu, 4/29/10, Grant Ingersoll wrote: > From: Grant Ingersoll > Subject: Relevancy Practices > To: java-user@lucene.apache.org > Date: Thursday, April 29, 2010,

Re: Lucene Partition Size

2010-04-12 Thread Ivan Provalov
loading the full 400Gb as a single index > on local disc. > > >     karl > > 8 apr 2010 kl. 22.07 skrev Ivan Provalov: > > > Karl, > > > > We have not done the same scale local-disk test.  > Our network  > > parameters are > > > > -  Networ

Re: Lucene Partition Size

2010-04-08 Thread Ivan Provalov
, Karl Wettin wrote: > From: Karl Wettin > Subject: Re: Lucene Partition Size > To: java-user@lucene.apache.org > Date: Thursday, April 8, 2010, 2:44 PM > > 8 apr 2010 kl. 20.05 skrev Ivan Provalov: > > > We are using Lucene for searching of 200+ mln > documents (perio

Lucene Partition Size

2010-04-08 Thread Ivan Provalov
We are using Lucene for searching of 200+ mln documents (periodical publications). Is there any limitation on the size of the Lucene index (file size, number of docs, etc...)? We are partitioning the indexes at about 10 mln documents per partition (each partition is on a separate box, some m

TREC-3 Runs

2010-03-12 Thread Ivan Provalov
Just to follow up on our previous discussion, here are a few runs in which we have tested some of the Lucene different scoring mechanisms and other options. We used Lucene's patches for LnbLtcSimilarity and BM25 and contrib module for the SweetSpotSimilarity. Lucene Default: 0.149 Lucene BM25:

Re: BM25 Scoring Patch

2010-02-17 Thread Ivan Provalov
listic IDF (what is > > > >> >> > interesting only at search > time). > > > >> >> > If you set BM25Similarity > at indexing time > > > >> >> > some basic stats are not > stored > > > >> >> > cor

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
ly you got a 24% improvement with > lnb.btc (right?) I > am guessing that we won't be able to draw many conclusions > at all due to > bias. > > On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov > wrote: > > > Robert, Joaquin, > > > > Sorry, I made an error

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
r.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf > > This might help explain why you see such a difference in > MAP score! > > On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov > wrote: > > > Joaquin, Robert, > > > > I followed Joaquin's recommend

Re: BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
same reason, I've only found a few > collections where BM25's doc > >> length normalization is really significantly > better than Lucene's. > >> > >> In my opinion, the results on a particular test > collection or 2 have > >> perhaps &

BM25 Scoring Patch

2010-02-16 Thread Ivan Provalov
I applied the Lucene patch mentioned in https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers on TREC-3 collection using topics 151-200. I am not getting worse results comparing to Lucene DefaultSimilarity. I suspect, I am not using it correctly. I have single field docum

Re: TREC Data and Topic-Specific Index

2010-02-11 Thread Ivan Provalov
e is some bias and error involved. > > On Wed, Feb 10, 2010 at 9:14 AM, Ivan Provalov > wrote: > > > Robert, > > > > Thank you for your reply.  What would be > considered a large difference?  We > > started applying the Sweet Spot Similarity.  It > giv

Re: TREC Data and Topic-Specific Index

2010-02-10 Thread Ivan Provalov
ith MAP. And with all > measures, > whether you look at bpref or map, my advice is to only > consider large > differences only when evaluating some potential > improvement! > > On Sun, Feb 7, 2010 at 6:49 PM, Ivan Provalov > wrote: > > > Robert, > > > >

TREC Data and Topic-Specific Index

2010-02-07 Thread Ivan Provalov
Robert, We are using TREC-3 data and Ad Hoc topics 151-200. The relevance judgments list contains 97,319 entries, of which 68,559 are unique document ids. The TIPSTER collection which was used in TREC-3 is around 750,000 documents. Should we (a) index the entire 750,000 document collection

Re: Average Precision - TREC-3

2010-01-28 Thread Ivan Provalov
Great points, Robert! I agree, we have a lot of fine tuning ahead of us. I think we probably have achieved the baseline with our MAP of 0.14. We should move on to stage two and apply some of the suggestions to improve the overall scores. These are just the first steps. Both you and Grant

Re: Average Precision - TREC-3

2010-01-28 Thread Ivan Provalov
, 9:34 AM > > On Jan 27, 2010, at 1:36 PM, Ivan Provalov wrote: > > > Robert, Grant: > > > > Thank you for your replies.  > > > > Our goal is to fine-tune our existing system to > perform better on relevance. > > What kind of documents do you ha

Re: Average Precision - TREC-3

2010-01-27 Thread Ivan Provalov
lucene benchmark > pkg prints out, but instead simply use the benchmark pkg to > run the queries > and generate the trec_top_file (submission.txt), which I > hand to trec_eval > > > On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov > wrote: > > > Robert, Grant:

Re: Average Precision - TREC-3

2010-01-27 Thread Ivan Provalov
x27;t introduce the relevance issues (content pre-processing steps, query parsing steps, etc...). Thank you, Ivan Provalov --- On Wed, 1/27/10, Robert Muir wrote: > From: Robert Muir > Subject: Re: Average Precision - TREC-3 > To: java-user@lucene.apache.org > Date: Wednesday, Janu

Average Precision - TREC-3

2010-01-26 Thread Ivan Provalov
We are looking into making some improvements to relevance ranking of our search platform based on Lucene. We started by running the Ad Hoc TREC task on the TREC-3 data using "out-of-the-box" Lucene. The reason to run this old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the