nd
> trigrams with their frequencies
>
> Thank you very much
>
>
> On Mon, Aug 23, 2010 at 3:22 PM, Ivan Provalov
> wrote:
>
> > Ahmed, if you want the raw score, you can do it the
> way you describe below.
> >
> >
> >
> > --- On Sun, 8
s, bigrams and
> trigrams with their frequencies
>
> Thank you very much
>
>
> On Mon, Aug 23, 2010 at 3:22 PM, Ivan Provalov
> wrote:
>
> > Ahmed, if you want the raw score, you can do it the
> way you describe below.
> >
> >
> >
> > --
get the
> > matching score ?
> >
> > for example, "damaged" co-occurs with "shipment"
> with a probability = 0.4
> > ??
> >
> >
> > On Sun, Aug 22, 2010 at 5:35 AM, Ivan Provalov
> wrote:
> >
> >> Ahmed,
> >>
&
orrect ??
>
> On Sun, Aug 22, 2010 at 2:47 PM, ahmed algohary wrote:
>
> > Thanks! It is exactly what I need. But, isn't there a
> way to get the
> > matching score ?
> >
> > for example, "damaged" co-occurs with "shipment"
> with a pr
Ahmed,
FYI, I updated the term collocations package I mentioned earlier with a few
fixes and changes which will make it work for Lucene 3.0.2. This may help your
task.
See:
https://issues.apache.org/jira/browse/LUCENE-474
Thanks,
Ivan Provalov
--- On Sat, 8/21/10, Otis Gospodnetic wrote
I used this before almost as is with couple of fixes:
http://issues.apache.org/jira/browse/LUCENE-474
Thanks,
IP
--- On Thu, 8/19/10, ahmed algohary wrote:
> From: ahmed algohary
> Subject: Calculate Term Co-occurrence Matrix
> To: java-user@lucene.apache.org
> Date: Thursday, August 19, 20
00).
http://project.carrot2.org
Ivan Provalov
On Jul 23, 2010, at 6:55 AM, Grant Ingersoll wrote:
On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote:
23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu:
Hi all, I have an interesting problem...instead of going from a query
to a document collection, is it possible
cene's scaling performance and
> relevance tuning
> To: java-user@lucene.apache.org
> Date: Tuesday, July 20, 2010, 2:16 PM
> are there such events in Russia?
>
> Best Regards
> Alexander Aristov
>
>
> On 20 July 2010 17:59, Ivan Provalov
> wrote:
>
&
We are organizing a meetup in michigan on IR. The first meeting is on august
19. We will be talking about lucene's scalability and relevance tuning
followed by a discussion. Feel free to sign up:
http://www.meetup.com/Michigan-Information-Retrieval-Enthusiasts-Group
Thanks,
Ivan pro
lan in this
> tool, we show all of the words in the index that start with
> plan. Here are some of the related words:
> plan
> plane
> planes
> planet
> planificaci
> planned
> plannedoutages.xls
> planner
> planners
>
> Just a thought.
> Herb
>
>
ssues with the TermVector?
Any suggestions?
Ivan Provalov
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
ns.
5. Some of the tools we use constantly - Lucene’s query Explanation and
Luke.
Thanks,
Ivan Provalov
--- On Thu, 4/29/10, Grant Ingersoll wrote:
> From: Grant Ingersoll
> Subject: Relevancy Practices
> To: java-user@lucene.apache.org
> Date: Thursday, April 29, 2010,
loading the full 400Gb as a single index
> on local disc.
>
>
> karl
>
> 8 apr 2010 kl. 22.07 skrev Ivan Provalov:
>
> > Karl,
> >
> > We have not done the same scale local-disk test.
> Our network
> > parameters are
> >
> > - Networ
, Karl Wettin wrote:
> From: Karl Wettin
> Subject: Re: Lucene Partition Size
> To: java-user@lucene.apache.org
> Date: Thursday, April 8, 2010, 2:44 PM
>
> 8 apr 2010 kl. 20.05 skrev Ivan Provalov:
>
> > We are using Lucene for searching of 200+ mln
> documents (perio
We are using Lucene for searching of 200+ mln documents (periodical
publications). Is there any limitation on the size of the Lucene index (file
size, number of docs, etc...)?
We are partitioning the indexes at about 10 mln documents per partition (each
partition is on a separate box, some m
Just to follow up on our previous discussion, here are a few runs in which we
have tested some of the Lucene different scoring mechanisms and other options.
We used Lucene's patches for LnbLtcSimilarity and BM25 and contrib module for
the SweetSpotSimilarity.
Lucene Default: 0.149
Lucene BM25:
listic IDF (what is
> > > >> >> > interesting only at search
> time).
> > > >> >> > If you set BM25Similarity
> at indexing time
> > > >> >> > some basic stats are not
> stored
> > > >> >> > cor
ly you got a 24% improvement with
> lnb.btc (right?) I
> am guessing that we won't be able to draw many conclusions
> at all due to
> bias.
>
> On Tue, Feb 16, 2010 at 2:01 PM, Ivan Provalov
> wrote:
>
> > Robert, Joaquin,
> >
> > Sorry, I made an error
r.uwaterloo.ca/slides/buettcher_reliable_evaluation.pdf
>
> This might help explain why you see such a difference in
> MAP score!
>
> On Tue, Feb 16, 2010 at 12:15 PM, Ivan Provalov
> wrote:
>
> > Joaquin, Robert,
> >
> > I followed Joaquin's recommend
same reason, I've only found a few
> collections where BM25's doc
> >> length normalization is really significantly
> better than Lucene's.
> >>
> >> In my opinion, the results on a particular test
> collection or 2 have
> >> perhaps
&
I applied the Lucene patch mentioned in
https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers on
TREC-3 collection using topics 151-200. I am not getting worse results
comparing to Lucene DefaultSimilarity. I suspect, I am not using it correctly.
I have single field docum
e is some bias and error involved.
>
> On Wed, Feb 10, 2010 at 9:14 AM, Ivan Provalov
> wrote:
>
> > Robert,
> >
> > Thank you for your reply. What would be
> considered a large difference? We
> > started applying the Sweet Spot Similarity. It
> giv
ith MAP. And with all
> measures,
> whether you look at bpref or map, my advice is to only
> consider large
> differences only when evaluating some potential
> improvement!
>
> On Sun, Feb 7, 2010 at 6:49 PM, Ivan Provalov
> wrote:
>
> > Robert,
> >
> >
Robert,
We are using TREC-3 data and Ad Hoc topics 151-200. The relevance judgments
list contains 97,319 entries, of which 68,559 are unique document ids. The
TIPSTER collection which was used in TREC-3 is around 750,000 documents.
Should we (a) index the entire 750,000 document collection
Great points, Robert!
I agree, we have a lot of fine tuning ahead of us.
I think we probably have achieved the baseline with our MAP of 0.14. We should
move on to stage two and apply some of the suggestions to improve the overall
scores.
These are just the first steps. Both you and Grant
, 9:34 AM
>
> On Jan 27, 2010, at 1:36 PM, Ivan Provalov wrote:
>
> > Robert, Grant:
> >
> > Thank you for your replies.
> >
> > Our goal is to fine-tune our existing system to
> perform better on relevance.
>
> What kind of documents do you ha
lucene benchmark
> pkg prints out, but instead simply use the benchmark pkg to
> run the queries
> and generate the trec_top_file (submission.txt), which I
> hand to trec_eval
>
>
> On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov
> wrote:
>
> > Robert, Grant:
x27;t
introduce the relevance issues (content pre-processing steps, query parsing
steps, etc...).
Thank you,
Ivan Provalov
--- On Wed, 1/27/10, Robert Muir wrote:
> From: Robert Muir
> Subject: Re: Average Precision - TREC-3
> To: java-user@lucene.apache.org
> Date: Wednesday, Janu
We are looking into making some improvements to relevance ranking of our search
platform based on Lucene. We started by running the Ad Hoc TREC task on the
TREC-3 data using "out-of-the-box" Lucene. The reason to run this old TREC-3
(TIPSTER Disk 1 and Disk 2; topics 151-200) data was that the
29 matches
Mail list logo