Hello Luceners
I have started a new project and need to index pdf documents.
There are several projects around, which allow to extract the content,
like pdfbox, xpdf and pjclassic.
As far as I studied the FAQ's and examples, all these
tools allow simple text extraction.
Which of these open sour
Hi all,
Lucene says if we use compound file format than it greatly increase the
number of file descriptors used by indexing and by searching. Can you please
tell me what does it mean. Which file are opened by during indexing and
searching. I know something but still not very clear. I have some oth
On 8/13/07, mark harwood <[EMAIL PROTECTED]> wrote:
> I would presume that (like a lot of things) there is power-law at play in the
> popularity of publication sources (i.e. a small number of popular sources and
> a lot of unpopular ones).
> The "Zipf" plugin in Luke can be used to illustrate thi
Have a look at the DisjunctionMaxQuery class. I don't think it is
exactly what you are looking for, but it might give you some ideas on
how to proceed, as it sounds similar to what you are trying to do.
Hope this helps,
Grant
On Aug 13, 2007, at 2:20 PM, Walt Stoneburner wrote:
Here's a s
I figured out the answer to 2[a] - its because by default CustomScoreQuery
does weight normalization. To disable that, one should use
customQuery.setStrict(true). Once I do this, I get the original values that
I stored during the indexing process.
Help with the other two questions ([1] and [2]b)
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> Have you tried the very simple techinque if just making an OR clause
> containing all the sources for a particular query and just letting
> it run? I was surprised at the speed...
I think the TermsFilter that I use does exactly that.
>
> But
A few questions on custom score queries:
[1] I need to rank matches by some combination of keyword match, popularity
and recency of the doc. I read the docs about CustomScoreQuery and seems to
be a resonable fit. An alternate way of achieving my goals is to use a
custom sort. What are the trade-
14 aug 2007 kl. 00.17 skrev lucene user:
What if my concern is more in terms of having a large number of
requests per
second? When should I start to be worried and start thinking about
more than
low end hardware?
I have served one request every 10th millisecond, 24/7 on a single
machine
I don't think that's all that large, though I have only been working with
Lucene for a short while. I have two corpuses with 445834 documents (3.43M
terms) and 132217 documents (1.6M terms). I don't have trouble querying either
of these with Luke.
- Original Message
From: lucene user
That is wonderful to hear. (I love that I am not stressing the technology
near its limits.)
What if my concern is more in terms of having a large number of requests per
second? When should I start to be worried and start thinking about more than
low end hardware?
Thanks!
On 8/12/07, karl wettin
There is also a Use Cases item on the Wiki...
On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:
I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.
Peter
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
Thanks for writing this up. Do you think this is an appr
Donna,
If I understand the problem correctly, it is: given a [job
description], find [candidates] that we would not otherwise find. That
seems to be a "user-weighted similarity" problem more than a simple
search problem.
IOW:
1. Given a [job description], create a set of queries that look for
I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.
Peter
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> Thanks for writing this up. Do you think this is an appropriate subject
> for the Wiki performance page?
>
> Erick
>
> On 8/13/07, Peter Keegan <[EMAIL P
Thanks for writing this up. Do you think this is an appropriate subject
for the Wiki performance page?
Erick
On 8/13/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
>
> I've been experimenting with using SpanQuery to perform what is
> essentially
> a limited type of database 'join'. Each document in
Hoss wrote:
this would be meaningless even if it were easier...
http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
FAQ: "Can I filter by score?"
-Hoss
I've read the warnings referenced there; but still have a problem to
solve. We have "fact-based" infor
I've been experimenting with using SpanQuery to perform what is essentially
a limited type of database 'join'. Each document in the index contains 1 or
more 'rows' of meta data from another 'table'. The meta data are simple
tokens representing a column name/value pair ( e.g. color$red or
location$1
Here's a scenario I just ran into, though I don't know how to make
Lucene do it (or even if it can).
I have two lists; to keep things simply lets assume (A B C D E F G) and (X Y).
I want to form a query so that when matches appear from both lists,
results rank higher, than if many elements matche
I would presume that (like a lot of things) there is power-law at play in the
popularity of publication sources (i.e. a small number of popular sources and a
lot of unpopular ones).
The "Zipf" plugin in Luke can be used to illustrate this distribution for the
values in your "publication source"
There is no *lucene* limitation of a 2GB index file. I've had no trouble
with single indexes over 8G. If you're referring to this page...
http://wiki.apache.org/lucene-java/LuceneFAQ?highlight=%282gb%29
then it's talking about an *operating system* limitation. So I wouldn't
worry about this unless
U, because I didn't write the code? You can always contribute a patch.
On 8/13/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
>
> Thanks Erick but unfortunately NumberTools works only with long primitive
> type I am wondering why you didn't put some method for double and float.
>
>
>
> On 8/1
Have you tried the very simple techinque if just making an OR clause
containing all the sources for a particular query and just letting
it run? I was surprised at the speed...
But before doing *any* of that, you need to find out, and tell us, what
exactly is taking the time. Are you opening a new
13 aug 2007 kl. 12.49 skrev Lukas Vlcek:
But I am looking for more IR oriented application of this
information. I
remember that once I read on Lucene mail list that somebody
suggested
utilization of previously issued user queries for suggestions of
similar/other/related queries or for typo
Enis,
thanks for excellent answer!
Lukas
On 8/13/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Lukas Vlcek wrote:
> > Enis,
> >
> > Thanks for your time.
> > I gave a quick glance at Pig and it seems good (seems it is directly
> based
> > on Hadoop which I am starting to play with :-).
Hi, Rohit,
You need to create index reader in the sub directory where you created
the index files. Lucene's IndexReader won't find your index if you
simply move the index to a sub directory.
Yes, if you have several index directory, you need to combine them
together. But you can achieve this by u
To me, it looks like what you are trying to achieve is more suitable to be
the database where it can help you do grouping and sorting, etc. But if you
still want to achieve it using lucene, you might want to post some code so
that I can go through it and see why it uses so much resource then.
Ch
Hi,
Lukas Vlcek wrote:
Enis,
Thanks for your time.
I gave a quick glance at Pig and it seems good (seems it is directly based
on Hadoop which I am starting to play with :-). It obvious that a huge
amount of data (like user queries or access logs) should be stored in flat
files which makes it co
Hi, John
I think you cost too much time in I/O,and if you use RAMDirectory first
will better.see http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
kai
-邮件原件-
发件人: Erick Erickson [mailto:[EMAIL PROTECTED]
发送时间: 2007年8月13日 星期一 1:57
收件人: java-user@lucene.apache.org
主题: Re: In
Hi All,
A bit of self-promotion again :) I hope you don't find it out of topic,
after all, some folks are using Carrot2 with Lucene and Solr, and Nutch has
a Carrot2-based clustering plugin.
Staszek
[EMAIL PROTECTED]
___
28 matches
Mail list logo