Hi Mark -
Having gone down this path for the past year, I echo comments from others
that scalability/availability/failover is a lot of work. We migrated away
from a custom system based on Lucene running on Windows to Solr running on
Linux. It took us 6 months to get our system to a solid five-n
Thank you, Grant, really help me :P
On 7/27/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
You could store Term Vectors for your documents, and then look up the
individual document vectors based on the query results. If you need
help w/ Term Vectors, check out Lucene in Action, search this li
Yes, I have closed IndexWriter. But it doesn't work.
2006/7/27, Michael McCandless <[EMAIL PROTECTED]>:
> I met this problem: when searching, I add documents to index. Although
I
> instantiates a new IndexSearcher, I can't retrieve the newly added
> documents. I have to close the program an
On 7/27/06, Mark Miller <[EMAIL PROTECTED]> wrote:
I thought I read that solr requires an OS that
supports hard links and thought that Windows only supports soft links.
For the default index distribution method from master to searcher,
yes, hard-links are currently needed.
The distribution mec
Otis Gospodnetic wrote:
I think we have an RMI example in Lucene in Action.
You could also look at how Nutch does it. I think the code is in
org.apache.nutch.ipc package.
I'm not sure why cross-platform requirement rules out Solr, I would think it
would exactly the opposite.
As for 10m limit,
Rossini{},
I think what you might have read might have been that searching a Lucene index
that lives in a HDFS would be slow. As far as I understand things, the thing
to do is to copy the index to a local disk, out of HDFS, and then search it
with Lucene from there.
Otis()
- Original Mes
I think we have an RMI example in Lucene in Action.
You could also look at how Nutch does it. I think the code is in
org.apache.nutch.ipc package.
I'm not sure why cross-platform requirement rules out Solr, I would think it
would exactly the opposite.
As for 10m limit, it depends. It depends on
I think:
- Get the number of documents from IndexReader.
- Go from 0 to that number.
- If reader.deleted(docId) == false
get doc
output doc fields' content
Otis
- Original Message
From: MALCOLM CLARK <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, July 27, 200
I know there has been a lot of discussion on distributed search...I am
looking for a cross platform solution, which seems to kill solr's
approach...Everyone seems to have implemented this, but only as
proprietary code...it would seem that just using the RMI searcher would
allow a simple solutio
Hi,
I'm going to attempt to output several thousand documents from a 3+ million
document collection into a csv file.
What is the most efficient method of retrieving all the text from the fields of
each document one by one? Please help!
Thanks,
Malcolm
Oits,
You mentioned the hadoop project. I check it out not a long time ago and
I read someting about it did not support the lucene index. Is it possible to
index and then search in a HDFS?
[]s
Rossini
On 7/27/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Michael,
Certainly paralleli
Michael,
Certainly parallelizing on a set of servers would work (hmm... hadoop?), but if
you want to do this on a single machine you should tune some of the IndexWriter
params. You didn't mention them, so I assume you didn't tune anything yet. If
you have Lucene in Action, check out
2.7.1
Yes - parallelizing works great - we built a share-nothing java-spaces based
system at X1 and on a 11-way cluster were able to index 350 office documents
per second - this included the binary-2-text conversion, using Stellent INSO
libraries. The trick is to create separate indexes and, if you do no
Is this the W3 Ent collection you are indexing?
MC
I built an indexer that runs through email and its attachments, rips out
content and what not and then creates a Document and adds it to an
index. It works w/ no problem. The issue is that it takes around 3-5
seconds per email and I have seen up to 10-15 seconds for email w/
attachments. I n
I am curious about the potential use of document scoring as a means to
extract additional data from an index. Specifically, I would like the
score to be a count of how many times a particular field matched a set
of terms.
For example, I am indexing movie-stars (Each document is a movie-star).
A
Ok, I just tested it.
So consider:
String string = "word -foo";
String[] fields = { "title", "body" };
For the MultField I have:
MultiFieldQueryParser qp = new MultiFieldQueryParser(fields,
SearchEngine.ANALYZER);
Query fieldsQuery = qp.parse(string);
System.out.
You could store Term Vectors for your documents, and then look up the
individual document vectors based on the query results. If you need
help w/ Term Vectors, check out Lucene in Action, search this list,
or http://www.cnlp.org/apachecon2005
-Grant
On Jul 27, 2006, at 4:52 AM, Jia Mi wr
Hi John,
> Just for the record - I've been using javamail POP and IMAP providers in
> the past, and they were prone to hanging with some servers, and resource
> intensive. I've been also using Outlook (proper, not Outlook Express -
> this is AFAIK impossible to work with) via a Java-COM bridge suc
I met this problem: when searching, I add documents to index. Although I
instantiates a new IndexSearcher, I can't retrieve the newly added
documents. I have to close the program and enter the program, then it will
be ok.
Did you close your IndexWriter (so it flushes all changes to disk)
be
I met this problem: when searching, I add documents to index. Although I
instantiates a new IndexSearcher, I can't retrieve the newly added
documents. I have to close the program and enter the program, then it will
be ok.
The platform is win xp. Is it the fault of xp?
Thank you in advance.
I didn't describe the context fully. The app is a server that recieves updates
randomly a couple of hundred times a day and I want the index to be updated at
all times. If I would recieve several updates at once I could batch it but that
is quite unlikely.
_
Björn Ekengren
Bankaktiebol
On Thu, 2006-07-27 at 11:06 +0200, Björn Ekengren wrote:
> Thancks everybody for the feedback. I now rewrote my app like this:
>
> synchronized (searcher.getWriteLock()){
> IndexReader reader = searcher.getIndexSearcher().getIndexReader();
> try {
>
Thancks everybody for the feedback. I now rewrote my app like this:
synchronized (searcher.getWriteLock()){
IndexReader reader = searcher.getIndexSearcher().getIndexReader();
try {
reader.deleteDocuments(new Term("id",id));
reader.cl
Hi everyone,
I am just developing an application using Lucene, and I know how to get the
Term Freq via the IndexReader for the whole corpus. But I wonder if I can
get the term freq statistics just inside the query results, like I want the
hot words in just recent two weeks added into Lucene indic
Erick Erickson wrote:
As Miles said, use the DateTools (lucene) class with a DAY resolution.
That'll give you a MMDD format, which won't blow your query with a
"TooManyClauses" exception...
Remember that Lucene deals with strings, so you want to store things in
easily-manipulated string
> I don't think it really matters wether you do deletes on the same
> IndexReader -- what matters is if there has been any deletes
> done to the
> index prior to opening the reader since it was last
> optimized. The reason
> being that deleting a document just causes a record of the
> deletion
: I looked at the implementation of 'read(int[], int[])' in
: 'SegmentTermDocs' and saw that it did the following things:
: - check if the document has a frequency higher than 1, and if so read
: it;
: - check if the document has been deleted, and if so don't add it to the
: result;
: - store the
On Thu, 2006-07-27 at 08:59 +0200, Björn Ekengren wrote:
> > > When I close my application containing index writers the
> > > lock files are left in the temp directory causing an "Lock obtain
> > > timed out" error upon the next restart.
> >
> > My guess is that you keep a writer open even though
: Unfortunately this is not that easy. Because I must be able to retrieve
: only one article and if i index all the content in one document then all
: the document will be retrieved instead of the single article.
i didn't say you had to *only* index the article contents in "group"
documents ... y
30 matches
Mail list logo