On Friday 03 August 2007 16:03:22 Doron Cohen wrote:
> What is the anticipated cause of corruption? Malicious?
> Hardware fault? This somewhat reminds of discussions in
> the list about encrypting the index. See LUCENE-737
> and a discussion pointed by it. One of the opinions
> there was that encry
Not sure how exactly understand corrupted indexes in the sense that could
not read / use indexes or something else..
thanks
DT
www.ejinz.com
EjinZ Search Engine
- Original Message -
From: "Doron Cohen" <[EMAIL PROTECTED]>
To:
Sent: Friday, August 03, 2007 1:03 AM
Subject: Re: How do
What is the anticipated cause of corruption? Malicious?
Hardware fault? This somewhat reminds of discussions in
the list about encrypting the index. See LUCENE-737
and a discussion pointed by it. One of the opinions
there was that encryption should be handled at a lower
level (OS/FS). Wouldn't that
Andreas Knecht wrote:
> We're considering to use the new IndexWriter.deleteDocuments call rather
> than the IndexReader.delete call. Are there any performance
> improvements that this may provide, other than the benefit of not having
> to switch between readers/writers?
>
> We've looked at LUCENE
Hi,
We're considering to use the new IndexWriter.deleteDocuments call rather
than the IndexReader.delete call. Are there any performance
improvements that this may provide, other than the benefit of not having
to switch between readers/writers?
We've looked at LUCENE-565, but there's no cle
I am doing implementation of SpanTermQuery for you, give me today. Sorry, I
was out for meetings for 2 days.
Enjoy,
Shailendra
On 8/3/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
>
> Hi Paul,
>
> Isn't SpanFirstQuery only match those with position less than a
> certain end position?
>
> I am rather l
Hi Paul,
Isn't SpanFirstQuery only match those with position less than a
certain end position?
I am rather looking for a query that would score a document higher for
terms appear near the start but not totally discard those with terms
appear near the end.
Regards,
Cedric
On 8/2/07, Paul Elschot
If you are just retrieving your custom id and you have more stored
fields (and they are not tiny) you certainly do want to use a field
selector. I would suggest SetBasedFieldSelector.
- Mark
testn wrote:
Hi,
Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an
ability t
On Thursday 02 August 2007 19:28:48 Mohammad Norouzi wrote:
> you should not store them in an Array structure since they will take up the
> memory.
> the BitSet is the best structure to store them
You can't store strings in a BitSet.
What I would do is return a List but make a custom subclass of
"Zach Bailey" <[EMAIL PROTECTED]> wrote:
> Unfortunately, I am not sure the leader of the project would feel good
> about running code from trunk, save without an explicit endorsement from
> a majority of the developers or contributors for that particular code
> (do those people keep up with t
I have been meaning to write up a Wiki page on this general topic but
have not quite made time yet ...
Sharing an index with a shared filesystem will work, however there are
some caveats:
* This is somewhat unchartered territory because it's fairly recent
fixes to Lucene that have enabled
Mark,
Thanks so much for your response.
Unfortunately, I am not sure the leader of the project would feel good
about running code from trunk, save without an explicit endorsement from
a majority of the developers or contributors for that particular code
(do those people keep up with this list
Rajesh,
I forgot to mention this, but we did investigate this option as well and
even prototyped it for an internal project. It ended up being too slow
for us.
It was adding a lot of overhead even to small updates, IIRC, mainly due
to the fact that the index was essentially stored as a files
One more alternative, though I am not sure if anyone
is using it.
Apache Compass has added a plug-in to allow storing
Lucene index files inside the database. This should
work in clustered environment as all nodes will share
the same database instance.
I am not sure the impact it will have on perf
Some quick info:
NFS should work, but I think youll want to be working off the trunk.
Also, Sharing an index over NFS is supposed to be slow. The standard so
far, if you are not partitioning the index, is to use a unix/linux
filesystem and hardlinks + rsync to efficiently share index changes
Thanks for your response --
Based on my understanding, hadoop and nutch are essentially the same
thing, with nutch being derived from hadoop, and are primarily intended
to be standalone applications.
We are not looking for a standalone application, rather we must use a
framework to implement
Hello,
I've been asked to devise some way to discover and correct data in Lucene
indexes that have been "corrupted." The word "corrupt", in this case, has a
few different meanings, some of which strike me as exceedingly difficult to
grok. What concerns me are the cases where we don't know that
Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.
Zach Bailey wrote:
>
> Hi,
>
> It's been a couple of days now and I haven't heard anything on this
> topic, while there has been substantial list traffic otherwise.
>
> Am I asking in the wrong place? Was I
Hi,
It's been a couple of days now and I haven't heard anything on this
topic, while there has been substantial list traffic otherwise.
Am I asking in the wrong place? Was I unclear?
I know there are people out there that have used/are using Lucene in a
clustered environment. I am just looki
In terms of PDF documents...
PDFBox should work just fine with any latin based languages; at this
time certain PDFs that have CJK characters can pose some issues. In
general english/french/spanish should be fine.
Some PDFs use custom encodings that make it impossible to extract text
and
Hey Michael,
Have you given it a try? I would think they would work, but haven't
actually done it. Setup a small test that reads in a PDF in French
or Spanish and give it a try. You might have to worry about
encodings or something, but the structure of the files should be the
same, i.
Check out..
http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e
heybluez wrote:
>
> Yea, I have seen those. I guess the question is what do you all use to
> extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and
> so on? This is what I
Alternatively, construct a parenthesized query that
reflects what you want. If you do, make sure that OR is capitalized,
or make REAL SURE you understand the Lucene syntax and construct
your query with that syntax.
Erick
On 8/2/07, testn <[EMAIL PROTECTED]> wrote:
>
>
> You can create two queries
Yea, I have seen those. I guess the question is what do you all use to
extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and
so on? This is what I use now to extract english.
Thanks,
Michael
testn wrote:
If you can extract token stream from those files already, you can simp
Yes, you are right, thanks for the great reply! I skimmed it so quickly today,
so re-read it now, and got the point you mean. I just tried Lucene 2.2.0 (I was
using 2.0.0) and i could do add, delete and update docs so smoothly! Based on
my tests i did so far, similar to tests I presented in my f
Thanks! Will look forward to 2.3 then.
Michael McCandless-2 wrote:
>
>
> Honestly I don't really think this is a good idea.
>
> While LUCENE-843 has proven stable so far (knock on wood!), it is
> still a major change and I do worry (less with time :) that maybe I
> broke something subtle some
Just use Nutch. If you look in the Crawl.java class in Nutch, you
can pretty easily tear out the appropriate pieces. Question is, do
you really need all of that? If so, why not just use Nutch?
-Grant
On Aug 2, 2007, at 2:32 AM, Srinivasarao Vundavalli wrote:
How can we use nutch APIs in
Honestly I don't really think this is a good idea.
While LUCENE-843 has proven stable so far (knock on wood!), it is
still a major change and I do worry (less with time :) that maybe I
broke something subtle somewhere.
While a few brave people have tested the trunk in their production
worlds and
How did you encode your integer into String? Did you use int2sortableStr?
is_maximum wrote:
>
> Hi
> I am using NumberUtils to encode and decode numbers while indexing and
> searching, when I am going to decode the number retrieved from an index it
> throws exception for some fields
> the exce
Mike, as a committer, what do you think?
Thanks!
Peter Keegan wrote:
>
> I've built a production index with this patch and done some query stress
> testing with no problems.
> I'd give it a thumbs up.
>
> Peter
>
> On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi guys,
>>
>> Do you t
If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
heybluez wr
20,000 queries continuously? Sounds a bit too much. Can you elaborate more
what you need to do? Probably you won't need that many queries.
Chew Yee Chuang wrote:
>
> Hi,
>
> Thanks for the link provided, actually I've go through those article when
> I
> developing the index and search functio
Hi,
Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an
ability to help you load only the first field in the document without
loading all the fields. After that, you can keep the whole document if you
like. It should help improve performance better.
is_maximum wrote:
>
You can create two queries from two query parser, one with AND and the other
one with OR. After you create both of them, you call setBoost() to give
different boost level and then join them together using BooleanQuery with
option BooleanClause.Occur.SHOULD. That should do the trick.
askarzaidi w
Hey Guys,
Quick question:
I do this in my code for searching:
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Lucene is OR by default so I change it to AND for my requirements. Now, I
have a requirement to do OR as well. I mean while doing AND I'd like to
include results from OR too .
you should not store them in an Array structure since they will take up the
memory.
the BitSet is the best structure to store them
On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
> Heres my index structure :
>
> Document -> contract ID -id (index AND store)
> -> paramName
yes it decrease the performance but the only solution.
I've spent many weeks to find best way to retrive my own IDs but find this
way as last one
now I am storing the ids in a BitSet structure and it's fast enough
public void collect(...){
idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnI
Heres my index structure :
Document -> contract ID -id (index AND store)
-> paramName -name (index AND store)
-> paramValue -value (index BUT NOT store)
When I get back 2 hits, each document contains ID and paramName, I have
no interest in paramN
Hi,
The solution you suggested will definitely work but will definitely slow
down my search by an order of magnitude. The problem I am trying to solve is
performance, thats why I need the collection of IDs and not the whole
documents.
- thanks for the prompt reply.
is_maximum wrote:
>
> y
What is the structure of your index?
If you havnt already, then add a new field to your index that stores the
contractId. For all other fields, set the "store" flag to false while
indexing.
You can now safely retrieve the value of this contractId field based on
your search results.
Regards,
kapil
yes if you extend your class from HitCollector and override the collect()
mthod with following signature you can get IDs
public void collect(int id, float score)
On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
> Hi all,
>
>Can I get just a list of document Ids given a search criteria ? To
>
Hi all,
Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:
I store 2 contracts in the file system index each with some
parameterName and Value. Given a search criterion - (paramValue='draft'). I
need to get just an ArrayList of Strings conta
Yes, you are correct, I close indexWriter and then add more docs. What's wrong?
it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX
SEARCHERS after calling close on the indexWriter.
As for creating new IndexWriter, I tried to, however i suffered of the lock
exception
> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> >
> If I'm reading this correctly, there's something a little wonky here. In>
> your example code, you close the IndexWriter and
> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> >
> If I'm reading this correctly, there's something a little wonky here. In>
> your example code, you close the IndexWriter and the
Cedric,
SpanFirstQuery could be a solution without payloads.
You may want to give it your own Similarity.sloppyFreq() .
Regards,
Paul Elschot
On Thursday 02 August 2007 04:07, Cedric Ho wrote:
> Thanks for the quick response =)
>
> On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote:
> > Yes
46 matches
Mail list logo