Hi Ching
I donot think issue with Lucene for 2000 documents. As Anshum mentioned,
give more details about environment.
And check what CPU usage and index fdt file timestamp while it hangs. And
using logs would help to identify real cause. I used to work with Lucene 2.4
and recently 3.0.2. No sim
Hi all,
I only want to index the latest one week's data, the previous data can
be deleted. So I'd like to know about lucene's delete performance and
whether it will has impact on the search performance when I do lots of
delete operation in the meantime. Thanks
--
Best Regards
Jeff Zhang
-
There's a deleteAll() method on IndexWriter, which is very fast. After you
commit(), all documents won't be visible to searchers anymore. When the last
searcher will be closed, the documents will completely disappear from the
index. All in all it's quite a good approach to take.
You can also consi
Jeff,
I would suggest not deleting documents off the back of the index unless you can
optimize your index regularly. (Depending on your volume, this could be every
day or once a week)
I would suggest having two indexes, one that is "this" week and one that is
"last" week and a multi-index searc
Note that deleteAll does not require you to optimize anything. It literally
removes all segments from the index in one shot, and when the files are
unreferenced, they will be removed entirely.
Shai
On Wed, Oct 13, 2010 at 4:53 PM, Dan OConnor wrote:
> Jeff,
> I would suggest not deleting documen
Hi there,
I'm currently trying to work out how I can determine the type
(string/number/date/etc)of a term. I've not seen any off the shelf way to do
it so am trying to store a payload against each term that records the type.
I'm having a little trouble retrieving a payload I'd stored onto the
One more suggestion:
With lucene 2.1 you might be using the hits API to search, which preloads
the documents
See
https://issues.apache.org/jira/browse/LUCENE-954?focusedCommentId=12579258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12579258
The performance hit i
Hi,
Thank you for your suggestions. I found the reason which is that PDFBox
seems having problem parsing large document (20MB), I have a few of them
within those 2000 docs, those are the ones throwing OutOfMemory errors. The
app does exit, and JVM died. I am running on 32bit machine.
-- Ching
On
What version of PDFBox are you running?
PDFBox 0.72 does not work properly with some pdf documents. See more in
https://issues.apache.org/jira/browse/PDFBOX-361.
So, I wrote a extractor (a copy of the original, in fact) based on trunk
version (1.2.1, actually). Furthermore, this version is faster e
Hello,
Of course, if you actually want the last 7 days rolling effect and not the this
week vs. previous week, then you could go with smaller indices, say daily ones.
Then you'd always add new docs to the latest index and removing the oldest
index
completely every 24 hours.
You could go hourly
Hi Group,
I have an isue when using MultiFieldQueryParser, I would like to use one query
against a number of fields however I get an
java.lang.IllegalArgumentException: queries.length != fields.length
Looked at the javadoc, and it looks like the only way to run one query against
multiple fie
I'm not quite sure what you mean by "run a query against multiple fields".
But would
creating your own BooleanQuery where each clause was the parsed result
against
a specific field work?
If this is irrelevant, could you give a couple of examples of what you're
looking to
accomplish?
Best
Erick
O
I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
which tools do you use to extract text from pdf? Thanks.
On Wed, Oct 13, 2010 at 11:36 AM, Fabiano Nunes wrote:
> What version of PDFBox are you running?
> PDFBox 0.72 does not work properly with some pdf documents. See more
Ching wrote:
> I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
> which tools do you use to extract text from pdf? Thanks.
Ching, in UpLib I use a patched version of xpdf which reports the
bounding box and font information for each word (as well as the Unicode
characters o
14 matches
Mail list logo