So, in version 3, I have to pass a version parameter to the constructor
for StandardAnalyzer. Since Version.LUCENE_CURRENT is deprecated, I'd
like this to be the same as the version of the index I'm using. Is
there a way of getting a value of Version for an index? I don't see
obvious methods on
Simon Willnauer wrote:
> Hey Bill,
> let me clarify what Version is used for since I think that caused
> little confusion.
Thanks.
> The Version constant was mainly introduced to help
> users with backwards compatibility and upgrading their codebase to a
> new version of lucene without breaking
Simon Willnauer wrote:
> On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen wrote:
> > Simon Willnauer wrote:
> >
> >> Hey Bill,
> >> let me clarify what Version is used for since I think that caused
> >> little confusion.
> >
> > Thanks.
>
Bill Janssen wrote:
> ...is there any attribute or static
> method somewhere in Lucene which will return a value of
> org.apache.lucene.util.Version that corresponds to the version of the
> code? That's what I'm looking for. Version.LUCENE_CURRENT looks good,
> but it&
Simon Willnauer wrote:
> On Fri, Sep 17, 2010 at 11:45 PM, Bill Janssen wrote:
> > Simon Willnauer wrote:
> >
> >> On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen wrote:
> >> > Simon Willnauer wrote:
> >> >
> >> >> Hey Bill,
>
I thought that since I'm updating UpLib's Lucene code, I should tackle
the issue of document languages, as well. Right now I'm using an
off-the-shelf language identifier, textcat, to figure out which language
a Web page or PDF is (mainly) written in. I then want to analyze that
document with an a
Robert Muir wrote:
> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen wrote:
>
> > I thought that since I'm updating UpLib's Lucene code, I should tackle
> > the issue of document languages, as well. Right now I'm using an
> > off-the-shelf language i
Ching wrote:
> I use PDFBox version 1.1.0; I did find a workaround now. Just wondering
> which tools do you use to extract text from pdf? Thanks.
Ching, in UpLib I use a patched version of xpdf which reports the
bounding box and font information for each word (as well as the Unicode
characters o
Hasan Diwan wrote:
> On 27 October 2010 18:16, Troy Wical wrote:
> > Depends on what your trying to index, I suppose. Maildir or mbox? For some
> > time now, off and on, I have been working to index an ezmlm mailing list
> > archive. In the end, I went with Swish-E and have made quite a bit of
Grant Ingersoll wrote:
> Where do you get your Lucene/Solr downloads from?
>
> [x] ASF Mirrors (linked in our release announcements or via the Lucene
> website)
>
> [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
>
> [x] I/we build them from source via an SVN/Git checkout.
Clemens Wyss wrote:
> > 1) Docs in different languages -- every document is one language
> > 2) Each document has fields in different languages
> We mainly have 1)-models
I've recently done this for UpLib. I run a language-guesser over the
document to identify the primary language when the docu
7;d have
to see numbers on that from some reasonable corpus to be convinced it
would be worth it.
Bill
>
> paul
>
>
> Le 19 janv. 2011 à 19:21, Bill Janssen a écrit :
>
> > Clemens Wyss wrote:
> >
> >>> 1) Docs in different languages -- every document
Paul Libbrecht wrote:
> I did several changes of this sort and the precision and recall
> measures went better in particular in presence of language-indication
> failure which happened to be very common in our authoring environment.
There are two kinds of failures: no language, or wrong languag
I hope this help.
>
> Dominique
> www.zoonix.fr
> www.crawl-anywhere.com
>
>
>
> Le 20/01/11 00:29, Bill Janssen a écrit :
> > Paul Libbrecht wrote:
> >
> >> I did several changes of this sort and the precision and recall
> >> measures went be
James Wilson wrote:
> I have completed a project to do the exact same thing. I put the pdf
> text in XML files. Then after I do a Lucene search I read the text from
> the XML files. I do not store the text in the Lucene index. That would
> bloat the index and slow down my searches. FYI -- I
shrinath.m wrote:
> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?
In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages
by selecting only text between certain tags, before indexing them.
These are offline
Lucille apparently doesn't require gcj.
Bill
> Why Lucille in light of PyLucene?
>
> Erik
>
>
> On Aug 28, 2007, at 10:55 AM, Dan Callaghan wrote:
>
> > Dear list,
> >
> > I have recently begun a Python port of Lucene, named Lucille. It is
> > still very much a work in progress, but I h
I've got a DB of about 2 pages which I thought I'd update to
Lucene 2.2. I removed the old index (2.0 based) completely, and
started re-indexing all the documents. I do this in stages, of about
50 pages at a time, serially, starting a new JVM each time, and reading
in the existing index, then
Here's the code I'm using:
try {
// Now add the documents to the index
IndexWriter writer = new IndexWriter(index_loc, new
StandardAnalyzer(), !index_loc.exists());
writer.setMaxFieldLength(Integer.MAX_VALUE);
try {
for (in
I just tried re-indexing with lucene-core-2.0.0.jar and the same
indexing code; works great. So what am I doing wrong with 2.2?
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
> Are you really sure in your 2.2 test you are starting with no prior
> index?
I'd ask that too, but yes, I'm really really sure. Building a
completely new index each time.
Works with 2.0.0. Fails with 2.2.0. Works with 2.2.0 *if* I remove
the optimization step.
Bill
---
> You are not hitting any other exception before this one right?
>
> Can you change your test case so that the "catch" clause is run
> before the "finally" clause? I wonder if you are hitting some
> interesting exception and then trying to optimize, which then
> masks the original exception.
Yes
> I'm going to run the same software on an
> Intel machine and see what happens.
So, I ran the same codebase with lucene-core-2.2.0.jar on an Intel Mac
Pro, OS X 10.5.0, Java 1.5, and no exception is raised. Different
corpus, about 5 pages instead of 2. This is reinforcing my
thinking th
> Hmmm ... how many chunks of "about 50 pages" do you do before hitting this?
> Roughly how many docs are in the index when it happens?
Oh, gosh, not sure. I'm guessing it's about half done.
> Can you describe the docs/fields you're adding?
I've got 1735 documents, 18969 pages -- average page s
> Do you have another PPC machine to reproduce this on? (To rule out
> bad RAM/hard-drive on the first one).
I'll dig up an old laptop and try it there.
> Another thing to try is turning on the infoStream
> (IndexWriter.setInfoStream(...)) and capture & post the resulting log.
> It will be very
> > Another thing to try is turning on the infoStream
> > (IndexWriter.setInfoStream(...)) and capture & post the resulting log.
> > It will be very large since it takes quite a while for the error to
> > occur...
>
> I can do that.
Here's what I see:
Optimizing...
merging segments _ram_a (1 doc
> > Another thing to try is turning on the infoStream
> > (IndexWriter.setInfoStream(...)) and capture & post the resulting log.
> > It will be very large since it takes quite a while for the error to
> > occur...
>
> I can do that.
Here's a more complete dump. I've modified the code so that I n
> Can you try running with the trunk version of Lucene (2.3-dev) and see
> if the error still occurs? EG you can download this AM's build here:
>
>
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/288/artifact/artifacts
Still there. Here's the dump with last night's build:
/L
> Are you still getting the original exception too or just the Array out =20=
>
> of bounds one now? Also, are you doing anything else to the index =20
> while this is happening? The code at the point in the exception below =20=
>
> is trying to properly handle deleted documents.
Just the arra
> Could you post this part of the code (deleting) too?
Here it is:
private static void remove (File index_file, String[] doc_ids, int start) {
String number;
String list;
Term term;
TermDocs matches;
if (debug_mode)
System.err.println("in
> Have you tried another PPC machine?
No. It's in another location, but perhaps I can get it tomorrow. On
the other hand, the success when using 2.0 makes it likely to me that
the machine isn't the problem.
OK, I've reverted to my original codebase (where I first create a
reader and do the dele
> Also, could you try out the CheckIndex tool in 2.3-dev before and
> after the deletes?
Great idea! I don't suppose there's a jar file of it?
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
So, it's a little clearer. I get the Array-out-of-bounds exception if
I'm re-indexing some already indexed documents -- if there are
deletions involved. I get the CorruptIndexException if I'm indexing
freshly -- no deletions. Here's an example of that (with the latest
nightly). I removed the ex
> Your errors seem to happen around the same area (~20K docs). If you
> skip the first say ~18K docs does the error still happen? We need to
> somehow narrow this down.
I'm trying to boil down the documents to a set which I can deploy on a
DVD-ROM, so I can move the same set around from machine
> I'll see if I can get back to this over the weekend.
I got a chance to copy my corpus to another G4 and try indexing with
Lucene 2.2. This one seems OK! Same texts. So now I'm inclined to
believe that it *is* the machine, rather than the code. Whew! Though
that doesn't explain why 2.0 works
> > Hmmm, it still sounds like you are hitting a threading issue that is
> > probably exacerbated by the multicore platform of the newer machine.
>
> Exactly what I was thinking.
> What are the details of the CPUs of these two systems?
Ah, good point. The bad machine is a dual-processor 1GHz G4
> - Fetch and index some pages (containing word and pdf documents) on
> daily basis.
> - Extract all pages that contain some provided keywords after fetching
> the pages.
> - Create some bulletin from fetched pages, bulletin will be in pdf
> format and are categorized based on keywords.
> - provide
> > the unix program pdf2text can convert keeping the text places, but I wanted
> > to ask you guys if you know something better,
>
> AFAIK, PDFBox has a lower-level API that allows you to get hold of text
> positions.
In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and
font in
> Problem I am having is that some of them has multiple columns. and multiple
> word boxes. Does the xpdf patch extract different columns and wordboxes?
It tells you where each word is. Columns you have to do for yourself.
Bill
> > In UpLib, I use xpdf-3.02pl2 with a patch which gives me positi
I do this with uplib (http://uplib.parc.com/) with fair success.
Originally I thought I'd need Lucene plus a relational database to
store metadata about the documents for metadata searches. So far,
though, I've been able to store the metadata in Lucene and use the
same Lucene DB for both metadata
I'd like to suggest a minor change in the QueryParser.jj. I thought
I'd describe it here and get some feedback before posting a patch.
The issue is that I can't get my hands on some clauses (typically
PhraseQuery instances) in my subclass of MultiFieldQueryParser, which
I'd like to do to implemen
I was thinking of trying something, and wondered if someone else
already had it working...
I'd like to take a document, and use it as a query to find other
documents in my index that 'match' it. I'm talking about short
documents, like newspaper articles or email messages. Seems to me
that there
MoreLikeThis is just what I wanted. Thanks.
Bill
> Yes, I believe Dave did something like that on searchmorph.org and somebody=
> else did this on some some with RFCs. What's that called? Query by examp=
> le? I think so, try define:Query By Example on Google.
>
> Take a look at=
> MoreLik
> For example, given terms "female", "John" and "London" - all 3 may
> have equal IDF but should a document representing a female in London
> be given equal weighting to a document representing the rarer example
> of a female who happens to be called "John"?
Not to mention multi-word phrase tokeni
Try looking at the "retrieveInterestingTerms" method on the class MoreLikeThis.
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For addi
> docfreqs (idfs) do not take into account deleted docs.
> This is more of an engineering tradeoff rather than a feature.
> If we could cheaply and easily update idfs when documents are deleted
> from an index, we would.
Wow. So is it fair to say that the stored IDF is really the
cumulative IDF f
> The difference between IndexReader.maxDoc() and numDocs() tells you
> how many documents have been marked for deletion but still take up
> space in the index.
But not which terms have an odd IDF value because of those deleted
documents. How much does the IDF value contribute to the "score" in
s
Dawid Weiss wrote:
> You could also try splitting the document into paragraphs and use Carrot2's
> Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters.
> Labelling routine in Lingo should extract 'key' phrases; this analysis is
> heavily frequency-based, but... you know, y
> Dawid Weiss wrote:
> > You could also try splitting the document into paragraphs and use Carrot2's
> > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters.
> > Labelling routine in Lingo should extract 'key' phrases; this analysis is
> > heavily frequency-based, but... y
I've got a some code developed for Lucene 1.4.1, that works around the
problem of having both (1) multiple default fields, and (2) the AND
operator for query elements. In 1.4.1, MultiFieldQueryParser
effectively only allowed the OR operator.
I'm wondering if this has changed in 1.9. Will I be ab
I've got a daemon process which keeps an IndexSearcher open on an
index and responds to query requests by sending back document
identifiers. I've also got other processes updating the index by
re-indexing existing documents, deleting obsolete documents, and
adding new documents. Is there any way
ls to getCurrentVersion()) in order to explicitly
re-load the index. Some postings about transactional updates make me
hopeful that there is some automatic system at work.
Bill
> Bill Janssen wrote:
> > I've got a daemon process which keeps an IndexSearcher open on an
> > index and resp
I don't see how to adjust the value of IndexWriter's
WRITE_LOCK_TIMEOUT in 1.9. Since the property
org.apache.lucene.writeLockTimeout is no longer consulted, the value
of IndexWriter.WRITE_LOCK_TIMEOUT is final, and there's no setter,
what's the deal?
Bill
---
Daniel Naber ponders:
> Seems these have been forgotten. They can easily be added, but I still
> wonder what the use case is to set these values?
The default value isn't magic. The appropriate value is
context-specific. I've got some people using Lucene on machines with
slow disks, and we need
Let's stop this thread.
> Can i use lucene to search the internet.
No.
You may be able to use Lucene to *index* the internet, and then search
the resulting index. Read the book "Lucene in Action" for a better idea
of what this would entail.
Bill
---
I presume the patch that gives us a way of overriding the default
timeout for write locks has made it into the source DB, but I really
need a jar file to point people at which contains it. Any chance of
a 1.9.2 release?
Bill
-
T
> Hi.
>
> Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded
> and there is no way to set it from outside?
>
> I've seen a check-in in the CVS from a few days ago which added
> getters/setters for this, but ... there is no release containing
> this, right?
>
> So, my que
In case anyone else was wondering:
I got curious about how one would replace FieldCache, and discovered
that you can create an instance of a class which implements
FieldCache, and then simply assign it to
org.apache.lucene.search.FieldCache.DEFAULT.
> 2) your use case sounds like it could best be
Hi!
I've got oodles of email stored in MH (one file per message,
hierarchical directories) format. I'm looking for an IMAP server that
will use Lucene to index that mail and perform the various search
parts of the IMAP protocol. Ideally, the mail would not have to be
converted to another email f
> The JDK comes with some classes that will let you get to
> that elegantly.
You mean clumsily :-).
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
I've got a situation where I'm searching over a number of different
repositories, each containing a different set of documents. I'd like
to run searches over, say, 4 different indices, then combine the
results outside of Java to present to the user. Is there any way of
normalizing search scores o
results outside of Java without some such calibration.
Bill
> I think Chuck and friends have provided just such a patch, but we
> haven't applied it yet.... :(
>
> Otis
>
> --- Bill Janssen <[EMAIL PROTECTED]> wrote:
> > I've got a situation where I'm
Thanks for pointing this out, Marvin. I wish Sun (or someone) would
document and register this particular character set encoding with
IANA, so that it could be used outside of Java. As it stands now,
it's essentially a bastard encoding, good for nothing, and one of the
warts of Java.
Lucene prob
63 matches
Mail list logo