recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-16 Thread Bill Janssen
So, in version 3, I have to pass a version parameter to the constructor for StandardAnalyzer. Since Version.LUCENE_CURRENT is deprecated, I'd like this to be the same as the version of the index I'm using. Is there a way of getting a value of Version for an index? I don't see obvious methods on

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-17 Thread Bill Janssen
Simon Willnauer wrote: > Hey Bill, > let me clarify what Version is used for since I think that caused > little confusion. Thanks. > The Version constant was mainly introduced to help > users with backwards compatibility and upgrading their codebase to a > new version of lucene without breaking

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-17 Thread Bill Janssen
Simon Willnauer wrote: > On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen wrote: > > Simon Willnauer wrote: > > > >> Hey Bill, > >> let me clarify what Version is used for since I think that caused > >> little confusion. > > > > Thanks. >

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-17 Thread Bill Janssen
Bill Janssen wrote: > ...is there any attribute or static > method somewhere in Lucene which will return a value of > org.apache.lucene.util.Version that corresponds to the version of the > code? That's what I'm looking for. Version.LUCENE_CURRENT looks good, > but it&

Re: recommended way to identify a version to pass to StandardAnalyzer constructor?

2010-09-19 Thread Bill Janssen
Simon Willnauer wrote: > On Fri, Sep 17, 2010 at 11:45 PM, Bill Janssen wrote: > > Simon Willnauer wrote: > > > >> On Fri, Sep 17, 2010 at 8:14 PM, Bill Janssen wrote: > >> > Simon Willnauer wrote: > >> > > >> >> Hey Bill, >

finding the analyzer for a language...

2010-09-24 Thread Bill Janssen
I thought that since I'm updating UpLib's Lucene code, I should tackle the issue of document languages, as well. Right now I'm using an off-the-shelf language identifier, textcat, to figure out which language a Web page or PDF is (mainly) written in. I then want to analyze that document with an a

Re: finding the analyzer for a language...

2010-09-25 Thread Bill Janssen
Robert Muir wrote: > On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen wrote: > > > I thought that since I'm updating UpLib's Lucene code, I should tackle > > the issue of document languages, as well. Right now I'm using an > > off-the-shelf language i

Re: Indexing is hung or doesn't complete

2010-10-13 Thread Bill Janssen
Ching wrote: > I use PDFBox version 1.1.0; I did find a workaround now. Just wondering > which tools do you use to extract text from pdf? Thanks. Ching, in UpLib I use a patched version of xpdf which reports the bounding box and font information for each word (as well as the Unicode characters o

Re: Email Indexing

2010-10-28 Thread Bill Janssen
Hasan Diwan wrote: > On 27 October 2010 18:16, Troy Wical wrote: > > Depends on what your trying to index, I suppose. Maildir or mbox? For some > > time now, off and on, I have been working to index an ezmlm mailing list > > archive. In the end, I went with Swish-E and have made quite a bit of

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Bill Janssen
Grant Ingersoll wrote: > Where do you get your Lucene/Solr downloads from? > > [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout.

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Bill Janssen
Clemens Wyss wrote: > > 1) Docs in different languages -- every document is one language > > 2) Each document has fields in different languages > We mainly have 1)-models I've recently done this for UpLib. I run a language-guesser over the document to identify the primary language when the docu

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Bill Janssen
7;d have to see numbers on that from some reasonable corpus to be convinced it would be worth it. Bill > > paul > > > Le 19 janv. 2011 à 19:21, Bill Janssen a écrit : > > > Clemens Wyss wrote: > > > >>> 1) Docs in different languages -- every document

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Bill Janssen
Paul Libbrecht wrote: > I did several changes of this sort and the precision and recall > measures went better in particular in presence of language-indication > failure which happened to be very common in our authoring environment. There are two kinds of failures: no language, or wrong languag

Re: AW: Best practices for multiple languages?

2011-01-20 Thread Bill Janssen
I hope this help. > > Dominique > www.zoonix.fr > www.crawl-anywhere.com > > > > Le 20/01/11 00:29, Bill Janssen a écrit : > > Paul Libbrecht wrote: > > > >> I did several changes of this sort and the precision and recall > >> measures went be

Re: about pdf search

2011-03-07 Thread Bill Janssen
James Wilson wrote: > I have completed a project to do the exact same thing. I put the pdf > text in XML files. Then after I do a Lucene search I read the text from > the XML files. I do not store the text in the Lucene index. That would > bloat the index and slow down my searches. FYI -- I

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-11 Thread Bill Janssen
shrinath.m wrote: > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages by selecting only text between certain tags, before indexing them. These are offline

Re: Lucille, a (new) Python port of Lucene

2007-08-28 Thread Bill Janssen
Lucille apparently doesn't require gcj. Bill > Why Lucille in light of PyLucene? > > Erik > > > On Aug 28, 2007, at 10:55 AM, Dan Callaghan wrote: > > > Dear list, > > > > I have recently begun a Python port of Lucene, named Lucille. It is > > still very much a work in progress, but I h

lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
I've got a DB of about 2 pages which I thought I'd update to Lucene 2.2. I removed the old index (2.0 based) completely, and started re-indexing all the documents. I do this in stages, of about 50 pages at a time, serially, starting a new JVM each time, and reading in the existing index, then

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
Here's the code I'm using: try { // Now add the documents to the index IndexWriter writer = new IndexWriter(index_loc, new StandardAnalyzer(), !index_loc.exists()); writer.setMaxFieldLength(Integer.MAX_VALUE); try { for (in

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
I just tried re-indexing with lucene-core-2.0.0.jar and the same indexing code; works great. So what am I doing wrong with 2.2? Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
> Are you really sure in your 2.2 test you are starting with no prior > index? I'd ask that too, but yes, I'm really really sure. Building a completely new index each time. Works with 2.0.0. Fails with 2.2.0. Works with 2.2.0 *if* I remove the optimization step. Bill ---

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
> You are not hitting any other exception before this one right? > > Can you change your test case so that the "catch" clause is run > before the "finally" clause? I wonder if you are hitting some > interesting exception and then trying to optimize, which then > masks the original exception. Yes

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
> I'm going to run the same software on an > Intel machine and see what happens. So, I ran the same codebase with lucene-core-2.2.0.jar on an Intel Mac Pro, OS X 10.5.0, Java 1.5, and no exception is raised. Different corpus, about 5 pages instead of 2. This is reinforcing my thinking th

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-28 Thread Bill Janssen
> Hmmm ... how many chunks of "about 50 pages" do you do before hitting this? > Roughly how many docs are in the index when it happens? Oh, gosh, not sure. I'm guessing it's about half done. > Can you describe the docs/fields you're adding? I've got 1735 documents, 18969 pages -- average page s

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> Do you have another PPC machine to reproduce this on? (To rule out > bad RAM/hard-drive on the first one). I'll dig up an old laptop and try it there. > Another thing to try is turning on the infoStream > (IndexWriter.setInfoStream(...)) and capture & post the resulting log. > It will be very

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> > Another thing to try is turning on the infoStream > > (IndexWriter.setInfoStream(...)) and capture & post the resulting log. > > It will be very large since it takes quite a while for the error to > > occur... > > I can do that. Here's what I see: Optimizing... merging segments _ram_a (1 doc

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> > Another thing to try is turning on the infoStream > > (IndexWriter.setInfoStream(...)) and capture & post the resulting log. > > It will be very large since it takes quite a while for the error to > > occur... > > I can do that. Here's a more complete dump. I've modified the code so that I n

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> Can you try running with the trunk version of Lucene (2.3-dev) and see > if the error still occurs? EG you can download this AM's build here: > > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/288/artifact/artifacts Still there. Here's the dump with last night's build: /L

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> Are you still getting the original exception too or just the Array out =20= > > of bounds one now? Also, are you doing anything else to the index =20 > while this is happening? The code at the point in the exception below =20= > > is trying to properly handle deleted documents. Just the arra

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> Could you post this part of the code (deleting) too? Here it is: private static void remove (File index_file, String[] doc_ids, int start) { String number; String list; Term term; TermDocs matches; if (debug_mode) System.err.println("in

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> Have you tried another PPC machine? No. It's in another location, but perhaps I can get it tomorrow. On the other hand, the success when using 2.0 makes it likely to me that the machine isn't the problem. OK, I've reverted to my original codebase (where I first create a reader and do the dele

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
> Also, could you try out the CheckIndex tool in 2.3-dev before and > after the deletes? Great idea! I don't suppose there's a jar file of it? Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-29 Thread Bill Janssen
So, it's a little clearer. I get the Array-out-of-bounds exception if I'm re-indexing some already indexed documents -- if there are deletions involved. I get the CorruptIndexException if I'm indexing freshly -- no deletions. Here's an example of that (with the latest nightly). I removed the ex

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-11-30 Thread Bill Janssen
> Your errors seem to happen around the same area (~20K docs). If you > skip the first say ~18K docs does the error still happen? We need to > somehow narrow this down. I'm trying to boil down the documents to a set which I can deploy on a DVD-ROM, so I can move the same set around from machine

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-12-02 Thread Bill Janssen
> I'll see if I can get back to this over the weekend. I got a chance to copy my corpus to another G4 and try indexing with Lucene 2.2. This one seems OK! Same texts. So now I'm inclined to believe that it *is* the machine, rather than the code. Whew! Though that doesn't explain why 2.0 works

Re: lucene-core-2.2.0.jar broken? CorruptIndexException?

2007-12-02 Thread Bill Janssen
> > Hmmm, it still sounds like you are hitting a threading issue that is > > probably exacerbated by the multicore platform of the newer machine. > > Exactly what I was thinking. > What are the details of the CPUs of these two systems? Ah, good point. The bad machine is a dual-processor 1GHz G4

Re: Does Lucene save an offline version of web pages?

2008-04-27 Thread Bill Janssen
> - Fetch and index some pages (containing word and pdf documents) on > daily basis. > - Extract all pages that contain some provided keywords after fetching > the pages. > - Create some bulletin from fetched pages, bulletin will be in pdf > format and are categorized based on keywords. > - provide

Re: text extraction from pdf

2008-05-14 Thread Bill Janssen
> > the unix program pdf2text can convert keeping the text places, but I wanted > > to ask you guys if you know something better, > > AFAIK, PDFBox has a lower-level API that allows you to get hold of text > positions. In UpLib, I use xpdf-3.02pl2 with a patch which gives me position and font in

Re: text extraction from pdf

2008-05-15 Thread Bill Janssen
> Problem I am having is that some of them has multiple columns. and multiple > word boxes. Does the xpdf patch extract different columns and wordboxes? It tells you where each word is. Columns you have to do for yourself. Bill > > In UpLib, I use xpdf-3.02pl2 with a patch which gives me positi

Re: Using lucene as a database... good idea or bad idea?

2008-07-29 Thread Bill Janssen
I do this with uplib (http://uplib.parc.com/) with fair success. Originally I thought I'd need Lucene plus a relational database to store metadata about the documents for metadata searches. So far, though, I've been able to store the metadata in Lucene and use the same Lucene DB for both metadata

overriding addClause()?

2006-10-23 Thread Bill Janssen
I'd like to suggest a minor change in the QueryParser.jj. I thought I'd describe it here and get some feedback before posting a patch. The issue is that I can't get my hands on some clauses (typically PhraseQuery instances) in my subclass of MultiFieldQueryParser, which I'd like to do to implemen

using a document as a query?

2007-01-30 Thread Bill Janssen
I was thinking of trying something, and wondered if someone else already had it working... I'd like to take a document, and use it as a query to find other documents in my index that 'match' it. I'm talking about short documents, like newspaper articles or email messages. Seems to me that there

Re: using a document as a query?

2007-01-31 Thread Bill Janssen
MoreLikeThis is just what I wanted. Thanks. Bill > Yes, I believe Dave did something like that on searchmorph.org and somebody= > else did this on some some with RFCs. What's that called? Query by examp= > le? I think so, try define:Query By Example on Google. > > Take a look at= > MoreLik

Re: Reduction based "more like this"?

2007-02-09 Thread Bill Janssen
> For example, given terms "female", "John" and "London" - all 3 may > have equal IDF but should a document representing a female in London > be given equal weighting to a document representing the rarer example > of a female who happens to be called "John"? Not to mention multi-word phrase tokeni

Re: keywords in a document

2007-04-09 Thread Bill Janssen
Try looking at the "retrieveInterestingTerms" method on the class MoreLikeThis. http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/similar/MoreLikeThis.html Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For addi

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
> docfreqs (idfs) do not take into account deleted docs. > This is more of an engineering tradeoff rather than a feature. > If we could cheaply and easily update idfs when documents are deleted > from an index, we would. Wow. So is it fair to say that the stored IDF is really the cumulative IDF f

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
> The difference between IndexReader.maxDoc() and numDocs() tells you > how many documents have been marked for deletion but still take up > space in the index. But not which terms have an odd IDF value because of those deleted documents. How much does the IDF value contribute to the "score" in s

Re: Keyphrase Extraction

2007-05-08 Thread Bill Janssen
Dawid Weiss wrote: > You could also try splitting the document into paragraphs and use Carrot2's > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. > Labelling routine in Lingo should extract 'key' phrases; this analysis is > heavily frequency-based, but... you know, y

Re: Keyphrase Extraction (via Lingo)

2007-05-09 Thread Bill Janssen
> Dawid Weiss wrote: > > You could also try splitting the document into paragraphs and use Carrot2's > > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. > > Labelling routine in Lingo should extract 'key' phrases; this analysis is > > heavily frequency-based, but... y

multi-field query parser with AND operator?

2006-01-04 Thread Bill Janssen
I've got a some code developed for Lucene 1.4.1, that works around the problem of having both (1) multiple default fields, and (2) the AND operator for query elements. In 1.4.1, MultiFieldQueryParser effectively only allowed the OR operator. I'm wondering if this has changed in 1.9. Will I be ab

notification of active IndexSearchers when index is modified?

2006-01-19 Thread Bill Janssen
I've got a daemon process which keeps an IndexSearcher open on an index and responds to query requests by sending back document identifiers. I've also got other processes updating the index by re-indexing existing documents, deleting obsolete documents, and adding new documents. Is there any way

Re: notification of active IndexSearchers when index is modified?

2006-01-19 Thread Bill Janssen
ls to getCurrentVersion()) in order to explicitly re-load the index. Some postings about transactional updates make me hopeful that there is some automatic system at work. Bill > Bill Janssen wrote: > > I've got a daemon process which keeps an IndexSearcher open on an > > index and resp

Adjusting WRITE_LOCK_TIMEOUT in 1.9.1

2006-03-09 Thread Bill Janssen
I don't see how to adjust the value of IndexWriter's WRITE_LOCK_TIMEOUT in 1.9. Since the property org.apache.lucene.writeLockTimeout is no longer consulted, the value of IndexWriter.WRITE_LOCK_TIMEOUT is final, and there's no setter, what's the deal? Bill ---

Re: Setting the COMMIT lock timeout.

2006-03-13 Thread Bill Janssen
Daniel Naber ponders: > Seems these have been forgotten. They can easily be added, but I still > wonder what the use case is to set these values? The default value isn't magic. The appropriate value is context-specific. I've got some people using Lucene on machines with slow disks, and we need

Re: Can i use lucene to search the internet.

2006-03-23 Thread Bill Janssen
Let's stop this thread. > Can i use lucene to search the internet. No. You may be able to use Lucene to *index* the internet, and then search the resulting index. Read the book "Lucene in Action" for a better idea of what this would entail. Bill ---

Any plans for a 1.9.2 release? Need timeout setting!

2006-03-30 Thread Bill Janssen
I presume the patch that gives us a way of overriding the default timeout for write locks has made it into the source DB, but I really need a jar file to point people at which contains it. Any chance of a 1.9.2 release? Bill - T

Re: WRITE_LOCK_TIMEOUT

2006-04-05 Thread Bill Janssen
> Hi. > > Is it correct that in Release 1.9.1 a WRITE_LOCK_TIMEOUT is hardcoded > and there is no way to set it from outside? > > I've seen a check-in in the CVS from a few days ago which added > getters/setters for this, but ... there is no release containing > this, right? > > So, my que

Re: Fetch Documents Without Retrieveing All Fields

2006-04-10 Thread Bill Janssen
In case anyone else was wondering: I got curious about how one would replace FieldCache, and discovered that you can create an instance of a class which implements FieldCache, and then simply assign it to org.apache.lucene.search.FieldCache.DEFAULT. > 2) your use case sounds like it could best be

IMAP server that uses Lucene?

2006-05-28 Thread Bill Janssen
Hi! I've got oodles of email stored in MH (one file per message, hierarchical directories) format. I'm looking for an IMAP server that will use Lucene to index that mail and perform the various search parts of the IMAP protocol. Ideally, the mail would not have to be converted to another email f

Re: Find version of Lucene library

2005-03-08 Thread Bill Janssen
> The JDK comes with some classes that will let you get to > that elegantly. You mean clumsily :-). Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Normalizing search scores over multiple indices

2005-04-04 Thread Bill Janssen
I've got a situation where I'm searching over a number of different repositories, each containing a different set of documents. I'd like to run searches over, say, 4 different indices, then combine the results outside of Java to present to the user. Is there any way of normalizing search scores o

Re: Normalizing search scores over multiple indices

2005-04-04 Thread Bill Janssen
results outside of Java without some such calibration. Bill > I think Chuck and friends have provided just such a patch, but we > haven't applied it yet.... :( > > Otis > > --- Bill Janssen <[EMAIL PROTECTED]> wrote: > > I've got a situation where I'm

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread Bill Janssen
Thanks for pointing this out, Marvin. I wish Sun (or someone) would document and register this particular character set encoding with IANA, so that it could be used outside of Java. As it stands now, it's essentially a bastard encoding, good for nothing, and one of the warts of Java. Lucene prob