Re: HTML text extraction

2006-06-20 Thread Otis Gospodnetic
John, I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also tells you what Simpy.com uses. Otis - Original Message From: John Wang <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 21, 2006 1:39:41 AM Subject: HTML text extraction Can

Re: Search within multiple different subfolders

2006-06-20 Thread Shaghayegh Sahebie
Thank you Erick, You are right! Lucene searches the Index not directories, but in this way we search the whole index and just add a new constraint(e.g. files path) to match proper results. Is there a way to not to search the whole index and just some branches of an index without using constraint

Re: HTML text extraction

2006-06-20 Thread Daniel Noll
John Wang wrote: Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. We use this library to do our HTML parsing work: http://htmlparser.sourceforge.net/ It's fairly resilient to bad code, including thin

HTML text extraction

2006-06-20 Thread John Wang
Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. Otis, what do you guys use at Simpy? Thanks -john

Re: addIndexes() is taking infinite time ...

2006-06-20 Thread heritrix . lucene
hi, thanks for your reply. Now i restarted my application with maxBufferedDocs=10,000. And i am sorry to say that i was adding those indexes one by one. :-) Anyway Can you please explain me the addIndex ? I want to know what exactly happens while adding these.. With Regards, On 6/20/06, Otis G

Re: use of lucene app..

2006-06-20 Thread Mike Richmond
Hello Bruce, is there a way to set lucene so that it only parses/crawls through a given portion of a website... Lucene does not crawl anything on its own. It is simply a search engine library. All indexing and crawling must be done by an application that you create. also, when lucene retu

use of lucene app..

2006-06-20 Thread bruce
hi.. is there a way to set lucene so that it only parses/crawls through a given portion of a website... i have a college site. i'm looking at simply extracting all the information for a given section of the site, ie the registrar section... if i can determine that i want all the pages underneath

Re: Lucene as syslog storage

2006-06-20 Thread Benjamin Stein
I've personally indexed over 1,000,000 documents and Lucene doesn't even breath hard. We are in the hundreds of millions and growing, and Lucene does tend to sweat a little bit, although it can certainly handle it. You're going to have to understand a bit of the internals of Lucene a bit more.

Re: Modifying the stored norm type

2006-06-20 Thread Dan Climan
>Paul Elschot <[EMAIL PROTECTED]> >>On Tuesday 20 June 2006 12:02, Marcus Falck wrote: >> After a lot of debugging and some API doc reading I have come to the > conclusion that the static encodeNorm method of the Similarity class > will encode my boost value into a single byte decimal number. >>

Re: Custom ScoreDocComparator and normalized Scores

2006-06-20 Thread Chris Hostetter
First off: why do you need the normalized scores in your equation? for the purposes of comparing the calculated values in order to sort them, it shouldn't matter if they are normalized or not. Second: I strongly suggest you take a look at FunctionQuery ... it was created for hte expres purpose

Re: How to search for europian word with and without special characters

2006-06-20 Thread Chris Hostetter
take a look at the ISOLatin1AccentFilter .. it doesn't seem to do exactly what you want (replacing "ü" with "ue" .. it just uses "u") but it should give you an idea of what you can do. There was also a discussion recently about how you can use a modified version of this Filter at index time to ge

Re: Indexing Dash concatenated words vs SynonymAnalyzer

2006-06-20 Thread Yonik Seeley
On 6/20/06, Martin Braun <[EMAIL PROTECTED]> wrote: german words are often dash-concatenated, e.g. West-Berlin or something like "C*-algebras and W*-algebras". I tend to write my own analyzer like the SynonymAnalyzer from the LIA-Book. I want to Index these words like this: West-Berlin => Westb

Re: SynonymsQuery

2006-06-20 Thread Ziv Gome
Hi All, NG Vinny. - Written in response to NG Vinny post from 12-Jun, for some reason I could not add it to the thread. :-( The problem lies in versions. The published code for SynonymsQuery was originally written on Lucene 1.4.3. It did not compile as is with Lucene 2.0. The required c

Re: Modifying the stored norm type

2006-06-20 Thread Yonik Seeley
On 6/20/06, Marcus Falck <[EMAIL PROTECTED]> wrote: So I guess I will have to get lucene to store a 4 byte norm in the form of a float instead of the single byte? Can you store the float in a field instead? That seems like it might be a bit easier than modifying how lucene stores norms. -Yon

Re: faceting and categorizing on color?

2006-06-20 Thread James Pine
> First off, let me clear up somethign regarding your > index field structure, > you mentioned that you currently have documents that > look like this... > > : IMAGE 1 > : COLORS F0 FF FFF000 00 F0 FF > : E0 EE EEE000 00 > > If you are indexing it as Fie

using lucene Lock inter-jvm

2006-06-20 Thread jm
Hi, I am trying to peruse lucene's Lock for my own purposes, I need to lock several java processes and I thought I could reuse the Lock stuff. I understand lucene locks work across jvm. But I cannot make it work. I tried to reproduce my problem in a small class: public class SysLock { privat

Re: Modifying the stored norm type

2006-06-20 Thread Paul Elschot
On Tuesday 20 June 2006 12:02, Marcus Falck wrote: > Hi again, > > > > After a lot of debugging and some API doc reading I have come to the > conclusion that the static encodeNorm method of the Similarity class > will encode my boost value into a single byte decimal number. > > And I will loos

Re: How to search for europian word with and without special characters

2006-06-20 Thread Otis Gospodnetic
I think you'll want to write your own Analyzer + Tokenizer, detect tokens with umlauts, and then emit two tokens at the same position (think of them as synonyms), one being the original one with the umlaut, and the other one with the umlaut transformed according to the rules (e.g. ü -> ue). Hm,

Re: addIndexes() is taking infinite time ...

2006-06-20 Thread Otis Gospodnetic
If you can tell how many indices you've merged, you must be mering them one at a time, and the pre and post merge optimize() calls are costing you. Also, that maxBufferedDocs looks pretty low. Unless you are working with very large documents and small heap, you should be able to bump that up mu

Re: addIndexes() is taking infinite time ...

2006-06-20 Thread Volodymyr Bychkoviak
I guess that you're adding those indexes one by one.. You should add all indexes at once rather then adding them one by one. addIndexes() method takes array of directories/readers to add indexes. IndexWriter performs optimize() after adding indexes, so with your big index it can take long enoug

lucene in combination with pattern recognition...

2006-06-20 Thread bruce
hi... i'm looking at a problem and i can't figure out how to "easily" solve it... basically, i'm trying to figure out if there's a way to use lucene/nutch with some form of pattern matching to extract course information from a College/Registrar's course section... Assume I can point to a Regiatr

Re: Exact Match Searches and Stop Words

2006-06-20 Thread Steven Rowe
Hugh Ross wrote: The problem is that the standard analyzer removes the stop word (i.e. "of") before indexing and searching. Is there an workaround for this? See my response to a similar question here: In

Exact Match Searches and Stop Words

2006-06-20 Thread Hugh Ross
I am running an exact match search using an index which was created using the StandardAnalyzer. My problem is that when I do an exact match search on this index and look for a phrase such as content:"Head of State" using the query parser (StandardAnalyzer again for searching), the search brings b

Re: How to do pagination on fethed result using lucene...

2006-06-20 Thread Grant Ingersoll
You may want to look into the new lazy field loading (on the trunk), in case you have 1 or 2 large fields that you don't necessarily display right away (unless a user clicks on a result). This can speed up document loading significantly. heritrix.lucene wrote: Hi, Actually i forgot to write t

Indexing Dash concatenated words vs SynonymAnalyzer

2006-06-20 Thread Martin Braun
Hello all, german words are often dash-concatenated, e.g. West-Berlin or something like "C*-algebras and W*-algebras". I tend to write my own analyzer like the SynonymAnalyzer from the LIA-Book. I want to Index these words like this: West-Berlin => Westberlin | West | Berlin | "West Berlin" C*-

Re: Search within multiple different subfolders

2006-06-20 Thread Erick Erickson
I'm a little confused. What does a directory have to do with searching? You can certainly build the *index* from files in a directory, but Lucene doesn't search files themselves, it searches the *index* built from the files. If I'm interpreting your problem correctly, you'll have to build an inde

Re: Modifying the stored norm type

2006-06-20 Thread karl wettin
On Tue, 2006-06-20 at 13:59 +0200, karl wettin wrote: > On Tue, 2006-06-20 at 12:02 +0200, Marcus Falck wrote: > > > So I guess I will have to get lucene to store a 4 byte norm in the > > form of a float instead of the single byte? > > > > Is this do able or is it just madness? And will it slow t

RE: How to search for europian word with and without special characters

2006-06-20 Thread Mile Rosu
Hello Supriya, One possibility would be to search for both müller and mueller from the interface. It means you should "normalize" in some way the search query you are doing. This solution would not affect the content of the existing index (no reindexing needed). Greets, Mile -Original Mes

Search within multiple different subfolders

2006-06-20 Thread Shaghayegh Sahebie
Hi all, I'm a new user of Lucene and wanted to know if Lucne can search multiple data directories which these directories do not have similar parent? e.g. I have a directory named "father" and i have a subDirectory called "son" and i have another directory called "mother" with a subDirectory cal

How to search for europian word with and without special characters

2006-06-20 Thread Supriya Kumar Shyamal
Hi All, I have a question regarding the indexing and searching for german characters. For eg. when I search for the word "müller" also I want to search for the word "mueller". How to achieve this in lucene. Thanks, supriya -- Mit freundlichen Grüßen / Regards Supriya Kumar Shyamal Software

addIndexes() is taking infinite time ...

2006-06-20 Thread heritrix . lucene
Hi all, I had five different indexes: 1 having 15469008 documents 2 having 7734504 documents 3 having 7734504 documents 4 having 7734504 documents 5 having 7734504 documents Which sums to 46407024. The constant values are maxMergeFactor = 1000 maxBufferedDocs = 1000 I wrote a simple program which

Re: Modifying the stored norm type

2006-06-20 Thread karl wettin
On Tue, 2006-06-20 at 12:02 +0200, Marcus Falck wrote: > So I guess I will have to get lucene to store a 4 byte norm in the > form of a float instead of the single byte? > > Is this do able or is it just madness? And will it slow the search > timings down or will it just eat more memory? It is

Modifying the stored norm type

2006-06-20 Thread Marcus Falck
Hi again, After a lot of debugging and some API doc reading I have come to the conclusion that the static encodeNorm method of the Similarity class will encode my boost value into a single byte decimal number. And I will loose a lot of resolution and will get severe rounding errors. (please

Get exact matching "Field" name from matching Documents

2006-06-20 Thread Vikas Khengare
Hi I am pretty well with getting results in the form of Documents objects. But now I want only those fields in which the query string found from matching Document. So If I have Document as { Field("EmpName","John");

Custom ScoreDocComparator and normalized Scores

2006-06-20 Thread Gustavo Comba
Hi, I'm trying to sort the search results by a "combination" of the "lucene score" and the value of a document field. The "combination" is something like that: scoreWeight * i.score + fieldWeight * getFieldValue(i.doc) I expect results between 0 and scoreWeight + fieldWeight

Re: How to do pagination on fethed result using lucene...

2006-06-20 Thread heritrix . lucene
Hi, Actually i forgot to write that my application is web based and i am running this on tomcat server. assuming your application is web based, the general concesus is to start by implimening your app so that each page reexecutes the search, reexecuting the search is not feasible as every time

Re: How to do pagination on fethed result using lucene...

2006-06-20 Thread Chris Hostetter
: I have built an small application that give some thousand results. I want to : display results as google displays using pagination. : Here my question is, how I'll maintain the sequence of displayed result. : : Should i associate the "Hits" object along with the session. assuming your applicati