RE: addIndexes() is taking infinite time ...

2006-06-21 Thread Mike Streeton
>From memory addIndexes() also does and optimization before hand, this might be what is taking the time. Mike www.ardentia.com the home of NetSearch -Original Message- From: heritrix.lucene [mailto:[EMAIL PROTECTED] Sent: 22 June 2006 05:05 To: java-user@lucene.apache.org Subject: Re: ad

Re: HTML text extraction

2006-06-21 Thread 张瑾
Please send it to me,thanks very much! 2006/6/21, Liao Xuefeng <[EMAIL PROTECTED]>: hi, i wrote my own html parser to do html2text and it works well. i can send you my code if it matches your require. -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21

Re: addIndexes() is taking infinite time ...

2006-06-21 Thread heritrix . lucene
No. I haven't tried. Today i can try it. One thing that i m thinking is that what role does the file system plays here. I mean is there any difference on if i am doing indexing on FAT32 or i am on EXT3??? i'll have to find it out Can anybody put some light on this?? With regards On 6/22/06,

What is a "Lazy Field"...

2006-06-21 Thread heritrix . lucene
Hi, Can anybody please tell me what a "Lazy Field" is ??? I noticed several time this term has come in discussion... With Regards,

Re: addIndexes() is taking infinite time ...

2006-06-21 Thread Daniel Noll
heritrix.lucene wrote: hi Otis, Now this time it took 10 Hr 34 Min. to merge the indexes. During merging i noticed it was not completey using the CPU. I have 512MB RAM. and here i found it used upto the 256 MB. Are there some more possibilities to make it more fast ... Have you tested how fast

Re: addIndexes() is taking infinite time ...

2006-06-21 Thread heritrix . lucene
hi Otis, Now this time it took 10 Hr 34 Min. to merge the indexes. During merging i noticed it was not completey using the CPU. I have 512MB RAM. and here i found it used upto the 256 MB. Are there some more possibilities to make it more fast ... With Regards, On 6/21/06, heritrix. lucene <[E

Re: Creating initial index using FSDirectory

2006-06-21 Thread Erick Erickson
From an e-mail from Kent Fitch in the thread "*Detecting index existance* " Try IndexReader static method indexExists: http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(java.lang.String)

Creating initial index using FSDirectory

2006-06-21 Thread Leandro Saad
Hi all. I'm writing a avalon component that wrapps lucene. My problem is that I can't start the component using FSDirectory unless the index files are already in place (segment, etc) , or I set the rewrite flag to true. I my case, I'd like to create the index file structure only the first time I

Phrase Frequency For Analysis

2006-06-21 Thread Nader Akhnoukh
Hi, I've looked through the archives and it looks like this question has been asked in one form or another a few times, but without a satisfactory solution. I am trying to get the most frequently occurring phrases in a document and in the index as a whole. The goal is compare the two to get some

Re: HTML text extraction

2006-06-21 Thread Daniel Noll
Simon Courtenage wrote: I also use htmlparser, which is rather good. I've had to customize it, though, to parse strings containing html source rather than accept urls of resources to fetch etc. Also it crashes on meta tags that don't have name attributes (something I discovered only a couple

RE: lucene consulting/support/help

2006-06-21 Thread Larry Ogrodnek
http://wiki.apache.org/jakarta-lucene/Support -Original Message- From: bruce [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 2:11 PM To: java-user@lucene.apache.org Subject: lucene consulting/support/help hi... anybody on the list provide consulting/support for lucene/nutch...

Re: Giving weight to partial matches

2006-06-21 Thread Chris Hostetter
: content field, and return results only if either title or content : matches ALL the words searched. So searching for "miracle cure for : cancer" might yield: : : (+title:miracle +title:cure +title:for +title:cancer)^5.0 : (+content:miracle +content:cure +content:for +content:cancer) first off, a

RE: Custom ScoreDocComparator and normalized Scores

2006-06-21 Thread Chris Hostetter
: Thanks Chris, I didn't know the "solr" package, it is not in the release : distribution, isn't? I'm going to read about it to see if it matchs our : needs. Solr is a seperate (incubation) project, that builds on top of Lucene, but the FunctionQuery classes have no dependencies outside of the Lu

lucene consulting/support/help

2006-06-21 Thread bruce
hi... anybody on the list provide consulting/support for lucene/nutch... get back to me with your contact info if you do... thanks -bruce - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PRO

Re: Modifying the stored norm type

2006-06-21 Thread Paul Elschot
On Tuesday 20 June 2006 18:42, Dan Climan wrote: > >Paul Elschot <[EMAIL PROTECTED]> > >>On Tuesday 20 June 2006 12:02, Marcus Falck wrote: > >> After a lot of debugging and some API doc reading I have come to the > > conclusion that the static encodeNorm method of the Similarity class > > will en

Re: Modifying the stored norm type

2006-06-21 Thread Paul Elschot
On Wednesday 21 June 2006 12:13, karl wettin wrote: > On Tue, 2006-06-20 at 18:01 +0200, Paul Elschot wrote: > > On Tuesday 20 June 2006 12:02, Marcus Falck wrote: > > > encodeNorm method of the Similarity class will encode my boost value > > into a single byte decimal number. And I will loose a l

Re: lucene Index search

2006-06-21 Thread Otis Gospodnetic
Hi, You may want to look at the Lucene highligher, if you are after "keyword in context". Otis P.S. Please use java-user list for questions. - Original Message From: "Ngo, Anh (ISS Southfield)" <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Wednesday, June 21, 2006 12:17:38 P

Re: lucene...

2006-06-21 Thread Otis Gospodnetic
Hi Bruce, You want to use Nutch. Nutch uses Lucene under the hood, and provides all the crawling stuff. Otis - Original Message From: bruce <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 21, 2006 12:21:28 PM Subject: lucene... hi... after reading through the

lucene...

2006-06-21 Thread bruce
hi... after reading through the docs for lucene/nutch, i'm trying to straighten out how it all works... if i want to crawl through a portion of a web site for the purpose of extracting information, it appears that this would work. however, i'm not sure if i need lucene/nutch or both.. i don't nee

RE: hi - testing

2006-06-21 Thread bruce
thanks to all who replied!!! -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 8:59 AM To: java-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: hi - testing Normally the way I see if I'm correctly sending something to a list is to sen

Re: Modifying the stored norm type

2006-06-21 Thread Yonik Seeley
On 6/21/06, karl wettin <[EMAIL PROTECTED]> wrote: Marcus is trying to use the norms to enforce results in chronological order when matching a TB-sized corpus. He can't get any speed by sorting on a date field. Once a FieldCache entry is populated, sorting on a DateField should be about the sam

Re: hi - testing

2006-06-21 Thread Yonik Seeley
Normally the way I see if I'm correctly sending something to a list is to send the first post I really want to send, and go check an archive of the list a little later. -Yonik On 6/21/06, bruce <[EMAIL PROTECTED]> wrote: hi.. can someone please respond to this so i can see if i'm getting throu

hi - testing

2006-06-21 Thread bruce
hi.. can someone please respond to this so i can see if i'm getting through.. thanks -bruce - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: using lucene Lock inter-jvm

2006-06-21 Thread Yonik Seeley
On 6/21/06, Michael McCandless <[EMAIL PROTECTED]> wrote: Does anyone know of any reasons not to switch Lucene's FSDirectory locking to the java.nio.channels.FileLock? EG, are there any performance issues that people are aware of? It's available since Java 1.4. Good question Michael, no reaso

Possible improvement in BooleanQuery

2006-06-21 Thread Satuluri, Venu_Madhav
The method BooleanQuery.add( Query q, BooleanClause.Occur o) accepts Query objects that are null for its first parameter i.e. it doesn't throw any exception. However, when we try to get the string form of the same BooleanQuery object, it throws a NullPointerException from within the toString() code

Re: HTML text extraction

2006-06-21 Thread John Wang
Thanks everyone for your responses! I will try them out. -John On 6/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: John, I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also tells you what Simpy.com uses. Otis - Original Message From: John Wang <[EMAI

Re: using lucene Lock inter-jvm

2006-06-21 Thread Michael McCandless
CC'ing java-dev to talk about details of locking. I can reproduce this on Windows XP, Java 1.4.2: two separate JVMs are able to get the Lock at the same time. The code looks correct to me. Strangely, if I make a separate standalone test that just uses java.io.File.createNewFile directly, it wo

RE: Giving weight to partial matches

2006-06-21 Thread Gustavo Comba
Hi, You can use something like: (title:miracle title:cure title:for title:cancer)^5.0 +((+title:miracle +title:cure +title:for +title:cancer) (+content:miracle +content:cure +content:for +content:cancer)) Should do the work. Regards, Gustavo -Mens

Giving weight to partial matches

2006-06-21 Thread Chun Wei Ho
I am performing searches on an index that includes a title field and a content field, and return results only if either title or content matches ALL the words searched. So searching for "miracle cure for cancer" might yield: (+title:miracle +title:cure +title:for +title:cancer)^5.0 (+content:mira

Re: using lucene Lock inter-jvm

2006-06-21 Thread jm
ok, in case somebody has the same problem: The problem is the true value in FSDirectory directory = FSDirectory.getDirectory("C:\\temp\\a", true); it deletes the previous lock file, that belongs to the lock adquired by the first process. Changing it to false prevents the lock being deleted and loc

RE: BooleanQuery

2006-06-21 Thread WATHELET Thomas
Ok thanks a lot. Before I use TermQuery for the filed doccotent now I use Query object with QueryParser.parse and it's work perfectly. -Original Message- From: Gustavo Comba [mailto:[EMAIL PROTECTED] Sent: 21 June 2006 16:00 To: java-user@lucene.apache.org Subject: RE: BooleanQuery Hel

RE: BooleanQuery

2006-06-21 Thread Gustavo Comba
Hello, I don't know how are you parsing your query, but may be the query you are looking for is something like: +(doccontent:avian doccontent:influenza) +doctype:AM +docdate:[2005033122000 TO 2006062022000] Regards, Gustavo -Mensaje original-

RE: BooleanQuery

2006-06-21 Thread Mile Rosu
You should specify the field name for influenza as well. Like this: +doccontent:avian +doccontent:influenza +doctype:AM +docdate:[2005033122000 TO 2006062022000] Mile -Original Message- From: WATHELET Thomas [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 4:40 PM To: j

BooleanQuery

2006-06-21 Thread WATHELET Thomas
Why I retrive hits with this query : +doccontent:avian +doctype:AM +docdate:[2005033122000 TO 2006062022000] and not with this one +doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO 2006062022000]

Re: Search within multiple different subfolders

2006-06-21 Thread Erick Erickson
Shagheyegh: I'm hardly the lucene expert, but I don't think you can search just a portion of the index. But that's effectively what you're doing if you restrict the search to "son and.". However, depending on your problem space, you could build separate indexes. To continue the example, you

RE: HTML text extraction

2006-06-21 Thread Liao Xuefeng
hi, i wrote my own html parser to do html2text and it works well. i can send you my code if it matches your require. -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 1:40 PM To: java-user@lucene.apache.org Subject: HTML text extraction Can someo

RE: Custom ScoreDocComparator and normalized Scores

2006-06-21 Thread Gustavo Comba
Thanks Chris, I didn't know the "solr" package, it is not in the release distribution, isn't? I'm going to read about it to see if it matchs our needs. The need for normalization is derived from converting a list of values in "polynomial" like ranking function. We define our "ranking" in a way lik

Re: Modifying the stored norm type

2006-06-21 Thread karl wettin
On Tue, 2006-06-20 at 18:01 +0200, Paul Elschot wrote: > On Tuesday 20 June 2006 12:02, Marcus Falck wrote: > encodeNorm method of the Similarity class will encode my boost value > into a single byte decimal number. And I will loose a lot of > resolution and will get severe rounding errors. > Are

Re: faceting and categorizing on color?

2006-06-21 Thread Chris Hostetter
: I thought that having: F0 FF FF 00... : in one field and then searching for FF in it would : match all documents that contain that "word" so I ... : the counts were equal. I guess I am still not clear on : what the differences/advantages/disadvantages are : between th

Re: HTML text extraction

2006-06-21 Thread Chris Hostetter
if you just want something to extract the text from HTML, without trying to extract structure (ie: you don't care about title vs h1 vs bold vs meta keywords) then the HTMLStripReader (or HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be usefull. It wasn't intended to deal with fu

Re: HTML text extraction

2006-06-21 Thread Simon Courtenage
I also use htmlparser, which is rather good. I've had to customize it, though, to parse strings containing html source rather than accept urls of resources to fetch etc. Also it crashes on meta tags that don't have name attributes (something I discovered only a couple of days ago). Simon Dan

RE: HTML text extraction

2006-06-21 Thread Rob Staveley (Tom)
I found that CyberNeko left style and script in the text and JTidy produced better output, but both of them use DOM and were therefore subject to OutOfMemory errors (JTidy being worse than CyberNeko). I've since then moved over to TagSoup, which I needed to customise to strip style script (a simple