Re: file format of index

2006-07-13 Thread Beady Geraghty
I think that I may be misreading the documentation. I didn't see the description of the Long and Int type under the "Primitive Types" section, while reading about the description of Byte, UInt32, Uint64, VInt. So, for some reason I thought that Long and Int are byte order sensitive. Upon re-readi

file format of index

2006-07-13 Thread Beady Geraghty
As I understand from earlier answers to my question that one can create an index on machine A, and use it (search and merge with other indices) on Machine B. I was reading the file format today. http://lucene.apache.org/java/docs/fileformats.html The index has Byte UInt32 UInt64 in most place

Re: HTMLParser

2006-07-13 Thread Yonik Seeley
I've never used HTMLParser, but if you have malformed., incomplete, or optional HTML that would otherwise choke an HTML parser, you could use Solr's HTMLStripping: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e It's pretty stand-alone, s

Re: Are Search Joins Possible between two Physically separate Indexes?

2006-07-13 Thread Paul Borgermans
Though I'm a newbie (which means I may be completely wrong), I don't think this is possible "out of the box". The quickest would be to write a filter which looks up document id's in the first index and applies this to the second index to get the disired subset to search over. I may need this too,

Are Search Joins Possible between two Physically separate Indexes?

2006-07-13 Thread Dejan Nenov
Here is a use case I am trying to address. I have two separate indexes, which contain sets of the same document pool/corpus. The two indexes have a different set of indexed fields. One of the indexed fields is an external DocumentID. I would like to perform searches, like a relational join, expre

HTMLParser

2006-07-13 Thread Ross Rankin
Since I cannot seem to access the HTMLParser mailing list and I saw the library recommended here, I thought someone here that has used it successfully can help me out. I have HTML text stored in a database field which I want to add to a Lucene document, but I want to remove the HTML tags, so HTM

Re: lengthnorm again

2006-07-13 Thread Yonik Seeley
On 7/13/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I am sure this is a question been asked before. :-) I have done some research too, but still don't quite understand. I indexed 20 terms under field name "mesh", and set the boost accordingly from 20 to 1.(just some arbitrary numbers) But when

lengthnorm again

2006-07-13 Thread Zhao, Xin
Hi, I am sure this is a question been asked before. :-) I have done some research too, but still don't quite understand. I indexed 20 terms under field name "mesh", and set the boost accordingly from 20 to 1.(just some arbitrary numbers) But when I checked the index from Luke, the boosts all app

Re: modify existing non-indexed field

2006-07-13 Thread Doron Cohen
> can't access the file: > http://cdoronc.20m.com/tmp/indexingThreads.zip Yes, this Web host sometimes behaves strange when clicking a link from a mail program. Please try to copy cdoronc.20m.com/tmp to the Web Browser (e.g. Firefox), click . This should show the content of that tmp folder, inc

Re: accented characters, wildcards and other problems

2006-07-13 Thread Otis Gospodnetic
Bok Tomi, What do you mean by "terms are misrepresented"? What should they be, and what are you seeing? > What I'm not clear on is how can I see the problematic *terms* in the list of > terms, but not the documents they're stored in? Are you saying that the content got indexed, but the file n

Re: Out of memory error

2006-07-13 Thread Suba Suresh
Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo) suba suresh. Rob Staveley (Tom) wrote: Let us know how you get on. There are a lot of people fighting very similar battles on this list. -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)
Let us know how you get on. There are a lot of people fighting very similar battles on this list. -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 15:30 To: java-user@lucene.apache.org Subject: Re: Out of memory error Thanks. I am using the getText(PDDo

Re: Out of memory error

2006-07-13 Thread Ben Litchfield
By 300MG I assume you mean 300MB. You can also try extracting the text outside of lucene by using a PDFBox command line app. java org.pdfbox.ExtractText you may need to increase the JRE memory like this java -Xmx512m .pdfbox.ExtractText OR java -Xmx1024m .pdfbox.ExtractText If this is

Re: Out of memory error

2006-07-13 Thread Suba Suresh
Thanks. I am using the getText(PDDocument) method of the PDFTextStripper. I will try the other suggestion. suba suresh. Rob Staveley (Tom) wrote: If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)
If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap. If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText (org.pdf

Re: Can I do "Google Suggest" Like Search?

2006-07-13 Thread Mark Miller
Another option is to use Sun's free and soon to be open source Java Studio Creator2. It's a great way to do JSF and provides an AJAX google suggest type component. You can hook this component up to a lucene search and *BOOM*...google suggest. Here is a link to a "did you mean" tutorial as well (i

Out of memory error

2006-07-13 Thread Suba Suresh
I am indexing different document formats with lucene 1.9. One of the pdf file I am indexing is 300MG. Whenever the index writer hits that file it stops the indexing with "Out of Memory" exception. I am using the pdf box library to index. I have set the following merge factors in my code. write

Re: Can I do "Google Suggest" Like Search?

2006-07-13 Thread karl wettin
On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote: > So when I type “L” it will give me search options names which will > start from “L”. Then when I will type “Lu” then it should give me > options for names which are starting from “Lu”. & so on …… Vikas, the Jira now contains code that do

accented characters, wildcards and other problems

2006-07-13 Thread Tomi NA
I've done a bit of testing with accented characters (Croatian, to be specific) and can't really explain what I see when I explore the index with luke. I've used accented characters in directory names, file names and file contents. Now, in the list of terms (in "Top ranking terms", "Overview" tab)

Re: question regarding Field.Index.UN_TOKENZED

2006-07-13 Thread Ramesh Salla
Are you using the StandardAnalyzer at the time of Indexing? which one do u use at the time of Querying? Ramesh Reddy On Mon, 2006-07-10 at 18:37 -0700, Chris Hostetter wrote: > : I'm storing a field in an index with that option > : (Field.Index.UN_TOKENZIED). > > the key to understanding your

RE: Searching for a phrase which spans on 2 pages

2006-07-13 Thread Ramesh Salla
Yes, this can be easily done using TokenStream class and hence getting the the BestTokens. But ofcourse you have to have this content in the index. DONE Ramesh Reddy On Wed, 2006-07-12 at 12:43 +0100, Mike Streeton wrote: > The simplest solution is always the best - when storing the p

Re: modify existing non-indexed field

2006-07-13 Thread dan2000
can't access the file: Forbidden Remote Host: [62.172.205.164] You do not have permission to access http://cdoronc.20m.com/tmp/indexingThreads.zip Data files must be stored on the same site they are linked from. Thank you for using 20m.com -- View this message in context: http://www.nabble.

QueryFilter and Memory

2006-07-13 Thread Chun Wei Ho
Hi, I've been trying to adjust the weightings for my searches (thanks Chris for his replies on that thread), and have been using ConstantScoreQuery to even out scores from portions in my query that I want to match but not to contribute to the ranking of that result. I convert a BooleanQuery/Term