QueryFilter and Memory
Hi, I've been trying to adjust the weightings for my searches (thanks Chris for his replies on that thread), and have been using ConstantScoreQuery to even out scores from portions in my query that I want to match but not to contribute to the ranking of that result. I convert a BooleanQuery/TermQuery (partialQuery) to a constant score one (before adding them to the overall BooleanQuery that gets searched) as follows: constantPartialQuery = new ConstantScoreQuery(new QueryFilter(partialQuery)); These partial queries are adhoc (created according to user input) and not reused. It worked but after some extended testing (like running a day of queries) - i get a Java Heap OutOfMemory error. I'm wondering: (a) Is there a better way to change a query to a constant score query other than what I did above? (b) I'm subclassing the QueryFilter not to cache the query results (which might be the cause) since the queryfilters are not re-used. Does anyone have an opinion on what else might be contributing to the memory problem. Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: modify existing non-indexed field
can't access the file: Forbidden Remote Host: [62.172.205.164] You do not have permission to access http://cdoronc.20m.com/tmp/indexingThreads.zip Data files must be stored on the same site they are linked from. Thank you for using 20m.com -- View this message in context: http://www.nabble.com/modify-existing-non-indexed-field-tf1905726.html#a5304328 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching for a phrase which spans on 2 pages
Yes, this can be easily done using TokenStream class and hence getting the the BestTokens. But ofcourse you have to have this content in the index. DONE Ramesh Reddy On Wed, 2006-07-12 at 12:43 +0100, Mike Streeton wrote: > The simplest solution is always the best - when storing the page, do not > break up sentences. So a page will be all the sentences that occur on > it. If a sentence starts on one page and finishes on the next it will be > included in both pages in the index. > > Hope this helps > > Mike > > www.ardentia.com the home of NetSearch > -Original Message- > From: Mile Rosu [mailto:[EMAIL PROTECTED] > Sent: 11 July 2006 15:55 > To: java-user@lucene.apache.org > Subject: Searching for a phrase which spans on 2 pages > > Hello, > > I am working on an application similar to google books which allows > searching on documents which represent a scanned page. Of course, one > might search for a phrase starting at the end of one page and ending at > the beginning of the next one. In this case I do not know how I might > treat this. Both pages should be returned as hit results. > Do you have any idea on how this situation might be handled? > > Thank you, > Mile Rosu > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: question regarding Field.Index.UN_TOKENZED
Are you using the StandardAnalyzer at the time of Indexing? which one do u use at the time of Querying? Ramesh Reddy On Mon, 2006-07-10 at 18:37 -0700, Chris Hostetter wrote: > : I'm storing a field in an index with that option > : (Field.Index.UN_TOKENZIED). > > the key to understanding your problem, is to realize that... > > UN_TOKENIZED == Not Analyzed > > > ...personally, i think name of the constant is missleading. > > > : The String that is being stored is: NORTH SAFETY PRODUCT (all uppercase) > > : When I try a wildcard query against that field, it only produces results > : if the query term is capitalized. > > that's because the term you've put in the index in capitalized. > > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
accented characters, wildcards and other problems
I've done a bit of testing with accented characters (Croatian, to be specific) and can't really explain what I see when I explore the index with luke. I've used accented characters in directory names, file names and file contents. Now, in the list of terms (in "Top ranking terms", "Overview" tab) I see that 2 out of 5 terms are misrepresented, but are indexed, nonetheless. The file names containing the problematic characters contain these characters themselves, i.e. if the file name is "file[x].txt", the file contents are "test[x]", where [x] represents the accented character. What I'm not clear on is how can I see the problematic *terms* in the list of terms, but not the documents they're stored in? That's one issue. The other is somewhat simpler, I expect. A search for "test*" returns no results. Acording to the FAQ, it should, so what am I missing? t.n.a. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I do "Google Suggest" Like Search?
On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote: > So when I type “L” it will give me search options names which will > start from “L”. Then when I will type “Lu” then it should give me > options for names which are starting from “Lu”. & so on …… Vikas, the Jira now contains code that does just that. It is a trie you will have to train with user queries (that result something), and is not based on the document corpus. http://issues.apache.org/jira/browse/LUCENE-625 I'd be more than happy to hear what you think of the API. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Out of memory error
I am indexing different document formats with lucene 1.9. One of the pdf file I am indexing is 300MG. Whenever the index writer hits that file it stops the indexing with "Out of Memory" exception. I am using the pdf box library to index. I have set the following merge factors in my code. writer.setMergeFactor(1000); writer.setMaxMergeDocs(999); writer.setMaxBufferedDocs(1000); writer.setMaxFieldLength(Integer.MAX_VALUE); I would like any help and suggestions. thanks, suba suresh. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I do "Google Suggest" Like Search?
Another option is to use Sun's free and soon to be open source Java Studio Creator2. It's a great way to do JSF and provides an AJAX google suggest type component. You can hook this component up to a lucene search and *BOOM*...google suggest. Here is a link to a "did you mean" tutorial as well (it may give some hints in the implementation of suggest as well): http://today.java.net/pub/a/today/2005/08/09/didyoumean.html - Mark On 7/13/06, karl wettin <[EMAIL PROTECTED]> wrote: On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote: > So when I type "L" it will give me search options names which will > start from "L". Then when I will type "Lu" then it should give me > options for names which are starting from "Lu". & so on …… Vikas, the Jira now contains code that does just that. It is a trie you will have to train with user queries (that result something), and is not based on the document corpus. http://issues.apache.org/jira/browse/LUCENE-625 I'd be more than happy to hear what you think of the API. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Out of memory error
If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap. If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary file, you will not need so much RAM, but you need to use http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field (rather than http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi eld.Store,%20org.apache.lucene.document.Field.Index)). -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 14:55 To: java-user@lucene.apache.org Subject: Out of memory error I am indexing different document formats with lucene 1.9. One of the pdf file I am indexing is 300MG. Whenever the index writer hits that file it stops the indexing with "Out of Memory" exception. I am using the pdf box library to index. I have set the following merge factors in my code. writer.setMergeFactor(1000); writer.setMaxMergeDocs(999); writer.setMaxBufferedDocs(1000); writer.setMaxFieldLength(Integer.MAX_VALUE); I would like any help and suggestions. thanks, suba suresh. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Out of memory error
Thanks. I am using the getText(PDDocument) method of the PDFTextStripper. I will try the other suggestion. suba suresh. Rob Staveley (Tom) wrote: If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap. If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary file, you will not need so much RAM, but you need to use http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field (rather than http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi eld.Store,%20org.apache.lucene.document.Field.Index)). -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 14:55 To: java-user@lucene.apache.org Subject: Out of memory error I am indexing different document formats with lucene 1.9. One of the pdf file I am indexing is 300MG. Whenever the index writer hits that file it stops the indexing with "Out of Memory" exception. I am using the pdf box library to index. I have set the following merge factors in my code. writer.setMergeFactor(1000); writer.setMaxMergeDocs(999); writer.setMaxBufferedDocs(1000); writer.setMaxFieldLength(Integer.MAX_VALUE); I would like any help and suggestions. thanks, suba suresh. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory error
By 300MG I assume you mean 300MB. You can also try extracting the text outside of lucene by using a PDFBox command line app. java org.pdfbox.ExtractText you may need to increase the JRE memory like this java -Xmx512m .pdfbox.ExtractText OR java -Xmx1024m .pdfbox.ExtractText If this is still giving you an out of memory error then it is possibly an issue with PDFBox, if that is the case then please create an issue and attach/upload the PDF on the PDFBox site. Ben > Thanks. > > I am using the getText(PDDocument) method of the PDFTextStripper. I will > try the other suggestion. > > suba suresh. > > Rob Staveley (Tom) wrote: > > If you are using > > http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getTe xt(o > > rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may > > need a 1G heap. > > > > If, however, you are using > > http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#write Text > > (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary > > file, you will not need so much RAM, but you need to use > > http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field. html > > #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field > > (rather than > > http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field. html > > #Field(java.lang.String,%20java.lang.String,% 20org.apache.lucene.document.Fi > > eld.Store,%20org.apache.lucene.document.Field.Index)). > > > > -Original Message- > > From: Suba Suresh [mailto:[EMAIL PROTECTED] > > Sent: 13 July 2006 14:55 > > To: java-user@lucene.apache.org > > Subject: Out of memory error > > > > I am indexing different document formats with lucene 1.9. One of the pdf > > file I am indexing is 300MG. Whenever the index writer hits that file it > > stops the indexing with "Out of Memory" exception. I am using the pdf box > > library to index. I have set the following merge factors in my code. > > > > writer.setMergeFactor(1000); > > writer.setMaxMergeDocs(999); > > writer.setMaxBufferedDocs(1000); > > writer.setMaxFieldLength(Integer.MAX_VALUE); > > > > I would like any help and suggestions. > > > > thanks, > > suba suresh. > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Out of memory error
Let us know how you get on. There are a lot of people fighting very similar battles on this list. -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 15:30 To: java-user@lucene.apache.org Subject: Re: Out of memory error Thanks. I am using the getText(PDDocument) method of the PDFTextStripper. I will try the other suggestion. suba suresh. Rob Staveley (Tom) wrote: > If you are using > http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get > Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large > String and may need a 1G heap. > > If, however, you are using > http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri > teText > (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a > temporary file, you will not need so much RAM, but you need to use > http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel > d.html > #Field(java.lang.String,%20java.io.Reader) to construct your Lucene > field (rather than > http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel > d.html > #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum > ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)). > > -Original Message- > From: Suba Suresh [mailto:[EMAIL PROTECTED] > Sent: 13 July 2006 14:55 > To: java-user@lucene.apache.org > Subject: Out of memory error > > I am indexing different document formats with lucene 1.9. One of the > pdf file I am indexing is 300MG. Whenever the index writer hits that > file it stops the indexing with "Out of Memory" exception. I am using > the pdf box library to index. I have set the following merge factors in my code. > > writer.setMergeFactor(1000); > writer.setMaxMergeDocs(999); > writer.setMaxBufferedDocs(1000); > writer.setMaxFieldLength(Integer.MAX_VALUE); > > I would like any help and suggestions. > > thanks, > suba suresh. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Out of memory error
Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo) suba suresh. Rob Staveley (Tom) wrote: Let us know how you get on. There are a lot of people fighting very similar battles on this list. -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 15:30 To: java-user@lucene.apache.org Subject: Re: Out of memory error Thanks. I am using the getText(PDDocument) method of the PDFTextStripper. I will try the other suggestion. suba suresh. Rob Staveley (Tom) wrote: If you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may need a 1G heap. If, however, you are using http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri teText (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary file, you will not need so much RAM, but you need to use http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel d.html #Field(java.lang.String,%20java.io.Reader) to construct your Lucene field (rather than http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel d.html #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)). -Original Message- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 13 July 2006 14:55 To: java-user@lucene.apache.org Subject: Out of memory error I am indexing different document formats with lucene 1.9. One of the pdf file I am indexing is 300MG. Whenever the index writer hits that file it stops the indexing with "Out of Memory" exception. I am using the pdf box library to index. I have set the following merge factors in my code. writer.setMergeFactor(1000); writer.setMaxMergeDocs(999); writer.setMaxBufferedDocs(1000); writer.setMaxFieldLength(Integer.MAX_VALUE); I would like any help and suggestions. thanks, suba suresh. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: accented characters, wildcards and other problems
Bok Tomi, What do you mean by "terms are misrepresented"? What should they be, and what are you seeing? > What I'm not clear on is how can I see the problematic *terms* in the list of > terms, but not the documents they're stored in? Are you saying that the content got indexed, but the file names did not? Out of curiosity (note my last name), I'm curious about what analyzer/tokenizer you're using. Is there an equivallent of Porter stemmer for Croatian? I could use that. :) Otis - Original Message From: Tomi NA <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, July 13, 2006 8:19:31 AM Subject: accented characters, wildcards and other problems I've done a bit of testing with accented characters (Croatian, to be specific) and can't really explain what I see when I explore the index with luke. I've used accented characters in directory names, file names and file contents. Now, in the list of terms (in "Top ranking terms", "Overview" tab) I see that 2 out of 5 terms are misrepresented, but are indexed, nonetheless. The file names containing the problematic characters contain these characters themselves, i.e. if the file name is "file[x].txt", the file contents are "test[x]", where [x] represents the accented character. What I'm not clear on is how can I see the problematic *terms* in the list of terms, but not the documents they're stored in? That's one issue. The other is somewhat simpler, I expect. A search for "test*" returns no results. Acording to the FAQ, it should, so what am I missing? t.n.a. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: modify existing non-indexed field
> can't access the file: > http://cdoronc.20m.com/tmp/indexingThreads.zip Yes, this Web host sometimes behaves strange when clicking a link from a mail program. Please try to copy cdoronc.20m.com/tmp to the Web Browser (e.g. Firefox), click . This should show the content of that tmp folder, including the downloadable file indexingThreads.zip Hope this works, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lengthnorm again
Hi, I am sure this is a question been asked before. :-) I have done some research too, but still don't quite understand. I indexed 20 terms under field name "mesh", and set the boost accordingly from 20 to 1.(just some arbitrary numbers) But when I checked the index from Luke, the boosts all appear to be 1. I saw the pervious post said it is because the boost shows in Luke is the product of index-time boost and lengthnorm. But if it is the case, aren't they supposed to be different instead of value "1"? I guess I still don't fully understand 'lengthnorm". Thank you, Xin
Re: lengthnorm again
On 7/13/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I am sure this is a question been asked before. :-) I have done some research too, but still don't quite understand. I indexed 20 terms under field name "mesh", and set the boost accordingly from 20 to 1.(just some arbitrary numbers) But when I checked the index from Luke, the boosts all appear to be 1. I saw the pervious post said it is because the boost shows in Luke is the product of index-time boost and lengthnorm. But if it is the case, aren't they supposed to be different instead of value "1"? I guess I still don't fully understand 'lengthnorm". I can't explain what you are seeing, but it sounds like your understanding of what it should be is correct. I guess you are either misinterpreting Luke's output, not indexing the docs correctly, or perhaps luke has a bug. Did you index the terms with different boosts in separate documents? There is only one norm per document for a specific indexed field. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTMLParser
Since I cannot seem to access the HTMLParser mailing list and I saw the library recommended here, I thought someone here that has used it successfully can help me out. I have HTML text stored in a database field which I want to add to a Lucene document, but I want to remove the HTML tags, so HTMLParser looked like it would fit the bill. First, it does not seem to be parsing hence my first problem and it also is throwing an exception along with this phrase sprinkled around (No such file or directory). I think I may be using it wrong, so heres what I have done. In my object where I create my document, I have the following code: StringExtractor extract = new StringExtractor(record.get("column14").toString().trim()); try { value = extract.extractStrings(false); } catch (ParserException pe) { System.out.println("Index Long Description Parser Exception:" + pe.getMessage() ); value = ""; } What I get out in value is like the following: Crystal Clear III and 3D combfilter for natural, sharp images with enhanced quality Compact and sleek design Incredible Surround (No such file or directory) So the tags are still there and oddly the (No such file or directory) phrase is added which is not in the original text. Then I get a ParserException. What am I doing wrong? Thanks, Ross
Are Search Joins Possible between two Physically separate Indexes?
Here is a use case I am trying to address. I have two separate indexes, which contain sets of the same document pool/corpus. The two indexes have a different set of indexed fields. One of the indexed fields is an external DocumentID. I would like to perform searches, like a relational join, expressing: "Return all fields (from both indexes) for document-ids that exist in both indexes and where field-X in Index-1 contains "foo" and field-Y in Index-2 contains "bar". How would you approach this. Do we need to handle the join logic ourselves, or is there an API Approach - possibly around MultiSearcher, that is meant to address this use case? Dejan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Are Search Joins Possible between two Physically separate Indexes?
Though I'm a newbie (which means I may be completely wrong), I don't think this is possible "out of the box". The quickest would be to write a filter which looks up document id's in the first index and applies this to the second index to get the disired subset to search over. I may need this too, so I'm curious what the experts have to say Regards Paul On 7/13/06, Dejan Nenov <[EMAIL PROTECTED]> wrote: Here is a use case I am trying to address. I have two separate indexes, which contain sets of the same document pool/corpus. The two indexes have a different set of indexed fields. One of the indexed fields is an external DocumentID. I would like to perform searches, like a relational join, expressing: "Return all fields (from both indexes) for document-ids that exist in both indexes and where field-X in Index-1 contains "foo" and field-Y in Index-2 contains "bar". How would you approach this. Do we need to handle the join logic ourselves, or is there an API Approach - possibly around MultiSearcher, that is meant to address this use case? Dejan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- http://walhalla.wordpress.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLParser
I've never used HTMLParser, but if you have malformed., incomplete, or optional HTML that would otherwise choke an HTML parser, you could use Solr's HTMLStripping: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e It's pretty stand-alone, so it should be trivial to rip it out of Solr and re-use it in your Lucene project. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 7/13/06, Ross Rankin <[EMAIL PROTECTED]> wrote: Since I cannot seem to access the HTMLParser mailing list and I saw the library recommended here, I thought someone here that has used it successfully can help me out. I have HTML text stored in a database field which I want to add to a Lucene document, but I want to remove the HTML tags, so HTMLParser looked like it would fit the bill. First, it does not seem to be parsing… hence my first problem and it also is throwing an exception along with this phrase sprinkled around "(No such file or directory)". I think I may be using it wrong, so here's what I have done. In my object where I create my document, I have the following code: StringExtractor extract = new StringExtractor(record.get("column14").toString().trim()); try { value = extract.extractStrings(false); } catch (ParserException pe) { System.out.println("Index Long Description Parser Exception:" + pe.getMessage() ); value = ""; } What I get out in value is like the following: Crystal Clear III and 3D combfilter for natural, sharp images with enhanced quality Compact and sleek design Incredible Surround (No such file or directory) So the tags are still there and oddly the '(No such file or directory)' phrase is added which is not in the original text. Then I get a ParserException. What am I doing wrong? Thanks, Ross - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
file format of index
As I understand from earlier answers to my question that one can create an index on machine A, and use it (search and merge with other indices) on Machine B. I was reading the file format today. http://lucene.apache.org/java/docs/fileformats.html The index has Byte UInt32 UInt64 in most places, making it byte order indepdent in these places. However a few spots has long and int. For example in the Compound Files section, there is a DataOffset-->Long and in term Vectors, there is a TVXVersion --> Int Is this an oversight in the documentation or is the document correct, and does this indicate that there will be problem using an index on a big Endian machine which is created on a small Endian machine. If both machines have the same Endian maybe the usage of Index created elsewhere is still fine ? I am starting to create quite a bit of code with the assumption that porting and merging index is okay anywhere.I would appreciate some more input as to whether these fields mattered or not and whether the documentation is correct. Thank you in advance.
Re: file format of index
I think that I may be misreading the documentation. I didn't see the description of the Long and Int type under the "Primitive Types" section, while reading about the description of Byte, UInt32, Uint64, VInt. So, for some reason I thought that Long and Int are byte order sensitive. Upon re-reading the document, I see that "All other data types are defined as sequences of bytes, so file formats are byte-order independent. "I think that I should be fine. Sorry for posting before reading more carefully. On 7/13/06, Beady Geraghty <[EMAIL PROTECTED]> wrote: As I understand from earlier answers to my question that one can create an index on machine A, and use it (search and merge with other indices) on Machine B. I was reading the file format today. http://lucene.apache.org/java/docs/fileformats.html The index has Byte UInt32 UInt64 in most places, making it byte order indepdent in these places. However a few spots has long and int. For example in the Compound Files section, there is a DataOffset-->Long and in term Vectors, there is a TVXVersion --> Int Is this an oversight in the documentation or is the document correct, and does this indicate that there will be problem using an index on a big Endian machine which is created on a small Endian machine. If both machines have the same Endian maybe the usage of Index created elsewhere is still fine ? I am starting to create quite a bit of code with the assumption that porting and merging index is okay anywhere.I would appreciate some more input as to whether these fields mattered or not and whether the documentation is correct. Thank you in advance.