date:20060713

QueryFilter and Memory

2006-07-13 Thread Chun Wei Ho


Hi,

I've been trying to adjust the weightings for my searches (thanks
Chris for his replies on that thread), and have been using
ConstantScoreQuery to even out scores from portions in my query that I
want to match but not to contribute to the ranking of that result.

I convert a BooleanQuery/TermQuery (partialQuery) to a constant score
one (before adding them to the overall BooleanQuery that gets
searched) as follows:
constantPartialQuery = new ConstantScoreQuery(new QueryFilter(partialQuery));

These partial queries are adhoc (created according to user input) and
not reused.

It worked but after some extended testing (like running a day of
queries) - i get a Java Heap OutOfMemory error. I'm wondering:

(a) Is there a better way to change a query to a constant score query
other than what I did above?

(b) I'm subclassing the QueryFilter not to cache the query results
(which might be the cause) since the queryfilters are not re-used.
Does anyone have an opinion on what else might be contributing to the
memory problem.

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: modify existing non-indexed field

2006-07-13 Thread dan2000


can't access the file:

Forbidden
Remote Host: [62.172.205.164]

You do not have permission to access
http://cdoronc.20m.com/tmp/indexingThreads.zip 
Data files must be stored on the same site they are linked from. 

Thank you for using 20m.com
-- 
View this message in context: 
http://www.nabble.com/modify-existing-non-indexed-field-tf1905726.html#a5304328
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Searching for a phrase which spans on 2 pages

2006-07-13 Thread Ramesh Salla

Yes, this can be easily done using TokenStream class and hence getting
the the BestTokens.
But ofcourse you have to have this content in the index.

DONE

Ramesh Reddy



On Wed, 2006-07-12 at 12:43 +0100, Mike Streeton wrote:

> The simplest solution is always the best - when storing the page, do not
> break up sentences. So a page will be all the sentences that occur on
> it. If a sentence starts on one page and finishes on the next it will be
> included in both pages in the index.
> 
> Hope this helps
> 
> Mike
> 
> www.ardentia.com the home of NetSearch
> -Original Message-
> From: Mile Rosu [mailto:[EMAIL PROTECTED] 
> Sent: 11 July 2006 15:55
> To: java-user@lucene.apache.org
> Subject: Searching for a phrase which spans on 2 pages
> 
> Hello,
> 
> I am working on an application similar to google books which allows 
> searching on documents which represent a scanned page. Of course, one 
> might search for a phrase starting at the end of one page and ending at 
> the beginning of the next one. In this case I do not know how I might 
> treat this. Both pages should be returned as hit results.
> Do you have any idea on how this situation might be handled?
> 
> Thank you,
> Mile Rosu
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Re: question regarding Field.Index.UN_TOKENZED

2006-07-13 Thread Ramesh Salla

Are you using the StandardAnalyzer at the time of Indexing?
which one do u use at the time of Querying?

Ramesh Reddy


On Mon, 2006-07-10 at 18:37 -0700, Chris Hostetter wrote:

> : I'm storing a field in an index with that option
> : (Field.Index.UN_TOKENZIED).
> 
> the key to understanding your problem, is to realize that...
> 
>   UN_TOKENIZED == Not Analyzed
> 
> 
> ...personally, i think name of the constant is missleading.
> 
> 
> : The String that is being stored is: NORTH SAFETY PRODUCT (all uppercase)
> 
> : When I try a wildcard query against that field, it only produces results
> : if the query term is capitalized.
> 
> that's because the term you've put in the index in capitalized.
> 
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

accented characters, wildcards and other problems

2006-07-13 Thread Tomi NA


I've done a bit of testing with accented characters (Croatian, to be
specific) and can't really explain what I see when I explore the index
with luke.
I've used accented characters in directory names, file names and file contents.
Now, in the list of terms (in "Top ranking terms", "Overview" tab) I
see that 2 out of 5 terms are misrepresented, but are indexed,
nonetheless.
The file names containing the problematic characters contain these
characters themselves, i.e. if the file name is "file[x].txt", the
file contents are "test[x]", where [x] represents the accented
character. What I'm not clear on is how can I see the problematic
*terms* in the list of terms, but not the documents they're stored in?

That's one issue. The other is somewhat simpler, I expect.
A search for "test*" returns no results. Acording to the FAQ, it
should, so what am I missing?

t.n.a.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can I do "Google Suggest" Like Search?

2006-07-13 Thread karl wettin

On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote:
> So when I type “L” it will give me search options names which will
> start from “L”. Then when I will type “Lu” then it should give me
> options for names which are starting from “Lu”. & so on …… 

Vikas,

the Jira now contains code that does just that. It is a trie you will
have to train with user queries (that result something), and is not
based on the document corpus. 

http://issues.apache.org/jira/browse/LUCENE-625

I'd be more than happy to hear what you think of the API.

-- 
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Out of memory error

2006-07-13 Thread Suba Suresh

I am indexing different document formats with lucene 1.9. One of the pdf 
file I am indexing is 300MG. Whenever the index writer hits that file it 
stops the indexing with "Out of Memory" exception. I am using the pdf 
box library to index. I have set the following merge factors in my code.


writer.setMergeFactor(1000);
writer.setMaxMergeDocs(999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can I do "Google Suggest" Like Search?

2006-07-13 Thread Mark Miller

Another option is to use Sun's free and soon to be open source Java Studio
Creator2. It's a great way to do JSF and provides an AJAX google suggest
type component. You can hook this component up to a lucene search and
*BOOM*...google suggest.

Here is a link to a "did you mean" tutorial as well (it may give some hints
in the implementation of suggest as well):

http://today.java.net/pub/a/today/2005/08/09/didyoumean.html

- Mark

On 7/13/06, karl wettin <[EMAIL PROTECTED]> wrote:

On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote:
> So when I type "L" it will give me search options names which will
> start from "L". Then when I will type "Lu" then it should give me
> options for names which are starting from "Lu". & so on ……

Vikas,

the Jira now contains code that does just that. It is a trie you will
have to train with user queries (that result something), and is not
based on the document corpus.

http://issues.apache.org/jira/browse/LUCENE-625

I'd be more than happy to hear what you think of the API.

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)

If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
need a 1G heap. 

If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
file, you will not need so much RAM, but you need to use
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
(rather than
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
eld.Store,%20org.apache.lucene.document.Field.Index)).

-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED] 
Sent: 13 July 2006 14:55
To: java-user@lucene.apache.org
Subject: Out of memory error

I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf box
library to index. I have set the following merge factors in my code.

writer.setMergeFactor(1000);
writer.setMaxMergeDocs(999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature

Re: Out of memory error

2006-07-13 Thread Suba Suresh


Thanks.

I am using the getText(PDDocument) method of the PDFTextStripper. I will 
try the other suggestion.


suba suresh.

Rob Staveley (Tom) wrote:

If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
need a 1G heap. 


If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a temporary
file, you will not need so much RAM, but you need to use
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.io.Reader) to construct your Lucene field
(rather than
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html
#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Fi
eld.Store,%20org.apache.lucene.document.Field.Index)).

-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED] 
Sent: 13 July 2006 14:55

To: java-user@lucene.apache.org
Subject: Out of memory error

I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf box
library to index. I have set the following merge factors in my code.

writer.setMergeFactor(1000);
writer.setMaxMergeDocs(999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Out of memory error

2006-07-13 Thread Ben Litchfield

By 300MG I assume you mean 300MB.

You can also try extracting the text outside of lucene by using a 
PDFBox command line app.  

java org.pdfbox.ExtractText 

you may need to increase the JRE memory like this

java -Xmx512m .pdfbox.ExtractText 

OR

java -Xmx1024m .pdfbox.ExtractText 


If this is still giving you an out of memory error then it is possibly 
an issue with PDFBox, if that is the case then please create an issue 
and attach/upload the PDF on the PDFBox site.


Ben



> Thanks.
> 
> I am using the getText(PDDocument) method of the PDFTextStripper. I 
will 
> try the other suggestion.
> 
> suba suresh.
> 
> Rob Staveley (Tom) wrote:
> > If you are using
> > 
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getTe
xt(o
> > rg.pdfbox.pdmodel.PDDocument), you are going to get a large String 
and may
> > need a 1G heap. 
> > 
> > If, however, you are using
> > 
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#write
Text
> > (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
temporary
> > file, you will not need so much RAM, but you need to use
> > 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.
html
> > #Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
field
> > (rather than
> > 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.
html
> > #Field(java.lang.String,%20java.lang.String,%
20org.apache.lucene.document.Fi
> > eld.Store,%20org.apache.lucene.document.Field.Index)).
> > 
> > -Original Message-
> > From: Suba Suresh [mailto:[EMAIL PROTECTED] 
> > Sent: 13 July 2006 14:55
> > To: java-user@lucene.apache.org
> > Subject: Out of memory error
> > 
> > I am indexing different document formats with lucene 1.9. One of 
the pdf
> > file I am indexing is 300MG. Whenever the index writer hits that 
file it
> > stops the indexing with "Out of Memory" exception. I am using the 
pdf box
> > library to index. I have set the following merge factors in my code.
> > 
> > writer.setMergeFactor(1000);
> > writer.setMaxMergeDocs(999);
> > writer.setMaxBufferedDocs(1000);
> > writer.setMaxFieldLength(Integer.MAX_VALUE);
> > 
> > I would like any help and suggestions.
> > 
> > thanks,
> > suba suresh.
> > 
> > 
-
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Out of memory error

2006-07-13 Thread Rob Staveley (Tom)

Let us know how you get on. There are a lot of people fighting very similar
battles on this list. 

-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED] 
Sent: 13 July 2006 15:30
To: java-user@lucene.apache.org
Subject: Re: Out of memory error

Thanks.

I am using the getText(PDDocument) method of the PDFTextStripper. I will try
the other suggestion.

suba suresh.

Rob Staveley (Tom) wrote:
> If you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
> Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large 
> String and may need a 1G heap.
> 
> If, however, you are using
> http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
> teText
> (org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
> temporary file, you will not need so much RAM, but you need to use 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
> d.html
> #Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
> field (rather than 
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
> d.html 
> #Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum
> ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).
> 
> -Original Message-
> From: Suba Suresh [mailto:[EMAIL PROTECTED]
> Sent: 13 July 2006 14:55
> To: java-user@lucene.apache.org
> Subject: Out of memory error
> 
> I am indexing different document formats with lucene 1.9. One of the 
> pdf file I am indexing is 300MG. Whenever the index writer hits that 
> file it stops the indexing with "Out of Memory" exception. I am using 
> the pdf box library to index. I have set the following merge factors in my
code.
> 
> writer.setMergeFactor(1000);
> writer.setMaxMergeDocs(999);
> writer.setMaxBufferedDocs(1000);
> writer.setMaxFieldLength(Integer.MAX_VALUE);
> 
> I would like any help and suggestions.
> 
> thanks,
> suba suresh.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature

Re: Out of memory error

2006-07-13 Thread Suba Suresh


Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo)

suba suresh.

Rob Staveley (Tom) wrote:

Let us know how you get on. There are a lot of people fighting very similar
battles on this list. 


-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED] 
Sent: 13 July 2006 15:30

To: java-user@lucene.apache.org
Subject: Re: Out of memory error

Thanks.

I am using the getText(PDDocument) method of the PDFTextStripper. I will try
the other suggestion.

suba suresh.

Rob Staveley (Tom) wrote:


If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#get
Text(o rg.pdfbox.pdmodel.PDDocument), you are going to get a large 
String and may need a 1G heap.


If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#wri
teText
(org.pdfbox.pdmodel.PDDocument,%20java.io.Writer) to go via a 
temporary file, you will not need so much RAM, but you need to use 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel

d.html
#Field(java.lang.String,%20java.io.Reader) to construct your Lucene 
field (rather than 
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Fiel
d.html 
#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.docum

ent.Fi eld.Store,%20org.apache.lucene.document.Field.Index)).

-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED]
Sent: 13 July 2006 14:55
To: java-user@lucene.apache.org
Subject: Out of memory error

I am indexing different document formats with lucene 1.9. One of the 
pdf file I am indexing is 300MG. Whenever the index writer hits that 
file it stops the indexing with "Out of Memory" exception. I am using 
the pdf box library to index. I have set the following merge factors in my


code.


writer.setMergeFactor(1000);
writer.setMaxMergeDocs(999);
writer.setMaxBufferedDocs(1000);
writer.setMaxFieldLength(Integer.MAX_VALUE);

I would like any help and suggestions.

thanks,
suba suresh.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: accented characters, wildcards and other problems

2006-07-13 Thread Otis Gospodnetic

Bok Tomi,

What do you mean by "terms are misrepresented"?  What should they be, and what 
are you seeing?

> What I'm not clear on is how can I see the problematic *terms* in the list of 
> terms, but not the documents they're stored in?

Are you saying that the content got indexed, but the file names did not?

Out of curiosity (note my last name), I'm curious about what analyzer/tokenizer 
you're using.  Is there an equivallent of Porter stemmer for Croatian?  I could 
use that. :)

Otis

- Original Message 
From: Tomi NA <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, July 13, 2006 8:19:31 AM
Subject: accented characters, wildcards and other problems

I've done a bit of testing with accented characters (Croatian, to be
specific) and can't really explain what I see when I explore the index
with luke.
I've used accented characters in directory names, file names and file contents.
Now, in the list of terms (in "Top ranking terms", "Overview" tab) I
see that 2 out of 5 terms are misrepresented, but are indexed,
nonetheless.
The file names containing the problematic characters contain these
characters themselves, i.e. if the file name is "file[x].txt", the
file contents are "test[x]", where [x] represents the accented
character. What I'm not clear on is how can I see the problematic
*terms* in the list of terms, but not the documents they're stored in?

That's one issue. The other is somewhat simpler, I expect.
A search for "test*" returns no results. Acording to the FAQ, it
should, so what am I missing?

t.n.a.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: modify existing non-indexed field

2006-07-13 Thread Doron Cohen

> can't access the file:
> http://cdoronc.20m.com/tmp/indexingThreads.zip

Yes, this Web host sometimes behaves strange when clicking a link from a
mail program. Please try to copy
   cdoronc.20m.com/tmp
to the Web Browser (e.g. Firefox), click .
This should show the content of that tmp folder, including the
downloadable file indexingThreads.zip

Hope this works,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

lengthnorm again

2006-07-13 Thread Zhao, Xin

Hi,
I am sure this is a question been asked before. :-) I have done some research 
too, but still don't quite understand. I indexed 20 terms under field name 
"mesh", and set the boost accordingly from 20 to 1.(just some arbitrary 
numbers) But when I checked the index from Luke, the boosts all appear to be 1. 
I saw the pervious post said it is because the boost shows in Luke is the 
product of index-time boost and lengthnorm. But if it is the case, aren't they 
supposed to be different instead of value "1"? I guess I still don't fully 
understand 'lengthnorm".
Thank you,
Xin

Re: lengthnorm again

2006-07-13 Thread Yonik Seeley


On 7/13/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:

Hi,
I am sure this is a question been asked before. :-) I have done some research too, but still don't quite 
understand. I indexed 20 terms under field name "mesh", and set the boost accordingly from 20 
to 1.(just some arbitrary numbers) But when I checked the index from Luke, the boosts all appear to be 
1. I saw the pervious post said it is because the boost shows in Luke is the product of index-time boost 
and lengthnorm. But if it is the case, aren't they supposed to be different instead of value 
"1"? I guess I still don't fully understand 'lengthnorm".


I can't explain what you are seeing, but it sounds like your
understanding of what it should be is correct.
I guess you are either misinterpreting Luke's output,  not indexing
the docs correctly, or perhaps luke has a bug.
Did you index the terms with different boosts in separate documents?
There is only one norm per document for a specific indexed field.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

HTMLParser

2006-07-13 Thread Ross Rankin

Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.  

I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTMLParser
looked like it would fit the bill.

 

First, it does not seem to be parsing hence my first problem and it
also is throwing an exception along with this phrase sprinkled around
(No such file or directory).  

 

I think I may be using it wrong, so heres what I have done.  In my
object where I create my document, I have the following code:

StringExtractor extract = new
StringExtractor(record.get("column14").toString().trim());

try {

value = extract.extractStrings(false);

} catch (ParserException pe) {

System.out.println("Index Long Description Parser
Exception:" + pe.getMessage() );

value = "";

}

 

What I get out in value is like the following:

Crystal Clear III and 3D combfilter for natural, sharp
images with enhanced quality 

Compact and sleek design 

Incredible Surround (No such file or directory)

 

So the tags are still there and oddly the (No such file or directory)
phrase is added which is not in the original text.

 

Then I get a ParserException.

 

What am I doing wrong?

 

Thanks,

Ross

Are Search Joins Possible between two Physically separate Indexes?

2006-07-13 Thread Dejan Nenov

Here is a use case I am trying to address.

I have two separate indexes, which contain sets of the same document
pool/corpus.
The two indexes have a different set of indexed fields.
One of the indexed fields is an external DocumentID.

I would like to perform searches, like a relational join, expressing:

"Return all fields (from both indexes) for document-ids that exist in both
indexes and where field-X in Index-1 contains "foo" and field-Y in Index-2
contains "bar".

How would you approach this. Do we need to handle the join logic ourselves,
or is there an API Approach - possibly around MultiSearcher, that is meant
to address this use case?

Dejan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Are Search Joins Possible between two Physically separate Indexes?

2006-07-13 Thread Paul Borgermans


Though I'm a newbie (which means I may be completely wrong), I don't
think this is possible "out of the box". The quickest would be to
write a filter which looks up document id's in the first index and
applies this to the second index to get the disired subset to search
over.

I may need this too, so I'm curious what the experts have to say

Regards

Paul

On 7/13/06, Dejan Nenov <[EMAIL PROTECTED]> wrote:

Here is a use case I am trying to address.

I have two separate indexes, which contain sets of the same document
pool/corpus.
The two indexes have a different set of indexed fields.
One of the indexed fields is an external DocumentID.

I would like to perform searches, like a relational join, expressing:

"Return all fields (from both indexes) for document-ids that exist in both
indexes and where field-X in Index-1 contains "foo" and field-Y in Index-2
contains "bar".

How would you approach this. Do we need to handle the join logic ourselves,
or is there an API Approach - possibly around MultiSearcher, that is meant
to address this use case?

Dejan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
http://walhalla.wordpress.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTMLParser

2006-07-13 Thread Yonik Seeley


I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

It's pretty stand-alone, so it should be trivial to rip it out of Solr
and re-use it in your Lucene project.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 7/13/06, Ross Rankin <[EMAIL PROTECTED]> wrote:

Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.

I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTMLParser
looked like it would fit the bill.



First, it does not seem to be parsing… hence my first problem and it
also is throwing an exception along with this phrase sprinkled around
"(No such file or directory)".



I think I may be using it wrong, so here's what I have done.  In my
object where I create my document, I have the following code:

StringExtractor extract = new
StringExtractor(record.get("column14").toString().trim());

try {

value = extract.extractStrings(false);

} catch (ParserException pe) {

System.out.println("Index Long Description Parser
Exception:" + pe.getMessage() );

value = "";

}



What I get out in value is like the following:

Crystal Clear III and 3D combfilter for natural, sharp
images with enhanced quality 

Compact and sleek design 

Incredible Surround (No such file or directory)



So the tags are still there and oddly the '(No such file or directory)'
phrase is added which is not in the original text.



Then I get a ParserException.



What am I doing wrong?



Thanks,

Ross


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

file format of index

2006-07-13 Thread Beady Geraghty


As I understand from earlier answers to my question that
one can create an index on machine A,
and use  it   (search and merge with other indices) on Machine B.

I was reading the file format today.
http://lucene.apache.org/java/docs/fileformats.html

The index has Byte UInt32 UInt64 in most places, making it byte order
indepdent in these places.

However  a few spots has long and int.
For example in the Compound Files section, there is a
DataOffset-->Long

and in term Vectors, there is a
TVXVersion --> Int

Is this an oversight in the documentation or is the document correct, and
does this indicate that there will be problem using an index on a big Endian
machine which is created on a small Endian machine.

If both machines have the same Endian maybe the usage of Index created
elsewhere is still fine ?

I am starting to create quite a bit of code with the assumption that porting
and merging index is okay anywhere.I would appreciate some more
input as to whether these fields mattered or not and whether the
documentation
is correct.

Thank you in advance.

Re: file format of index

2006-07-13 Thread Beady Geraghty


I think that I may be misreading the documentation.
I didn't see the description of the Long and Int type under the "Primitive
Types" section, while reading about the description of Byte, UInt32, Uint64,
VInt. So, for some reason I thought that Long and Int are byte
order sensitive.

Upon re-reading the document,  I see that
"All other data types are defined as sequences of bytes, so file formats are
byte-order independent. "I think that I should be fine.

Sorry for posting before reading more carefully.



On 7/13/06, Beady Geraghty <[EMAIL PROTECTED]> wrote:



As I understand from earlier answers to my question that
one can create an index on machine A,
and use  it   (search and merge with other indices) on Machine B.

I was reading the file format today.
http://lucene.apache.org/java/docs/fileformats.html

The index has Byte UInt32 UInt64 in most places, making it byte order
indepdent in these places.

However  a few spots has long and int.
For example in the Compound Files section, there is a
DataOffset-->Long

and in term Vectors, there is a
TVXVersion --> Int

Is this an oversight in the documentation or is the document correct, and
does this indicate that there will be problem using an index on a big
Endian
machine which is created on a small Endian machine.

If both machines have the same Endian maybe the usage of Index created
elsewhere is still fine ?

I am starting to create quite a bit of code with the assumption that
porting
and merging index is okay anywhere.I would appreciate some more
input as to whether these fields mattered or not and whether the
documentation
is correct.

Thank you in advance.

QueryFilter and Memory

Re: modify existing non-indexed field

RE: Searching for a phrase which spans on 2 pages

Re: question regarding Field.Index.UN_TOKENZED

accented characters, wildcards and other problems

Re: Can I do "Google Suggest" Like Search?

Out of memory error

Re: Can I do "Google Suggest" Like Search?

RE: Out of memory error

Re: Out of memory error

Re: Out of memory error

RE: Out of memory error

Re: Out of memory error

Re: accented characters, wildcards and other problems

Re: modify existing non-indexed field

lengthnorm again

Re: lengthnorm again

HTMLParser

Are Search Joins Possible between two Physically separate Indexes?

Re: Are Search Joins Possible between two Physically separate Indexes?

Re: HTMLParser

file format of index

Re: file format of index

23 matches

Site Navigation

Mail list logo

Footer information