I think that I may be misreading the documentation.
I didn't see the description of the Long and Int type under the "Primitive
Types" section, while reading about the description of Byte, UInt32, Uint64,
VInt. So, for some reason I thought that Long and Int are byte
order sensitive.
Upon re-readi
As I understand from earlier answers to my question that
one can create an index on machine A,
and use it (search and merge with other indices) on Machine B.
I was reading the file format today.
http://lucene.apache.org/java/docs/fileformats.html
The index has Byte UInt32 UInt64 in most place
I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e
It's pretty stand-alone, s
Though I'm a newbie (which means I may be completely wrong), I don't
think this is possible "out of the box". The quickest would be to
write a filter which looks up document id's in the first index and
applies this to the second index to get the disired subset to search
over.
I may need this too,
Here is a use case I am trying to address.
I have two separate indexes, which contain sets of the same document
pool/corpus.
The two indexes have a different set of indexed fields.
One of the indexed fields is an external DocumentID.
I would like to perform searches, like a relational join, expre
Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.
I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTM
On 7/13/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:
Hi,
I am sure this is a question been asked before. :-) I have done some research too, but still don't quite
understand. I indexed 20 terms under field name "mesh", and set the boost accordingly from 20
to 1.(just some arbitrary numbers) But when
Hi,
I am sure this is a question been asked before. :-) I have done some research
too, but still don't quite understand. I indexed 20 terms under field name
"mesh", and set the boost accordingly from 20 to 1.(just some arbitrary
numbers) But when I checked the index from Luke, the boosts all app
> can't access the file:
> http://cdoronc.20m.com/tmp/indexingThreads.zip
Yes, this Web host sometimes behaves strange when clicking a link from a
mail program. Please try to copy
cdoronc.20m.com/tmp
to the Web Browser (e.g. Firefox), click .
This should show the content of that tmp folder, inc
Bok Tomi,
What do you mean by "terms are misrepresented"? What should they be, and what
are you seeing?
> What I'm not clear on is how can I see the problematic *terms* in the list of
> terms, but not the documents they're stored in?
Are you saying that the content got indexed, but the file n
Definitely. Thanks for both the suggestions. Yes it is 300MB.(typo)
suba suresh.
Rob Staveley (Tom) wrote:
Let us know how you get on. There are a lot of people fighting very similar
battles on this list.
-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED]
Sent: 13 July 2
Let us know how you get on. There are a lot of people fighting very similar
battles on this list.
-Original Message-
From: Suba Suresh [mailto:[EMAIL PROTECTED]
Sent: 13 July 2006 15:30
To: java-user@lucene.apache.org
Subject: Re: Out of memory error
Thanks.
I am using the getText(PDDo
By 300MG I assume you mean 300MB.
You can also try extracting the text outside of lucene by using a
PDFBox command line app.
java org.pdfbox.ExtractText
you may need to increase the JRE memory like this
java -Xmx512m .pdfbox.ExtractText
OR
java -Xmx1024m .pdfbox.ExtractText
If this is
Thanks.
I am using the getText(PDDocument) method of the PDFTextStripper. I will
try the other suggestion.
suba suresh.
Rob Staveley (Tom) wrote:
If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get
If you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#getText(o
rg.pdfbox.pdmodel.PDDocument), you are going to get a large String and may
need a 1G heap.
If, however, you are using
http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html#writeText
(org.pdf
Another option is to use Sun's free and soon to be open source Java Studio
Creator2. It's a great way to do JSF and provides an AJAX google suggest
type component. You can hook this component up to a lucene search and
*BOOM*...google suggest.
Here is a link to a "did you mean" tutorial as well (i
I am indexing different document formats with lucene 1.9. One of the pdf
file I am indexing is 300MG. Whenever the index writer hits that file it
stops the indexing with "Out of Memory" exception. I am using the pdf
box library to index. I have set the following merge factors in my code.
write
On Wed, 2006-05-24 at 13:11 +0530, Vikas Khengare wrote:
> So when I type “L” it will give me search options names which will
> start from “L”. Then when I will type “Lu” then it should give me
> options for names which are starting from “Lu”. & so on ……
Vikas,
the Jira now contains code that do
I've done a bit of testing with accented characters (Croatian, to be
specific) and can't really explain what I see when I explore the index
with luke.
I've used accented characters in directory names, file names and file contents.
Now, in the list of terms (in "Top ranking terms", "Overview" tab)
Are you using the StandardAnalyzer at the time of Indexing?
which one do u use at the time of Querying?
Ramesh Reddy
On Mon, 2006-07-10 at 18:37 -0700, Chris Hostetter wrote:
> : I'm storing a field in an index with that option
> : (Field.Index.UN_TOKENZIED).
>
> the key to understanding your
Yes, this can be easily done using TokenStream class and hence getting
the the BestTokens.
But ofcourse you have to have this content in the index.
DONE
Ramesh Reddy
On Wed, 2006-07-12 at 12:43 +0100, Mike Streeton wrote:
> The simplest solution is always the best - when storing the p
can't access the file:
Forbidden
Remote Host: [62.172.205.164]
You do not have permission to access
http://cdoronc.20m.com/tmp/indexingThreads.zip
Data files must be stored on the same site they are linked from.
Thank you for using 20m.com
--
View this message in context:
http://www.nabble.
Hi,
I've been trying to adjust the weightings for my searches (thanks
Chris for his replies on that thread), and have been using
ConstantScoreQuery to even out scores from portions in my query that I
want to match but not to contribute to the ranking of that result.
I convert a BooleanQuery/Term
23 matches
Mail list logo