in 6.1.0 version BigIntegerPoint seams be moved in the mains module (no
more in sandbox).
However
1) BigIntegerPoint seams be a class for searching a 128integer not for
sorting. NumericDocValuesField supports long not BigInteger. so I used for
sorting SortedDocValuesField.
2) BigIntegerPoint name
To index into postings, use TextField (analyzes text into tokens) or
StringField (indexes entire string as one token). E.g. you could map
boolean true to StringField("true").
See BigIntegerPoint in lucene's sandbox module.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Aug 10, 2016 at
thanks for suggestion about postings (i think you mean "posting format" ,
just found mentions in google now :)) I have difficulty anyway to find a
example how to use postings. Any example how to use postings in code ? just
a link for example?
*Passing to docvalues* :
in version 6.1 docvalues (like
It would make little sense to use points for a boolean field in the 1D case
since there are only two possible values, postings would likely be faster
and use less disk space thanks to their skipping capabilities and better
doc ID compression. Even with multiple dimensions, postings might still be
a
You shouldn't need to use setNumericPrecisionStep at all: that's how
Lucene's old numerics worked.
For boolean type, Lucene will still use one byte per value when you index
it as points (or as a term) ... I don't know how to get that down to only 1
bit :)
Mike McCandless
http://blog.mikemccandle
in addition in the previous version of my code i used
TYPE.setNumericPrecisionStep for setting the precision of a number in
docvalues. Now i saw it is deprecated.
So i have a similar question also in this case: it is still possible
to use less space for (byte,boolean,short,int) types?
2016
ok thanks so i can do them.
but for boolean type? i could compress using bit. Is there pack function
for boolean arrays?
2016-08-10 11:25 GMT+02:00 Michael McCandless :
> It's partially right!
>
> E.g. IndexWriter will use less memory, and so you'll get better indexing
> throughput with a ShortP
It's partially right!
E.g. IndexWriter will use less memory, and so you'll get better indexing
throughput with a ShortPoint and BytePoint.
But index size will be the same, because Lucene's default codec does a good
job compressing these values.
Mike McCandless
http://blog.mikemccandless.com
On
sorry but I was developping a shortPoint and BytePoint for less using less
memory space. it is wrong?
2016-08-09 22:01 GMT+02:00 Michael McCandless :
> It's best to index numeric using the new dimensional points, e.g. IntPoint.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Au
It's best to index numeric using the new dimensional points, e.g. IntPoint.
Mike McCandless
http://blog.mikemccandless.com
On Tue, Aug 9, 2016 at 10:12 AM, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> wrote:
> how to encode a short or a byte type in byteRef in lucene 6.1?
>
how to encode a short or a byte type in byteRef in lucene 6.1?
nt: Monday, March 3, 2014 12:09 PM
To: java-user@lucene.apache.org
Subject: encoding problem when retrieving document field value
Hi :)
My index (Lucene 3.5) contains a field called title. Its value is
indexed (analyzed and stored) with the WhitespaceAnalyzer and can
contains html entities such as
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky wrote:
> What is the hex value for that second character returned that appears to
> display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use
> 2", so who knows what it might display as.
Well, if they're dealing with HTML, then it wil
come about picking a PU Unicode
character?
-- Jack Krupansky
-Original Message-
From: G.Long
Sent: Monday, March 3, 2014 12:09 PM
To: java-user@lucene.apache.org
Subject: encoding problem when retrieving document field value
Hi :)
My index (Lucene 3.5) contains a field called title. It
p://www.thetaphi.de
eMail: u...@thetaphi.de
-Original Message-
From: G.Long [mailto:jde...@gmail.com]
Sent: Monday, March 03, 2014 6:09 PM
To: java-user@lucene.apache.org
Subject: encoding problem when retrieving document field value
Hi :)
My index (Lucene 3.5) contains a field called title. It
M
> To: java-user@lucene.apache.org
> Subject: encoding problem when retrieving document field value
>
> Hi :)
>
> My index (Lucene 3.5) contains a field called title. Its value is indexed
> (analyzed and stored) with the WhitespaceAnalyzer and can contains html
> entities such as ’ or
Hi :)
My index (Lucene 3.5) contains a field called title. Its value is
indexed (analyzed and stored) with the WhitespaceAnalyzer and can
contains html entities such as ’ or °
My problem is that when i retrieve values from this field, some of the
html entities are missing.
For example :
Lu
Well, assuming there are no objections to the current approach, and
performance checks out, I'll try to get this into 3.1...
Mike
On Tue, Nov 10, 2009 at 4:33 AM, Benjamin Heilbrunn wrote:
> Hi,
>
> I applied
> http://issues.apache.org/jira/secure/attachment/12411342/Lucene-1260.patch
> That's
Hi,
I applied
http://issues.apache.org/jira/secure/attachment/12411342/Lucene-1260.patch
That's exactly what I was looking for.
The problem is, that from know on I'm on a patched version and I'm not
very happy with breaking compatibility to the "original" jars...
So is there a chance that this p
On Mon, Nov 9, 2009 at 12:19 PM, Benjamin Heilbrunn wrote:
> After making my post i found this (without taking a deeper look):
>
> http://issues.apache.org/jira/browse/LUCENE-1260
>
> Looks like a solution for that problem.
Indeed the most recent patch there looks almost exactly like what
you're
Hi Mike,
thanks for your reply.
After making my post i found this (without taking a deeper look):
http://issues.apache.org/jira/browse/LUCENE-1260
Looks like a solution for that problem.
Why wasn't it applied to lucene?
Benjamin
-
On Mon, Nov 9, 2009 at 11:04 AM, Benjamin Heilbrunn wrote:
> i've got a problem concerning encoding of norms.
> I want to use int values (0-255) instead of float interpreted bytes.
>
> In my own Similarity-Class, which I use for indexing and searching, I
> implemente
Hi,
i've got a problem concerning encoding of norms.
I want to use int values (0-255) instead of float interpreted bytes.
In my own Similarity-Class, which I use for indexing and searching, I
implemented the static methods encodeNorms, decodeNorms and
getNormDecoder.
But because they are s
Hi,
I know that a new encoding technique, PFOR, is being implemented in the
Lucene project [1]. Have you heard about the "Group Varint" encoding
technique from Google ? There is a technical explanation in the talk of
Jeffrey Dean, "Challenges in Building Large-Scale Inform
i've been using the one in icu for some time...
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
On Fri, Mar 27, 2009 at 2:57 PM, Zhang, Lisheng <
lisheng.zh...@broadvision.com> wrote:
> Hi,
>
> What's the best free tool for encoding detection
Hi,
What's the best free tool for encoding detection? For example we have
a ASCII file README.txt, which needs to be indexed, but we need to
know its encoding before we can convert it to Java String.
I saw some free tools on the market, but have no experiences with any
of them yet? What i
Renaud Delbru wrote:
Hi Michael,
Michael McCandless wrote:
Also, this issue was just opened:
https://issues.apache.org/jira/browse/LUCENE-1419
which would make it possible for classes in the same package
(oal.index) to use their own indexing chain. With that fix, if you
make your ow
Hi Michael,
Michael McCandless wrote:
Also, this issue was just opened:
https://issues.apache.org/jira/browse/LUCENE-1419
which would make it possible for classes in the same package
(oal.index) to use their own indexing chain. With that fix, if you
make your own classes in oal.index pa
Hi,
Michael McCandless wrote:
This looks right, though you would also need to modify SegmentMerger
to read & write your new format when merging segments.
Another thing you could do is grep for "omitTf" which should touch
exactly the same places you need to touch.
Ok, thanks for the pointers.
Renaud Delbru wrote:
Hi,
We are trying to modify the positional encoding of a term occurrence
for experimentation purposes. One solution we adopt is to use
payloads to sotre our own positional information encoding, but with
this solution, it becomes difficult to measure the increase or
Hi,
We are trying to modify the positional encoding of a term occurrence for
experimentation purposes. One solution we adopt is to use payloads to
sotre our own positional information encoding, but with this solution,
it becomes difficult to measure the increase or decrease of index size.
It
ally when people have encoding problems in their lucene applications,
the origin of hte problem is in the way they fetch the data before
indexing it ... if you can make a String object, and System.out.println
that string and see what you expect, then handing that string to Lucene as
a field value sh
@lucene.apache.org
Subject: Re: encoding question.
Internally Lucene deals with pure Java Strings; when writing those
strings
to and reading those strings back from disk, Lucene allways uses the
stock
Java "modified UTF-8" format, regardless of what your file.encoding
system property may be.
typc
Internally Lucene deals with pure Java Strings; when writing those strings
to and reading those strings back from disk, Lucene allways uses the stock
Java "modified UTF-8" format, regardless of what your file.encoding
system property may be.
typcially when people have encoding problem
Hi
I want to index data with utf-8 encoding, so when adding field to a document
I am using the code new String(value.getBytes("utf-8"))
in the other hand, when I am going to search I was using the same snippet
code to convert to utf-8 but it did not work so finally I found somewhere
tha
ource code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writ
would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme
it sounds like you may be confused by a couple of differnet things:
1) you are getting a parse exception bcause the '"' character is
meaningful to the query parser ... it thinks you are trying to do a phraes
search but you haven't finished the phrase, try escaping it with \"
2) just because you
I'm trying to search my index using this search phrase: 1"
That returns zero search results and throws a ParseException: Lexical error at
line... I can see that 1" is part of that particular document by searching
that same document using a different search term.
How should the Lucene in
Hello,
On Jan 27, 2006, at 11:44, John Haxby wrote:
I've attached the perl script -- feed
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it.
Thanks! Works great!
It's based on a slightly different principle to yours. You seem to
look for things like "mumble mumble LETTER X m
petite_abeille wrote:
I would love to see this. I presently have a somewhat unwieldy
conversion table [1] that I would love to get ride of :))
[snip]
[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
I've attached the perl script -- feed
http://www.unicode.org/Public/4.1.0/u
Hello,
On Jan 26, 2006, at 12:01, John Haxby wrote:
I have a perl script here that I used to generate downgrading table
for a C program. I can let you have the perl script as is, but if
there's enough interest(*) I'll use it to generate, say,
CompoundAsciiFilter since it converts compound cha
u're indexing. As
Erik says, you need to make sure that you're reading files with the
proper encoding and removing accent and adding dots won't help.
jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
with the Turkish text "düzenlediğimiz kampanyamıza" the lucene
index will contain re encoded data with �k��
Reading encoding files is your applications responsibility. You need
to be sure to read the files in using the proper encoding. Once read
properly into Java all will
PROTECTED]
Envoyé : jeudi 26 janvier 2006 03:01
À : java-user@lucene.apache.org
Objet : Re: encoding
arnaudbuffet wrote:
>For text files, data could be in different languages so different
>encoding. If data are in Turkish for exemple, all special characters and
>accents are not recognized
arnaudbuffet wrote:
For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?
I've been lo
Hello,
I 've a problem with data i try to index with lucene. I browse a
directory and index text from different types of files throw parsers.
For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accent
Please ignore my previous post, I have solved the problem.
Turned out that my IDE(eclipse) didn't use UTF-8 encoding by default.
Edwin
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [
case 'é'::
In this example the 'ä' Character causes a problem.
I think the code is messed up because of wrong character encoding of the
java file.
Does anyone know if I'm correct and more importantly how to solve this
problem.
Thanks,
Edwin Mol
-
49 matches
Mail list logo