in 6.1.0 version BigIntegerPoint seams be moved in the mains module (no
more in sandbox).
However
1) BigIntegerPoint seams be a class for searching a 128integer not for
sorting. NumericDocValuesField supports long not BigInteger. so I used for
sorting SortedDocValuesField.
2) BigIntegerPoint name
To index into postings, use TextField (analyzes text into tokens) or
StringField (indexes entire string as one token). E.g. you could map
boolean true to StringField("true").
See BigIntegerPoint in lucene's sandbox module.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Aug 10, 2016 at
thanks for suggestion about postings (i think you mean "posting format" ,
just found mentions in google now :)) I have difficulty anyway to find a
example how to use postings. Any example how to use postings in code ? just
a link for example?
*Passing to docvalues* :
in version 6.1 docvalues (like
It would make little sense to use points for a boolean field in the 1D case
since there are only two possible values, postings would likely be faster
and use less disk space thanks to their skipping capabilities and better
doc ID compression. Even with multiple dimensions, postings might still be
a
You shouldn't need to use setNumericPrecisionStep at all: that's how
Lucene's old numerics worked.
For boolean type, Lucene will still use one byte per value when you index
it as points (or as a term) ... I don't know how to get that down to only 1
bit :)
Mike McCandless
http://blog.mikemccandle
in addition in the previous version of my code i used
TYPE.setNumericPrecisionStep for setting the precision of a number in
docvalues. Now i saw it is deprecated.
So i have a similar question also in this case: it is still possible
to use less space for (byte,boolean,short,int) types?
2016
ok thanks so i can do them.
but for boolean type? i could compress using bit. Is there pack function
for boolean arrays?
2016-08-10 11:25 GMT+02:00 Michael McCandless :
> It's partially right!
>
> E.g. IndexWriter will use less memory, and so you'll get better indexing
> throughput with a ShortP
It's partially right!
E.g. IndexWriter will use less memory, and so you'll get better indexing
throughput with a ShortPoint and BytePoint.
But index size will be the same, because Lucene's default codec does a good
job compressing these values.
Mike McCandless
http://blog.mikemccandless.com
On
sorry but I was developping a shortPoint and BytePoint for less using less
memory space. it is wrong?
2016-08-09 22:01 GMT+02:00 Michael McCandless :
> It's best to index numeric using the new dimensional points, e.g. IntPoint.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Au
It's best to index numeric using the new dimensional points, e.g. IntPoint.
Mike McCandless
http://blog.mikemccandless.com
On Tue, Aug 9, 2016 at 10:12 AM, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> wrote:
> how to encode a short or a byte type in byteRef in lucene 6.1?
>
Hi :)
I found the source of the problem. It is indeed the input string. It
comes from a csv export from a relational database. The inputStream of
this csv file was encoded with the wrong charset (ISO8859-1 instead of
CP1252). So the right single quote was returned as this character
correspond
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky wrote:
> What is the hex value for that second character returned that appears to
> display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use
> 2", so who knows what it might display as.
Well, if they're dealing with HTML, then it wil
What is the hex value for that second character returned that appears to
display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use
2", so who knows what it might display as. All that is important is the
binary/hax value.
Out of curiosity, how did your application come about pic
Hi :)
I've got this result directly from tncTitle in the following code:
field = doc.getFieldable(IndexConstants.FIELD_TNC_TITLE);
if (field != null) {
tncTitle = field.stringValue();
}
ps: in my previous email, the copy/paste of the apostrophe html number
made it appear correctly althou
Hi G. Long,
Most likely, the problem is in your application. Lucene does not change the
value stored in the index. For stored fields, Lucene does not deal with
entities, it's just binary data to Lucene. From your application perspective,
it is String in -> String out. I think maybe you strip th
i've been using the one in icu for some time...
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
On Fri, Mar 27, 2009 at 2:57 PM, Zhang, Lisheng <
lisheng.zh...@broadvision.com> wrote:
> Hi,
>
> What's the best free tool for encoding detection? For example we have
> a AS
The source data for my index is already in standard UTF-8 and available as a
simple byte array. I need to do some simple tokenization of the data (check
for whitespace and special characters that control position increment). What
is the most efficient way to index this data and avoid unnecessary
c
@lucene.apache.org
Subject: Re: encoding question.
Internally Lucene deals with pure Java Strings; when writing those
strings
to and reading those strings back from disk, Lucene allways uses the
stock
Java "modified UTF-8" format, regardless of what your file.encoding
system property may be.
typc
Internally Lucene deals with pure Java Strings; when writing those strings
to and reading those strings back from disk, Lucene allways uses the stock
Java "modified UTF-8" format, regardless of what your file.encoding
system property may be.
typcially when people have encoding problems in their
Hello,
On Jan 27, 2006, at 11:44, John Haxby wrote:
I've attached the perl script -- feed
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it.
Thanks! Works great!
It's based on a slightly different principle to yours. You seem to
look for things like "mumble mumble LETTER X m
petite_abeille wrote:
I would love to see this. I presently have a somewhat unwieldy
conversion table [1] that I would love to get ride of :))
[snip]
[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
I've attached the perl script -- feed
http://www.unicode.org/Public/4.1.0/u
Hello,
On Jan 26, 2006, at 12:01, John Haxby wrote:
I have a perl script here that I used to generate downgrading table
for a C program. I can let you have the perl script as is, but if
there's enough interest(*) I'll use it to generate, say,
CompoundAsciiFilter since it converts compound cha
arnaudbuffet wrote:
if I try to index a text file encoded in Western 1252 for exemple with the Turkish text
"düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with
�k��
ISOLatin1AccentFilter.removeAccents() converts that string to
"duzenlediğimiz kampanyamıza"
On Jan 26, 2006, at 7:26 PM, arnaudbuffet wrote:
I do not find the ISOLatin1AccentFilter class in my lucene jar, but
I find one on google attach to this mail, could you tell me if it
is the good one?
This used to be in contrib/analyzers but has been moved into the core
(Subversion only fo
PROTECTED]
Envoyé : jeudi 26 janvier 2006 03:01
À : java-user@lucene.apache.org
Objet : Re: encoding
arnaudbuffet wrote:
>For text files, data could be in different languages so different
>encoding. If data are in Turkish for exemple, all special characters and
>accents are not recognized
arnaudbuffet wrote:
For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?
I've been looking
26 matches
Mail list logo