Re: encoding in byteref?

2016-08-18 Thread Cristian Lorenzetto
in 6.1.0 version BigIntegerPoint seams be moved in the mains module (no more in sandbox). However 1) BigIntegerPoint seams be a class for searching a 128integer not for sorting. NumericDocValuesField supports long not BigInteger. so I used for sorting SortedDocValuesField. 2) BigIntegerPoint name

Re: encoding in byteref?

2016-08-11 Thread Michael McCandless
To index into postings, use TextField (analyzes text into tokens) or StringField (indexes entire string as one token). E.g. you could map boolean true to StringField("true"). See BigIntegerPoint in lucene's sandbox module. Mike McCandless http://blog.mikemccandless.com On Wed, Aug 10, 2016 at

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
thanks for suggestion about postings (i think you mean "posting format" , just found mentions in google now :)) I have difficulty anyway to find a example how to use postings. Any example how to use postings in code ? just a link for example? *Passing to docvalues* : in version 6.1 docvalues (like

Re: encoding in byteref?

2016-08-10 Thread Adrien Grand
It would make little sense to use points for a boolean field in the 1D case since there are only two possible values, postings would likely be faster and use less disk space thanks to their skipping capabilities and better doc ID compression. Even with multiple dimensions, postings might still be a

Re: encoding in byteref?

2016-08-10 Thread Michael McCandless
You shouldn't need to use setNumericPrecisionStep at all: that's how Lucene's old numerics worked. For boolean type, Lucene will still use one byte per value when you index it as points (or as a term) ... I don't know how to get that down to only 1 bit :) Mike McCandless http://blog.mikemccandle

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
in addition in the previous version of my code i used TYPE.setNumericPrecisionStep for setting the precision of a number in docvalues. Now i saw it is deprecated. So i have a similar question also in this case: it is still possible to use less space for (byte,boolean,short,int) types? 2016

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
ok thanks so i can do them. but for boolean type? i could compress using bit. Is there pack function for boolean arrays? 2016-08-10 11:25 GMT+02:00 Michael McCandless : > It's partially right! > > E.g. IndexWriter will use less memory, and so you'll get better indexing > throughput with a ShortP

Re: encoding in byteref?

2016-08-10 Thread Michael McCandless
It's partially right! E.g. IndexWriter will use less memory, and so you'll get better indexing throughput with a ShortPoint and BytePoint. But index size will be the same, because Lucene's default codec does a good job compressing these values. Mike McCandless http://blog.mikemccandless.com On

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
sorry but I was developping a shortPoint and BytePoint for less using less memory space. it is wrong? 2016-08-09 22:01 GMT+02:00 Michael McCandless : > It's best to index numeric using the new dimensional points, e.g. IntPoint. > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Au

Re: encoding in byteref?

2016-08-09 Thread Michael McCandless
It's best to index numeric using the new dimensional points, e.g. IntPoint. Mike McCandless http://blog.mikemccandless.com On Tue, Aug 9, 2016 at 10:12 AM, Cristian Lorenzetto < cristian.lorenze...@gmail.com> wrote: > how to encode a short or a byte type in byteRef in lucene 6.1? >

Re: encoding problem when retrieving document field value

2014-03-04 Thread G.Long
Hi :) I found the source of the problem. It is indeed the input string. It comes from a csv export from a relational database. The inputStream of this csv file was encoded with the wrong charset (ISO8859-1 instead of CP1252). So the right single quote was returned as this character correspond

Re: encoding problem when retrieving document field value

2014-03-03 Thread Trejkaz
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky wrote: > What is the hex value for that second character returned that appears to > display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use > 2", so who knows what it might display as. Well, if they're dealing with HTML, then it wil

Re: encoding problem when retrieving document field value

2014-03-03 Thread Jack Krupansky
What is the hex value for that second character returned that appears to display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use 2", so who knows what it might display as. All that is important is the binary/hax value. Out of curiosity, how did your application come about pic

Re: encoding problem when retrieving document field value

2014-03-03 Thread G.Long
Hi :) I've got this result directly from tncTitle in the following code: field = doc.getFieldable(IndexConstants.FIELD_TNC_TITLE); if (field != null) { tncTitle = field.stringValue(); } ps: in my previous email, the copy/paste of the apostrophe html number made it appear correctly althou

RE: encoding problem when retrieving document field value

2014-03-03 Thread Uwe Schindler
Hi G. Long, Most likely, the problem is in your application. Lucene does not change the value stored in the index. For stored fields, Lucene does not deal with entities, it's just binary data to Lucene. From your application perspective, it is String in -> String out. I think maybe you strip th

Re: Encoding detection free software?

2009-03-27 Thread Robert Muir
i've been using the one in icu for some time... http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html On Fri, Mar 27, 2009 at 2:57 PM, Zhang, Lisheng < lisheng.zh...@broadvision.com> wrote: > Hi, > > What's the best free tool for encoding detection? For example we have > a AS

Re: encoding question.

2007-07-19 Thread Peter Keegan
The source data for my index is already in standard UTF-8 and available as a simple byte array. I need to do some simple tokenization of the data (check for whitespace and special characters that control position increment). What is the most efficient way to index this data and avoid unnecessary c

RE: encoding question.

2007-02-14 Thread Benson Margulies
@lucene.apache.org Subject: Re: encoding question. Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typc

Re: encoding question.

2007-02-14 Thread Chris Hostetter
Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typcially when people have encoding problems in their

Re: encoding

2006-01-28 Thread petite_abeille
Hello, On Jan 27, 2006, at 11:44, John Haxby wrote: I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it. Thanks! Works great! It's based on a slightly different principle to yours. You seem to look for things like "mumble mumble LETTER X m

Re: encoding

2006-01-27 Thread John Haxby
petite_abeille wrote: I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :)) [snip] [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/u

Re: encoding

2006-01-26 Thread petite_abeille
Hello, On Jan 26, 2006, at 12:01, John Haxby wrote: I have a perl script here that I used to generate downgrading table for a C program. I can let you have the perl script as is, but if there's enough interest(*) I'll use it to generate, say, CompoundAsciiFilter since it converts compound cha

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text "düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with �k�� ISOLatin1AccentFilter.removeAccents() converts that string to "duzenlediğimiz kampanyamıza"

Re: RE : encoding

2006-01-26 Thread Erik Hatcher
On Jan 26, 2006, at 7:26 PM, arnaudbuffet wrote: I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one on google attach to this mail, could you tell me if it is the good one? This used to be in contrib/analyzers but has been moved into the core (Subversion only fo

RE : encoding

2006-01-26 Thread arnaudbuffet
PROTECTED] Envoyé : jeudi 26 janvier 2006 03:01 À : java-user@lucene.apache.org Objet : Re: encoding arnaudbuffet wrote: >For text files, data could be in different languages so different >encoding. If data are in Turkish for exemple, all special characters and >accents are not recognized

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are not recognized in my lucene index. Is there a way to resolve problem? How do I work with the encoding ? I've been looking