Re: encoding in byteref?

2016-08-18 Thread Cristian Lorenzetto
in 6.1.0 version BigIntegerPoint seams be moved in the mains module (no more in sandbox). However 1) BigIntegerPoint seams be a class for searching a 128integer not for sorting. NumericDocValuesField supports long not BigInteger. so I used for sorting SortedDocValuesField. 2) BigIntegerPoint name

Re: encoding in byteref?

2016-08-11 Thread Michael McCandless
To index into postings, use TextField (analyzes text into tokens) or StringField (indexes entire string as one token). E.g. you could map boolean true to StringField("true"). See BigIntegerPoint in lucene's sandbox module. Mike McCandless http://blog.mikemccandless.com On Wed, Aug 10, 2016 at

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
thanks for suggestion about postings (i think you mean "posting format" , just found mentions in google now :)) I have difficulty anyway to find a example how to use postings. Any example how to use postings in code ? just a link for example? *Passing to docvalues* : in version 6.1 docvalues (like

Re: encoding in byteref?

2016-08-10 Thread Adrien Grand
It would make little sense to use points for a boolean field in the 1D case since there are only two possible values, postings would likely be faster and use less disk space thanks to their skipping capabilities and better doc ID compression. Even with multiple dimensions, postings might still be a

Re: encoding in byteref?

2016-08-10 Thread Michael McCandless
You shouldn't need to use setNumericPrecisionStep at all: that's how Lucene's old numerics worked. For boolean type, Lucene will still use one byte per value when you index it as points (or as a term) ... I don't know how to get that down to only 1 bit :) Mike McCandless http://blog.mikemccandle

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
in addition in the previous version of my code i used TYPE.setNumericPrecisionStep for setting the precision of a number in docvalues. Now i saw it is deprecated. So i have a similar question also in this case: it is still possible to use less space for (byte,boolean,short,int) types? 2016

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
ok thanks so i can do them. but for boolean type? i could compress using bit. Is there pack function for boolean arrays? 2016-08-10 11:25 GMT+02:00 Michael McCandless : > It's partially right! > > E.g. IndexWriter will use less memory, and so you'll get better indexing > throughput with a ShortP

Re: encoding in byteref?

2016-08-10 Thread Michael McCandless
It's partially right! E.g. IndexWriter will use less memory, and so you'll get better indexing throughput with a ShortPoint and BytePoint. But index size will be the same, because Lucene's default codec does a good job compressing these values. Mike McCandless http://blog.mikemccandless.com On

Re: encoding in byteref?

2016-08-10 Thread Cristian Lorenzetto
sorry but I was developping a shortPoint and BytePoint for less using less memory space. it is wrong? 2016-08-09 22:01 GMT+02:00 Michael McCandless : > It's best to index numeric using the new dimensional points, e.g. IntPoint. > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Au

Re: encoding in byteref?

2016-08-09 Thread Michael McCandless
It's best to index numeric using the new dimensional points, e.g. IntPoint. Mike McCandless http://blog.mikemccandless.com On Tue, Aug 9, 2016 at 10:12 AM, Cristian Lorenzetto < cristian.lorenze...@gmail.com> wrote: > how to encode a short or a byte type in byteRef in lucene 6.1? >

encoding in byteref?

2016-08-09 Thread Cristian Lorenzetto
how to encode a short or a byte type in byteRef in lucene 6.1?

Re: encoding problem when retrieving document field value

2014-03-04 Thread G.Long
nt: Monday, March 3, 2014 12:09 PM To: java-user@lucene.apache.org Subject: encoding problem when retrieving document field value Hi :) My index (Lucene 3.5) contains a field called title. Its value is indexed (analyzed and stored) with the WhitespaceAnalyzer and can contains html entities such as

Re: encoding problem when retrieving document field value

2014-03-03 Thread Trejkaz
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky wrote: > What is the hex value for that second character returned that appears to > display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use > 2", so who knows what it might display as. Well, if they're dealing with HTML, then it wil

Re: encoding problem when retrieving document field value

2014-03-03 Thread Jack Krupansky
come about picking a PU Unicode character? -- Jack Krupansky -Original Message- From: G.Long Sent: Monday, March 3, 2014 12:09 PM To: java-user@lucene.apache.org Subject: encoding problem when retrieving document field value Hi :) My index (Lucene 3.5) contains a field called title. It

Re: encoding problem when retrieving document field value

2014-03-03 Thread G.Long
p://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: G.Long [mailto:jde...@gmail.com] Sent: Monday, March 03, 2014 6:09 PM To: java-user@lucene.apache.org Subject: encoding problem when retrieving document field value Hi :) My index (Lucene 3.5) contains a field called title. It

RE: encoding problem when retrieving document field value

2014-03-03 Thread Uwe Schindler
M > To: java-user@lucene.apache.org > Subject: encoding problem when retrieving document field value > > Hi :) > > My index (Lucene 3.5) contains a field called title. Its value is indexed > (analyzed and stored) with the WhitespaceAnalyzer and can contains html > entities such as ’ or

encoding problem when retrieving document field value

2014-03-03 Thread G.Long
Hi :) My index (Lucene 3.5) contains a field called title. Its value is indexed (analyzed and stored) with the WhitespaceAnalyzer and can contains html entities such as ’ or ° My problem is that when i retrieve values from this field, some of the html entities are missing. For example : Lu

Re: Change norm encoding

2009-11-10 Thread Michael McCandless
Well, assuming there are no objections to the current approach, and performance checks out, I'll try to get this into 3.1... Mike On Tue, Nov 10, 2009 at 4:33 AM, Benjamin Heilbrunn wrote: > Hi, > > I applied > http://issues.apache.org/jira/secure/attachment/12411342/Lucene-1260.patch > That's

Re: Change norm encoding

2009-11-10 Thread Benjamin Heilbrunn
Hi, I applied http://issues.apache.org/jira/secure/attachment/12411342/Lucene-1260.patch That's exactly what I was looking for. The problem is, that from know on I'm on a patched version and I'm not very happy with breaking compatibility to the "original" jars... So is there a chance that this p

Re: Change norm encoding

2009-11-09 Thread Michael McCandless
On Mon, Nov 9, 2009 at 12:19 PM, Benjamin Heilbrunn wrote: > After making my post i found this (without taking a deeper look): > > http://issues.apache.org/jira/browse/LUCENE-1260 > > Looks like a solution for that problem. Indeed the most recent patch there looks almost exactly like what you're

Re: Change norm encoding

2009-11-09 Thread Benjamin Heilbrunn
Hi Mike, thanks for your reply. After making my post i found this (without taking a deeper look): http://issues.apache.org/jira/browse/LUCENE-1260 Looks like a solution for that problem. Why wasn't it applied to lucene? Benjamin -

Re: Change norm encoding

2009-11-09 Thread Michael McCandless
On Mon, Nov 9, 2009 at 11:04 AM, Benjamin Heilbrunn wrote: > i've got a problem concerning encoding of norms. > I want to use int values (0-255) instead of float interpreted bytes. > > In my own Similarity-Class, which I use for indexing and searching, I > implemente

Change norm encoding

2009-11-09 Thread Benjamin Heilbrunn
Hi, i've got a problem concerning encoding of norms. I want to use int values (0-255) instead of float interpreted bytes. In my own Similarity-Class, which I use for indexing and searching, I implemented the static methods encodeNorms, decodeNorms and getNormDecoder. But because they are s

Posting List Encoding: Group Varint Encoding

2009-05-05 Thread Renaud Delbru
Hi, I know that a new encoding technique, PFOR, is being implemented in the Lucene project [1]. Have you heard about the "Group Varint" encoding technique from Google ? There is a technical explanation in the talk of Jeffrey Dean, "Challenges in Building Large-Scale Inform

Re: Encoding detection free software?

2009-03-27 Thread Robert Muir
i've been using the one in icu for some time... http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html On Fri, Mar 27, 2009 at 2:57 PM, Zhang, Lisheng < lisheng.zh...@broadvision.com> wrote: > Hi, > > What's the best free tool for encoding detection

Encoding detection free software?

2009-03-27 Thread Zhang, Lisheng
Hi, What's the best free tool for encoding detection? For example we have a ASCII file README.txt, which needs to be indexed, but we need to know its encoding before we can convert it to Java String. I saw some free tools on the market, but have no experiences with any of them yet? What i

Re: Modification of positional information encoding

2008-10-15 Thread Michael McCandless
Renaud Delbru wrote: Hi Michael, Michael McCandless wrote: Also, this issue was just opened: https://issues.apache.org/jira/browse/LUCENE-1419 which would make it possible for classes in the same package (oal.index) to use their own indexing chain. With that fix, if you make your ow

Re: Modification of positional information encoding

2008-10-14 Thread Renaud Delbru
Hi Michael, Michael McCandless wrote: Also, this issue was just opened: https://issues.apache.org/jira/browse/LUCENE-1419 which would make it possible for classes in the same package (oal.index) to use their own indexing chain. With that fix, if you make your own classes in oal.index pa

Re: Modification of positional information encoding

2008-10-13 Thread Renaud Delbru
Hi, Michael McCandless wrote: This looks right, though you would also need to modify SegmentMerger to read & write your new format when merging segments. Another thing you could do is grep for "omitTf" which should touch exactly the same places you need to touch. Ok, thanks for the pointers.

Re: Modification of positional information encoding

2008-10-13 Thread Michael McCandless
Renaud Delbru wrote: Hi, We are trying to modify the positional encoding of a term occurrence for experimentation purposes. One solution we adopt is to use payloads to sotre our own positional information encoding, but with this solution, it becomes difficult to measure the increase or

Modification of positional information encoding

2008-10-13 Thread Renaud Delbru
Hi, We are trying to modify the positional encoding of a term occurrence for experimentation purposes. One solution we adopt is to use payloads to sotre our own positional information encoding, but with this solution, it becomes difficult to measure the increase or decrease of index size. It

Re: encoding question.

2007-07-19 Thread Peter Keegan
ally when people have encoding problems in their lucene applications, the origin of hte problem is in the way they fetch the data before indexing it ... if you can make a String object, and System.out.println that string and see what you expect, then handing that string to Lucene as a field value sh

RE: encoding question.

2007-02-14 Thread Benson Margulies
@lucene.apache.org Subject: Re: encoding question. Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typc

Re: encoding question.

2007-02-14 Thread Chris Hostetter
Internally Lucene deals with pure Java Strings; when writing those strings to and reading those strings back from disk, Lucene allways uses the stock Java "modified UTF-8" format, regardless of what your file.encoding system property may be. typcially when people have encoding problem

encoding question.

2007-02-13 Thread Mohammad Norouzi
Hi I want to index data with utf-8 encoding, so when adding field to a document I am using the code new String(value.getBytes("utf-8")) in the other hand, when I am going to search I was using the same snippet code to convert to utf-8 but it did not work so finally I found somewhere tha

Advices on a replacement of Lucene gap encoding scheme?

2007-02-01 Thread Thang Luong Minh
ource code, and I need some advices on how significance my modification work would be. What I am interested so far is the gap encoding scheme in Lucene which is used in DocumentWriter.writePostings() to record the gap positions of a term within a document. The writePostings(), in turn, calls the writ

Advices on a replacement of Lucene gap encoding scheme?

2007-02-01 Thread Thang Luong Minh
would be. What I am interested so far is the gap encoding scheme in Lucene which is used in DocumentWriter.writePostings() to record the gap positions of a term within a document. The writePostings(), in turn, calls the writeVInt() method to record the gap, which is the byte-aligned coding scheme

Re: Question regarding URL encoding

2006-07-17 Thread Chris Hostetter
it sounds like you may be confused by a couple of differnet things: 1) you are getting a parse exception bcause the '"' character is meaningful to the query parser ... it thinks you are trying to do a phraes search but you haven't finished the phrase, try escaping it with \" 2) just because you

Question regarding URL encoding

2006-07-17 Thread Van Nguyen
I'm trying to search my index using this search phrase: 1" That returns zero search results and throws a ParseException: Lexical error at line... I can see that 1" is part of that particular document by searching that same document using a different search term. How should the Lucene in

Re: encoding

2006-01-28 Thread petite_abeille
Hello, On Jan 27, 2006, at 11:44, John Haxby wrote: I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it. Thanks! Works great! It's based on a slightly different principle to yours. You seem to look for things like "mumble mumble LETTER X m

Re: encoding

2006-01-27 Thread John Haxby
petite_abeille wrote: I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :)) [snip] [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/u

Re: encoding

2006-01-26 Thread petite_abeille
Hello, On Jan 26, 2006, at 12:01, John Haxby wrote: I have a perl script here that I used to generate downgrading table for a C program. I can let you have the perl script as is, but if there's enough interest(*) I'll use it to generate, say, CompoundAsciiFilter since it converts compound cha

Re: encoding

2006-01-26 Thread John Haxby
u're indexing. As Erik says, you need to make sure that you're reading files with the proper encoding and removing accent and adding dots won't help. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: RE : encoding

2006-01-26 Thread Erik Hatcher
with the Turkish text "düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with �k�� Reading encoding files is your applications responsibility. You need to be sure to read the files in using the proper encoding. Once read properly into Java all will

RE : encoding

2006-01-26 Thread arnaudbuffet
PROTECTED] Envoyé : jeudi 26 janvier 2006 03:01 À : java-user@lucene.apache.org Objet : Re: encoding arnaudbuffet wrote: >For text files, data could be in different languages so different >encoding. If data are in Turkish for exemple, all special characters and >accents are not recognized

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are not recognized in my lucene index. Is there a way to resolve problem? How do I work with the encoding ? I've been lo

encoding

2006-01-26 Thread arnaudbuffet
Hello, I 've a problem with data i try to index with lucene. I browse a directory and index text from different types of files throw parsers. For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accent

Re: International Stemmers and Character Encoding

2005-06-11 Thread Edwin Mol
Please ignore my previous post, I have solved the problem. Turned out that my IDE(eclipse) didn't use UTF-8 encoding by default. Edwin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [

International Stemmers and Character Encoding

2005-06-11 Thread Edwin Mol
case 'é':: In this example the 'ä' Character causes a problem. I think the code is messed up because of wrong character encoding of the java file. Does anyone know if I'm correct and more importantly how to solve this problem. Thanks, Edwin Mol -