Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream
I got this exception while indexing with Lucene 3.4: Exception in thread "Thread-0" java.lang.IllegalArgumentException: Illegal shift value, must be 0..31 at org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:157) at org.apache.lucene.analysis.NumericTokenStream.incrementToken(NumericTokenStream.java:217) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:185) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2067) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2041) at com.adxpose.affinity.IndexerHelper.index(IndexerHelper.java:797) at com.adxpose.affinity.IndexerHelper$Clerk.run(IndexerHelper.java:433) at java.lang.Thread.run(Thread.java:662) It is not clear to my why the NumericTokenStream is being called here, as my analyzer do not use that. Any clues much appreciated. thx, thushara
Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream
Yes, there is one. This is how the field is being created: new NumericField("timestamp", Field.Store.NO, true); Thus, the field is not stored, but indexed. thx, thushara On Fri, Dec 16, 2011 at 3:28 PM, Uwe Schindler wrote: > Do you have NumericFields? If yes, how are they configured? > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message- > > From: Thushara Wijeratna [mailto:thu...@gmail.com] > > Sent: Saturday, December 17, 2011 12:25 AM > > To: java-user@lucene.apache.org > > Subject: Lucene 3.4 : shift bug in possibly invalid use of > NumericTokenStream > > > > I got this exception while indexing with Lucene 3.4: > > > > Exception in thread "Thread-0" java.lang.IllegalArgumentException: > Illegal > shift > > value, must be 0..31 > > > > at > > > org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:157) > > > > at > > org.apache.lucene.analysis.NumericTokenStream.incrementToken(NumericTok > > enStream.java:217) > > > > at > > > > org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFiel > d > > .java:185) > > > > at > > org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFie > > ldProcessorPerThread.java:278) > > > > at > > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter > > .java:766) > > > > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2067) > > > > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2041) > > > > at com.adxpose.affinity.IndexerHelper.index(IndexerHelper.java:797) > > > > at com.adxpose.affinity.IndexerHelper$Clerk.run(IndexerHelper.java:433) > > > > at java.lang.Thread.run(Thread.java:662) > > > > > > It is not clear to my why the NumericTokenStream is being called here, as > my > > analyzer do not use that. Any clues much appreciated. > > > > > > thx, > > > > thushara > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream
Yes, I use this field to set a timestamp (an int). And I'm not using the special constructor, so I must be using the default precision step. Java version : 1.6.0_24 mpire@seafcmr16:~$ java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) Also : I have only seen this when multiple threads within the app is writing to a single Lucene index. But it is rare. I'm attaching the indexing code. Could you also point me to the JVM bug you suspect to be the cause? thx, thushara On Fri, Dec 16, 2011 at 4:07 PM, Uwe Schindler wrote: > Hi, > > Thanks, this *may* cause the exception, but it is impossible that the > exception stack trace you are posting occurs in Lucene's code with a > default > precision step on a numeric field, as you use here. I assume it's a 32bit > integer (NumericField.setIntValue or setFloatValue)? > > Please provide us your full Java version (java -version) and ideally the > full source code you use during indexing. The only chance you can get this > Exception is by a JVM bug. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Thushara Wijeratna [mailto:thu...@gmail.com] > > Sent: Saturday, December 17, 2011 1:01 AM > > To: java-user@lucene.apache.org; u...@thetaphi.de > > Subject: Re: Lucene 3.4 : shift bug in possibly invalid use of > > NumericTokenStream > > > > Yes, there is one. > > > > This is how the field is being created: > > > > new NumericField("timestamp", Field.Store.NO, true); > > > > Thus, the field is not stored, but indexed. > > > > thx, > > thushara > > > > > > On Fri, Dec 16, 2011 at 3:28 PM, Uwe Schindler wrote: > > > > > Do you have NumericFields? If yes, how are they configured? > > > > > > - > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de > > > > > > > > > > -Original Message- > > > > From: Thushara Wijeratna [mailto:thu...@gmail.com] > > > > Sent: Saturday, December 17, 2011 12:25 AM > > > > To: java-user@lucene.apache.org > > > > Subject: Lucene 3.4 : shift bug in possibly invalid use of > > > NumericTokenStream > > > > > > > > I got this exception while indexing with Lucene 3.4: > > > > > > > > Exception in thread "Thread-0" java.lang.IllegalArgumentException: > > > Illegal > > > shift > > > > value, must be 0..31 > > > > > > > > at > > > > > > > > org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:157) > > > > > > > > at > > > > > > org.apache.lucene.analysis.NumericTokenStream.incrementToken(NumericTok > > > > enStream.java:217) > > > > > > > > at > > > > > > > > > > > > > > org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerFiel > > > d > > > > .java:185) > > > > > > > > at > > > > > > org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFie > > > > ldProcessorPerThread.java:278) > > > > > > > > at > > > > > > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter > > > > .java:766) > > > > > > > > at > > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2067) > > > > > > > > at > > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2041) > > > > > > > > at com.adxpose.affinity.IndexerHelper.index(IndexerHelper.java:797) > > > > > > > > at > com.adxpose.affinity.IndexerHelper$Clerk.run(IndexerHelper.java:433) > > > > > > > > at java.lang.Thread.run(Thread.java:662) > > > > > > > > > > > > It is not clear to my why the NumericTokenStream is being called > here, > as > > > my > > > > analyzer do not use that. Any clues much appreciated. > > > > > > > > > > > > thx, > > > > > > > > thushara > > > > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream
This is difficult to repro. I'm not using any JVM flags. It does seem that the following code could never call NumericUtils.intToPrefixCoded with a shift > 31 (or shift < 0) so I tend to agree this must be a JVM bug. Looking through all logs I have for December, I only found one instance of this issue. It seems it has nothing to do with concurrency, then it must have to do with the value set in the NumericField, so the bug must be triggered by a particular timestamp. from: http://javasourcecode.org/html/open-source/lucene/lucene-3.3.0/org/apache/lucene/analysis/NumericTokenStream.java.html public boolean incrementToken() { if (valSize == 0) throw new IllegalStateException <http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/lang/IllegalStateException.java.html>("call set???Value() before usage"); if (shift >= valSize) return false; clearAttributes(); final char[] buffer; switch (valSize) { case 64: buffer = termAtt.resizeBuffer(NumericUtils.BUF_SIZE_LONG); termAtt.setLength(NumericUtils.longToPrefixCoded(value, shift, buffer)); break; case 32: buffer = termAtt.resizeBuffer(NumericUtils.BUF_SIZE_INT); termAtt.setLength(NumericUtils.intToPrefixCoded((int) value, shift, buffer)); break; default: // should not happenthrow new IllegalArgumentException <http://javasourcecode.org/html/open-source/jdk/jdk-6u23/java/lang/IllegalArgumentException.java.html>("valSize must be 32 or 64"); } typeAtt.setType((shift == 0) ? TOKEN_TYPE_FULL_PREC : TOKEN_TYPE_LOWER_PREC); posIncrAtt.setPositionIncrement((shift == 0) ? 1 : 0); shift += precisionStep; return true; } On Sun, Dec 18, 2011 at 2:50 PM, Uwe Schindler wrote: > Hi, > > ** ** > > Can you try 1.6.0_29 or disable hotspot by using “-Xint” JVM startup flag > (just to test, I know, it’s slow then)? Are you **not** using > “-XX:+AggressiveOpts” as JVM parameter? > > The JVM bug which may lead to this is a sign-flip bug: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921 (see also > http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2011-March/004942.html > ) > > ** ** > > Otherwise, is all fine, if you remove the numeric field? The code you are > using can never cause such behavior, this is extensively tested. > > ** ** > > Uwe > > ** ** > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > ** ** > > *From:* Thushara Wijeratna [mailto:thu...@gmail.com] > *Sent:* Sunday, December 18, 2011 11:17 PM > > *To:* java-user@lucene.apache.org; u...@thetaphi.de > *Subject:* Re: Lucene 3.4 : shift bug in possibly invalid use of > NumericTokenStream > > ** ** > > Yes, I use this field to set a timestamp (an int). And I'm not using the > special constructor, so I must be using the default precision step. > > Java version : 1.6.0_24 > > ** ** > > mpire@seafcmr16:~$ java -version > > java version "1.6.0_24" > > Java(TM) SE Runtime Environment (build 1.6.0_24-b07) > > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) > > ** ** > > Also : I have only seen this when multiple threads within the app is > writing to a single Lucene index. But it is rare. > > ** ** > > I'm attaching the indexing code. > > ** ** > > Could you also point me to the JVM bug you suspect to be the cause? > > ** ** > > thx, > > thushara > > ** ** > > On Fri, Dec 16, 2011 at 4:07 PM, Uwe Schindler wrote:*** > * > > Hi, > > Thanks, this *may* cause the exception, but it is impossible that the > exception stack trace you are posting occurs in Lucene's code with a > default > precision step on a numeric field, as you use here. I assume it's a 32bit > integer (NumericField.setIntValue or setFloatValue)? > > Please provide us your full Java version (java -version) and ideally the > full source code you use during indexing. The only chance you can get this > Exception is by a JVM bug. > > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Thushara Wijeratna [mailto:thu...@gmail.com] > > > Sent: Saturday, December 17, 2011 1:01 AM > > To: java-user@lucene.apache.org; u...@thetaphi.de > > Subject: Re: Lucene 3.4 : shift bug in possibly invalid use of > > NumericTokenStream > > > > Yes, there is one. > > > > This
Re: Lucene 3.4 : shift bug in possibly invalid use of NumericTokenStream
Actually, the a single timestamp field is being used by several threads. Sorry, I missed that, and thanks Peter, Uwe both for the explanations. [In my code snippet, I was trying to simplify so missed this. I'm constructing one timestamp field and passing it to all threads in the ctor.] On Mon, Dec 19, 2011 at 5:07 AM, Uwe Schindler wrote: > Hi, > > NumericUtils is an internal implementation class, you should not use it. > What do you want to do? There is no need to call any of its methods during > indexing or searching. Everything else is advanced. I the latter case you > should RTFM of BytesRef and realted classes (possibly watch the flexible > indexing talk done by me in Berlin, Barcelona or San Francisco). Lucene > moved to binary terms in 4.0 and no longer uses character based terms, so > the code is different. BytesRef is just a wrapper around a byte[]. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Peter Karich [mailto:peat...@yahoo.de] > > Sent: Monday, December 19, 2011 1:40 PM > > To: java-user@lucene.apache.org > > Subject: Re: Lucene 3.4 : shift bug in possibly invalid use of > > NumericTokenStream > > > > BTW: how can I use NumericUtils.longToPrefixCoded in 4.0 ? > > > > Peter. > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
lucene gosen diff btn jars
I'm testing lucene-gosen for Japanese tokenization and wondering what the differences are between the two jars provided. (ipadic / chaisen)? In my preliminary testing, I'm not seeing any difference in tokenization in these two jars. (the jar with no dictionary did not work, I assume I need to make available a custom dictionary - header.sen which I did not try) I tried to tokenize this phrase: ゴルフが大好きなあなた。 アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。 詳しい情報は which google translates as You love golf. Best golf course information in the United States is in the Yellow Pages Japan is full of, any information can be obtained easily from online or book guide. For more information I'm getting identical tokenization from both jars, namely : ゴルフ / Golf 大好き / I love あなた / You アメリカ / America ベスト / best ゴルフコース / Golf course 情報 / information 満載 / save イエロ / Hierro ページ / page ジャパン / Japan オンライン / online ガイド / guide ブック / book あらゆる / all 情報 / information 簡単 / simple 入手 / obtaining できる / able to 詳しい /detailed 情報 / information Note: translations based on Google Translate Any pointers you can provide as to the difference of the two methods of tokenizing would be highly appreciated. thx, thushara