Hi Uwe, thanks a lot, I will try with that.
Uwe Schindler wrote > Hi andy, > > unfortunately, that is not easy to show with one simple code. You have to > change the Similarity used. > > Before starting to do this, you should be sure, that this affects you > users. The example you gave is showing very short documents. Lucene is > optimized to handle larger documents, for short documents, the document > statistics are not behaving in an ideal way - that’s the main issue here. > Instead of trying to change the very basic Lucene statictics, you should > first verify that this affects a large part of your user queries and > documents, not just this example which looks like special case. Otherwise > it is not an option. > > Please read the documentation of Lucene how to change the similarity, > specifically the length norm, while indexing/searching: > http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/package-summary.html#changingScoring > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: > uwe@ > > >> -----Original Message----- >> From: andy [mailto: > yhlweb@ > ] >> Sent: Wednesday, February 12, 2014 10:53 AM >> To: > java-user@.apache >> Subject: RE: Length of the filed does not affect the doc score accurately >> for >> chinese analyzer(SmartChineseAnalyzer) >> >> Thanks Uwe,could you please give me a more detail example about how to >> change the lucene behavior >> >> >> Uwe Schindler wrote >> > Hi Erick, >> > >> > a statement like " Adding &debug=all to the query will show you if >> > this is the case" will not help a Lucene user, as it is only available >> > in the Solr server. But Andy uses Lucene directly. In his case he >> > should use IndexSearcher's explain functionalities to retrieve a >> > structured output of how the documents are scored for this query for >> debugging: >> > >> > >> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/Inde >> > xSearcher.html#explain(org.apache.lucene.search.Query, >> > int) >> > >> > But yes, the length norm is encoded with loss of precsision in Lucene >> > (it is a float values encoded to 1 byte only). With Lucene 4 there are >> > ways to change that behavior, but that included changing the >> > similarity implementation and use a different DocValues type for >> encoding >> the norms. >> > In most cases this is not needed, because user won't notice. >> > >> > Uwe >> > >> > ----- >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: >> >> > uwe@ >> >> > >> > >> >> -----Original Message----- >> >> From: Erick Erickson [mailto: >> >> > erickerickson@ >> >> > ] >> >> Sent: Wednesday, January 15, 2014 1:30 PM >> >> To: java-user >> >> Subject: Re: Length of the filed does not affect the doc score >> >> accurately for chinese analyzer(SmartChineseAnalyzer) >> >> >> >> the lengths of fields are encoded and lose some precision. So I >> >> suspect the length of the field calculated for the two documents are >> >> the same after encoding. >> >> >> >> Adding &debug=all to the query will show you if this is the case. >> >> >> >> Best >> >> Erick >> >> >> >> On Wed, Jan 15, 2014 at 3:39 AM, andy < >> >> > yhlweb@ >> >> > > wrote: >> >> > Hi guys, >> >> > >> >> > As the topic,it seems that the length of filed does not affect the >> >> > doc score accurately for chinese analyzer in my source code >> >> > >> >> > index source code >> >> > >> >> > private static Directory DIRECTORY; >> >> > >> >> > >> >> > @BeforeClass >> >> > public static void before() throws IOException { >> >> > DIRECTORY = new RAMDirectory(); >> >> > Analyzer chineseanalyzer = new >> >> > SmartChineseAnalyzer(Version.LUCENE_40); >> >> > IndexWriterConfig indexWriterConfig = new >> >> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer); >> >> > FieldType nameType = new FieldType(); >> >> > nameType.setIndexed(true); >> >> > nameType.setStored(true); >> >> > nameType.setOmitNorms(false); >> >> > try { >> >> > IndexWriter indexWriter = new IndexWriter(DIRECTORY, >> >> > indexWriterConfig); >> >> > >> >> > List >> > > <String> >> > nameList = new ArrayList >> > > <String> >> > (); >> >> > >> >> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司 >> >> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司 >> >> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司 >> "); >> >> > for (int i = 0; i < nameList.size(); i++) { >> >> > Document document = new Document(); >> >> > document.add(new Field("name", nameList.get(i), >> >> > nameType)); >> >> > document.add(new >> >> > Field("id",String.valueOf(i+1),nameType)); >> >> > indexWriter.addDocument(document); >> >> > } >> >> > indexWriter.commit(); >> >> > } catch (IOException e) { >> >> > // TODO Auto-generated catch block >> >> > e.printStackTrace(); >> >> > } >> >> > } >> >> > >> >> > search snippet: >> >> > @Test >> >> > public void testChinese() throws IOException, ParseException { >> >> > String keyword = "咨询公司"; >> >> > System.out.println("Searching for:" + keyword); >> >> > System.out.println(); >> >> > IndexReader indexReader = DirectoryReader.open(DIRECTORY); >> >> > IndexSearcher indexSearcher = new >> IndexSearcher(indexReader); >> >> > Query query = null; >> >> > query = new QueryParser(Version.LUCENE_40,"name",new >> >> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword); >> >> > TopDocs topDocs = indexSearcher.search(query,15); >> >> > System.out.println("Search Result:"); >> >> > if (null !=topDocs && 0 < topDocs.totalHits) { >> >> > for (ScoreDoc scoreDoc : topDocs.scoreDocs) { >> >> > System.out.println("doc id:" + >> >> > indexSearcher.doc(scoreDoc.doc).get("id")); >> >> > String name = >> >> indexSearcher.doc(scoreDoc.doc).get("name"); >> >> > System.out.println("content of Field:" + name); >> >> > dumpCNTokens(name); >> >> > System.out.println("score:" + scoreDoc.score); >> >> > >> >> > System.out.println("-------------------------------------------"); >> >> > } >> >> > } else { >> >> > System.out.println("no results"); >> >> > } >> >> > >> >> > } >> >> > >> >> > >> >> > And search result as follows: >> >> > Searching for:咨询公司 >> >> > >> >> > Search Result: >> >> > doc id:1 >> >> > content of Field:咨询公司 >> >> > Terms:咨询 公司 >> >> > score:0.74763227 >> >> > ------------------------------------------- >> >> > doc id:2 >> >> > content of Field:飞鹰咨询管理咨询公司 >> >> > Terms:飞鹰 咨询 管理 咨询 公司 >> >> > score:0.6317303 >> >> > ------------------------------------------- >> >> > doc id:3 >> >> > content of Field:北京中标咨询公司 >> >> > Terms:北京 中标 咨询 公司 >> >> > score:0.5981058 >> >> > ------------------------------------------- >> >> > doc id:4 >> >> > content of Field:重庆咨询公司 >> >> > Terms:重庆 咨询 公司 >> >> > score:0.5981058 >> >> > ------------------------------------------- >> >> > doc id:5 >> >> > content of Field:商务咨询服务公司 >> >> > Terms:商务 咨询 服务 公司 >> >> > score:0.5981058 >> >> > ------------------------------------------- >> >> > doc id:6 >> >> > content of Field:法律咨询公司 >> >> > Terms:法律 咨询 公司 >> >> > score:0.5981058 >> >> > ------------------------------------------- >> >> > >> >> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6 >> >> > should have a higner score than the doc 3,5, becase the doc 4 and >> >> > doc >> >> > 6 have three terms ,doc 3,5 have four terms. >> >> > Am I right? who can give me a explanation? And how to get the >> >> > expected result? >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-aff >> >> > ect >> >> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-t >> >> > p41 11390.html Sent from the Lucene - Java Users mailing list >> >> > archive at Nabble.com. >> >> > >> >> > ------------------------------------------------------------------- >> >> > -- >> >> > To unsubscribe, e-mail: >> >> > java-user-unsubscribe@.apache >> >> >> > For additional commands, e-mail: >> >> > java-user-help@.apache >> >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: >> >> > java-user-unsubscribe@.apache >> >> >> For additional commands, e-mail: >> >> > java-user-help@.apache >> >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: >> >> > java-user-unsubscribe@.apache >> >> > For additional commands, e-mail: >> >> > java-user-help@.apache >> >> >> >> >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Length- >> of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer- >> SmartChineseAnalyz-tp4111390p4116850.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: > java-user-unsubscribe@.apache >> For additional commands, e-mail: > java-user-help@.apache > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > java-user-unsubscribe@.apache > For additional commands, e-mail: > java-user-help@.apache -- View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4117051.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org