RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Uwe Schindler Wed, 12 Feb 2014 00:11:23 -0800

Hi Erick,

a statement like " Adding &debug=all to the query will show you if this is the 
case" will not help a Lucene user, as it is only available in the Solr server. 
But Andy uses Lucene directly. In his case he should use IndexSearcher's 
explain functionalities to retrieve a structured output of how the documents 
are scored for this query for debugging:


http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query,
 int)

But yes, the length norm is encoded with loss of precsision in Lucene (it is a 
float values encoded to 1 byte only). With Lucene 4 there are ways to change 
that behavior, but that included changing the similarity implementation and use 
a different DocValues type for encoding the norms. In most cases this is not 
needed, because user won't notice.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, January 15, 2014 1:30 PM
> To: java-user
> Subject: Re: Length of the filed does not affect the doc score accurately for
> chinese analyzer(SmartChineseAnalyzer)
> 
> the lengths of fields are encoded and lose some precision. So I suspect the
> length of the field calculated for the two documents are the same after
> encoding.
> 
> Adding &debug=all to the query will show you if this is the case.
> 
> Best
> Erick
> 
> On Wed, Jan 15, 2014 at 3:39 AM, andy <yhl...@sohu.com> wrote:
> > Hi guys,
> >
> > As the topic,it seems that the length of filed does not affect the doc
> > score accurately for chinese analyzer in my source code
> >
> > index source code
> >
> >  private static Directory DIRECTORY;
> >
> >
> >     @BeforeClass
> >     public static void before() throws IOException {
> >           DIRECTORY = new RAMDirectory();
> >           Analyzer chineseanalyzer = new
> > SmartChineseAnalyzer(Version.LUCENE_40);
> >           IndexWriterConfig indexWriterConfig = new
> > IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
> >           FieldType nameType = new FieldType();
> >           nameType.setIndexed(true);
> >           nameType.setStored(true);
> >           nameType.setOmitNorms(false);
> >           try {
> >               IndexWriter indexWriter = new IndexWriter(DIRECTORY,
> > indexWriterConfig);
> >
> >               List<String> nameList = new ArrayList<String>();
> >
> > nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司
> ");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司
> ");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
> >               for (int i = 0; i < nameList.size(); i++) {
> >                   Document document = new Document();
> >                   document.add(new Field("name", nameList.get(i),
> > nameType));
> >                   document.add(new
> > Field("id",String.valueOf(i+1),nameType));
> >                   indexWriter.addDocument(document);
> >             }
> >               indexWriter.commit();
> >           } catch (IOException e) {
> >               // TODO Auto-generated catch block
> >               e.printStackTrace();
> >           }
> >     }
> >
> > search snippet:
> >  @Test
> >     public void testChinese() throws IOException, ParseException {
> >         String keyword = "咨询公司";
> >         System.out.println("Searching for:" + keyword);
> >         System.out.println();
> >         IndexReader indexReader = DirectoryReader.open(DIRECTORY);
> >         IndexSearcher indexSearcher = new IndexSearcher(indexReader);
> >         Query query = null;
> >         query = new QueryParser(Version.LUCENE_40,"name",new
> > SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
> >         TopDocs topDocs = indexSearcher.search(query,15);
> >         System.out.println("Search Result:");
> >         if (null !=topDocs && 0 < topDocs.totalHits) {
> >             for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
> >                 System.out.println("doc id:" +
> > indexSearcher.doc(scoreDoc.doc).get("id"));
> >                 String name = indexSearcher.doc(scoreDoc.doc).get("name");
> >                 System.out.println("content of Field:" + name);
> >                 dumpCNTokens(name);
> >                 System.out.println("score:" + scoreDoc.score);
> >
> > System.out.println("-------------------------------------------");
> >             }
> >         } else {
> >             System.out.println("no results");
> >         }
> >
> >     }
> >
> >
> > And search result as follows:
> > Searching for:咨询公司
> >
> > Search Result:
> > doc id:1
> > content of Field:咨询公司
> > Terms:咨询        公司
> > score:0.74763227
> > -------------------------------------------
> > doc id:2
> > content of Field:飞鹰咨询管理咨询公司
> > Terms:飞鹰        咨询      管理      咨询      公司
> > score:0.6317303
> > -------------------------------------------
> > doc id:3
> > content of Field:北京中标咨询公司
> > Terms:北京        中标      咨询      公司
> > score:0.5981058
> > -------------------------------------------
> > doc id:4
> > content of Field:重庆咨询公司
> > Terms:重庆        咨询      公司
> > score:0.5981058
> > -------------------------------------------
> > doc id:5
> > content of Field:商务咨询服务公司
> > Terms:商务        咨询      服务      公司
> > score:0.5981058
> > -------------------------------------------
> > doc id:6
> > content of Field:法律咨询公司
> > Terms:法律        咨询      公司
> > score:0.5981058
> > -------------------------------------------
> >
> > docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6
> > should have a higner score than the doc 3,5, becase the doc 4 and doc
> > 6 have three terms ,doc 3,5 have four terms.
> > Am I right? who can give me a explanation? And how to get the expected
> > result?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect
> > -the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp41
> > 11390.html Sent from the Lucene - Java Users mailing list archive at
> > Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Reply via email to