Re: Document category identification in query

Weiwei Wang Mon, 21 Dec 2009 05:30:17 -0800

Your search service sounds very interesting. I once also thought about a
similar application for shopping.


2009/12/21 fei liu <liufeipeki...@gmail.com>

> that's a good idea. This approad will work.
>
> Supposing someone submit a query "nice rice restaurant", I think he/she
> wants to find a chinese restaurant. But you will find the category of
> "restaurant" retrieved with higher score than "chinese restaurant".
> Usually you may find no category retrieved by your approach, because users'
> query doesn't share the same words with your categories' name.
> But you could have a try. If you have interesting results, let me know.
> Thanks!
>
> That's my opinion.
>
> 2009/12/21 Alex <azli...@gmail.com>
>
> > Hi !
> >
> > Many thanks to both of you for your suggestions and answers!
> >
> > What Weiwei Wang suggests is a part of the solution I am willing to
> > implement. I will definitely use the suggest-as-you-type approach in the
> > query form as it will allow for pre-emptive disambiguation and I believe,
> > will give very satisfying results.
> >
> > However, search users are wild beasts and I can't count on them to always
> > use the given suggestions. All I can count on is very erratic, sparse and
> > ambiguous queries :) So I need an almost fool proof solution.
> >
> > To answer your question :
> > "BTW, I do not understand why you need to know the category of user
> input"
> > I am trying to understand the user intent behind the query to filter out
> > results based on a given category of locations. If a user queries "Fast
> > Food
> > in Nanjing" I don't want to return all the documents that contain the
> words
> > "Fast" and "Food" and "Nanjing". I use a custom algorithm to figure out
> the
> > intended location first. Then using the Spatial contrib I filter out the
> > results based on a given area that was identified earlier. Finally I sort
> > the results according to distance from the location point / centroid
> found
> > earlier.
> >
> >  Identifying the category allows me two things :
> >
> > 1) Filter out irrelevant results : I dont want my resultset to include a
> > Supermarket in "Nanjing" where the "food" is fresh and service is "fast"
> > just because the query words were included in the description of the
> > location. Since I am using custom, distance based, sorting of the
> results,
> > I
> > can't afford to have the supermarket be the top result because it is the
> > closest to the location centroid identified earlier. The user intent was
> > clearly "fast food" and not a supermarket !
> >
> > 2) Understand user intent to provide targetted advertizing.
> >
> > 3) Understanding the category of location a user is looking for also
> allows
> > to calculate more accurately the bounding box  = the maximum distance at
> > which the location should be located to be relevant to the user. A user
> > looking for Pizza in New York is expecting to have his results within a
> > radius of a maximum of 1 or 2 miles. If he is looking for a Theme Park he
> > will probably be willing to go further away to find it. So identifying
> the
> > category of the location the user is looking for lets me calculate the
> > didstance radius more acurately.
> >
> >
> >
> >
> > Fei Liu
> >
> > Thanks a lot for the papers you pointed me to. I cam accross them earlier
> > in
> > my research and re-reading them gave me new insights. However I believe
> > that
> > the Two steps approach you are recomending is not very viable under heavy
> > load as it requires two passes on the index.
> > I believe however that Identifying the dominant category(ies) of the
> > resultset when no category could be clearly identified using the query
> > alone, can be very valuable if sent back to the user as an information
> and
> > a
> > category link !
> >
> > Now what I think I will do to pre-emptively identify the location
> > category(ies) implied in the query :
> >
> > 1 - use my own custom category set and index their names using the
> synonym
> > analyzer provided with Lucene and also use some sort of normalization
> such
> > as stemmin. maybe also using snowball analyzer.
> > 2 - break the query into Shingles (word based grams) and analyze each
> > shingle using the analyzers that were used in (1). then query Lucene with
> > these analyzed shingles against the category index built earlier.
> >
> > Hopefully the category with the highest Lucene score should be the one
> > intended by the user....
> >
> > Later on, I also intend to use some sort of training based approach using
> > search queries that would have been tagged with the relevant location
> > categories.
> >
> > What do you guys think ?
> >
> > Would this be a viable approach ?
> >
> > Thanks for all !
> >
> >
> > Cheers
> >
> > Alex
> >
>
>
>
> --
> -------------------------
> Liu Fei（刘飞）
> Institute of Software
> School of Electronics Engineering & Computer Science
> Peking University
> Beijing, 100871, PRC.
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Document category identification in query

Reply via email to