Hi, So, it's been a long time since I threw the first idea of this (see HSEARCH-917) but, after a lot more thoughts, and the fact that I'm basically stuck for a long time on this one, it's probably better to agree with a plan before putting together some code.
Note that this plan is based on our usage of Hibernate Search on a lot of applications for several years and I think our usage pattern is quite common. But, even so, I'm pretty sure there are other search patterns out there which might be interesting and it would be nice to include them in this proposal if they don't fit. I. How do we search at my company? ------------------------------------------------------- We mainly use Search for 2 things: - autocompletion; - search engines: search form to filter a list of items. Usually, a plain text field and several structured fields (drop down choice mostly). We usually sort with business rules, not using score. Users usually like it better as it's more predictable. For example, we sort our autocompletion results alphabetically. An interesting note here is probably that we work on structured data, not on CMS content. This might be considered a detail but you'll see it's important. We use analyzers to: - split the words (the WordDelimiterFilter - yeah, I have a Solr background :)); - filter the input (AsciiFoldingFilter, LowercaseFilter...); - eventually do simple stemming (with our own very minimal stemmers). We sometimes use Search to find the elements to apply business rules when it's really hard to use the database to do so. Search provides a convenient way to denormalize the data. II. On why we can't use the DSL out of the box -------------------------------------------------------------------- The Hibernate Search DSL is great and I must admit this is the DSL which learned me how to build DSL for our own usage. It's intuitive, well thought, definitely a nice piece of code. So, why don't we use it for our plain text queries? (Disclaimer: we use it under the hood, we just have to do a lot of things manually outside of the DSL) Several reasons: 1/ the aforementioned detail about sorting: we need AND badly in plain text search; 2/ we often need to add a clause only if the text isn't empty or the object not null and we then need to add more logic than the fluent approach allows it (I don't have any ideas/proposals for this one but I think it's worth mentioning). And why is it not ideal: 3/ wildcard and analyzers are really a pain with Lucene and you need to implement your own cleaning stuff to get a working wildcard query. 1/ is definitely our biggest problem. III. So let's add an AND option... ----------------------------------------------- Yeah, well, that's not so easy. Let's take a look at the code, especially our dear friend ConnectedMultiFieldsTermQueryBuilder . When I started to look at HSEARCH-917, I thought it would be quite easy to build lucene queries using a Lucene QueryParser instead of all the machinery in ConnectedMultiFieldsTermQueryBuilder. It's not. Here are pointers to the main problems I have: 1/ the getAllTermsFromText is cute when you want to OR the terms but really bad when you need AND, especially when you use analyzers which returns several tokens for a term (this is the case when you use the SynonymFilter or the WordDelimiterFilter); 2/ the fieldBridge thing is quite painful for plain text search as we are not sure that all the fields have the same fieldBridge and, thus, the search terms might be different for each fields after applying the fieldBridge. These problems are not so easy to solve in an absolute kind of way. That's why I haven't made any progress on this problem. Let's illustrate the problem: - you search for "several words in my content" (without ", it's not a phrase query, just terms) - you search in the fields title, summary and content so you expect to find at least one occurrence of each term in one of these fields; - for some reason, you have a different fieldBridge on one of the fields and it's quite hard to define "at least one occurrence of each term in one of these fields" as the fieldBridge might transform the text. My point is that I don't see a way to fix the current DSL without breaking some cases (note that the current code only works because only the OR operator is supported) even if we might consider they are weird. >From my perspective, a plainText branch of the DSL could ignore the fieldBridge machinery but I'm not sure it's a good idea. That's why I would like some feedback about this before moving in this direction. I took a look at the new features of Lucene 4.7 and the new SimpleQueryParser looks kinda interesting as it's really simple and could be a good starting point to come up with a QueryParser which simply does the job for our plain text search queries. IV. About wildcard queries -------------------------------------- Let's say it frankly: wildcard queries are a pain in Lucene. Let's take an example: - You index "Parking" and you have a LowerCaseFilter so your index contains "parking"; - You search for Parking without wildcard, it will work; - You search for Parki* with wildcard, yeah, it won't work. This is due to the fact that for wildcards, the analyzers are ignored. Usually, because if you use ? or *, they can be replaced by the filters you use in your analyzers. While we all understand the Lucene point of view from a technical perspective, I don't think we can keep this position for Hibernate Search as a user friendly search framework on top of Hibernate. At Open Wide, we have a quite complex method which rewrites a search as a working autocompletion search which might work most of the time (with a high value of most...). It's kinda ugly, far from perfect and I'm wondering if we could have something more clever in Search. I once talked with Emmanuel about having different analyzers for Indexing, Querying (this is the Solr way) and Wildcards/Fuzzy search (this is IMHO a good idea as the way you want to normalize your wildcard query highly depends on the analyzer used to index your data). V. The "don't add this clause if null/empty" problem ---------------------------------------------------------------------------- Ideas welcome! VI. Provide not so bad default analyzers --------------------------------------------------------- I think it would be nice to provide default analyzers for plain text. Not necessarily ones including complex/debatable things like stemmers, but at least something which gives a good taste of Search before going into more details. Why would it be interesting? As a French speaking person, I see so much search engines out there which don't normalize accented characters, it would be nice to have something working by default. VII. Conclusion ---------------------- I would really like to make some quick progress on III. I'm pretty sure, we're not the only ones having a lot of MultiFieldQueryParser instantiations in our Search code to deal with this. And I don't talk about the numerous times when one of our developers used the DSL without even thinking it would use the OR operator. Comments welcome. -- Guillaume _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev