[hibernate-dev] HSEARCH-2358 "fields" attribute in Elasticsearch search results is being ignored

2016-09-28 Thread Yoann Rodiere
Hi,

I wanted to start a discussion about this issue.

It's about stored field retrieval. When searching, Elasticsearch can return
field values two different ways:

 * through the "_source" attribute [1], which basically provides a
copy-paste of the JSON that was submitted when indexing
 * or through the "fields" attribute [2], which only works for stored
fields and provides the actual value that Elasticsearch stored

The main difference really boils down to formatting. With the "_source"
attribute, there's no formatting involved, you get exactly what was
originally submitted. With the "fields" attribute, the value is formatted
according to the first format in the mapping's format list [3].

The thing is, Elasticsearch allows admins to set multiple formats for a
given field. This won't change the output format, but will allow using any
one of these formats when submitting information. Since these "extra"
formats probably aren't understood by Hibernate Search, this means that
using the "_source" attribute to retrieve field values becomes unreliable
as soon as someone else adds/changes documents in Elasticsearch...

So we have two solutions:

 1. Either we only use the "fields" attribute to retrieve field values, and
we force users to have the output format set to something HSearch will
understand, but allow extra input formats.
 2. or we use the "_source" attribute to retrieve field values, and then we
force both output and input format on users, and do not allow extra formats.

I'd be in favor of 1, which seems more rational to me. It only has one
downside: if we go on with this approach, Calendar values (and
ZonedDateTime, ZonedTime, etc.) will have to be stored as String, not as
Date, since Elasticsearch doesn't store the timezone, just the UTC
timestamp. We're currently working this around by inspecting the "_source",
which contains the original timezone (since it's just the raw, originally
submitted JSON).

What do you think?

[1]
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
[2]
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fields.html
[3]
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-date-format.html#custom-date-formats

Yoann Rodière 
Hibernate NoORM Team
___
hibernate-dev mailing list
hibernate-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev

Re: [hibernate-dev] HSEARCH-2358 "fields" attribute in Elasticsearch search results is being ignored

2016-09-28 Thread Guillaume Smet
Hi Yoann,

On Wed, Sep 28, 2016 at 2:56 PM, Yoann Rodiere  wrote:

> I'd be in favor of 1, which seems more rational to me. It only has one
> downside: if we go on with this approach, Calendar values (and
> ZonedDateTime, ZonedTime, etc.) will have to be stored as String, not as
> Date, since Elasticsearch doesn't store the timezone, just the UTC
> timestamp. We're currently working this around by inspecting the "_source",
> which contains the original timezone (since it's just the raw, originally
> submitted JSON).
>
> What do you think?
>

I'm not sure you completely understood the consequences of storing dates as
strings.

You won't be able to use these sorts of features:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html
which are used very often when dealing with dates.

I don't think storing dates as strings is a viable alternative.

IMHO, the choice is between:
- using _source as we currently do it. I'm not sure allowing people to
directly inject data into Elasticsearch and bypass Hibernate Search is
something we can support in the long run so I think it would be acceptable
if we document that we don't expect people to index documents directly (or
at least that they should carefully follow the HS indexing format - which
looks like an acceptable thing).
- using fields and be aware that we will get back UTC values from
projections on these fields

-- 
Guillaume
___
hibernate-dev mailing list
hibernate-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev


Re: [hibernate-dev] HSEARCH-2358 "fields" attribute in Elasticsearch search results is being ignored

2016-09-28 Thread Yoann Rodiere
On 28 September 2016 at 15:23, Guillaume Smet 
wrote:
>
> You won't be able to use these sorts of features:
> https://www.elastic.co/guide/en/elasticsearch/reference/curr
> ent/search-aggregations-bucket-datehistogram-aggregation.html
> https://www.elastic.co/guide/en/elasticsearch/reference/curr
> ent/search-aggregations-bucket-daterange-aggregation.html
> which are used very often when dealing with dates.
>
> I don't think storing dates as strings is a viable alternative.
>

Right. I didn't know about these.


> IMHO, the choice is between:
> - using _source as we currently do it. I'm not sure allowing people to
> directly inject data into Elasticsearch and bypass Hibernate Search is
> something we can support in the long run so I think it would be acceptable
> if we document that we don't expect people to index documents directly (or
> at least that they should carefully follow the HS indexing format - which
> looks like an acceptable thing).
> - using fields and be aware that we will get back UTC values from
> projections on these fields
>

... and the latter is a no-go for ZonedDate et al., since the point of
those classes is to preserve timezone/offset.
Maybe we could just use "_source" when we really need to, but I doubt
there's an elegant way to do this, so I guess we'd better not.

Anyway, it seems we're down to only one acceptable solution... Unless
anyone has another view on all this, I'll index ZonedDate/etc. as dates and
use "_source" for value retrieval, and I'll close HSEARCH-2358 as "Won't
fix".

Thanks for the insight!

Yoann Rodière 
Hibernate NoORM Team
___
hibernate-dev mailing list
hibernate-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hibernate-dev