Re: [DISCUSS] Nested YAML configs for new features

[email protected] Mon, 29 Nov 2021 08:51:14 -0800

Maybe we can make our query language more expressive 😊

We might anyway want to introduce e.g. a LIKE filtering option to find/discover 
flattened config parameters?


From: Benjamin Lerer <[email protected]>
Date: Monday, 29 November 2021 at 16:41
To: [email protected] <[email protected]>
Subject: Re: [DISCUSS] Nested YAML configs for new features
>
> I don’t think it’s necessarily a requirement that we use the flattened
> version in vtables. At the very least we can make use of sets, lists, etc.
> But we can probably also use UDTs if this improves clarity.


In my opinion part of the issue is on the query side. How do we select a
nested set or a specific set easily? UDTs are not great for this type of
queries. For collection we can use CONTAINS and element or range selection
but insertion might be the problem.

Le lun. 29 nov. 2021 à 17:23, Bowen Song <[email protected]> a écrit :

> In ElasticSearch, the default is a flattened format with almost all
> lines commented out. See
>
> https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml
>
> I guess they chose to do that because user can uncomment individual
> lines to make changes. In a structured config file, the user will have
> to uncomment all lines containing the parent keys to get it work. For
> example, if someone wants to set the config keyABB to a non-default
> value, they will have to correctly uncomment 3 lines: keyA, keyAB and
> keyABB, which can be annoying and could easily maker a mistake. If any
> of the first two keys is not uncommented, the YAML file will still be
> valid but the config like keyX.keyAB.keyABB might just get silently
> ignored by the database.
>
>     keyX:
>        keyY:
>          keyZ: value
>     # keyA:
>     #   keyAA:
>     #     key AAA: value
>     #   keyAB:
>     #     keyABA: value
>     #     keyABB: value
>
> On 29/11/2021 15:54, Benjamin Lerer wrote:
> > I do not think that supporting both options is an issue. The settings
> > virtual table would have to use the flattened version.
> > If we support both formats, the question would be: what should be the one
> > used by default in the configuration file?
> >
> > Le ven. 26 nov. 2021 à 15:40,[email protected]  <[email protected]>
> a
> > écrit :
> >
> >> This is the approach I favour for config files also. We had a much less
> >> engaged discussion on this topic only a few months ago, so glad to see
> more
> >> people getting involved now.
> >>
> >> I would however personally prefer to see the configuration file slowly
> >> deprecated (if perhaps never retired), in favour of virtual tables, so
> that
> >> operators may easily set configurations for the entire cluster. Ideally
> it
> >> would be possible to specify configuration per cluster, per DC and per
> >> node, with the most specific configuration applying I would like to see
> a
> >> similar hierarchy for Keyspace, Table and Per-Query options. Ideally
> only
> >> the barest minimum number of options would be necessary to supply in a
> >> config file, and only on first launch – seed nodes, for instance.
> >>
> >> So whatever design we employ here, we should IMO be aiming for it to be
> >> compatible with a CQL representation also.
> >>
> >>
> >> From: Bowen Song<[email protected]>
> >> Date: Wednesday, 24 November 2021 at 18:15
> >> To:[email protected]  <[email protected]>
> >> Subject: Re: [DISCUSS] Nested YAML configs for new features
> >> Since you mentioned ElasticSearch, I'm actually pretty happy with their
> >> config file syntax. It allows the user to completely flatten out the
> >> entire config file. To give people who isn't familiar with ElasticSearch
> >> an idea, here is a config file we use:
> >>
> >>      cluster.name: foobar
> >>
> >>      node.remote_cluster_client: false
> >>      node.name: "foo.example.com"
> >>      node.master: true
> >>      node.data: true
> >>      node.ingest: true
> >>      node.ml: false
> >>
> >>      xpack.ml.enabled: false
> >>      xpack.security.enabled: false
> >>      xpack.security.audit.enabled: false
> >>      xpack.watcher.enabled: false
> >>
> >>      action.auto_create_index: "+.,-*"
> >>
> >>      network.host: _global_
> >>
> >>      discovery.zen.hosts_provider: file
> >>      discovery.zen.minimum_master_nodes: 2
> >>
> >>      http.publish_host: "foo.example.com"
> >>      http.publish_port: 443
> >>      http.bind_host: 127.0.0.1
> >>
> >>      transport.publish_host: "bar.example.com"
> >>      transport.bind_host: 0.0.0.0
> >>
> >>      indices.fielddata.cache.size: 1GB
> >>      indices.breaker.total.use_real_memory: false
> >>
> >>      path.logs: /var/log/elasticsearch
> >>      path.data: /var/lib/elasticsearch/data
> >>
> >> As you can see we can use the flat (grep-able) syntax for everything.
> >> This is also human readable because we can group options together by
> >> inserting empty lines between them.
> >>
> >> The equivalent of the above in a structured syntax will be:
> >>
> >>      cluster:
> >>           name: foobar
> >>
> >>      node:
> >>           remote_cluster_client: false
> >>           name: "foo.example.com"
> >>           master: true
> >>           data: true
> >>           ingest: true
> >>           ml: false
> >>
> >>      xpack:
> >>           ml:
> >>               enabled: false
> >>           security:
> >>               enabled: false
> >>               audit:
> >>                   enabled: false
> >>           watcher:
> >>               enabled: false
> >>
> >>      action:
> >>           auto_create_index: "+.,-*"
> >>
> >>      network:
> >>           host: _global_
> >>
> >>      discovery:
> >>           zen:
> >>               hosts_provider: file
> >>               minimum_master_nodes: 2
> >>
> >>      http:
> >>           publish_host: "foo.example.com"
> >>           publish_port: 443
> >>           bind_host: 127.0.0.1
> >>
> >>      transport:
> >>           publish_host: "bar.example.com"
> >>           bind_host: 0.0.0.0
> >>
> >>      indices:
> >>           fielddata:
> >>               cache:
> >>                   size: 1GB
> >>      indices:
> >>           breaker:
> >>               total:
> >>                   use_real_memory: false
> >>
> >>      path:
> >>           logs: /var/log/elasticsearch
> >>           data: /var/lib/elasticsearch/data
> >>
> >> This may be easier to read for some people, but it is a total nightmare
> >> for "grep" - so many keys have identical names, such as "enabled".
> >>
> >> Also, for the virtual tables, it would be a lot easier to represent
> >> individual values in a virtual table when the config is flat and keys
> >> are unique. The virtual tables would need to either support the encoding
> >> and decoding of the structured config into a flat structure, or use JSON
> >> encoded string value. The use of JSON would make querying individual
> >> value much harder.
> >>
> >> On 22/11/2021 16:16, Joseph Lynch wrote:
> >>> Isn't one of the primary reasons to have a YAML configuration instead
> >>> of a properties file is to allow typed and structured (implies nested)
> >>> configuration? I think it makes a lot of sense to group related
> >>> configuration options (e.g. a feature) into a typed class when we're
> >>> talking about more than one or two related options.
> >>>
> >>> It's pretty standard elsewhere in the JVM ecosystem to encode YAMLs to
> >>> period encoded key->value pairs when required (usually when providing
> >>> a property or override layer), Spring and Elasticsearch yamls both
> >>> come to mind. It seems pretty reasonable to support dot encoding and
> >>> decoding, for example {"a": {"b": 12}} -> '"a.b": 12'.
> >>>
> >>> Regarding quickly telling what configuration a node is running I think
> >>> we should lean on virtual tables for "what is the current
> >>> configuration" now that we have them, as others have said the written
> >>> cassandra.yaml is not necessarily the current configuration ... and
> >>> also grep -C or -A exist for this reason.
> >>>
> >>> -Joey
> >>>
> >>> On Mon, Nov 22, 2021 at 4:14 AM Benjamin Lerer<[email protected]>
> >> wrote:
> >>>> I do not have a strong opinion for one or the other but wanted to
> raise
> >> the
> >>>> issue I see with the "Settings" virtual table.
> >>>>
> >>>> Currently the "Settings" virtual table converts nested options into
> flat
> >>>> options using a "_" separator. For those options it allows a user to
> >> query
> >>>> the all set of options through some hack.
> >>>> If we decide to move to more nesting (more than one level), it seems
> to
> >> me
> >>>> that we need to change the way this table is behaving and how we can
> >> query
> >>>> its data.
> >>>>
> >>>> We would need to start using "." as a nesting separator to ensure that
> >>>> things are consistent between the configuration and the table and add
> >>>> support for LIKE restrictions for filtering queries to allow operators
> >> to
> >>>> be able to select the precise set of settings that the operator is
> >> looking
> >>>> for.
> >>>>
> >>>> Doing so is not really complicated in itself but might impact some
> >> users.
> >>>> Le ven. 19 nov. 2021 à 22:39, David Capwell<[email protected]
> .invalid>
> >> a
> >>>> écrit :
> >>>>
> >>>>>> it is really handy to grep
> >>>>>> cassandra.yaml on some config key and you know the value instantly.
> >>>>> You can still do that
> >>>>>
> >>>>> $ grep -A2 coordinator_read_size conf/cassandra.yaml
> >>>>> #     coordinator_read_size:
> >>>>> #         warn_threshold_kb: 0
> >>>>> #         abort_threshold_kb: 0
> >>>>>
> >>>>> I was also arguing we should support nested and flat, so if your
> infra
> >>>>> works better with flat then you could use
> >>>>>
> >>>>> track_warnings.coordinator_read_size.warn_threshold_kb: 0
> >>>>> track_warnings.coordinator_read_size.abort_threshold_kb: 0
> >>>>>
> >>>>>> On Nov 19, 2021, at 1:34 PM, David Capwell<[email protected]>
> >> wrote:
> >>>>>>> With the flat structure it turns into properties file - would it be
> >>>>>>> possible to support both formats - nested yaml and flat properties?
> >>>>>> For majority of our configs yes, but there are a subset where flat
> >>>>> properties is annoying
> >>>>>> hinted_handoff_disabled_datacenters - set type, so you could do
> >>>>> hinted_handoff_disabled_datacenters=“a,b,c,d” but we would need to
> deal
> >>>>> with separators as the format doesn’t support
> >>>>>> seed_provider.parameters - this is a map type… so would need to do
> >>>>> something like seed_provider.parameters=“{\”a\”: \a\”}” …. Maybe we
> >> special
> >>>>> case maps as dynamic fields?  Then seed_provider.parameters.a=a?  We
> >> have
> >>>>> ParameterizedClass all over the code
> >>>>>> So, as long as we define how to deal with java collections; we could
> >> in
> >>>>> theory support properties files (not arguing for that in this thread)
> >> as
> >>>>> well as system properties.
> >>>>>>> On Nov 19, 2021, at 1:22 PM, Jacek Lewandowski <
> >>>>> [email protected]> wrote:
> >>>>>>> With the flat structure it turns into properties file - would it be
> >>>>>>> possible to support both formats - nested yaml and flat properties?
> >>>>>>>
> >>>>>>>
> >>>>>>> - - -- --- ----- -------- -------------
> >>>>>>> Jacek Lewandowski
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Nov 19, 2021 at 10:08 PM Caleb Rackliffe <
> >>>>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> If it's nested, "track_warnings" would still work if you're
> grepping
> >>>>> around
> >>>>>>>> vim or less.
> >>>>>>>>
> >>>>>>>> I'd have to concede the point about grep output, although there
> are
> >>>>> tools
> >>>>>>>> likehttps://github.com/kislyuk/yq  that could probably be bent to
> >> do
> >>>>> what
> >>>>>>>> you want.
> >>>>>>>>
> >>>>>>>> On Fri, Nov 19, 2021 at 1:08 PM Stefan Miklosovic <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> Hi David,
> >>>>>>>>>
> >>>>>>>>> while I do not oppose nested structure, it is really handy to
> grep
> >>>>>>>>> cassandra.yaml on some config key and you know the value
> instantly.
> >>>>>>>>> This is not possible when it is nested (easily & fastly) as it is
> >> on
> >>>>>>>>> two lines. Or maybe my grepping is just not advanced enough to
> >> cover
> >>>>>>>>> this case? If it is flat, I can just grep "track_warnings" and I
> >> have
> >>>>>>>>> them all.
> >>>>>>>>>
> >>>>>>>>> Can you elaborate on your last bullet point? Parsing layer ...
> >> What do
> >>>>>>>>> you mean specifically?
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>> On Fri, 19 Nov 2021 at 19:36, David Capwell<[email protected]>
> >>>>> wrote:
> >>>>>>>>>> This has been brought up in a few tickets, so pushing to the dev
> >>>>> list.
> >>>>>>>>>> CASSANDRA-15234 - Standardise config and JVM parameters
> >>>>>>>>>> CASSANDRA-16896 - hard/soft limits for queries
> >>>>>>>>>> CASSANDRA-17147 - Guardrails prototype
> >>>>>>>>>>
> >>>>>>>>>> In short, do we as a project wish to move "new features" into
> >> nested
> >>>>>>>>>> YAML when the feature has "enough" to justify the nesting?  I
> >> would
> >>>>>>>>>> really like to focus this discussion on new features rather than
> >>>>>>>>>> retroactively grouping (leaving that to CASSANDRA-15234), as
> >> there is
> >>>>>>>>>> already a place to talk about that.
> >>>>>>>>>>
> >>>>>>>>>> To get things started, let's start with the track-warning
> feature
> >>>>>>>>>> (hard/soft limits for queries), currently the configs look as
> >> follows
> >>>>>>>>>> (assuming 15234)
> >>>>>>>>>>
> >>>>>>>>>> track_warnings:
> >>>>>>>>>>     enabled: true
> >>>>>>>>>>     coordinator_read_size:
> >>>>>>>>>>         warn_threshold: 10kb
> >>>>>>>>>>         abort_threshold: 1mb
> >>>>>>>>>>     local_read_size:
> >>>>>>>>>>         warn_threshold: 10kb
> >>>>>>>>>>         abort_threshold: 1mb
> >>>>>>>>>>     row_index_size:
> >>>>>>>>>>         warn_threshold: 100mb
> >>>>>>>>>>         abort_threshold: 1gb
> >>>>>>>>>>
> >>>>>>>>>> or should this be "flat"
> >>>>>>>>>>
> >>>>>>>>>> track_warnings_enabled: true
> >>>>>>>>>> track_warnings_coordinator_read_size_warn_threshold: 10kb
> >>>>>>>>>> track_warnings_coordinator_read_size_abort_threshold: 1mb
> >>>>>>>>>> track_warnings_local_read_size_warn_threshold: 10kb
> >>>>>>>>>> track_warnings_local_read_size_abort_threshold: 1mb
> >>>>>>>>>> track_warnings_row_index_size_warn_threshold: 100mb
> >>>>>>>>>> track_warnings_row_index_size_abort_threshold: 1gb
> >>>>>>>>>>
> >>>>>>>>>> For me I prefer nested for a few reasons
> >>>>>>>>>> * easier to enforce consistency as the configs can use shared
> >> types;
> >>>>>>>>>> in the track warnings patch I had mismatches cross configs (warn
> >> vs
> >>>>>>>>>> warns, fail vs abort, etc.) before going nested, now everything
> >>>>> reuses
> >>>>>>>>>> the same types
> >>>>>>>>>> * even though it is longer, things can be more clear how they
> are
> >>>>>>>> related
> >>>>>>>>>> * parsing layer can add support for mixed or purely flat
> >> depending on
> >>>>>>>>>> user preference (example:
> >>>>>>>>>> track_warnings.row_index_size.abort_threshold, using the '.'
> >> notation
> >>>>>>>>>> to represent nested structures)
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>>>>>
> >>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe,e-mail:[email protected]
> >>>>>>>>>> For additional commands,e-mail:[email protected]
> >>>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>> To unsubscribe,e-mail:[email protected]
> >>>>>>>>> For additional commands,e-mail:[email protected]
> >>>>>>>>>
> >>>>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe,e-mail:[email protected]
> >>>>> For additional commands,e-mail:[email protected]
> >>>>>
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe,e-mail:[email protected]
> >>> For additional commands,e-mail:[email protected]
> >>>

Re: [DISCUSS] Nested YAML configs for new features

Reply via email to