Re: [DISCUSS] Nested YAML configs for new features

Bowen Song Mon, 29 Nov 2021 08:22:39 -0800

In ElasticSearch, the default is a flattened format with almost alllines commented out. Seehttps://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml

I guess they chose to do that because user can uncomment individuallines to make changes. In a structured config file, the user will haveto uncomment all lines containing the parent keys to get it work. Forexample, if someone wants to set the config keyABB to a non-defaultvalue, they will have to correctly uncomment 3 lines: keyA, keyAB andkeyABB, which can be annoying and could easily maker a mistake. If anyof the first two keys is not uncommented, the YAML file will still bevalid but the config like keyX.keyAB.keyABB might just get silentlyignored by the database.


   keyX:
      keyY:
        keyZ: value
   # keyA:
   #   keyAA:
   #     key AAA: value
   #   keyAB:
   #     keyABA: value
   #     keyABB: value

On 29/11/2021 15:54, Benjamin Lerer wrote:

I do not think that supporting both options is an issue. The settings
virtual table would have to use the flattened version.
If we support both formats, the question would be: what should be the one
used by default in the configuration file?

Le ven. 26 nov. 2021 à 15:40,bened...@apache.org  <bened...@apache.org>  a
écrit :

This is the approach I favour for config files also. We had a much less
engaged discussion on this topic only a few months ago, so glad to see more
people getting involved now.

I would however personally prefer to see the configuration file slowly
deprecated (if perhaps never retired), in favour of virtual tables, so that
operators may easily set configurations for the entire cluster. Ideally it
would be possible to specify configuration per cluster, per DC and per
node, with the most specific configuration applying I would like to see a
similar hierarchy for Keyspace, Table and Per-Query options. Ideally only
the barest minimum number of options would be necessary to supply in a
config file, and only on first launch – seed nodes, for instance.

So whatever design we employ here, we should IMO be aiming for it to be
compatible with a CQL representation also.


From: Bowen Song<bo...@bso.ng.INVALID>
Date: Wednesday, 24 November 2021 at 18:15
To:dev@cassandra.apache.org  <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] Nested YAML configs for new features
Since you mentioned ElasticSearch, I'm actually pretty happy with their
config file syntax. It allows the user to completely flatten out the
entire config file. To give people who isn't familiar with ElasticSearch
an idea, here is a config file we use:

     cluster.name: foobar

     node.remote_cluster_client: false
     node.name: "foo.example.com"
     node.master: true
     node.data: true
     node.ingest: true
     node.ml: false

     xpack.ml.enabled: false
     xpack.security.enabled: false
     xpack.security.audit.enabled: false
     xpack.watcher.enabled: false

     action.auto_create_index: "+.,-*"

     network.host: _global_

     discovery.zen.hosts_provider: file
     discovery.zen.minimum_master_nodes: 2

     http.publish_host: "foo.example.com"
     http.publish_port: 443
     http.bind_host: 127.0.0.1

     transport.publish_host: "bar.example.com"
     transport.bind_host: 0.0.0.0

     indices.fielddata.cache.size: 1GB
     indices.breaker.total.use_real_memory: false

     path.logs: /var/log/elasticsearch
     path.data: /var/lib/elasticsearch/data

As you can see we can use the flat (grep-able) syntax for everything.
This is also human readable because we can group options together by
inserting empty lines between them.

The equivalent of the above in a structured syntax will be:

     cluster:
          name: foobar

     node:
          remote_cluster_client: false
          name: "foo.example.com"
          master: true
          data: true
          ingest: true
          ml: false

     xpack:
          ml:
              enabled: false
          security:
              enabled: false
              audit:
                  enabled: false
          watcher:
              enabled: false

     action:
          auto_create_index: "+.,-*"

     network:
          host: _global_

     discovery:
          zen:
              hosts_provider: file
              minimum_master_nodes: 2

     http:
          publish_host: "foo.example.com"
          publish_port: 443
          bind_host: 127.0.0.1

     transport:
          publish_host: "bar.example.com"
          bind_host: 0.0.0.0

     indices:
          fielddata:
              cache:
                  size: 1GB
     indices:
          breaker:
              total:
                  use_real_memory: false

     path:
          logs: /var/log/elasticsearch
          data: /var/lib/elasticsearch/data

This may be easier to read for some people, but it is a total nightmare
for "grep" - so many keys have identical names, such as "enabled".

Also, for the virtual tables, it would be a lot easier to represent
individual values in a virtual table when the config is flat and keys
are unique. The virtual tables would need to either support the encoding
and decoding of the structured config into a flat structure, or use JSON
encoded string value. The use of JSON would make querying individual
value much harder.

On 22/11/2021 16:16, Joseph Lynch wrote:

Isn't one of the primary reasons to have a YAML configuration instead
of a properties file is to allow typed and structured (implies nested)
configuration? I think it makes a lot of sense to group related
configuration options (e.g. a feature) into a typed class when we're
talking about more than one or two related options.

It's pretty standard elsewhere in the JVM ecosystem to encode YAMLs to
period encoded key->value pairs when required (usually when providing
a property or override layer), Spring and Elasticsearch yamls both
come to mind. It seems pretty reasonable to support dot encoding and
decoding, for example {"a": {"b": 12}} -> '"a.b": 12'.

Regarding quickly telling what configuration a node is running I think
we should lean on virtual tables for "what is the current
configuration" now that we have them, as others have said the written
cassandra.yaml is not necessarily the current configuration ... and
also grep -C or -A exist for this reason.

-Joey

On Mon, Nov 22, 2021 at 4:14 AM Benjamin Lerer<ble...@apache.org>

wrote:

I do not have a strong opinion for one or the other but wanted to raise

the

issue I see with the "Settings" virtual table.

Currently the "Settings" virtual table converts nested options into flat
options using a "_" separator. For those options it allows a user to

query

the all set of options through some hack.
If we decide to move to more nesting (more than one level), it seems to

me

that we need to change the way this table is behaving and how we can

query

its data.

We would need to start using "." as a nesting separator to ensure that
things are consistent between the configuration and the table and add
support for LIKE restrictions for filtering queries to allow operators

to

be able to select the precise set of settings that the operator is

looking

for.

Doing so is not really complicated in itself but might impact some

users.

Le ven. 19 nov. 2021 à 22:39, David Capwell<dcapw...@apple.com.invalid>

écrit :

it is really handy to grep
cassandra.yaml on some config key and you know the value instantly.

You can still do that

$ grep -A2 coordinator_read_size conf/cassandra.yaml
#     coordinator_read_size:
#         warn_threshold_kb: 0
#         abort_threshold_kb: 0

I was also arguing we should support nested and flat, so if your infra
works better with flat then you could use

track_warnings.coordinator_read_size.warn_threshold_kb: 0
track_warnings.coordinator_read_size.abort_threshold_kb: 0

On Nov 19, 2021, at 1:34 PM, David Capwell<dcapw...@apple.com>

wrote:

With the flat structure it turns into properties file - would it be
possible to support both formats - nested yaml and flat properties?

For majority of our configs yes, but there are a subset where flat

properties is annoying

hinted_handoff_disabled_datacenters - set type, so you could do

hinted_handoff_disabled_datacenters=“a,b,c,d” but we would need to deal
with separators as the format doesn’t support

seed_provider.parameters - this is a map type… so would need to do

something like seed_provider.parameters=“{\”a\”: \a\”}” …. Maybe we

special

case maps as dynamic fields?  Then seed_provider.parameters.a=a?  We

have

ParameterizedClass all over the code

So, as long as we define how to deal with java collections; we could

in

theory support properties files (not arguing for that in this thread)

as

well as system properties.

On Nov 19, 2021, at 1:22 PM, Jacek Lewandowski <

lewandowski.ja...@gmail.com> wrote:

With the flat structure it turns into properties file - would it be
possible to support both formats - nested yaml and flat properties?


- - -- --- ----- -------- -------------
Jacek Lewandowski


On Fri, Nov 19, 2021 at 10:08 PM Caleb Rackliffe <

calebrackli...@gmail.com>

wrote:

If it's nested, "track_warnings" would still work if you're grepping

around

vim or less.

I'd have to concede the point about grep output, although there are

tools

likehttps://github.com/kislyuk/yq  that could probably be bent to

do

what

you want.

On Fri, Nov 19, 2021 at 1:08 PM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:

Hi David,

while I do not oppose nested structure, it is really handy to grep
cassandra.yaml on some config key and you know the value instantly.
This is not possible when it is nested (easily & fastly) as it is

on

two lines. Or maybe my grepping is just not advanced enough to

cover

this case? If it is flat, I can just grep "track_warnings" and I

have

them all.

Can you elaborate on your last bullet point? Parsing layer ...

What do

you mean specifically?

Thanks

On Fri, 19 Nov 2021 at 19:36, David Capwell<dcapw...@gmail.com>

wrote:

This has been brought up in a few tickets, so pushing to the dev

list.

CASSANDRA-15234 - Standardise config and JVM parameters
CASSANDRA-16896 - hard/soft limits for queries
CASSANDRA-17147 - Guardrails prototype

In short, do we as a project wish to move "new features" into

nested

YAML when the feature has "enough" to justify the nesting?  I

would

really like to focus this discussion on new features rather than
retroactively grouping (leaving that to CASSANDRA-15234), as

there is

already a place to talk about that.

To get things started, let's start with the track-warning feature
(hard/soft limits for queries), currently the configs look as

follows

(assuming 15234)

track_warnings:
    enabled: true
    coordinator_read_size:
        warn_threshold: 10kb
        abort_threshold: 1mb
    local_read_size:
        warn_threshold: 10kb
        abort_threshold: 1mb
    row_index_size:
        warn_threshold: 100mb
        abort_threshold: 1gb

or should this be "flat"

track_warnings_enabled: true
track_warnings_coordinator_read_size_warn_threshold: 10kb
track_warnings_coordinator_read_size_abort_threshold: 1mb
track_warnings_local_read_size_warn_threshold: 10kb
track_warnings_local_read_size_abort_threshold: 1mb
track_warnings_row_index_size_warn_threshold: 100mb
track_warnings_row_index_size_abort_threshold: 1gb

For me I prefer nested for a few reasons
* easier to enforce consistency as the configs can use shared

types;

in the track warnings patch I had mismatches cross configs (warn

vs

warns, fail vs abort, etc.) before going nested, now everything

reuses

the same types
* even though it is longer, things can be more clear how they are

related

* parsing layer can add support for mixed or purely flat

depending on

user preference (example:
track_warnings.row_index_size.abort_threshold, using the '.'

notation

to represent nested structures)

Thoughts?

---------------------------------------------------------------------

To unsubscribe,e-mail:dev-unsubscr...@cassandra.apache.org
For additional commands,e-mail:dev-h...@cassandra.apache.org

---------------------------------------------------------------------

To unsubscribe,e-mail:dev-unsubscr...@cassandra.apache.org
For additional commands,e-mail:dev-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe,e-mail:dev-unsubscr...@cassandra.apache.org
For additional commands,e-mail:dev-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe,e-mail:dev-unsubscr...@cassandra.apache.org
For additional commands,e-mail:dev-h...@cassandra.apache.org

Re: [DISCUSS] Nested YAML configs for new features

Reply via email to