main: introduce schema commitlog scheduling group #18715

piodul · 2024-05-16T17:06:33Z

Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However:

It is important and significant enough that it should run in a scheduling group that is separate from the main one,
It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues.

Therefore, introduce a new scheduling group dedicated to the schema commitlog.

Fixes: #15566

It's not a regression, so conservatively not marking for backport

Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However: - It is important and significant enough that it should run in a scheduling group that is separate from the main one, - It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues. Therefore, introduce a new scheduling group dedicated to the schema commitlog. Fixes: scylladb#15566

piodul · 2024-05-16T17:16:31Z

I ran the following script:

from cassandra.cluster import Cluster

c = Cluster()
s = c.connect()

while True:
    s.execute("CREATE KEYSPACE IF NOT EXISTS ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}")
    s.execute("DROP KEYSPACE ks")

and was looking whether the scylla_io_queue_total_write_bytes metric goes up in the schema_commitlog scheduling group. Indeed, most of the data is written in this scheduling group, but there is also some non-trivial traffic in the main scheduling group.

I added an error injection which aborts if seastar tries to process a write in the main scheduling group in order to obtain a backtrace, and it looks like writing sstables is done there.

kostja · 2024-05-16T19:49:10Z

main.cc

@@ -1024,6 +1024,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
            dbcfg.memtable_to_cache_scheduling_group = make_sched_group("memtable_to_cache", "mt2c", 200);
            dbcfg.gossip_scheduling_group = make_sched_group("gossip", "gms", 1000);
            dbcfg.commitlog_scheduling_group = make_sched_group("commitlog", "clog", 1000);
+            dbcfg.schema_commitlog_scheduling_group = make_sched_group("schema_commitlog", "sclg", 1000);


sclog or schema_clog perhaps for consistency with the prevoius name (stretching my intelligence trying to contribute to this patch to the extreme)

I'd love to use "sclog", but there is a hard-coded 4-character limit on the size of the short name of the scheduling group.

kostja · 2024-05-16T19:51:28Z

I ran the following script:
from cassandra.cluster import Cluster

c = Cluster()
s = c.connect()

while True:
    s.execute("CREATE KEYSPACE IF NOT EXISTS ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}")
    s.execute("DROP KEYSPACE ks")
and was looking whether the scylla_io_queue_total_write_bytes metric goes up in the schema_commitlog scheduling group. Indeed, most of the data is written in this scheduling group, but there is also some non-trivial traffic in the main scheduling group.

I added an error injection which aborts if seastar tries to process a write in the main scheduling group in order to obtain a backtrace, and it looks like writing sstables is done there.

We don't force-flush the sstables any more when changing schema tables. Are there any flushes in the log?

avikivity · 2024-05-16T21:04:29Z

I think it's too minor to have its own scheduling group.

What about gossip? It's for "related" work.

scylladb-promoter · 2024-05-16T23:40:45Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 6 hr 33 min
Builder: i-03450de8b53d36ec1 (m5ad.8xlarge)

kostja · 2024-05-17T09:37:28Z

@avikivity the goal here is to avoid priority inversion since schema commit log is used during raft read barrier, which is in turn used when performing fencing on the data write path.
I believe @kbr-scylla established the need for this during benchmarks

piodul · 2024-05-17T10:04:40Z

We don't force-flush the sstables any more when changing schema tables. Are there any flushes in the log?

It doesn't look like we are doing explicit flushes. Flushes don't happen every time keyspace is created or added, but around once in 10 iterations of the script loop. They seem to be triggered by the dirty_memory_manager due to memory pressure.

kostja · 2024-05-17T10:20:34Z

We don't force-flush the sstables any more when changing schema tables. Are there any flushes in the log?

It doesn't look like we are doing explicit flushes. Flushes don't happen every time keyspace is created or added, but around once in 10 iterations of the script loop. They seem to be triggered by the dirty_memory_manager due to memory pressure.

Perhaps the settings of hte default dirty memory manager, which is used for system tables, could be tweaked now to increasae the limits, since we no longer need to flush the memtables at every update to the system tables, so don't need to make the settings so tight.

system.paxos is already using user memory (check out maybe_write_in_user_memory()), but that's another extreme.

avikivity · 2024-05-21T09:46:05Z

@avikivity the goal here is to avoid priority inversion since schema commit log is used during raft read barrier, which is in turn used when performing fencing on the data write path. I believe @kbr-scylla established the need for this during benchmarks

I don't object to moving schema commitlog to another scheduling group. It's a good thing to do.

I want to see if we can reuse another low-throughput group instead of adding a new one.

piodul · 2024-05-21T12:07:36Z

I don't object to moving schema commitlog to another scheduling group. It's a good thing to do.

I want to see if we can reuse another low-throughput group instead of adding a new one.

@avikivity I must admit that I am a bit confused on how we choose scheduling groups for particular workloads. We have named groups for dedicated workloads: "gossip" for gossip, "commitlog" for the (non-schema) commitlog, etc. I know that we put background work into "streaming", and we also have "main" for unclassified stuff.

Why would schema commitlog belong to gossip? How should one think about choosing the scheduling group?

denesb · 2024-05-21T13:28:53Z

@avikivity I must admit that I am a bit confused on how we choose scheduling groups for particular workloads. We have named groups for dedicated workloads: "gossip" for gossip, "commitlog" for the (non-schema) commitlog, etc. I know that we put background work into "streaming", and we also have "main" for unclassified stuff.

Why would schema commitlog belong to gossip? How should one think about choosing the scheduling group?

gossip is administrative work that is as important as user workloads. Hence gossip sg also has 1000 shares (just like the statement sg).

kbr-scylla · 2024-05-21T13:31:10Z

If we want to put schema commitlog and gossip to the same group then the group should not be named "gossip" but perhaps something like "administrative" or "internal" or whatever.

Anyway I already queued the patch. I think it is better than leaving schema commitlog in "main" group. If we want to merge schema group with gossip group then it can be done in follow-up patch.

avikivity · 2024-05-21T13:38:33Z

I don't object to moving schema commitlog to another scheduling group. It's a good thing to do.
I want to see if we can reuse another low-throughput group instead of adding a new one.

@avikivity I must admit that I am a bit confused on how we choose scheduling groups for particular workloads. We have named groups for dedicated workloads: "gossip" for gossip, "commitlog" for the (non-schema) commitlog, etc. I know that we put background work into "streaming", and we also have "main" for unclassified stuff.

Why would schema commitlog belong to gossip? How should one think about choosing the scheduling group?

We don't have a secret procedure for assigning scheduling groups.

I agree that placing the schema commitlog in gossip is incongruous. We could rename it to "metadata", and move all of group0 there. After all, it's wrong that DDL statements or topology movements compete with repair.

We may want to keep the actual tablet load balancer calculations in maintenance, since they can be heavy and nothing depends on them.

kbr-scylla · 2024-05-21T13:40:35Z

After all, it's wrong that DDL statements or topology movements compete with repair.

Thank you for reminding me that queries done by group 0 state machine should be moved out of "statement" scheduling group, it should be a GH issue but I keep forgetting to open it

denesb · 2024-05-21T14:00:06Z

After all, it's wrong that DDL statements or topology movements compete with repair.

Thank you for reminding me that queries done by group 0 state machine should be moved out of "statement" scheduling group, it should be a GH issue but I keep forgetting to open it

Statement verbs support tenants -- it can switch between a predefined set of scheduling groups. Currently there is 2: statement and default (system). There is an open PR for a third: streaming.
That said tenants are not free, they each have their own connection. So maybe the best solution is to move DDL verbs to the gossip connection. This cannot be done if the verbs is shared with other users.

piodul added the backport/none Backport is not required label May 16, 2024

piodul requested review from kostja, avikivity and kbr-scylla May 16, 2024 17:06

kostja reviewed May 16, 2024

View reviewed changes

kostja approved these changes May 16, 2024

View reviewed changes

kbr-scylla approved these changes May 21, 2024

View reviewed changes

scylladb-promoter closed this in 9820472 May 21, 2024

github-actions bot added the promoted-to-master label May 21, 2024

kbr-scylla mentioned this pull request May 24, 2024

Group 0 state machine queries are using streaming scheduling group #18863

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main: introduce schema commitlog scheduling group #18715

main: introduce schema commitlog scheduling group #18715

piodul commented May 16, 2024

piodul commented May 16, 2024 •

edited

kostja May 16, 2024

piodul May 17, 2024

kostja commented May 16, 2024

avikivity commented May 16, 2024

scylladb-promoter commented May 16, 2024

kostja commented May 17, 2024 •

edited

piodul commented May 17, 2024

kostja commented May 17, 2024 •

edited

avikivity commented May 21, 2024

piodul commented May 21, 2024

denesb commented May 21, 2024

kbr-scylla commented May 21, 2024

avikivity commented May 21, 2024

kbr-scylla commented May 21, 2024

denesb commented May 21, 2024

main: introduce schema commitlog scheduling group #18715

main: introduce schema commitlog scheduling group #18715

Conversation

piodul commented May 16, 2024

piodul commented May 16, 2024 • edited

kostja May 16, 2024

Choose a reason for hiding this comment

piodul May 17, 2024

Choose a reason for hiding this comment

kostja commented May 16, 2024

avikivity commented May 16, 2024

scylladb-promoter commented May 16, 2024

🟢 CI State: SUCCESS

Build Details:

kostja commented May 17, 2024 • edited

piodul commented May 17, 2024

kostja commented May 17, 2024 • edited

avikivity commented May 21, 2024

piodul commented May 21, 2024

denesb commented May 21, 2024

kbr-scylla commented May 21, 2024

avikivity commented May 21, 2024

kbr-scylla commented May 21, 2024

denesb commented May 21, 2024

piodul commented May 16, 2024 •

edited

kostja commented May 17, 2024 •

edited

kostja commented May 17, 2024 •

edited