Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main: introduce schema commitlog scheduling group #18715

Closed

Conversation

piodul
Copy link
Contributor

@piodul piodul commented May 16, 2024

Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However:

  • It is important and significant enough that it should run in a scheduling group that is separate from the main one,
  • It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues.

Therefore, introduce a new scheduling group dedicated to the schema commitlog.

Fixes: #15566

  • It's not a regression, so conservatively not marking for backport

Currently, we do not explicitly set a scheduling group for the schema
commitlog which causes it to run in the default scheduling group (called
"main"). However:

- It is important and significant enough that it should run in a
  scheduling group that is separate from the main one,
- It should not run in the existing "commitlog" group as user writes may
  sometimes need to wait for schema commitlog writes (e.g. read barrier
  done to learn the schema necessary to interpret the user write) and we
  want to avoid priority inversion issues.

Therefore, introduce a new scheduling group dedicated to the schema
commitlog.

Fixes: scylladb#15566
@piodul piodul added the backport/none Backport is not required label May 16, 2024
@piodul
Copy link
Contributor Author

piodul commented May 16, 2024

I ran the following script:

from cassandra.cluster import Cluster

c = Cluster()
s = c.connect()

while True:
    s.execute("CREATE KEYSPACE IF NOT EXISTS ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}")
    s.execute("DROP KEYSPACE ks")

and was looking whether the scylla_io_queue_total_write_bytes metric goes up in the schema_commitlog scheduling group. Indeed, most of the data is written in this scheduling group, but there is also some non-trivial traffic in the main scheduling group.

I added an error injection which aborts if seastar tries to process a write in the main scheduling group in order to obtain a backtrace, and it looks like writing sstables is done there.

@@ -1024,6 +1024,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
dbcfg.memtable_to_cache_scheduling_group = make_sched_group("memtable_to_cache", "mt2c", 200);
dbcfg.gossip_scheduling_group = make_sched_group("gossip", "gms", 1000);
dbcfg.commitlog_scheduling_group = make_sched_group("commitlog", "clog", 1000);
dbcfg.schema_commitlog_scheduling_group = make_sched_group("schema_commitlog", "sclg", 1000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sclog or schema_clog perhaps for consistency with the prevoius name (stretching my intelligence trying to contribute to this patch to the extreme)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to use "sclog", but there is a hard-coded 4-character limit on the size of the short name of the scheduling group.

@kostja
Copy link
Contributor

kostja commented May 16, 2024

I ran the following script:

from cassandra.cluster import Cluster

c = Cluster()
s = c.connect()

while True:
    s.execute("CREATE KEYSPACE IF NOT EXISTS ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}")
    s.execute("DROP KEYSPACE ks")

and was looking whether the scylla_io_queue_total_write_bytes metric goes up in the schema_commitlog scheduling group. Indeed, most of the data is written in this scheduling group, but there is also some non-trivial traffic in the main scheduling group.

I added an error injection which aborts if seastar tries to process a write in the main scheduling group in order to obtain a backtrace, and it looks like writing sstables is done there.

We don't force-flush the sstables any more when changing schema tables. Are there any flushes in the log?

@avikivity
Copy link
Member

I think it's too minor to have its own scheduling group.

What about gossip? It's for "related" work.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 6 hr 33 min
  • Builder: i-03450de8b53d36ec1 (m5ad.8xlarge)

@kostja
Copy link
Contributor

kostja commented May 17, 2024

@avikivity the goal here is to avoid priority inversion since schema commit log is used during raft read barrier, which is in turn used when performing fencing on the data write path.
I believe @kbr-scylla established the need for this during benchmarks

@piodul
Copy link
Contributor Author

piodul commented May 17, 2024

We don't force-flush the sstables any more when changing schema tables. Are there any flushes in the log?

It doesn't look like we are doing explicit flushes. Flushes don't happen every time keyspace is created or added, but around once in 10 iterations of the script loop. They seem to be triggered by the dirty_memory_manager due to memory pressure.

@kostja
Copy link
Contributor

kostja commented May 17, 2024

We don't force-flush the sstables any more when changing schema tables. Are there any flushes in the log?

It doesn't look like we are doing explicit flushes. Flushes don't happen every time keyspace is created or added, but around once in 10 iterations of the script loop. They seem to be triggered by the dirty_memory_manager due to memory pressure.

Perhaps the settings of hte default dirty memory manager, which is used for system tables, could be tweaked now to increasae the limits, since we no longer need to flush the memtables at every update to the system tables, so don't need to make the settings so tight.

system.paxos is already using user memory (check out maybe_write_in_user_memory()), but that's another extreme.

@avikivity
Copy link
Member

@avikivity the goal here is to avoid priority inversion since schema commit log is used during raft read barrier, which is in turn used when performing fencing on the data write path. I believe @kbr-scylla established the need for this during benchmarks

I don't object to moving schema commitlog to another scheduling group. It's a good thing to do.

I want to see if we can reuse another low-throughput group instead of adding a new one.

@piodul
Copy link
Contributor Author

piodul commented May 21, 2024

I don't object to moving schema commitlog to another scheduling group. It's a good thing to do.

I want to see if we can reuse another low-throughput group instead of adding a new one.

@avikivity I must admit that I am a bit confused on how we choose scheduling groups for particular workloads. We have named groups for dedicated workloads: "gossip" for gossip, "commitlog" for the (non-schema) commitlog, etc. I know that we put background work into "streaming", and we also have "main" for unclassified stuff.

Why would schema commitlog belong to gossip? How should one think about choosing the scheduling group?

@denesb
Copy link
Contributor

denesb commented May 21, 2024

@avikivity I must admit that I am a bit confused on how we choose scheduling groups for particular workloads. We have named groups for dedicated workloads: "gossip" for gossip, "commitlog" for the (non-schema) commitlog, etc. I know that we put background work into "streaming", and we also have "main" for unclassified stuff.

Why would schema commitlog belong to gossip? How should one think about choosing the scheduling group?

gossip is administrative work that is as important as user workloads. Hence gossip sg also has 1000 shares (just like the statement sg).

@kbr-scylla
Copy link
Contributor

If we want to put schema commitlog and gossip to the same group then the group should not be named "gossip" but perhaps something like "administrative" or "internal" or whatever.

Anyway I already queued the patch. I think it is better than leaving schema commitlog in "main" group. If we want to merge schema group with gossip group then it can be done in follow-up patch.

@avikivity
Copy link
Member

I don't object to moving schema commitlog to another scheduling group. It's a good thing to do.
I want to see if we can reuse another low-throughput group instead of adding a new one.

@avikivity I must admit that I am a bit confused on how we choose scheduling groups for particular workloads. We have named groups for dedicated workloads: "gossip" for gossip, "commitlog" for the (non-schema) commitlog, etc. I know that we put background work into "streaming", and we also have "main" for unclassified stuff.

Why would schema commitlog belong to gossip? How should one think about choosing the scheduling group?

We don't have a secret procedure for assigning scheduling groups.

I agree that placing the schema commitlog in gossip is incongruous. We could rename it to "metadata", and move all of group0 there. After all, it's wrong that DDL statements or topology movements compete with repair.

We may want to keep the actual tablet load balancer calculations in maintenance, since they can be heavy and nothing depends on them.

@kbr-scylla
Copy link
Contributor

After all, it's wrong that DDL statements or topology movements compete with repair.

Thank you for reminding me that queries done by group 0 state machine should be moved out of "statement" scheduling group, it should be a GH issue but I keep forgetting to open it

@denesb
Copy link
Contributor

denesb commented May 21, 2024

After all, it's wrong that DDL statements or topology movements compete with repair.

Thank you for reminding me that queries done by group 0 state machine should be moved out of "statement" scheduling group, it should be a GH issue but I keep forgetting to open it

Statement verbs support tenants -- it can switch between a predefined set of scheduling groups. Currently there is 2: statement and default (system). There is an open PR for a third: streaming.
That said tenants are not free, they each have their own connection. So maybe the best solution is to move DDL verbs to the gossip connection. This cannot be done if the verbs is shared with other users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a dedicated scheduling group for schema commitlog
6 participants