> Let me know your thoughts and looking forward to a good LPC MC discussion! >
Nice write up Joel, thanks for taking time to compile this with great detail! After going through the details of interface proposal using cgroup v2 controllers, and based on our discussion offline, would like to note down this idea about a new pseudo filesystem interface for core scheduling. We could include this also for the API discussion during core scheduler MC. coreschedfs: pseudo filesystem interface for Core Scheduling ---------------------------------------------------------------------------------- The basic requirement of core scheduling is simple - we need to group a set of tasks into a trust group that can share a core. So we don’t really need a nested hierarchy for the trust groups. Cgroups v2 follow a unified nested hierarchy model that causes a considerable confusion if the trusted tasks are in different levels of the hierarchy and we need to allow them to share the core. Cgroup v2's single hierarchy model makes it difficult to regroup tasks in different levels of nesting for core scheduling. As noted in this mail, we could use multi-file approach and other interfaces like prctl to overcome this limitation. The idea proposed here to overcome the above limitation is to come up with a new pseudo filesystem - “coreschedfs”. This filesystem is basically a flat filesystem with maximum nesting level of 1. That means, root directory can have sub-directories for sub-groups, but those sub-directories cannot have more sub-directories representing trust groups. Root directory is to represent the system wide trust group and sub-directories represent trusted groups. Each directory including the root directory has the following set of files/directories: - cookie_id: User exposed id for a cookie. This can be compared to a file descriptor. This could be used in programmatic API to join/leave a group - properties: This is an interface to specify how child tasks of this group should behave. Can be used for specifying future flag requirements as well. Current list of properties include: NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group will result in creation of a new trust group SAME_COOKIE_FOR_CHILD: All fork() for tasks in this group will end up in this same group ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this group goes to the root group - tasks: Lists the tasks in this group. Main interface for adding removing tasks in a group - <pid>: A directory per task who is am member of this trust group. - <pid>/properties: This file is same as the parent properties file but this is to override the group setting. This pseudo filesystem can be mounted any where in the root filesystem, I propose the default to be in “/sys/kernel/coresched” When coresched is enabled, kernel internally creates the framework for this filesystem. The filesystem gets mounted to the default location and admin can change this if needed. All tasks by default are in the root group. The admin or programs can then create trusted groups on top of this filesystem. Hooks will be placed in fork() and exit() to make sure that the filesystem’s view of tasks is up-to-date with the system. Also, APIs manipulating core scheduling trusted groups should also make sure that the filesystem's view is updated. Note: The above idea is very similar to cgroups v1. Since there is no unified hierarchy in cgroup v1, most of the features of coreschedfs could be implemented as a cgroup v1 controller. As no new v1 controllers are allowed, I feel the best alternative to have a simple API is to come up with a new filesystem - coreschedfs. The advantages of this approach is: - Detached from cgroup unified hierarchy and hence the very simple requirement of core scheduling can be easily materialized. - Admin can have fine-grained control of groups using shell and scripting - Can have programmatic access to this using existing APIs like mkdir,rmdir, write, read. Or can come up with new APIs using the cookie_id which can wrap t he above linux apis or use a new systemcall for core scheduling. - Fine grained permission control using linux filesystem permissions and ACLs Disadvantages are - yet another psuedo filesystem. - very similar to cgroup v1 and might be re-implementing features that are already provided by cgroups v1. Use Cases ----------------- Usecase 1: Google cloud --------------------------------- Since we no longer depend on cgroup v2 hierarchies, there will not be any issue of nesting and sharing. The main daemon can create trusted groups in the fileystem and provide required permissions for the group. Then the init processes for each job can be added to respective groups for them to create children tasks as needed. Multiple jobs under the same customer which needs to share the core can be housed in one group. Usecase 2: Chrome browser ------------------------ We start with one group for the first task and then set properties to NEW_COOKIE_FOR_CHILD. Usecase 3: chrome VMs --------------------- Similar to chrome browser, the VM task can make each vcpu on its own group. Usecase 4: Oracle use case -------------------------- This is also similar to use case 1 with this interface. All tasks that need to be in the root group can be easily added by the admin. Use case 5: General virtualization ---------------------------------- The requirement is each VM should be isolated. This can be easily done by creating a new group per VM Please have a look at the above proposal and let us know your thoughts. We shall include this also during the interface discussion at core scheduling MC. Thanks, Vineeth