OK, I got curious about when NIX started to happen. Basically, in 2011
or so, we had wrapped up the Blue Gene work, the last Blue Gene
systems having been shipped, and jmk and I were thinking about what to
do; there was still DOE money left.. We decided to revive the k10 work
from 2005 or so. We had stopped the k10 work in 2006, when Fred
Johnson, DOE program manager of FAST-OS, asked the FAST-OS researchers
to start focusing on the upcoming petaflop HPC systems, which were not
going to be x86 clusters, and (so long ago!) were not going to run
Linux. So we went full circle: DOE funded Plan 9 on k10, then we
shifted gears in 2006 to Blue Gene (Power PC), then in 2011, it was
back to ... K10.

I wrote a note summarizing what jmk and I came up with and sent it out
to the lsub folks and jmk on April 21. A lot of it is not what we did
:-) but the core idea, of application cores, we did do.

So, below, the note, showing the core idea in april 2021. The "in May"
reference is lsub's kind offer to fly me out and give me a place to
stay for May 2011. The result was, for me, a chance to work with and
learn from very smart researchers! The 'we' mentioned in the note is
jmk and me. Note that the idea is very unformed. By the time I got
there Nemo had figured out the core operation of switching between AC
and TC, and Charles had convinced me that, since we're running on a
shared memory machine, we might want to take advantage of that fact.

I pushed hard on having only 2M pages, which we later continued on
Harvey. A standard HPC noise benchmark (rminnich/github.com/ftq)
showed this worked very well. Nemo came up with a very nice idea; once
the break got above 1G, just use 1G pages, b/c only 1 or 2 programs
would need it, but we'd save lots of page table pages in doing this.
It worked well.

In retrospect, it's not clear that just having 2M pages is a good
idea. 4K seems clearly too small, but 64K seems a better size all
around.

What was really incredible was just how little of the kernel we had to
change to get it to work, and just how quickly we had the basic system
(about 2 weeks). And it worked so well. We made a minor change to exec
and to rc to make it trivially easy to schedule processes on an AC.

Finally, why did something like this not ever happen? Because GPUs
came along a few years later and that's where all the parallelism in
HPC is nowadays. NIX was a nice idea, but it did not survive in the
GPU era.

"
I think we came to a good conclusion about what to do in May.

The idea is to base our work on the Plan 9 k8 port for the 9k kernel.
I would like to explore the concept of application cores. An
application core is a core that only runs user mode programs. Right
now there are lots of questions about how to do this, but application
cores are seen as a next step in manycore systems. In a system of N^2
cores,vendors are telling me that something like N of then will be
able to run a kernel, and that N^2-N will not be able to. Application
cores save power, heat, money, and die space.

The idea is that to prototype we can run a full-up Plan 9 kernel on
core 0, then have a driver (/dev/apcore) with  clone file
(/dev/apcore/clone) that a process can open to gain access to a core.
The application core process can be assembled by writing to a ctl file
to do such operations as allocating and writing memory, setting
registers, etc., then launched via write to the ctl file. The
application core process talks to the kernel via typed ipc channels
such as we have today -- it will look kind of like an ioproc but the
channels will be highly optimized like the ones in Barrelfish.

All the models we need for this mode exist in Plan 9.

Here's what's neat. You can have application cores, but we can also
have core 0 running a traditional time-shared kernel (Plan 9 in this
case). That way, if you run out of application cores, the traditional
time-shared model is there on core 0. I think this hybrid model is
going to be very powerful.

So I will be booting the 9k/k8 kernel to make sure I know how. That
way, when I get there, we can get a quick start.

thanks
"

On Thu, Dec 26, 2024 at 9:54 PM Ron Minnich <rminn...@p9f.org> wrote:
>
> Hello, I more or less started that project with a white paper early in
> 2011 so may be able to help. NIX was inspired by what we learned from
> the Blue Gene work and other Plan 9 work sponsored by DOE FAST-OS,
> which ran from 2005-2011. During those years, DOE FAST-OS sponsored
> the amd64 compiler, k10 kernel, blue gene port, and NIX, to name a few
> things. I was at both LANL and SNL over that time period.
>
> A group of us spent May of 2011 at lsub getting the initial NIX system
> to work. It was a very productive month :-) The group at lsub were as
> good as it gets, and then we had jmk and Charles there too. Quite the
> Dream Team.
>
> What would you like to know? I also have an initial broken port to
> 9front if you'd like to try to bring it to life.
>
> ron
>
> On Thu, Dec 26, 2024 at 9:13 PM Andreas.Elding via 9fans
> <9fans@9fans.net> wrote:
> > 
> > Hello,
> > 
> > I was wondering if anyone has any experience using the NIX HPC environment? 
> > Traditionally, there's a scheduler that keeps track of the resources in the 
> > system, what nodes are busy and with which jobs, how much ram is in use and 
> > such.
> > 
> > I'm finding very sparse information on the NIX project, so I turn here to 
> > ask if anyone has actually used it and can share some details?
> > 
> > The site with the most information on it seems to be https://lsub.org/nix/  
> > but the research papers that I have found there are not too detailed 
> > (perhaps I've only found previews?).
> > 
> > Any extra information would be appreciated.
> > 

------------------------------------------
9fans: 9fans
Permalink: 
https://9fans.topicbox.com/groups/9fans/T7692a612f26c8ec5-M00c88605e06db8f37099b970
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

Reply via email to