I would be dismayed if autothreading used threads to accomplish it's goals. Simple iteration in a single interpreter should be more than sufficient.
For sure. No point in doing 10_000 cycles to set up a scratch area for a single boolean test that might take 10 cycles.
A software transaction (atomic { } block) behaves in many ways like setting up a new interpreter thread and exiting at the end. I expect these will be more lightweight than a real thread... in some cases able to be reduced to no-ops.
So, implementing autothreading using STM as the 'localizing' engine is another possibility within a single process/thread.
Sam.