Jerome pointed me to some accounting error in the DMA API debugging code and while I can't figure it out yet, I did notice some extreme slowness - which is due to the nouveau driver calling the unpopulate (now that unbind + unpopulate are squashed) quite a lot (this is using Gnome Shell - I think GNOME2 did not have those issues but I can't recall).
Anyhow these patches fix the 50% perf regression I saw and also some minor bugs that I noticed.