are looking for. I recognize that it disrupts your current views/plans on how this should be done, but I do want to work with you to find a suitable middle ground that covers most of the possiblities.
In case you are looking at my code to follow the above-described scenarios, please make sure you pull the latest stuff from my github repository. I have been pushing new stuff since my original annoucement. > I still foresee problems with tiling, we generally don't encourage > accel code to live in the kernel, and you'll really want a > tiled->untiled blit for this thing, Accel code should not go into the kernel (that I fully agree) and there is nothing here that would behove us to do so. Restricting my comments to Radeon GPU (which is the only one that I know well enough), shaders for blit copy live in the kernel and irrespective of VCRTCM work. I rely on them to move the frame buffer out of VRAM to CTD device but I don't add any additional features. Now for detiling, I think that it should be the responsibility of the receiving CTD device, not the GPU pushing the data (Alan mentioned that during the initial set of comments, and although I didn't say anything to it that has been my view as well). Even if you wanted to use GPU for detiling (which I'll explain shortly why you should not), it would not require any new accel code in the kernel. It would merely require one bit flip in the setup of blit copy that already lives in the kernel. However, de-tiling in GPU is a bad idea for two reasons. I tried to do that just as an experiment on Radeon GPUs and watched with the PCI Express analyzer what happens on the bus (yeah, I have some "heavy weapons" in my lab). Normally a tile is a continuous array of memory locations in VRAM. If blit-copy function is told to assume tiled source and linear destination (de-tiling) it will read a continuous set of addresses in VRAM, but then scatter 8 rows of 8 pixels each on non-contignuous set of addresses of the destination. If the destination is the PCI-Express bus, it will result in 8 32-byte write transactions instead of 2 128-byte transactions per each tile. That will choke the throughput of the bus right there. BTW, this is the crux of the blit-copy performance improvement that you got from me back in October. Since blit-copy deals with copying a linear array, playing with tiled/non-tiled bits only affects the order in which addresses are accessed, so the trick was to get rid of short PCIe transactions and also shape up linear to rectangle mapping to make address pattern more friendly for the host. > also for Intel GPUs where you have > UMA, would you read from the UMA. > Yes the read would be from UMA. I have not yet looked at Intel GPUs in detail, so I don't have an answer for you on what problems would pop up and how to solve them, but I'll be glad to revisit the Intel discussion once I do some homework. Some initial thoughts is that frame buffer in Intel are at the end of the day pages in the system memory, so anyone/anything can get to them if they are correctly mapped. > It also doesn't solve the optimus GPU problem in any useful fashion, > since it can't deal with all the use cases, so we still have to write > an alternate solution that can deal with them, so we just end up with > two answers. > Can you elaborate on some specific use cases that are of your concern? I have had this case in mind and I think I can make it work. First I would have to add CTD functionality to Intel driver. That should be straightforward. Once I get there, I'll be ready to experiment and we'll probably be in better position to discuss the specifics then (i.e. when we have something working to compare with what you did in PRIME experiemnt), but it would be good to know your specific concerns early. thanks, Ilija