Um gleich mal eine deiner Quellen zu zitieren:
Right now, the best available evidence suggests that when AMD and Nvidia talk about asynchronous compute, they are talking about two very different capabilities. “Asynchronous compute,” in fact, isn’t necessarily the best name for what’s happening here. The question is whether or not Nvidia GPUs can run graphics and compute workloads concurrently. AMD can, courtesy of its ACE units.
It’s been suggested that AMD’s approach is more like Hyper-Threading, which allows the GPU to work on disparate compute and graphics workloads simultaneously without a loss of performance, whereas Nvidia may be leaning on the CPU for some of its initial setup steps and attempting to schedule simultaneous compute + graphics workload for ideal execution. Obviously that process isn’t working well yet.
...
“We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute.”
Here’s what that likely means, given Nvidia’s own presentations at GDC and the various test benchmarks that have been assembled over the past week. Maxwell does not have a GCN-style configuration of asynchronous compute engines and it cannot switch between graphics and compute workloads as quickly as GCN
...
Ext3h goes on to say that preemption in Nvidia’s case is only used when switching between graphics contexts (1x graphics + 31 compute mode) and “pure compute context,” but claims that this functionality is “utterly broken” on Nvidia cards at present. He also states that while Maxwell 2 (GTX 900 family) is capable of parallel execution, “The hardware doesn’t profit from it much though, since it has only little ‘gaps’ in the shader utilization either way. So in the end, it’s still just sequential execution for most workload, even though if you did manage to stall the pipeline in some way by constructing an unfortunate workload, you could still profit from it.”
Im Endeffekt eine ziemlich genaue Bestätigung von meinem Statement, wie Async Compute funktionieren sollte:
Hardware in der GPU sorgt dafür, dass Tasks wie bei HT abgearbeitet werden
Edit:
Anandtech ist ebenfalls meiner Meinung:
This from a technical perspective is all that you need to offer a basic level of asynchronous compute support: expose multiple queues so that asynchronous jobs can be submitted. Past that, it's up to the driver/hardware to handle the situation as it sees fit; true async execution is not guaranteed. Frustratingly then, NVIDIA never enabled true concurrency via asynchronous compute on Maxwell 2 GPUs. This despite stating that it was technically possible. For a while NVIDIA never did go into great detail as to why they were holding off, but it was always implied that this was for performance reasons, and that using async compute on Maxwell 2 would more likely than not reduce performance rather than improve it.
There’s a maxim in the consumer electronics industry that if you want to know what’s wrong with the current product, wait for the next one to be released. And in the case of the Pascal launch, this definitely ended up being true. Now that Pascal is upon us and NVIDIA has fixed that which ills Maxwell 2, we finally know why NVIDIA has held off from enabling concurrency with asynchronous compute on Maxwell 2 all this time.
The issue, as it turns out, is that while Maxwell 2 supported a sufficient number of queues, how Maxwell 2 allocated work wasn’t very friendly for async concurrency. Under Maxwell 2 and earlier architectures, GPU resource allocation had to be decided ahead of execution. Maxwell 2 could vary how the SMs were partitioned between the graphics queue and the compute queues, but it couldn’t dynamically alter them on-the-fly. As a result, it was very easy on Maxwell 2 to hurt performance by partitioning poorly, leaving SM resources idle because they couldn’t be used by the other queues.
Edit2:
Hier die Zusammenfassung von WCCFtech:
What we do find, however, is that the Titan X is likely allowing the benchmarks request for async compute to go through, but instead those workloads are placed directly into the 3D render queue. So Async is still on, and NVIDIA’s driver is aware if it, it’s just not scheduling it as would be proper. What might be happening is that some kind of other, still efficient method of dealing with those specific types of requests is being used instead.
Titan X bügelt Async compute einfach platt und packt alles in die 3D render Warteschlange
Edit3:
Da Pascal und Maxwell nahezu identisch sind, kann auch Pascal kein wirkliches AC.
Pascal ist nur um einiges besser darin, einen Task zu unterbrechen und einen andren ranzunehmen.
Dadurch scheint es so, als ob Pascal wirkliches AC macht, dem ist aber nicht so.
Für wirkliches AC im Sinn von HT (wie es eigentlich gedacht ist - und auf diversen Events auch erklärt wurde), müssen beide Tasks parallel weiter laufen.
Hier nochmal ein Link zu AMD Folien - da siehst Du auch Beispiele für Pre-emption und Async Compute:
AMD Improves DirectX 12 Performance By Up To 46% With Asynchronous Compute Engines