Radeon RX 480 (8 GiByte) im Test: Preisbrecher mit 14-nm-Technik

Khabarak · 20. Oktober 2016

Grestorn schrieb:
Genau, ICH habe es nicht verstanden....

Du bist echt goldig, weißt Du das?

Ahem... richtige AC Unterstützung braucht entsprechende Hardware auf der GPU - genau wie HT in CPUs.
Sonst funktioniert es nicht.
Außerdem:
Parallelisieren ist nicht gleich Async Compute...
Parallelisiert wird in DX schon sehr lange - nur eben synchronisiert, dass die eine Aufgabe auf mehr Kernen läuft.
Ohne das würde mehr als ein Kern pro GPU auch keinen Sinn ergeben.

seahawk · 20. Oktober 2016

Schon Big Kepler kann AC, allerdings kann er nur 32 compute queues gleichzeitig abarbeiten. Maxwell 2 kann auch 32 queues bedienen es dürfen auch 1 graphics und 31 compute sein. Performance technisch relevant ist nur, dass er die Verteilung der SMs auf jede Aufgabe vor Durchführung der Aufgabe festlegt und erst nach Ende der Aufgabe ändern kann. Das macht das Loadbalancing sehr schwierig.

Pascal kann auch 1+31 aber er kann auch die Verteilung der SM dynamisch anpassen. Technisch unterscheidet er sich damit nicht mehr von AMDs CGN Lösung, auch wenn diese 1+64 queues abarbeiten kann, was aber praktisch kaum Relevanz hat. Wenn man, wie NV, nun in der Lage ist seine SMs auch nur mit der Graphics Core zu großen Teilen auszulasten, dann ändert es an der Performance wenig, wenn man Compute nicht parallel macht, sondern dazwischen schiebt. AMD hat aber Probleme die vielen SMs zu beschäftigen und dort verbessert AC dann die Auslastung und somit die Leistung.

Khabarak · 20. Oktober 2016

seahawk schrieb:
Schon Big Kepler kann AC, allerdings kann er nur 32 compute queues gleichzeitig abarbeiten. Maxwell 2 kann auch 32 queues bedienen es dürfen auch 1 graphics und 31 compute sein. Performance technisch relevant ist nur, dass er die Verteilung der SMs auf jede Aufgabe vor Durchführung der Aufgabe festlegt und erst nach Ende der Aufgabe ändern kann. Das macht das Loadbalancing sehr schwierig.

Pascal kann auch 1+31 aber er kann auch die Verteilung der SM dynamisch anpassen. Technisch unterscheidet er sich damit nicht mehr von AMDs CGN Lösung, auch wenn diese 1+64 queues abarbeiten kann, was aber praktisch kaum Relevanz hat. Wenn man, wie NV, nun in der Lage ist seine SMs auch nur mit der Graphics Core zu großen Teilen auszulasten, dann ändert es an der Performance wenig, wenn man Compute nicht parallel macht, sondern dazwischen schiebt. AMD hat aber Probleme die vielen SMs zu beschäftigen und dort verbessert AC dann die Auslastung und somit die Leistung.

Nochmal... Async Compute ist wie bei HT an CPUs - ein high prio und ein low prio Task laufen parallel und stören einander nicht.

Sowohl Kepler als auch Maxwell - und alle Pascal - können nur den einen Task unterbrechen, um den zweiten zu bearbeiten und wieder retour.
Das ist nicht AC...

seahawk · 20. Oktober 2016

Khabarak schrieb:
Nochmal... Async Compute ist wie bei HT an CPUs - ein high prio und ein low prio Task laufen parallel und stören einander nicht.

Sowohl Kepler als auch Maxwell - und alle Pascal - können nur den einen Task unterbrechen, um den zweiten zu bearbeiten und wieder retour.
Das ist nicht AC...

Und das ist schlicht falsch.

Khabarak · 20. Oktober 2016

seahawk schrieb:
Und das ist schlicht falsch.

Hast Du auch mehr als die einfache Behauptung auf Lager?

seahawk · 20. Oktober 2016

Asynchronous Concurrent Compute: Pascal Gets More Flexible - The NVIDIA GeForce GTX 1080 & GTX 1070 Founders Editions Review: Kicking Off the FinFET Generation

3DMark Time Spy: Looking at DX12 Asynchronous Compute Performance | PC Perspective

[Exclusive] Asynchronous Compute Investigated On Nvidia And AMD in Fable Legends DX12 Benchmark, Not Working on Maxwell

Asynchronous compute, AMD, Nvidia, and DX12: What we know so far | ExtremeTech

Khabarak · 20. Oktober 2016

seahawk schrieb:
Asynchronous Concurrent Compute: Pascal Gets More Flexible - The NVIDIA GeForce GTX 1080 & GTX 1070 Founders Editions Review: Kicking Off the FinFET Generation

3DMark Time Spy: Looking at DX12 Asynchronous Compute Performance | PC Perspective

[Exclusive] Asynchronous Compute Investigated On Nvidia And AMD in Fable Legends DX12 Benchmark, Not Working on Maxwell

Asynchronous compute, AMD, Nvidia, and DX12: What we know so far | ExtremeTech

Um gleich mal eine deiner Quellen zu zitieren:

Right now, the best available evidence suggests that when AMD and Nvidia talk about asynchronous compute, they are talking about two very different capabilities. “Asynchronous compute,” in fact, isn’t necessarily the best name for what’s happening here. The question is whether or not Nvidia GPUs can run graphics and compute workloads concurrently. AMD can, courtesy of its ACE units.

It’s been suggested that AMD’s approach is more like Hyper-Threading, which allows the GPU to work on disparate compute and graphics workloads simultaneously without a loss of performance, whereas Nvidia may be leaning on the CPU for some of its initial setup steps and attempting to schedule simultaneous compute + graphics workload for ideal execution. Obviously that process isn’t working well yet.
...

“We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute.”

Here’s what that likely means, given Nvidia’s own presentations at GDC and the various test benchmarks that have been assembled over the past week. Maxwell does not have a GCN-style configuration of asynchronous compute engines and it cannot switch between graphics and compute workloads as quickly as GCN

...

Ext3h goes on to say that preemption in Nvidia’s case is only used when switching between graphics contexts (1x graphics + 31 compute mode) and “pure compute context,” but claims that this functionality is “utterly broken” on Nvidia cards at present. He also states that while Maxwell 2 (GTX 900 family) is capable of parallel execution, “The hardware doesn’t profit from it much though, since it has only little ‘gaps’ in the shader utilization either way. So in the end, it’s still just sequential execution for most workload, even though if you did manage to stall the pipeline in some way by constructing an unfortunate workload, you could still profit from it.”

Im Endeffekt eine ziemlich genaue Bestätigung von meinem Statement, wie Async Compute funktionieren sollte:

Hardware in der GPU sorgt dafür, dass Tasks wie bei HT abgearbeitet werden

Edit:

Anandtech ist ebenfalls meiner Meinung:
This from a technical perspective is all that you need to offer a basic level of asynchronous compute support: expose multiple queues so that asynchronous jobs can be submitted. Past that, it's up to the driver/hardware to handle the situation as it sees fit; true async execution is not guaranteed. Frustratingly then, NVIDIA never enabled true concurrency via asynchronous compute on Maxwell 2 GPUs. This despite stating that it was technically possible. For a while NVIDIA never did go into great detail as to why they were holding off, but it was always implied that this was for performance reasons, and that using async compute on Maxwell 2 would more likely than not reduce performance rather than improve it.

There’s a maxim in the consumer electronics industry that if you want to know what’s wrong with the current product, wait for the next one to be released. And in the case of the Pascal launch, this definitely ended up being true. Now that Pascal is upon us and NVIDIA has fixed that which ills Maxwell 2, we finally know why NVIDIA has held off from enabling concurrency with asynchronous compute on Maxwell 2 all this time.

The issue, as it turns out, is that while Maxwell 2 supported a sufficient number of queues, how Maxwell 2 allocated work wasn’t very friendly for async concurrency. Under Maxwell 2 and earlier architectures, GPU resource allocation had to be decided ahead of execution. Maxwell 2 could vary how the SMs were partitioned between the graphics queue and the compute queues, but it couldn’t dynamically alter them on-the-fly. As a result, it was very easy on Maxwell 2 to hurt performance by partitioning poorly, leaving SM resources idle because they couldn’t be used by the other queues.

Edit2:
Hier die Zusammenfassung von WCCFtech:
What we do find, however, is that the Titan X is likely allowing the benchmarks request for async compute to go through, but instead those workloads are placed directly into the 3D render queue. So Async is still on, and NVIDIA’s driver is aware if it, it’s just not scheduling it as would be proper. What might be happening is that some kind of other, still efficient method of dealing with those specific types of requests is being used instead.

Titan X bügelt Async compute einfach platt und packt alles in die 3D render Warteschlange

Edit3:
Da Pascal und Maxwell nahezu identisch sind, kann auch Pascal kein wirkliches AC.
Pascal ist nur um einiges besser darin, einen Task zu unterbrechen und einen andren ranzunehmen.
Dadurch scheint es so, als ob Pascal wirkliches AC macht, dem ist aber nicht so.
Für wirkliches AC im Sinn von HT (wie es eigentlich gedacht ist - und auf diversen Events auch erklärt wurde), müssen beide Tasks parallel weiter laufen.

Hier nochmal ein Link zu AMD Folien - da siehst Du auch Beispiele für Pre-emption und Async Compute:

AMD Improves DirectX 12 Performance By Up To 46% With Asynchronous Compute Engines

seahawk · 20. Oktober 2016

Das ist mir zu doof, den Anand Artikel nur zu quoten bis sie zu Pascal kommen...

Khabarak · 20. Oktober 2016

seahawk schrieb:
Das ist mir zu doof, den Anand Artikel nur zu quoten bis sie zu Pascal kommen...

ok, hier das Endergebnis zu Pascal:

At the same time however I feel it’s important to note that the scheduling change alone won’t (and can’t) guarantee that Pascal will see significant gains from async compute across the board. Async compute itself is a catch-all term – there are lots of things you can do with asynchronous work submission/execution – so async doesn’t mean that a game is making significant use of concurrency. Furthermore the concurrency is still based on filling execution bubbles, and that means that there needs to be bubbles to fill in the first place. In other words, the greatest gains from async will come from scenarios where for whatever reason, the graphics queue and its synchronous shaders can’t completely saturate the GPU on its own.

Right now I think it’s going to prove significant that while NVIDIA introduced dynamic scheduling in Pascal, they also didn’t make the architecture significantly wider than Maxwell 2. As we discussed earlier in how Pascal has been optimized, it’s a slightly wider but mostly higher clocked successor to Maxwell 2. As a result there’s not too much additional parallelism needed to fill out GP104; relative to GM204, you only need 25% more threads, a relatively small jump for a generation. This means that while NVIDIA has made Pascal far more accommodating to asynchronous concurrent executeion, there’s still no guarantee that any specific game will find bubbles to fill. Thus far there’s little evidence to indicate that NVIDIA’s been struggling to fill out their GPUs with Maxwell 2, and with Pascal only being a bit wider, it may not behave much differently in that regard.

Hier sagt Anandtech nochmal recht deutlich, dass Async Compute eigentlich nichts anderes als HT ist.
Und dass HT in Pascal nicht so funktioniert, weil es eh keine freien Ressourcen dafür gibt.

Wie war das noch mit deiner Behauptung, Async Compute sei nicht wie HT?

JoM79 · 20. Oktober 2016

Seid ihr jetzt mal fertig mit eurem OT?

Radeon RX 480 (8 GiByte) im Test: Preisbrecher mit 14-nm-Technik

Das ist das Thema des Threads, den Rest könnt ihr auch per PN ausmachen. :nene:

DerLachs · 20. Oktober 2016

Bei Geizhals werden jetzt die PowerColor Radeon RX 480 Red Dragon, 8GB GDDR5 Preisvergleich | Geizhals Deutschland und die Sapphire Nitro+ Radeon RX 480 8G D5 (1276MHz) Preisvergleich | Geizhals Deutschland gelistet.

seahawk · 21. Oktober 2016

JoM79 schrieb:
Seid ihr jetzt mal fertig mit eurem OT?

Das ist das Thema des Threads, den Rest könnt ihr auch per PN ausmachen.

Wobei es schon nicht off-topic ist wenn man die Überlegenheit von AMD unter neuen APIs heraus stellt. Und ich denke Khabarak hat das sehr gut gemacht und gezeigt, warum man NV karten nicht kaufen sollte. :daumen:

Cleriker · 21. Oktober 2016

Wie ihr merkt habe ich eilig beiden einen Daumen nach oben gegeben. Einfach weil man das wirklich per pm, oder in einem anderen Thread hätte machen können und dann hier nur die Ergebnisse posten. Wie dem auch sei, es ist eine Frage der Definition. Nicht nur der von AC, sondern von der Aussage seitens Nvidia. Was bedeutet denn Kompatibilität? Nicht zwangsläufig dass es auf die selbe weiße laufen muss, sondern nur, dass es mit der dadurch entstehenden Aufgabe umzugehen weiß. Das klappt ja schon, das muss man ihnen lassen. Die Technik an sich beherrschen sie aber nicht. Sie haben nur eine Möglichkeit gefunden um "kompatibel" zu Anwendungen mit dieser Technik zu laufen.

seahawk · 21. Oktober 2016

Anand ist da ganz klar und Pascal kann AC in vergleichbaren Umfang wie AMD ab CGN 3.0. Die Tests von Pcper im Time Spy zeigen ja auch Leistungsgewinne für NV unter Nutzung von AC.

Man muss es eben auch mal von einer anderen Seite betrachten wenn man sich FPS/Flop ansieht, dann ist AMD unter DX11 einfach schlecht. DX12 hilft ihnen ihre überlegene Rechenleistung endlich in FPS zu verwandeln.

Die NV Schwäche unter DX12 kommt aus ihrer bereits unter DX11 sehr gut ausgelasteten Architektur und vor allem eben auch an deutlich geringerer Rechenleistung als AMD. Und da die AMD Chips immer mehr Rechenleistung bieten als die NV Konkurrenz, sind sie eben auch der bessere Kauf für DX12, also generell der bessere Kauf für alles außer Retrorechner.

Cleriker · 21. Oktober 2016

Joar, das könnte man so sagen. Werden aber einige nicht gern lesen wollen, Ganz egal ob es gleichzeitig ein Kompliment für Grün unter DX11 darstellt.

DARPA · 21. Oktober 2016

Offtopic finde ich es nicht, es geht ja um ein Feature der RX480 (im Vergleich zu deren Konkurrenz).

Ist nicht der Vorteil von GCN, dass die einzelnen SM Direct und Compute Tasks parallel abarbeiten können, während bei Pascal nur entweder oder geht (dafür mit schnellerem Switchen gegenüber Maxwell) ?

Kaaruzo · 21. Oktober 2016

seahawk schrieb:
Wobei es schon nicht off-topic ist wenn man die Überlegenheit von AMD unter neuen APIs heraus stellt. Und ich denke Khabarak hat das sehr gut gemacht und gezeigt, warum man NV karten nicht kaufen sollte.

Dieses "Überlegenheit" sieht man ja gut in Benchmarks von BF1.

Und weil man NV Karten nicht kaufen soll, hat NV auch eine Marktmacht von der AMD nur träumen kann. Aber zumindest hat mich dein Beitrag erheitert. So hatte er ja auch was gutes :lol:

Khabarak · 21. Oktober 2016

Kaaruzo schrieb:
Dieses "Überlegenheit" sieht man ja gut in Benchmarks von BF1.

Und weil man NV Karten nicht kaufen soll, hat NV auch eine Marktmacht von der AMD nur träumen kann. Aber zumindest hat mich dein Beitrag erheitert. So hatte er ja auch was gutes

Das schöne an Nvidia ist ja, dass alle auf die Performance schauen und nicht eine Sekunde über all die nicht erfüllten Versprechungen zu vorhandenen Features reden.
Wen die Performance stimmt, ist es offenbar auch egal, wenn das Produkt nicht mal den eigenen Spezifikationen entspricht...

Kaaruzo · 21. Oktober 2016

Selbst wenn dem so wäre (was nicht der Fall ist).

NV ist selbst ohne das Erfüllen der Spezifikationen AMD überlegen. Spricht nicht gerade für AMD :lol:

Und die neusten News sprechen da eine deutliche Sprache.

Khabarak · 21. Oktober 2016

Kaaruzo schrieb:
Selbst wenn dem so wäre (was nicht der Fall ist).

NV ist selbst ohne das Erfüllen der Spezifikationen AMD überlegen. Spricht nicht gerade für AMD

Und die neusten News sprechen da eine deutliche Sprache.

Jeder wie er es mag.
Ich werd nur nicht gern von den Firmen belogen, deren Produkte ich kaufe.

Aber hey, es funktioniert ja offenbar mit genug Leuten.

Du meinst die Nachricht, dass AMD inzwischen knapp 606 Millionen ihrer Schulden abgebaut haben?
Tja.. geht langsam aufwärts.

Und ich will man Nvidia sehen, wenn sie es mal schaffen, eine X86 Lizenz zu bekommen.
Wird interessant.

Radeon RX 480 (8 GiByte) im Test: Preisbrecher mit 14-nm-Technik

BIOS-Overclocker(in)

Lötkolbengott/-göttin

BIOS-Overclocker(in)

Lötkolbengott/-göttin

BIOS-Overclocker(in)

Lötkolbengott/-göttin

BIOS-Overclocker(in)

Lötkolbengott/-göttin

BIOS-Overclocker(in)

Trockeneisprofi (m/w)