Democoder at B3D wrote
There are the cases I typically notice in poorly optimized DX9 PS2.0 code
1) use more more registers than needed
a) not detecting when a register goes "dead" (not uses by any instructions following it) and hence can be reused
b) not detecting cases where multiple scalars can be "packed" into a single register, or 2 "2-d" vectors can be packed. e.g. dot products produce scalars, but I typically see entire registers being wasted to store the result
2) inadequate copy propagation. uses up instruction slot needless and register as well. (e.g. mov r1, r0 )
3) missing peohole optimizations. CMP into LRP, MIN/MAX, etc, doesn't always generate fused multiply-add where possible
4) algebraic and strength reduction. lack of algebraic reasoning/substitution. e.g. if someone wrote a "stupid crossproduct", would the compiler recognize it and substitute the two-instruction-swizzle-trick?
5) on Nvidia's architecture, common subexpression elimination has to be balanced against register usage. Strategies for avoiding recalculation of subexpressions cache subexpression results into extra registers and use them later. However, on NV3x, calcuting x = y+z + 2 and w = y+z +5 might be faster if you do not waste an extra register to hold Y+Z.
However, most compilers think CSE is a quite reasonable optimization, and on most archiectures it is, not Nvidia takes a severe performance hit on register usage, probably way more costly to use an extra 2 registers vs just execute an extra 2 instructions.
Unfortunately, in most compilers, CSE happens way before register allocation, which means the register allocator would have to "UN-CSE" the intermediate representation.
Now do you see why I think Nvidia is having problems dealing with the output from FXC or a typical developer's hand-code PS2.0? Most developers do CSE naturally, cause they think that recalculating an expression twice is a waste, so they naturally assign it into another local variable and reuse it where they need it. But Nvidia takes a 50% speed reduction if you use more than a certain number of registers, therefore, the drivers would actually have to "de-optimize" some of the natural optimizations that programmers and compilers perform.
That would require picking a "live register" that has a sequence of instructions which calculate a value and cache it there, which is then used by two other register definitions later. Once it finds this register, it would have to INLINE the expressions in place of where the register was used and thus expand the size of the shader.
But picking the right parts of the shader to do this on is extremely tricky and much of the information of any common subexpressions was erased by the time it went through MS's FXC and turned into PS2.0 code.
Consider this
Code:
x = N.L * k
y = N.L * t0
z = x+y
A compiler without algebraic reasoning would not "get" that dot product follows the distributive law. This code can be rewritten
Code:
x = k + t0
y = N * x
x = y . L
with only two temporary registers required.
More likely, the compiler will recognize N.L as a common subexpression and generate
Code:
temp = N.L
x = temp * k
y = temp * t0
temp = x + y
using 3 registers and 4 slots
I predict if Nvidia does any good optimizing at all, it will work best under OpenGL2.0 because of the way the compiler is integrated with the driver.
-=-=-=-=-=-=-=-=-=-=-=-
E.T. is alive...
MMORPG.com - Gramp Staff
Comments
------------------------------------------------------------------------------------------
"Nearly all men can stand adversity, but if you want to test a man´s character give him power"
-Abraham Lincoln
Coder, Webmaster and Modeler
------------------------------------------------------------------------------------------
"Nearly all men can stand adversity, but if you want to test a man?s character give him power"
-Abraham Lincoln
Coder, Webmaster and Modeler
-=-=-=-=-=-=-=-=-=-=-=-
E.T. is alive...
MMORPG.com - Gramp Staff
What will he do next?
------------------------------------------------------------------------------------------
"Nearly all men can stand adversity, but if you want to test a man´s character give him power"
-Abraham Lincoln
Coder, Webmaster and Modeler
------------------------------------------------------------------------------------------
"Nearly all men can stand adversity, but if you want to test a man?s character give him power"
-Abraham Lincoln
Coder, Webmaster and Modeler