We have an optimization for non-uniform if/else where if all channels meet the
jump condition we emit a branch to jump straight to the ELSE block. Similarly,
if at the end of the THEN block we don't have any channels that would execute
the ELSE block, we emit a branch to jump straight to the AFTER block.
This optimization has a cost though: we need to emit the condition for the
branch and a branch instruction (which also comes with a 3 delay slot), so for
very small blocks (just a couple of ALU for example) emitting the branch
instruction is typically worse. Futher, if the condition for the branch is not
met, we still pay the cost for no benefit at all.
Here is an example:
nop ; fmul.ifa rf26, 0x3e800000, rf54
xor.pushz -, rf52, 2 ; nop
bu.alla 32, r:unif (0x00000000 / 0.000000)
nop ; nop
nop ; nop
nop ; nop
xor.pushz -, rf52, 3 ; nop
nop ; mov.ifa rf52, 0
nop ; mov.pushz -, rf52
nop ; mov.ifa rf26, 0x3f800000
The bu instruction here is setup to jump over the following 4 instructions
(the last 4 instructions in there). To do this, we pay the price of the xor
to generate the condition, the bu instruction, and the 3 delay slots right
after it, so we end up paying 6 instructions to skip over 4 which we pay
always, even if the branch is not taken and we still have to execute those
4 instructions. With this change, we produce:
nop ; fmul.ifa rf56, 0x3e800000, rf28
xor.pushz -, rf9, 3 ; nop
nop ; mov.ifa rf9, 0
nop ; mov.pushz -, rf9
nop ; mov.ifa rf56, 0x3f800000
Now we don't try to skip the small block, ever. At worse, if all channels
would have met the branch condition, we only pay the cost of the 4
instructions instead of 6, at best, if any channel wouldn't take the
branch, we save ourselves 5 cycles for the branch condition, the branch
instruction and its 3 delay slots.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23161>
Since 624e799cc3 ("nir: Drop nir_ssa_def::name and nir_register::name"), SSA
defs don't have names, making the name argument unused. Drop it from the
signature and fix the call sites. This was done with the help of the following
Coccinelle semantic patch:
@@
expression A, B, C, D, E;
@@
-nir_ssa_dest_init(A, B, C, D, E);
+nir_ssa_dest_init(A, B, C, D);
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23078>
Some values like the transform feedback offset or the number of output
vertices in VS can be obtained knowing how many vertices and primitive
type are used in the drawcall.
But when the primitive restart is enabled, doing this is quite more
complex, as we should parse the vertex buffer to know where is the
restart values, and so on.
In this case, delay this computation after the drawcall is executed, by
querying the GPU to know these values.
Similarly, this delay is also applied to compute the transform feedback
buffer offsets when there is a geometry shader, as we don't know
beforehand how many vertices it is going to output.
This fixes `spec@!opengl 3.1@primitive-restart-xfb flush` and
`spec@!opengl 3.1@primitive-restart-xfb generated`.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22716>
This helps by reducing the number of branches with their corresponding
delay slots, at the expense of additional register pressure. It also helps
a lot with SFU stalls, probably because removing control-flow blocks
gives us more QPU scheduling flexibility to hide them.
Shader-db results below correspond to the "closed shaders" set, since the
full set is very dominated by the massive impact this change has on Skia's
shaders (for the better), so this is probably more representative of real
impact:
total instructions in shared programs: 11887255 -> 11854898 (-0.27%)
instructions in affected programs: 538170 -> 505813 (-6.01%)
helped: 1653
HURT: 43
Instructions are helped.
total threads in shared programs: 385924 -> 385872 (-0.01%)
threads in affected programs: 236 -> 184 (-22.03%)
helped: 22
HURT: 48
Inconclusive result (%-change mean confidence interval includes 0).
total uniforms in shared programs: 3552808 -> 3547894 (-0.14%)
uniforms in affected programs: 157486 -> 152572 (-3.12%)
helped: 1673
HURT: 35
Uniforms are helped.
total max-temps in shared programs: 2062403 -> 2064720 (0.11%)
max-temps in affected programs: 18209 -> 20526 (12.72%)
helped: 168
HURT: 369
Max-temps are HURT.
total spills in shared programs: 1937 -> 1994 (2.94%)
spills in affected programs: 79 -> 136 (72.15%)
helped: 0
HURT: 1
total fills in shared programs: 2652 -> 2717 (2.45%)
fills in affected programs: 115 -> 180 (56.52%)
helped: 0
HURT: 1
total sfu-stalls in shared programs: 19349 -> 18010 (-6.92%)
sfu-stalls in affected programs: 2321 -> 982 (-57.69%)
helped: 674
HURT: 74
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 11906604 -> 11872908 (-0.28%)
inst-and-stalls in affected programs: 541339 -> 507643 (-6.22%)
helped: 1656
HURT: 43
Inst-and-stalls are helped.
total nops in shared programs: 245740 -> 238085 (-3.12%)
nops in affected programs: 19282 -> 11627 (-39.70%)
helped: 1335
HURT: 76
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22922>
Seems the only thing that really needs this is fpow(0, 0), which should
return NaN, but then gets multiplied with zero. Let's fix that by doing
a bcsel instead of fmul to select the result here. While we're at it,
get rid of the fabs for stop, which isn't needed.
This fixes a piglits failure for most (if not all?) drivers that doesn't
support legacy math rules.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Acked-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22789>
If we are trying to lower register pressure this can make a big
difference in some cases. To avoid adding even more strategies,
merge this with disabling ubo load sorting, since they are basically
trying to do the same.
total instructions in shared programs: 12848024 -> 12844510 (-0.03%)
instructions in affected programs: 236537 -> 233023 (-1.49%)
helped: 195
HURT: 87
Instructions are helped.
total uniforms in shared programs: 3815601 -> 3814932 (-0.02%)
uniforms in affected programs: 31773 -> 31104 (-2.11%)
helped: 67
HURT: 115
Inconclusive result (value mean confidence interval includes 0).
total max-temps in shared programs: 2210803 -> 2210622 (<.01%)
max-temps in affected programs: 9362 -> 9181 (-1.93%)
helped: 114
HURT: 34
Max-temps are helped.
total spills in shared programs: 2556 -> 2330 (-8.84%)
spills in affected programs: 1391 -> 1165 (-16.25%)
helped: 39
HURT: 9
total fills in shared programs: 3840 -> 3317 (-13.62%)
fills in affected programs: 2379 -> 1856 (-21.98%)
helped: 39
HURT: 23
total sfu-stalls in shared programs: 21965 -> 21978 (0.06%)
sfu-stalls in affected programs: 2618 -> 2631 (0.50%)
helped: 45
HURT: 81
Inconclusive result (value mean confidence interval includes 0).
total inst-and-stalls in shared programs: 12869989 -> 12866488 (-0.03%)
inst-and-stalls in affected programs: 238771 -> 235270 (-1.47%)
helped: 193
HURT: 87
Inst-and-stalls are helped.
total nops in shared programs: 303501 -> 303274 (-0.07%)
nops in affected programs: 4159 -> 3932 (-5.46%)
helped: 87
HURT: 105
Inconclusive result (value mean confidence interval includes 0).
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22824>
rf0 is affected by restrictions in some scenarios so we rather use
a register that does not cause conflicts for scheduling.
total instructions in shared programs: 12850958 -> 12848024 (-0.02%)
instructions in affected programs: 331974 -> 329040 (-0.88%)
helped: 2559
HURT: 201
Instructions are helped.
total max-temps in shared programs: 2210893 -> 2210803 (<.01%)
max-temps in affected programs: 1486 -> 1396 (-6.06%)
helped: 96
HURT: 7
Max-temps are helped.
total sfu-stalls in shared programs: 21975 -> 21965 (-0.05%)
sfu-stalls in affected programs: 32 -> 22 (-31.25%)
helped: 16
HURT: 6
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 12872933 -> 12869989 (-0.02%)
inst-and-stalls in affected programs: 332036 -> 329092 (-0.89%)
helped: 2560
HURT: 189
Inst-and-stalls are helped.
total nops in shared programs: 305911 -> 303501 (-0.79%)
nops in affected programs: 11215 -> 8805 (-21.49%)
helped: 2131
HURT: 3
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22797>
tex_format should be `enum V3DX(Texture_Data_Formats)`, but using that enum
type in the header requires including `v3dx_pack.h`, which triggers circular
include dependencies issues, so use a `uint32_t` for now.
"fix" the one place that was using the correct enum, because doing so
triggers `-Wenum-int-mismatch` in GCC 13 as the function declaration
doesn't match the function definition.
Reported-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Eric Engestrom <eric@igalia.com>
Acked-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22739>
We have been stopping as soon as we find a conflict but that doesn't
mean we can't merge it in an earlier slot, so keep going. Going by
shader-db, this sometimes allows us to merge the final thrsw a bit
earlier and avoid emitting NOP instructions at the program end to
make up for its delay slots. I have not observed cases where this
helps with regular thrsw though, but it doesn't hurt to try with
those too.
total instructions in shared programs: 11526876 -> 11526354 (<.01%)
instructions in affected programs: 10760 -> 10238 (-4.85%)
helped: 236
HURT: 0
Instructions are helped.
total max-temps in shared programs: 2231705 -> 2231677 (<.01%)
max-temps in affected programs: 276 -> 248 (-10.14%)
helped: 27
HURT: 0
Max-temps are helped.
total inst-and-stalls in shared programs: 11545177 -> 11544655 (<.01%)
inst-and-stalls in affected programs: 10777 -> 10255 (-4.84%)
helped: 236
HURT: 0
Inst-and-stalls are helped.
total nops in shared programs: 321624 -> 321152 (-0.15%)
nops in affected programs: 751 -> 279 (-62.85%)
helped: 236
HURT: 0
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22679>
Before testing the waddr for SFU we should first validate this
is indeed a valid (not NOP) magic write. Use the helper we have for
this which gets this right.
total instructions in shared programs: 12898957 -> 12850958 (-0.37%)
instructions in affected programs: 4328937 -> 4280938 (-1.11%)
helped: 19974
HURT: 439
Instructions are helped.
total max-temps in shared programs: 2211503 -> 2210893 (-0.03%)
max-temps in affected programs: 12924 -> 12314 (-4.72%)
helped: 509
HURT: 20
Max-temps are helped.
total sfu-stalls in shared programs: 22233 -> 21975 (-1.16%)
sfu-stalls in affected programs: 722 -> 464 (-35.73%)
helped: 297
HURT: 54
Sfu-stalls are helped.
total inst-and-stalls in shared programs: 12921190 -> 12872933 (-0.37%)
inst-and-stalls in affected programs: 4337977 -> 4289720 (-1.11%)
helped: 20015
HURT: 404
Inst-and-stalls are helped.
total nops in shared programs: 333743 -> 305911 (-8.34%)
nops in affected programs: 86902 -> 59070 (-32.03%)
helped: 14545
HURT: 76
Nops are helped.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22593>