This is actually wrong because if radv_handle_fbfetch_output() triggers
a decompression pass and graphics shaders (ESO) are saved/restored
they won't be updated because radv_bind_graphics_shaders() was called
before.
This fixes a very recent regression that I noticed while implementing
a new extension.
This reverts commit 9b912f00c7.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37194>
It doesn't make any sense ot have TCS but not TES (or vice versa), but
coverity doesn't realize that. Add an assertion that they are both
non-null before we start reading them.
Fixes: 50fd669294 ("anv: prep work for separate tessellation shaders")
CID: 1665360
CID: 1665327
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37266>
The intel_vue_map is only partially initialized before being used. All
used fields are initialized, but in debug paths the unitialzed fields
will also be read. To fix this initialize the struct to 0. In the brw
path this struct is part of the prog_data, and is rzalloc'd.
CID: 1665308
Reviewed-by: Iván Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37261>
RADV_DYNAMIC_VERTEX_INPUT_BINDING_STRIDE doesn't emit any context
registers, so it can be excluded for the late scissor workaround to
avoid re-emitting scissors all the time it's dirty.
This fixes a performance regression noticed with Cyberpunk on Vega10,
but other games are likely affected too. The late scissor workaround is
only applied on Raven/Vega10.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13828
Fixes: d7f401c2bb ("radv: bind the vertex binding strides like a normal dynamic state")
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37252>
This will allow ubo buffers to have arrays containing millions of
elements without excessive memory use on a remap table. Before this
change using the max sized array on radeonsi would result in 1.3GB
of memory being used for a remap table in a single shader.
There is also a small functional change here, previously if the
shader used more than GL_MAX_UNIFORM_BLOCK_SIZE mesa would ignore
and allow this as the original ARB_uniform_buffer_object spec
stated:
"If the amount of storage required for a uniform block exceeds
this limit, a program may fail to link."
However in OpenGL 4.3 the text was clarified and the "may" was
removed so with this change we enforce the max limit.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9953
Acked-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36997>
This disables for now the "optimistic" SIMD heuristic that was
implemented for xe3+ and makes it dependent on a debugging option,
instead use the static analysis-based codepath that was used in
previous generations and was extended by previous commits in this MR
to model the xe3 trade-off between register use and thread
parallelism.
The reason is that the main assumption of the optimistic SIMD
heuristic didn't hold up with reality: Real-world testing on PTL shows
that there are many cases where SIMD32 shows performance degradation
relative to SIMD16 despite the ability of xe3 hardware to scale the
GRF file of a thread on demand, unfortunately that scenario seems to
be more pervasive than hoped when the optimistic SIMD heuristic was
implemented pre-silicon.
In many cases what seems to be going on is that even when the register
file is able to scale with the increased register use of SIMD32, the
thread parallelism of the EU is scaled down by a similar factor, so at
the bottom line SIMD32 (depending on the actual ratio of register use
between both variants) may not buy us anything, and it frequently
encounters constraints (like SIMD lowering and less effective
scheduling) that lead to worse codegen than SIMD16, easily tipping the
balance in favor of SIMD16. The extension of the performance analysis
pass that was done in a previous commit allows the original SIMD32
heuristic to take into account quantitatively this effect, and that
seems pretty effective at disabling SIMD32 shaders that underperform
judging from the statistically significant improvement of most Traci
test-cases that run on my PTL system (4 iterations, 5% significance),
no statistically significant regressions were observed:
Nba2K23-trace-dx11-2160p-ultra: 10.16% ±0.34%
Superposition-trace-dx11-2160p-extreme: 4.06% ±0.50%
TotalWarWarhammer3-trace-dx11-1080p-high: 3.52% ±0.76%
Payday3-trace-dx11-1440p-ultra: 2.41% ±0.81%
MetroExodus-trace-dx11-2160p-ultra: 2.28% ±0.78%
Borderlands3-trace-dx11-2160p-ultra: 1.89% ±0.65%
MountAndBlade2-trace-dx11-1440p-veryhigh: 1.81% ±0.40%
Blackops3-trace-dx11-1080p-high: 1.66% ±0.29%
HogwartsLegacy-trace-dx12-1080p-ultra: 1.53% ±0.22%
TotalWarPharaoh-trace-dx11-1440p-ultra: 1.44% ±0.31%
Fortnite-trace-dx11-2160p-epix: 1.44% ±0.27%
Naraka-trace-dx11-1440p-highest: 1.39% ±0.27%
PubG-trace-dx11-1440p-ultra: 1.30% ±0.49%
Destiny2-trace-dx11-1440p-highest: 1.10% ±0.23%
Factorio-trace-1080p-high: 1.10% ±1.77%
TerminatorResistance-trace-dx11-2160p-ultra: 1.08% ±0.31%
Ghostrunner2-trace-dx11-1440p-ultra: 1.05% ±0.15%
ShadowTombRaider-trace-dx11-2160p-ultra: 0.98% ±0.19%
CitiesSkylines2-trace-dx11-1440p-high: 0.67% ±0.19%
Palworld-trace-dx11-1080p-med: 0.44% ±0.22%
The downside is that this will reverse the large reduction in
compile-time we gained from the optimistic SIMD heuristic -- The
run-time of both shader-db and fossil-db jump back up by nearly 20%
with this change. I'm working on a better compromise based on
run-time feedback that will hopefully allow us to preserve the
compile-time benefit of the optimistic heuristic without the reduction
in run-time performance, but in the meantime it seems like the
run-time performance gap from SIMD32 is the more urgent issue to
address since it has an impact on titles across the board. Despite
the reversal of that compile-time improvement xe3 still achieves
slightly lower compile time on the average than previous generations
as a result of VRT, so this doesn't seem terribly tragic.
v2: Add bit to brw_get_compiler_config_value() (Lionel).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
The current register allocation loop attempts to use a sequence of
pre-RA scheduling heuristics until register allocation is successful.
The sequence of scheduling heuristics is expected to be increasingly
aggressive at reducing the register pressure of the program (at a
performance cost), so that the instruction ordering chosen gives the
lowest latency achievable with the register space available.
Unfortunately that approach doesn't consistently give the best
performance on xe3+, since on recent platforms a schedule with higher
latency may actually give better performance if its lower register
pressure allows the use of a lower number of VRT register blocks which
allows the EU to run more threads in parallel.
This means that on xe3+ the scheduling mode with highest performance
is fundamentally dependent on the specific scenario (in particular
where in the thread count-register use curve the program is at, and
how effective the scheduler heuristics are at reducing latency for
each additional block of GRFs used), so it isn't possible to construct
a fixed sequence of the existing heuristics guaranteed to be ordered
by decreasing performance. In order to find the scheduling heuristic
with better performance we have to run multiple of them prior to
register allocation and do some arithmetic to account for the effect
on parallelism of the register pressure estimated in each case, in
order to decide which heuristic will give the best performance.
This sounds costly but it is similar to the approach taken by
brw_allocate_registers() when unable to allocate without spills in
order to decide which scheduling heuristic to use in order to minimize
the number of spills. In cases where that happens on xe3+ the
scheduling runs introduced here don't add to the scheduling runs done
to find the heuristic with minimum register pressure, we attempt to
determine the heuristic with lowest pressure and best performance in
the same loop, and then use one or the other depending on whether
register allocation succeeds without spills.
Significantly improves performance on PTL of the following Traci test
cases (4 iterations, 5% significance):
Nba2K23-trace-dx11-2160p-ultra: 4.48% ±0.38%
Fortnite-trace-dx11-2160p-epix: 1.61% ±0.28%
Superposition-trace-dx11-2160p-extreme: 1.37% ±0.26%
PubG-trace-dx11-1440p-ultra: 1.15% ±0.29%
GtaV-trace-dx11-2160p-ultra: 0.80% ±0.24%
CitiesSkylines2-trace-dx11-1440p-high: 0.68% ±0.19%
SpaceEngineers-trace-dx11-2160p-high: 0.65% ±0.34%
The compile-time cost of shader-db increases significantly by 3.7%
after this commit (15 iterations, 5% significance), the compile-time
of fossil-db doesn't change significantly in my setup.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
Mainly this involves changing 'struct state' so that the dep_ready
array is allocated with a dynamic size based on the number of VGRFs of
the program instead of assuming a fixed XE3_MAX_GRF count of GRF
dependencies. VGRF register dependencies are then handled by using
one dep_ready entry per VGRF allocation instead of one per hardware
register.
The ability to use the performance analysis pass pre-regalloc will
mostly be useful on xe3+, but this also has the side effect of saving
some memory on xe2 and earlier platforms since we no longer need to
allocate XE3_MAX_GRF dep_ready entries for them.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>