Copy propagation often eliminates all uses of an instruction. If we
detect that we've done so, we can eliminate the instruction ourselves
rather than leaving it hanging until the next DCE pass.
This saves some CPU time as other passes don't see dead code.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
The new def-based pass works better in many cases, and should be less
resource intensive. However, the limited visibility of the defs-based
pass due to many values not being SSA yet makes it unable to fully
replace the old pass. Try the new one, and if it can't make progress,
then try the old one. That way, things will mostly be handled by the
new pass, but everything that was being cleaned up still will be.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
While the limited visibility due to partial SSA is a downside to the new
pass, it has a huge number of advantages that make it worth switching
over even now. It's much more efficient, can eliminate redundant memory
loads across blocks, and doesn't generate loads of unnecessary copies
that other passes have to clean up. This means we also eliminate the
infighting between the old CSE, coalescing, and copy propagation passes.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
This has a number of advantages compared to the pass I wrote years ago:
- It can easily perform either Global CSE or block-local CSE, without
needing to roll any dataflow analysis, thanks to SSA def analysis.
This global CSE is able to detect and coalesce memory loads across
blocks. Although it may increase spilling a little, the reduction
in memory loads seems to more than compensate.
- Because SSA guarantees that values are never written more than once,
the new CSE pass can directly reuse an existing value. The old pass
emitted copies at the point where it discovered a value because it
had no idea whether it'd be mutated later. This led it to generate
a ton of trash for copy propagation to clean up later, and also a
nasty fragility where CSE, register coalescing, and copy propagation
could all fight one another by generating and cleaning up copies,
leading to infinite optimization loops unless we were really careful.
Generating less trash improves our CPU efficiency.
- It uses hash tables like nir_instr_set and nir_opt_cse, instead of
linearly walking lists and comparing each element. This is much more
CPU efficient.
- It doesn't use liveness analysis, which is one of the most expensive
analysis passes that we have. Def analysis is cheaper.
In addition to CSE'ing SSA values, we continue to handle flag writes,
as this is a huge source of CSE'able values. These remain block local.
However, we can simply track the last flag write, rather than creating
entire sets of instruction entries like the old pass. Much simpler.
The only real downside to this pass is that, because the backend is
currently only partially SSA, it has limited visibility and isn't able
to see all values. However, the results appear to be good enough that
the new pass can effectively replace the old pass in almost all cases.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
Like NIR, we print SSA defs as %1, %2, and so on. The number here is
the VGRF number. VGRFs that don't correspond to a SSA def remain
printed as vgrf1, vgrf2, and so on.
This makes it much easier to see what values are SSA and which aren't.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
Even without a full use list, simply tracking the number of uses will
let us tell "this is the only use of the def" or "we've just replaced
all uses of a def". It's inexpensive to calculate and will be useful.
(rebased by Kenneth Graunke)
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
Our code to initialize gl_SubgroupInvocation uses multiple instructions
some of which are partial writes. This makes it difficult to analyze
expressions involving gl_SubgroupInvocation, which appear very
frequently in compute shaders.
To make this easier, we add a new virtual opcode which initializes
a full VGRF to the value of gl_SubgroupInvocation. (We also expand
it to UD for SIMD8 so there are not partial write issues.) We then
lower it to the original code later on in compilation, after we've
done the bulk of our optimizations.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
The support is incomplete and largely untested, but more importantly
glsl ir is depreciated at this point. This feature was added to support
building additional passes but that shouldn't ever be needed from here
on.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29469>
for drivers that don't support PIPE_CAP_SHAREABLE_SHADERS,
the zombie shader mechanism is used, storing shaders to delete after
the next flush
the zombie mechanism also calls bind_*_state(pipe, NULL) during deletion,
however, which breaks drivers in the following scenario:
* create_all_shaders(pipe_A)
* bind_vs(pipe_A, vs_A)
* bind_fs(pipe_A, fs_A)
* draw(pipe_A)
* makeCurrent(pipe_B)
* delete_vs(pipe_B, vs_B)
* vs_B must only be deleted on pipe_A
* zombie_shader_add(pipe_A, vs_B)
* makeCurrent(pipe_A)
* free_zombie_shaders(pipe_A)
* bind_vs(pipe_A, NULL)
* delete_vs(pipe_A, vs_B)
* draw(pipe_A)
* boom
the problem being that bind_vs(pipe_A, NULL) was called when deleting
vs_B, but it was actually vs_A which was bound
to solve this, just flag the shader state for updating and let st figure it out
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11122
cc: mesa-stable
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29680>
Most passes want to preserve this specific combination of metadata, so let's add
an alias for the combination. The alias communicates that the control flow graph
is preserved, rather than a particular statement about e.g. dominance
preservation.
You don't need to understand dominance to write a simple
nir_shader_instructions_pass. And since you were going to cargo cult the
metadata anyway, this way you'll cargo cult a version you're more likely to
understand.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Acked-by: Karol Herbst <kherbst@redhat.com>
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29745>
The semantics of discard differ between GLSL and HLSL and
their various implementations. Subsequently, numerous application
bugs occurred and SPV_EXT_demote_to_helper_invocation was written
in order to clarify the behavior. In NIR, we now have 3 different
intrinsics for 2 things, and while demote and terminate have clear
semantics, discard still doesn't and can mean either of the two.
This patch entirely removes nir_intrinsic_discard and
nir_intrinsic_discard_if and replaces all occurences either with
nir_intrinsic_terminate{_if} or nir_intrinsic_demote{_if} in the
case that the NIR option 'discard_is_demote' is being set.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27617>
Vectorization can make the constant layout worse and increase
the constant register usage. The worst scenario is vectorization
lowered indirect register access, where we access i-th element
and later we access i-1 or i+1 (most notably glamor and gsk shaders).
In this case we already added constants 1..n where n is the array
size, however we can reuse them unless the lowered ladder gets
vectorized later.
Thus prevent vectorization of the specific patterns from lowered
indirect access.
This is quite a heavy hammer, we could in theory estimate how many
slots will the current ubos and constants need and only disable
vectorization when we are close to the limit. However, this would
likely need a global shader analysis each time r300_should_vectorize_inst
is called, which we want to avoid.
So for now just don't vectorize anything that loads constants if we
already have lot of uniforms.
This is the final missing piece to make glamor work on R400.
shader-db R420:
total instructions in shared programs: 107288 -> 107290 (<.01%)
instructions in affected programs: 236 -> 238 (0.85%)
helped: 2
HURT: 3
total temps in shared programs: 17730 -> 17726 (-0.02%)
temps in affected programs: 41 -> 37 (-9.76%)
helped: 4
HURT: 0
total cycles in shared programs: 163251 -> 163251 (0.00%)
cycles in affected programs: 478 -> 478 (0.00%)
helped: 2
HURT: 3
GAINED: 7 (2 glamor and 5 GSK shaders)
RV370 is quite similar instruction/temp-wise, but we don't gain any
shader there, because they are all over the 64 instructions limit...
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10787
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Reviewed-by: Filip Gawin <filip.gawin@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29734>
The commit 973e6f3b implemented compaction of the stream-number space. The
functions `update_vertex_elements(_sw)` began using the post-compacted
stream-numbers/indices when maintaining the `stream_usage_mask` and
when reading from the arrays `vtxstride` and `stream_freq`.
But, the `stream_instancedata_mask`, with which the `stream_usage_mask`
is compared/bitwise-anded, maintains bits for the pre-compacted indices.
Additionally, the information within the arrays is stored using the
pre-compacted indices.
The functions have a disagreement, regarding the type (pre- vs post-
compacted) of indices, with the rest of the relevant source. This change
removes the disagreement by having them use pre-compacted indices when
maintaining the `stream_usage_mask` and when reading from the arrays.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11283
Fixes: 973e6f3b ("gallium: remove start_slot parameter from pipe_context::set_vertex_buffers")
Reviewed-by: Axel Davy <davyaxel0@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29704>