Even though the hardware does not naively support these configurations,
there are many potential benefits to advertising them. These
configurations can theoretically use half the memory bandwidth for loads
and stores. For large matrices, that can be the limiting in performance.
The current implementation, however, has a number of significant
problems.
The conversion from float16 to float32 is performed in the driver during
conversion from NIR. As a result, many common usage patterns end up
doing back-to-back conversions to and from float16 between matrix
multiplications (when the result of one multiplication is used as the
accumulator for the next).
The float16 version of the matrix waste half the possible register
space. Each float16 value sits alone in a dword. This is done so that
the per-invocation slice of an 8x8 float16 result matrix and an 8x8
float32 result matrix will have the same number of elements. This makes
it possible to do straightforward implementations of all the unary_op
type conversions in NIR.
It would be possible to perform N:M element type conversions in the
backend using specialized NIR intrinsics. However, per #10961, this
would be very, very painful. My hope is that, once a suitable resolution
for that issue can be found, support for these configs can be restored.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
The r0.5 thread payload register contains Surface State Offset bits
[27:6] as bits [31:10], so we need to shift the register right by 4 in
order to get the surface state offset expected in ExBSO mode, which is
the only extended descriptor encoding supported by the UGM shared
function for SS addressing on Xe2+.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29543>
Our code to initialize gl_SubgroupInvocation uses multiple instructions
some of which are partial writes. This makes it difficult to analyze
expressions involving gl_SubgroupInvocation, which appear very
frequently in compute shaders.
To make this easier, we add a new virtual opcode which initializes
a full VGRF to the value of gl_SubgroupInvocation. (We also expand
it to UD for SIMD8 so there are not partial write issues.) We then
lower it to the original code later on in compilation, after we've
done the bulk of our optimizations.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28666>
The semantics of discard differ between GLSL and HLSL and
their various implementations. Subsequently, numerous application
bugs occurred and SPV_EXT_demote_to_helper_invocation was written
in order to clarify the behavior. In NIR, we now have 3 different
intrinsics for 2 things, and while demote and terminate have clear
semantics, discard still doesn't and can mean either of the two.
This patch entirely removes nir_intrinsic_discard and
nir_intrinsic_discard_if and replaces all occurences either with
nir_intrinsic_terminate{_if} or nir_intrinsic_demote{_if} in the
case that the NIR option 'discard_is_demote' is being set.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Faith Ekstrand <faith.ekstrand@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27617>
Pass in the nir_src and check if it's constant, handling it via CPU-side
arithmetic instead of emitting instructions. While we can constant fold
these via our optimization passes, we have to do opt_algebraic to fold
the binary operation with constant sources into a MOV of an immediate,
then opt_copy_propagation to put it in the next expression, and so on,
until the entire expression is folded. This can take several iterations
of the optimization loop, which is inefficient.
For example, gfxbench5/aztec-ruins/normal/7 has load/store_scratch
intrinsics with constant sources, and this patch removes a number of
optimization passes according to INTEL_DEBUG=optimizer.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29624>
Geometry shaders write outputs multiple times, with EmitVertex()
between them. The value of output variables becomes undefined after
calling EmitVertex(), so we don't need to preserve those. This lets
us recreate new registers after each EmitVertex(), assuming we aren't
in control flow, allowing them to have separate live ranges. It also
means that those registers are more likely to be written once, rather
than having multiple writes, which can make optimization easier.
This is pretty much a total hack, but it's helpful.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29624>
In 2c65d90bc8 I forgot to add the new SHADER_OPCODE_READ_MASK_REG
opcode to the list of barrier instruction in the scheduler. Let's just
use a single opcode for all ARF registers that need special
scoreboarding and put the register as source (nicer for the debug
output).
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 2c65d90bc8 ("intel/brw: ensure find_live_channel don't access arch register without sync")
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29446>
This allows us to not generate 64-bit iadd3 on Intel but continue
generating it for NVIDIA.
No shader-db or fossil-db changes.
v2: Add nir_lower_iadd3_64 flag so we can continue to generate 64-bit
iadd3 on NVIDIA platforms.
v3: s/bit_size == 64/s == 64/. This cut-and-paste bug prevented any of
the optimizations from ever occuring.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29148>
Gfx11-12 can support SLM block loads via OWord Block Load messages
(notably, the aligned version, not the unaligned version).
A while back we deleted the SHADER_OPCODE_OWORD_BLOCK_READ opcode.
Rather than bring it back, we continue using UNALIGNED_OWORD_BLOCK_READ
for SLM block access (like we do for SSBOs) but switch it over to the
aligned variant when lowering logical sends. We do ensure the alignment
is at least 16B, however. This is ugly, but it's probably not worth
bringing back a whole extra opcode for a legacy HDC block load quirk.
References: BSpec 47652 and 1689
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9960
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29429>
This bit from the comment should have been a big red flag:
There are currently zero instances of fsign(double(x))*IMM in
shader-db or any test suite, so it is hard to care at this time.
The implementation of that path was incorrect. The XOR instructions
should be predicated like the OR instruction in the non-multiplication
path. As a result, dsign(zero_value) * x will not produce the correct
result.
Instead of fixing this code that is never exercised by anything, replace
it with the simple lowering in NIR.
No shader-db or fossil-db changes on any Intel platform.
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29095>
Fixes fs-uint-to-float-of-extract-int8.shader_test and
fs-uint-to-float-of-extract-int16.shader_test added by piglit!883.
No shader-db or fossil-db changes on any Intel platform.
v2: Expand the comment explaining the potential problem. Suggested by
Caio.
Fixes: 29ce110be6 ("i965/fs: Remove extract virtual opcodes.")
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27891>
When emitting a sampler message, we allocate a temporary destination
large enough to hold 4 values (or 5 for sparse). This is the maximum
size needed to hold any result. However, we shrink the size written by
the sampler message to skip writing any trailing components that NIR
tells us are never read. So we may not write the entire temporary.
The NIR texture instruction has a destination VGRF which is sized
assuming that all components are present. We issue a LOAD_PAYLOAD
instruction to copy our sampler result temporary to the NIR destination.
When we reduce the response length of the sampler messages, then some of
these temporary components have undefined values. The correct way to
indicate that is by using a BAD_FILE source. Unfortunately, we were
naively reading offsets of the temporary that were never written, but
are still part of a larger VGRF. This complicates things.
For example, sampling and only using RGB (not RGBA) was producing this:
txl_logical(8) (written: 3) vgrf3+0.0:F, ...
undef(8) (written: 4) vgrf4:UD
load_payload(8) (written: 4) vgrf4:F, vgrf3+0.0:F, vgrf3+1.0:F, vgrf3+2.0:F, vgrf3+3.0:F
The last source, vgrf3+3.0:F, is undefined, and should be BAD_FILE.
Doing so allows VGRF splitting and other optimizations to work better.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28971>
With the previous commit, we now have new builder helpers that will
allocate a temporary destination for us. So we can eliminate a lot
of the temporary naming and declarations, and build up expressions.
In a number of cases here, the code was confusingly mixing D-type
addresses with UD-immediates, or expecting a UD destination. But the
underlying values should always be positive anyway. To accomodate the
type inference restriction that the base types much match, we switch
these over to be purely UD calculations. It's cleaner to do so anyway.
Compared to the old code, this may in some cases allocate additional
temporary registers for subexpressions.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28957>