The intel_perf_counter_pass::pass field is actually useless and
invalid.
Once you have mapped all the counters to all the metrics, the order of
the metrics capture is dictated by intel_perf_get_n_passes().
When reading values that is the order we should follow.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: 2001a80d4a ("anv: Implement VK_KHR_performance_query")
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18893>
__builtin_clz(value - 1) is undefined for with value=1 (because
__builtin_clz(0) is undefined).
Because we set rt_pipeline->stack_size = 1 when a ray tracing pipeline
doesn't need any stack allocation to differentiate from a dynamic size
(rt_pipeline->stack_size = 0) we can run into this undefinied behavior
issue.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: f68d64dac0 ("anv: Add support for vkCmdSetRayTracingPipelineStackSizeKHR")
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19781>
In 23c7142cd6 ("anv: disable SIMD16 for RT shaders") we were forcing the SIMD8
using the mechanism for subgroup size control, which is problematic since it has
other effects on the shader behavior.
The code was changed to select the SIMD in a different way in the previous patches,
so we can revert the behavior to the original semantics.
Fixes dEQP-VK.subgroups.builtin_var.ray_tracing.subgroupsize.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19601>
The code builds up the dynamic array of objects (spirv_objs) and
collect pointers to each of them into another dynamic
array (spirv_ptr_objs).
If the growth of the first array cause a reallocation, it is
possible that the previous pointers end up invalid.
Fixes: 77e929a527 ("intel/clc: allow multiple CL files to be compiled together")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19730>
On Intel HW we use the same mechanism for internal operations surfaces
as well as application surfaces (VkDescriptor).
This change splits the surface pool in 2, one part dedicated to
internal allocations, the other to application VkDescriptors.
To do so, the STATE_BASE_ADDRESS::SurfaceStateBaseAddress points to a
4Gb area, with the following layout :
- 1Gb of binding table pool
- 2Gb of internal surface states
- 1Gb of bindless surface states
That way any entry from the binding table can refer to both internal &
bindless surface states but none of the driver allocations interfere
with the allocation of the application.
Based off a change from Sviatoslav Peleshko.
v2: Allocate image view null surface state from bindless heap (Sviatoslav)
Removed debug stuff (Sviatoslav)
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7110
Cc: mesa-stable
Tested-by: Sviatoslav Peleshko <sviatoslav.peleshko@globallogic.com>
Reviewed-by: Tapani Pälli <tapani.palli@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19275>
Some review feedback of an earlier commit caused me to rearrange some
code quite a bit. I wasn't paying enough attention while applying the
later commits, and these breaks should have been returns. As it is, the
result of the imin or imax analysis is overwritten by the default case
handling... effectively the original commit does nothing. :(
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
total instructions in shared programs: 19914090 -> 19904772 (-0.05%)
instructions in affected programs: 121258 -> 111940 (-7.68%)
helped: 445 / HURT: 0
total cycles in shared programs: 855291535 -> 855266659 (<.01%)
cycles in affected programs: 2737005 -> 2712129 (-0.91%)
helped: 426 / HURT: 17
LOST: 0
GAINED: 3
Skylake and Broadwell had similar results. (Skylake shown)
total cycles in shared programs: 842395356 -> 842338259 (<.01%)
cycles in affected programs: 5460985 -> 5403888 (-1.05%)
helped: 458 / HURT: 0
Haswell and Ivy Bridge had similar results. (Haswell shown)
total instructions in shared programs: 16710449 -> 16708449 (-0.01%)
instructions in affected programs: 44101 -> 42101 (-4.54%)
helped: 75 / HURT: 0
total cycles in shared programs: 882760230 -> 882727923 (<.01%)
cycles in affected programs: 2867797 -> 2835490 (-1.13%)
helped: 62 / HURT: 10
No shader-db change on any other Intel platform.
No fossil-db changes on any Intel platform.
Fixes: 5ec75ca10d ("intel/compiler: Teach signed integer range analysis about imax and imin")
Reviewed-by: Marcin Ślusarz <marcin.slusarz@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19602>
The back-end swizzles dwords so that our indirect scratch messages match
the memory layout of spill/fill messages for better cache coherency.
The swizzle happens at a DWORD granularity. If a read or write crosses
a DWORD boundary, the first bit will get correctly swizzled but whatever
piece lands in the next dword will not because the scatter instructions
assume sequential addresses for all bytes. For DWORD writes, this is
handled naturally as part of scalarizing. For smaller writes, we need
to be sure that a single write never escapes a dword.
Fixes: fd04f858b0 ("intel/nir: Don't try to emit vector load_scratch instructions")
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/7364
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19580>
Because dup_mem_intrinsic() retains the SSA offset from the original
intrinsic and only modifies it by adding a constant, we can compute the
alignment based on the original alignment and the constant offset. This
is both easier and more accurate.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19580>
Many Intel platforms can only perform 32x16 bit multiplication. The
straightforward way to implement 32x32 bit multiplications is by
splitting one of the operands into high and low parts called H and L,
repsectively. The full multiplication can be implemented as:
((A * H) << 16) + (A * L)
On Intel platforms, special register accesses can be used to eliminate
the shift operation. This results in three instructions and a temporary
register for most values.
If H or L is 1, then one (or both) of the multiplications will later be
eliminated. On some platforms it may be possible to eliminate the
multiplication when H is 256.
If L is zero (note that H cannot be zero), one of the multiplications
will also be eliminated.
Instead of splitting the operand into high and low parts, it may
possible to factor the operand into two 16-bit factors X and Y. The
original multiplication can be replaced with (A * (X * Y)) = ((A * X) *
Y). This requires two instructions without a temporary register.
I may have gone a bit overboard with optimizing the factorization
routine. It was a fun brainteaser, and I couldn't put it down. :) On my
1.3GHz Ice Lake, a standalone test could chug through 1,000,000 randomly
selected values in about 5.7 seconds. This is about 9x the performance
of the obvious, straightforward implementation that I started with.
v2: Drop an unnecessary return. Rearrange logic slightly and rename
variables in factor_uint32 to better match the names used in the large
comment. Both suggested by Caio. Rearrange logic to avoid possibly
using `a` uninitialized. Noticed by Marcin.
v3: Use DIV_ROUND_UP instead of open coding it. Noticed by Caio.
Tiger Lake, Ice Lake, Haswell, and Ivy Bridge had similar results. (Ice Lake shown)
total instructions in shared programs: 19912558 -> 19912526 (<.01%)
instructions in affected programs: 3432 -> 3400 (-0.93%)
helped: 10 / HURT: 0
total cycles in shared programs: 856413218 -> 856412810 (<.01%)
cycles in affected programs: 122032 -> 121624 (-0.33%)
helped: 9 / HURT: 0
No shader-db changes on any other Intel platforms.
Tiger Lake and Ice Lake had similar results. (Ice Lake shown)
Instructions in all programs: 141997227 -> 141996923 (-0.0%)
Instructions helped: 71
Cycles in all programs: 9162524757 -> 9162523886 (-0.0%)
Cycles helped: 63
Cycles hurt: 5
No fossil-db changes on any other Intel platforms.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>
This is especially helpful for a*isign(a) generated by idiv_by_const
optimization. On many GPUs, isign(a) is lowered to imax(imin(a, 1),
-1).
There are no changes on fossil-db because ANV uses a different
optimization path for idiv with a constant denominator. A future MR
will change this.
NOTE: This commit used to help a few hundred shader-db shaders, but
now none are affected. I suspect this is due to some change in the
idiv_by_const optimization. This could possibly be dropped.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17718>