When the system is under memory pressure (which can happen, for
example, during CI runs), don't immediately give up the exec ioctl
(which, for Vulkan, will result in the device being declared lost).
Instead, retry a little bit just like we do for i915.ko.
This is a trade-off.
One of the reasons to *not* have unified behavior regarding ENOMEM
between i915.ko and xe.ko is the fact that xe.ko uses vm_bind, so if
the user tried to bind more memory than it is able to, we'll just keep
getting ENOMEM as long as we retry the ioctl. We now have a retry
limit, so we'll eventually return the error.
On the other hand, if the problem is other applications consuming all
the memory, having the retry loop may really help avoid unnecessarily
marking the device as lost, since one of our retries may eventually
succeed.
I believe the tradeoff of "we'll now eventually succeed in some cases
where it's possible to succeed, at the expense of retrying for a few
seconds until giving up in cases where we would never be able to
succeed" is an improvement.
If xe.ko ever gives us a way to differentiate between the two
different reasons for ENOMEM, we'll be able to make things much
better. We can also tune our timeouts if needed.
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37559>
If nothing has freed memory until that point, return the error, which
may make the upper layers report the device as lost. It could be that
the system is under very very heavy swapping and that waiting a little
more would make it work, but let's try 16s for now.
v2: Bring down the timeout from ~60s to ~16s (José).
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37559>
If the ioctl is returning ENOMEM, incessantly retrying does not seem
to be the best way to proceed. After the second retry, sleep 0.1ms,
then more each time, giving the CPU some time to run the other threads
and processes, in the hope that whatever is eating all the memory
might eventually return it.
If the problem is the current thread, then busy looping won't help
either, so here we at least save some power before the user kills the
app.
v2: Adjust the control flow and the sleep time.
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37559>
By dumping the contents of a HiZ buffer before and after fast-clearing,
I've observed that a zeroed HiZ block corresponds to the CLEAR state
until gfx12. The fast-clearing application was piglit's bin/hiz. I ran
this test on a couple bare metal platforms (ICL and BDW) and many
simulated ones (SKL, TGL, DG2, and LNL).
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36383>
With EXT_shader_object, it became possible to compile shaders
independently and then use them together later, so we cannot rely on the
lack of task shader data to decide that no task shader will be used. The
flag VK_SHADER_CREATE_NO_TASK_SHADER_BIT_EXT exists for that purpose,
but it doesn't really make any difference for us. Always assume that if
the mesh shader is reading the task payload, it's going to be used with
one, as otherwise the application is doing it wrong.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13983
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37648>
Sometimes the compute shader workgroup size requires a larger SIMD width
than the minimum in order to fit in the available threads. In that case
we'll skip the SIMD8 shader, and need to try SIMD16 regardless of how
the register pressure estimate looks.
Fixes: 3af4e63061 ("brw: Skip compilation of larger SIMDs when pressure is too high")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Tested-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37649>
This allows us to skip the entire backend compilation process for
large SIMD widths when register pressure is high enough that we'd
likely decide to prefer a smaller one in the end anyway. The hope
is to make the same decisions as before, but with less CPU overhead.
We are making mostly the same decisions as before:
| API / Platform | Total Shaders | Changed | % Identical
--------------------------------------------------
| VK / Arc A770 | 905,525 | 1,157 | 99.872% |
| VK / Arc B580 | 788,127 | 53 | 99.993% |
| VK / Panther | 786,333 | 13 | 99.998% |
| GL / Arc A770 | 308,618 | 269 | 99.913% |
| GL / Arc B580 | 264,066 | 13 | 99.995% |
| GL / Panther | 273,212 | 0 | 100.000% |
Improves compile times on my i7-12700K:
| Game | Arc B580 | Arc A770 |
---------------------------------------------------
| Assassins Creed: Odyssey | -13.47% | -10.98% |
| Borderlands 3 (DX12) | -10.05% | -11.31% |
| Dark Souls 3 | -21.06% | -21.08% |
| Oblivion Remastered | -11.10% | -9.82% |
| Phasmophobia | -32.73% | -31.00% |
| Red Dead Redemption 2 | -20.10% | -14.38% |
| Total War: Warhammer III | -10.11% | -14.44% |
| Wolfenstein Youngblood | -15.91% | -13.47% |
| Shadow of the Tomb Raider | -30.23% | -25.86% |
It seems to have nearly no effect on compile times on Xe3 unfortunately,
as only 1,014 shaders in fossil-db even fail SIMD32 compilation in the
first place, and we want to let most of the "might succeed" cases
through to the backend for throughput analysis.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
This tries to calculate an underestimate (lower bound) for the register
pressure at various SIMD widths, by counting live values in the NIR
shader. This fundamentally won't be accurate, but it can give us an
idea of whether it's even worth trying a certain SIMD-width compile.
Doing this at the NIR level means we:
- Can use SSA structure rather than fuzzy liveness intervals
- Can avoid the backend scheduler aggressively trying to hide latency,
presenting an overinflated view of the register pressure
- Have divergence information on-hand, making it easier to "scale up"
- Can skip cloning and optimizing NIR for compute shader SIMD widths
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
We were doing a lot of NIR work repeatedly for each SIMD variant of
compute and mesh shaders. Instead, do it once before cloning, and
just do one final optimization loop and out-of-SSA for each.
fossil-db results on Arc B580:
Totals:
Instrs: 233771096 -> 233794024 (+0.01%); split: -0.01%, +0.02%
Subgroup size: 15922768 -> 15922736 (-0.00%); split: +0.00%, -0.00%
Send messages: 12095619 -> 12098234 (+0.02%); split: -0.00%, +0.02%
Loop count: 137562 -> 137523 (-0.03%)
Cycle count: 32600323744 -> 32667411252 (+0.21%); split: -0.06%, +0.27%
Spill count: 540908 -> 542027 (+0.21%); split: -0.07%, +0.28%
Fill count: 700938 -> 698983 (-0.28%); split: -0.73%, +0.45%
Scratch Memory Size: 37266432 -> 37304320 (+0.10%); split: -0.10%, +0.20%
Max live registers: 72691728 -> 72692987 (+0.00%); split: -0.00%, +0.00%
Non SSA regs after NIR: 67690309 -> 67688352 (-0.00%); split: -0.01%, +0.00%
Totals from 3576 (0.45% of 789301) affected shaders:
Instrs: 6932956 -> 6955884 (+0.33%); split: -0.41%, +0.74%
Subgroup size: 88816 -> 88784 (-0.04%); split: +0.09%, -0.13%
Send messages: 329168 -> 331783 (+0.79%); split: -0.02%, +0.81%
Loop count: 8753 -> 8714 (-0.45%)
Cycle count: 15153678820 -> 15220766328 (+0.44%); split: -0.14%, +0.58%
Spill count: 213751 -> 214870 (+0.52%); split: -0.18%, +0.71%
Fill count: 282616 -> 280661 (-0.69%); split: -1.82%, +1.13%
Scratch Memory Size: 13056000 -> 13093888 (+0.29%); split: -0.27%, +0.56%
Max live registers: 834757 -> 836016 (+0.15%); split: -0.11%, +0.26%
Non SSA regs after NIR: 995033 -> 993076 (-0.20%); split: -0.48%, +0.28%
Looking at a few of the shaders with substantial instruction count
increases, it appears that it is largely due to more loops being
unrolled, which is probably actually a good thing.
The compile time impact of this patch appears to be negligable.
However, doing postprocessing before SIMD cloning allows us to
examine the postprocessed SSA-form NIR for improvements in an
upcoming patch.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
brw_postprocess_nir contains a lot of stuff these days. The first part
does a bunch of lowering and cleanup optimizations in SSA form. The
second part does some post-optimization lowering and the out-of-SSA
conversion.
We may want to do additional work before the post-optimization/post-SSA
phase. Splitting this allows us to insert such tasks in the "middle".
For convenience, brw_postprocess_nir() becomes a wrapper which invokes
both parts, so callers can continue working as they did until they have
a reason to do otherwise.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
This allows us to lower known subgroup size cases earlier, giving us
some earlier optimization opportunities. We would need to know the
actual SIMD width to handle certain cases, but we can just pass 0 here,
which will lead to get_subgroup_size returning 0 - the same as leaving
this unset. We can come back to that later during the per-SIMD-width
postprocessing.
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
float_controls2 may have marked these as needing to preserve NaN or
other values. If so, our newly contracted ffma needs to as well.
Fixes dEQP-VK.spirv_assembly.instruction.compute.float_controls2.*.input_args.mat_det_testedWithout_NotNan*
when nir_opt_algebraic is run after this pass.
Cc: mesa-stable
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36750>
If the (NIR) destination is a register (i.e., not an SSA value), the
destination of the BRW instruction will not be is_scalar. This occurs in
some shaders in Final Fantasy XVI (and
finalfantasytype0_1.rdc.2826e29da3722a83.1.foz).
If the destination is not is_scalar, revert most of this code to the
state previous to f3593df877. This means
- Allocate a SIMD1 register and UNDEF it.
- Emit a SIMD1 MOV_RELOC_IMM to that register.
- Emit an additional MOV to expand the SIMD1 result.
Closes: #12520
Fixes: f3593df877 ("brw/nir: Treat load_reloc_const_intel as convergent")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37384>