Gfx 12.5 moved scratch to a surface and SURFTYPE_SCRATCH has this pitch
restriction:
RENDER_SURFACE_STATE::Surface Pitch
For surfaces of type SURFTYPE_SCRATCH, valid range of pitch is:
[63,262143] -> [64B, 256KB]
The pitch of the surface is the scratch size per thread and the surface
should be large enough to accommodate every physical thread.
So here adding a new field to intel_device_info, setting it in
intel_device_info_init_common() so even offline tools can have it set.
And finally adding a check to fail shader compilation if needed
scratch is larger than supported.
This issue can be reproduced in debug builds when running
dEQP-VK.protected_memory.stack.stacksize_1024 on Gfx 12.5 or newer
platforms.
Ref: BSpec 43862 (r52666)
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30271>
We want to use actual sparse-capable queues as the default
trtt->queue, not copy queues that may have a companion_rcs_batch.
Before this patch, if we expose more than one queue *and* the
application creates a copy queue first, we'll end up setting
trtt->queue as the copy queue, which will GPU hang when we submit the
TR-TT batches as they don't support the pipe_control commands we
issue.
The trtt->queue queue is used for binding/unbinding buffers in code
paths where there's no specific queue coming from user space, such as
when we're creating or destroying a sparse resource.
This is not a problem yet on i915.ko since we are exposing
only a single queue, and it is not a problem for xe.ko since TR-TT is
not the default there. This is also not a problem in applications
that create the render or compute queue first. We plan to expose more
queues when using TR-TT, so this would become a problem without this
patch.
None of VK-GL-CTS seems to exercise that, and none of the Steam games
I tested exercise that as well. I was able to reproduce this issue
using our internal tracing tool.
v2: New implementation that doesn't break when we only have a compute
queue (Lionel).
Fixes: 04bfe828db ("anv/sparse: allow sparse resouces to use TR-TT as its backend")
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30252>
On Gen12 (the oldest we support on Mesa right now for TR-TT) we
started having per-engine TR-TT registers and we are supposed to make
all contexts share the same TR-TT programming.
On LNL+, this is documented in the BSpec page for the TRTT_CNTRL
register (68417), with more details in HSDs 14020454786 and
16022013154.
On Gen12 platforms this information is a little harder to find and
there's a whole trail of HSDs leading up to 1209977595, which links to
the documents that describe the programming. BSpec for TR-TT on Gen12
is very confusing as it still contains registers and other information
from Gen11 that were not removed.
Regarding the additional BLT and COMP registers, please notice that on
the BSpec pages for the TR-TT registers, the "Register Instance"
section only lists the GFX registers as non-privileged. However, the
"User Mode Privileged Commands" lists the other instances of the TR-TT
Regsiters as non-privileged, which matches what we see: there's no
need to put these addresses in the FORCE_TO_NONPRIV registers.
Notice that for now, when TR-TT is being used we only expose a single
queue, so this change effectively does nothing until we start exposing
extra queues. I left that part for later to help bisectability.
v2:
- s/trtt_init_context_state/trtt_init_queues_state/ (José)
- pass device as the argument to init_queues_state (José)
v3:
- use async_submit_end (José)
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30252>
Having this as a separate batch was the normal behavior until
7da5b1caef ("anv: move trtt submissions over to the
anv_async_submit").
While it certainly sounds better to do everything related to TR-TT
initialization in one batch, we need to revert it back to be a
separate batch (but now using the new anv_async_submit infrastructure)
because we'll want to run this batch on every engine.
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30252>
On MTL ChromeOS boards, during AI based video conference, we were
observing a lot of overhead from invalidations. Upon debug, it was found
that we were using clflush in this function and that isn't efficient.
With this change, while executing compute workloads like zoo models, we
are getting ~25% performance improvements in a best case scenario.
Rework:
* Jordan: Call intel_clflushopt_range() rather than
__builtin_ia32_clflushopt() because intel_mem.c is not compiled
with -mclflushopt.
Backport-to: 24.1 24.2
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30238>
This implements an undocumented workaround for a hardware bug that
affects draw calls with a pixel shader that has 0 push constant cycles
when TBIMR is enabled, which has been seen to lead to a hang with
Fallout 3 and Metal Gear Rising Revengeance. This hardware bug has
been reported as HSDES#22020184996 which is still pending a resolution
by the hardware team. However since this workaround found empirically
has been confirmed to fix the issue reliably and it's relatively
harmless it seems worth checking in already even though no final W/A
number is available nor has the W/A json file been updated.
To avoid the issue we simply pad the push constant payload to be at
least 1 register. This is enabled via a brw_wm_prog_key since the
driver needs to be in agreement with the compiler on whether the dummy
push constant cycle is present, and it can be avoided in cases where
the driver knows that TBIMR will be disabled (e.g. for BLORP).
Related: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10728
Related: https://gitlab.freedesktop.org/mesa/mesa/-/issues/11399
Fixes: 57decad976 ("intel/xehp: Enable TBIMR by default.")
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30031>
Pretty big change... Sorry for that.
I can't exactly remember why I created 2 heaps. I think it's because I
mistakenly thought the samplers in the binding sampler pointers needed
to be indexed from the binding table. But that's not the case, they
just need to be in the dynamic state heap.
In the future, this change will allow to also allocate buffers for
push constant data in the newly created dynamic_visible_pool which
will be useful on < Gfx12.0 where this is the only place push constant
data can live for compute shaders.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30047>
If you have something like this :
con 32 %66 = @load_reg (%62) (base=0, legacy_fabs=0, legacy_fneg=0)
con 32 %27 = @resource_intel (%22 (0xdeaddead), %66, %67, %17 (0x0)) (desc_set=2, binding=96, resource_intel=0, resource_block_intel=-1)
Just copying the brw_reg in ssa_values[] is not enough for the
load_reg intrinsic. We need to call get_nir_src() to force some logic
to create the register correct.
Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Fixes: b8209d69ff ("intel/fs: Add support for new-style registers")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30050>
When I915_PARAM_HAS_CONTEXT_FREQ_HINT is not supported the
intel_ioctl(fd, DRM_IOCTL_I915_GETPARAM, &gp) will return -1 and that
will cause i915_gem_get_param() to return false.
val will be different than 1 when not using GuC submission, so we are
forcing val check to ensure this holds good in platforms that doesn't
support GuC submission.
Fixes: d52dd5a9 ("anv/drirc: add option to provide low latency hint")
Signed-off-by: Sushma Venkatesh Reddy <sushma.venkatesh.reddy@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30234>
We were generating odd instructions like:
math inv(8) g93<1>HF g85<8,8,1>HF null<8,8,1>F { align1 1Q @7 $4 };
It's unclear whether the type of the null operand matters, but sometimes
these things don't get ignored properly. Out of caution, retype the
null source to match the actual operand's type. It'll at least look
less surprising in assembly dumps.
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30193>
So for this entry we want the CPU mapping to be WC but GPU caches
can be WB.
This way GPU don't need to snoop to CPU caches and at the end of
workloads L3 cache is flushed, so CPU access is coherent after get
the signal that workload was finished.
With this the transient(XD) L3 flushes will only affect displayable
buffers.
Ref: Bspec 71582 (r59285)
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29950>
My initial understating was that L3_CACHE_POLICY would be the CPU
caching mode but that has nothing to do with CPU caching, it is the
GPU caching mode.
Due this miss understating we were using a not optimal PAT index that
will be fixed in the next patches, so to avoid such issues in future
adding comments to intel_device_info_pat_entry struct.
Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Jianxun Zhang <jianxun.zhang@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/29950>
Xe KMD renamed XE_PERF to XE_OBSERVATION to better match with Intel
specification and avoid confusion.
This uAPI rename will land in the same kernel version that added
the uAPI being renamed.
There is no uAPI change, just renames.
Sync xe_drm.h with 63347fe031e3 ("drm/xe/uapi: Rename xe perf layer as xe observation layer").
Acked-by: Caio Oliveira <caio.oliveira@intel.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30027>
Commit 94989b45a5 ("anv,driconf: Add fake non device local memory WA
for Total War: Warhammer 3") implemented a workaround to make
Warhammer 3 work on ADL, but the game still doesn't work on LNL, which
uses xe.ko, and MTL, which uses i915.ko: it still fails at launch
claiming it couldn't allocate memory.
So in this implementation, instead of clearing DEVICE_LOCAL_BIT we
just duplicate our memory types, one having the bit and one not
having.
v2:
- Check for VK_MAX_MEMORY_TYPES (José)
- Invert the order of the memory types (José)
- Fix white space issue (José)
v3:
- Comment our non-spec-compliance (José)
- Remove useless lines (José)
Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8721
Fixes: 94989b45a5 ("anv,driconf: Add fake non device local memory WA for Total War: Warhammer 3")
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30162>
Handle '\n' when inside the MSGDESC start condition,
otherwise the lexer would apply its default rule (write
to stdout).
Without that, newlines were "leaking" to the output when
parsing a multiple line "MsgDesc". E.g. given the file
example.asm below
```
send(8) nullUD g126UD nullUD 0x02000000 0x00000000
thread_spawner MsgDesc: mlen 1 ex_mlen 0 rlen 0
{ align1 WE_all 1Q @1 EOT };
```
the assembler would produce one extra newline
```
$ brw_asm -t hex -g tgl example.asm
31 01 03 80 04 00 00 00 0c 7e 00 70 00 00 00 00
```
Reviewed-by: Sagar Ghuge <sagar.ghuge@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30100>
Commit f695a9fed2 moved the 64-bit float <-> 16-bit float conversion
splitting into a core NIR pass, so the code remaining here is only
needed for 64-bit integer types.
Presumably in an attempt to remove the float handling, it replaced
simple bit_size == 64 checks with this expression:
(full_type & (nir_type_int64 | nir_type_uint64))
I believe that the intended expression was:
(full_type == nir_type_int64 || full_type == nir_type_uint64)
Unfortunately, the former is incorrect. Any integer or unsigned
NIR type would trigger the former expression. For example:
nir_type_uint32 & (nir_type_int64 | nir_type_uint64) => nir_type_uint
This meant that we were splitting e.g. u2f16 on 32-bit unsigned types
into u2f32 and f2f16, when we can easily natively handle that case.
To fix this, we go back to simple bit_size == 64 checks. This pass is
already run after nir_lower_fp16_casts which will split the float case,
so we will never see it here.
fossil-db on Alchemist shows a -1.14% reduction in affected shaders for
google-meet-clvk shaders. In another ChromeOS workload, it improves
performance by around 8% on Meteorlake.
Thanks to Sushma Venkatesh Reddy for finding this performance issue!
Fixes: f695a9fed2 ("intel/compiler: use nir_lower_fp16_casts")
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30091>