In 2183bc73a6 ("nvk: Use suld for EDB uniform texel buffers"), we
started using suld instead of tld for EDB uniform texel buffers because
we needed it for correctness. However, it's slow as mud. Using
suld.constant seems to fix the performance regression. I don't know if
it's quite tld performance, but it's close.
Backport-to: 25.0
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33612>
This is way faster than suld.sys, which is what we're using today. So
far I haven't seen it matter for anything but texel buffers but it
likely helps some app somewhere.
Backport-to: 25.0
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33612>
This should never happen as the client should always give us aligned
addresses. However, in the off chance that it does, aligning down is
probably safer than aligning up as it won't cause the top end of the
range increase and potentially fault.
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33610>
The tricks we play for texel buffers with VK_EXT_descriptor_buffer don't
work with tld with very large buffers. suld, on the other hand, doesn't
seem to have these limitations.
Fixes: 3b94c5c22a ("nvk: Lower descriptors for VK_EXT_descriptor_buffer buffer views")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33610>
We can theoretically hit this if CmdProcessGeneratedCommandsEXT is
called with a state command buffer that doesn't have compute shader set
if execute commands bind a shader. We do, however, need to still call
nvk_cmd_upload_qmd() because it also uploads push constants and we need
those regardless of whether or not there's a shader bound.
Fixes: 976f22a5da ("nvk: Implement CmdProcess/ExecuteGeneratedCommandsEXT")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33610>
For the instructions we parse with brw_gram.y, don't unconditionally
call brw_eu_inst_set_cond_modifier(). Do it like we do in
brw_generator::generate_code() and only call it if we have a
cond_modifier to set.
Why? Because for ONE_SRC instructions, CondCtrl (bits 95:92) only
exists if Src.IsImm is false. If Src.Imm is true, then bits 95:64 are
actually Src0.ImmValue[63:32]. If we unconditionally call
brw_eu_inst_set_cond_modifier(), we'll end up zeroing bits 95:92 for
ONE_SRC instructions with 64bit immediates. See BSpec page
Structure_EU_INSTRUCTION_BASIC_ONE_SRC (56880).
This issue can be reproduced with src/intel/executor if you try to
have the following instruction:
mov(16) g10<1>Q 0xfedcba9876543210:Q { align1 WE_all 1H };
our parser will end up zeroing the top bits, so the value of the
immediate will be 0x0edcba9876543210.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33559>
It's entirely possible that the same GEM handle is referenced across
different Vulkan object instances. As per the warnings in xf86drm.h,
the caller is responsible for reference counting.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33451>
Since Xe2, the registers are bigger and even the instruction
structures got updated to have 6 bits.
The way I detected this issue was when I tried to use
src/intel/executor to add the following instruction:
add(8) g6.8<1>UD g4<8,8,1>UD 0x00000008UD { align1 WE_all 1Q I@1 };
Executor would read this and end up emitting an add with dst being
g6<1>UD instead of what we wanted. It turns out that inside
brw_gram.y, at dstoperand and dstoperandex we do:
$$.subnr = $$.subnr * brw_type_size_bytes($4);
which would overflow subnr back to 0.
The overflow doesn't seem to be a problem with code we emit directly
(unlike the code we parse, like above) due to the fact that we seem to
treat Xe2 registers as smaller all the way until we call phys_nr() and
phys_subnr() during code generation. The phys_subnr() function can
generate a value that would overflow reg.subnr, but this value is
never written back to reg.subnr, it's just returned as an unsigned
int.
Fixes: e9f63df2f2 ("intel/dev: Enable LNL PCI IDs without INTEL_FORCE_PROBE")
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33539>
The intent here was to check if the tile we're trying to merge
vertically (prev_y_tile) has already been merged horizontally into a
neighboring tile, but I used the slot_mask which also contains the tiles
that have been merged into the prev_y_tile, so the check was too
conservative and would fail even if another tile had been merged into
prev_y_tile. This meant that we would fail to ever create 2x2 regions of
tiles. Fix this by just testing prev_y_tile's bit in the mask.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33534>
Even though we always try to merge a horizontally or vertically adjacent
tile, when we try to merge a vertically adjacent tile it may not
actually be adjacent because it was merged horizontally and the current
tile wasn't or vice versa. We have to detect this and reject merging it.
Fixes: 3fdaad0948 ("tu: Implement bin merging for fragment density map")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33534>
This custom builder implements fine-grained instance node bounds
calculation by looking at all AABBs at tree depth 2.
Shaves off 0.3ms in the start scene for Indiana Jones: The Great Circle
on Deck (roughly 29.1ms->28.7ms).
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32797>
This allows drivers to inject custom functions to calculate the bounds
of instance nodes. For example, this can be used to determine instance
bounds by transforming the AABBs of all child nodes at some level in the
BVH. When instance transforms contain rotations of close to 45°, this
can yield a tighter AABB than just taking the instance's top-level AABB
and rotating it.
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32797>
For decode this is also done in decode_video.
This breaks if app doesn't call vkCmdEncodeVideoKHR before end, eg:
vkCmdBeginVideoCodingKHR
vkCmdControlVideoCodingKHR
vkCmdEndVideoCodingKHR
Cc: mesa-stable
Reviewed-by: Dave Airlie <airlied@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33582>
KGSL unconditionally supports preemption so we cannot ignore it.
On a6xx, we have to emit VSC addresses per-bin or make the amble include
these registers, because CP_SET_BIN_DATA5_OFFSET will use the
register instead of the pseudo register and its value won't survive
across preemptions. The blob seems to take the second approach and
emits the preamble lazily. We chose the per-bin approach but blob's
should be a better one.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12627
Signed-off-by: Danylo Piliaiev <dpiliaiev@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33580>