This logic doesn't really do what it pretends to; we don't expose the
RGTC features unless we actually have RGTC support. This is about to
change, but for that logic to work, we need to be able to tell if we're
using a fallback-format or not, and we can't do that unless we keep the
format as RGTC.
Reviewed-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Acked-by: Eric Engestrom <eric@igalia.com>
Tested-by: Eric Engestrom <eric@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18248>
If we get CPU access (such as a read) after an upload transfer, we need
to ensure that the host has handled the upload. Do this by stalling
when the buffer is mapped. (The previous commit ensures we don't try to
do a pointless upload for an already mapped buffer.)
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18604>
Typically GLSL mediump lowering will have lowered all the ALU ops
generating the values to 16-bit, and once vars_to_ssa happens the mediump
temps disappear. However, if they don't disappear (for example, the var
gets indirected and eventually gets lowered to scratch or indirect
lowering), then you don't want the storage upconverted to 32-bit.
Also, if a CS shared var is declared mediump, then storing it as 16 bit
prevents conversions around the load store assuming the ALU ops related to
them are 16 bit. For gfxbench aztec ruins, the CS shared var sizes are
cut in half, improving overall perf by 0.805549% +/- 0.0953482% (n=6) on
gl-5-normal.
freedreno shader-db:
total instructions in shared programs: 2917577 -> 2917743 (<.01%)
instructions in affected programs: 46141 -> 46307 (0.36%)
total last-baryf in shared programs: 109712 -> 109492 (-0.20%)
last-baryf in affected programs: 638 -> 418 (-34.48%)
total full in shared programs: 190275 -> 190218 (-0.03%)
full in affected programs: 156 -> 99 (-36.54%)
total constlen in shared programs: 492596 -> 492600 (<.01%)
constlen in affected programs: 8 -> 12 (50.00%)
total cat6 in shared programs: 33019 -> 33107 (0.27%)
cat6 in affected programs: 3604 -> 3692 (2.44%)
total stp in shared programs: 3626 -> 3670 (1.21%)
stp in affected programs: 3336 -> 3380 (1.32%)
total ldp in shared programs: 1718 -> 1762 (2.56%)
ldp in affected programs: 1680 -> 1724 (2.62%)
(this is all in aztec ruins)
total sstall in shared programs: 195656 -> 195182 (-0.24%)
sstall in affected programs: 3249 -> 2775 (-14.59%)
total (ss) in shared programs: 52823 -> 52966 (0.27%)
(ss) in affected programs: 1733 -> 1876 (8.25%)
total systall in shared programs: 507928 -> 508687 (0.15%)
systall in affected programs: 103010 -> 103769 (0.74%)
total (sy) in shared programs: 23185 -> 23196 (0.05%)
(sy) in affected programs: 1276 -> 1287 (0.86%)
total waves in shared programs: 435290 -> 435302 (<.01%)
waves in affected programs: 12 -> 24 (100.00%)
total loops in shared programs: 407 -> 405 (-0.49%)
loops in affected programs: 9 -> 7 (-22.22%)
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18452>
I don't know of any GPUs doing 16-bit atomic accesses, nor do I know of
anybody wanting that in shaders. But deqp has GLES CTS cases that set
mediump on shared variables, so just skip lowering for those vars.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18452>
If every use was a conversion to 16, then ir3_cf would fold it into the
bary instruction. But if something had generated a highp comparison of
the mediump input with a mediump op result, it would get stuck as highp,
even though we could have used 16-bit values without upconverting.
This fixes dEQP-GLES2.functional.shaders.algorithm.rgb_to_hsl_fragment on
ANGLE on turnip, closing #7043. fossil-db results are mixed:
fossil-db:
Totals from 697 (4.65% of 14988) affected shaders:
MaxWaves: 10712 -> 10736 (+0.22%)
Instrs: 82394 -> 83572 (+1.43%); split: -1.31%, +2.74%
CodeSize: 178280 -> 180118 (+1.03%); split: -0.46%, +1.49%
NOPs: 15887 -> 16067 (+1.13%); split: -7.48%, +8.61%
MOVs: 1297 -> 1328 (+2.39%); split: -6.86%, +9.25%
Full: 3730 -> 3842 (+3.00%); split: -1.80%, +4.80%
(ss): 1877 -> 1849 (-1.49%); split: -5.59%, +4.10%
(sy): 1249 -> 1255 (+0.48%); split: -1.04%, +1.52%
(ss)-stall: 6809 -> 6364 (-6.54%); split: -13.85%, +7.31%
(sy)-stall: 17059 -> 17257 (+1.16%); split: -6.51%, +7.67%
Cat0: 17220 -> 17400 (+1.05%); split: -6.90%, +7.94%
Cat1: 5307 -> 6366 (+19.95%); split: -6.93%, +26.89%
Cat2: 39138 -> 39101 (-0.09%); split: -0.31%, +0.22%
Cat3: 16772 -> 16741 (-0.18%)
Cat5: 1269 -> 1276 (+0.55%)
I tried to pick some apps to test that looked the most impacted, and
indeed the results are mixed:
cookie_run_kingdom: +0.275514% +/- 0.0883816% (n=68)
trex_200: +0.0943847% +/- 0.0297073% (n=1463)
command_and_conquer_rivals: no difference (n=131)
war_planet_online: no difference (n=120)
lego_legacy: -0.192131% +/- 0.152083% (n=99)
among_us: -0.625227% +/- 0.385419% (n=60)
Given that the perf results are small and go both ways, and apparently
we're an outlier in not always lowering mediump inputs to 16-bit, just do
it for consistency with other drivers.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18506>
During the lower_regioning() optimization, required_exec_type() is
returning BRW_REGISTER_TYPE_UQ type when processing
SHADER_OPCODE_SHUFFLE instructions of type BRW_REGISTER_TYPE_DF but
MTL has float64 support but lacks int64 support causing shader
compilation to fail.
To fix that we could make required_exec_type() return
BRW_REGISTER_TYPE_DF in such case but SHADER_OPCODE_SHUFFLE virtual
instruction runs in the integer pipeline(inferred_exec_pipe()).
So here replacing the has_64bit check by has_64bit_int, this will
properly handle older and newer cases making this function return
BRW_REGISTER_TYPE_UD.
Then lower_exec_type() will take care to generate 2 32bits operations
to accomplish the same.
While at it also dropping the 'devinfo->verx10 == 70' check as
GFX7_FEATURES fall into the same category as MTL, has float64 but no
int64 support.
Fixes at least this crucible tests:
func.uniform-subgroup.exclusive.fadd64.q0
func.uniform-subgroup.exclusive.fmin64.q0
func.uniform-subgroup.exclusive.fmax64.q0
Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: José Roberto de Souza <jose.souza@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18577>
Add the structs to vn_physical_device, just like we do for the 1.1 and
1.2 structs.
Prepares for Vulkan 1.3 enablement. No intended change in behavior.
Tested with gpu Intel Tigerlake on CrOS device volteer.
I tested only a small subset of dEQP because this branch only touches
the code for VkPhysicalDevice{Features2,Properties2}.
vulkan-cts-1.3.3.0
dEQP-VK.api.info.*
dEQP-VK.api.smoke.*
pass/skip/fail = 3796/9/0
I tested Dota 2 on borealis on volteer, with non-Proton Vulkan. The
game launches and reaches the main menu. Same with Hades with DX on
Proton 7.
Signed-off-by: Chad Versace <chadversary@chromium.org>
Reviewed-by: Yiwei Zhang <zzyiwei@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18158>
Motivation is easier sorting and readability.
- In VN_ADD_TO_PNEXT_OF, re-arrange params to allow sorting. Param1 is
invariant in each block. Param2 is sType.
- In VN_ADD_EXT_TO_PNEXT_OF, make its initial params match those of
VN_ADD_TO_PNEXT_OF.
- Then sort the macro calls.
Signed-off-by: Chad Versace <chadversary@chromium.org>
Reviewed-by: Yiwei Zhang <zzyiwei@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18158>
Make the variable name more closely match the type name.
This also allows them to sort correctly.
argb_4444_formats -> _4444_formats
eight_bit_storage -> _8bit_storage
sixteen_bit_storage -> _16bit_storage
While touching vn_physical_device.[ch], also run clang-format.
Signed-off-by: Chad Versace <chadversary@chromium.org>
Reviewed-by: Yiwei Zhang <zzyiwei@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18158>
if the swapchain image is acquired in a different cmdbuf than it gets
presented with, the acquire semaphore will have already been submitted
by this point, and the swapchain should be flagged as such
cc: mesa-stable
Reviewed-by: Adam Jackson <ajax@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18557>
This extension adds new NONE attachment load / store operations,
which are identical to the DONT_CARE variants with the difference
that DONT_CARE doesn't ensure that the original contents of the
memory within the render area are preserved and these new versions
do (with some caveats).
Our implementation was not destroying data with DONT_CARE anyway
so we already support the new semantics. Our implementation is
such that we don't need to do anything specific with the new
operations and the current behavior will do what is expected.
We pass all the tests under:
dEQP-VK.renderpass*.load_store_op_none.*
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18570>
If the render area is not aligned to tile boundaries it means we have partially
covered tiles in the framebuffer. In this case, we always need to load the tile
buffer from memory in order to preserve the contents outside the render area
on the tile buffer store. However, if in this scenario we know we won't be
storing the tile buffer we can skip the load safely.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18570>
This is a trivial implementation where we just insert a UBO descriptor
pointing to the actual data and then treat it as a normal UBO everywhere
else. In theory an indirect CP_LOAD_STATE would be more efficient than
ldc.k to preload inline uniform blocks to constants. However we will
always need the UBO descriptor anyway, even if we lower the limits
enough to always be able to preload them, because with variable pointers
we may have a pointer that could be to either an inline uniform block or
regular uniform block. So, using an indirect CP_LOAD_STATE should be an
optimization on top of this.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17960>