The updated board has a stabilized GPU and now I just need to decide if
I'm building a farm of them or not. The new firmware flash needs a
reminder to the kernel of how to do NFS (no v2, thanks). Also, the full
run is long and we need the TEST_PHASE_TIMEOUT variable to go past 20
minutes now.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18674>
Right now there is a call to rc_get_variables, which performs a global
analysis of the whole shader, for every IF encountered. As a result,
shaders with a lot of IFs are compiled very slowly. The patological
cases are shaders using relative adressing, where the lowered array
access can result in tens of IFs.
This patch restructures the pass to call the rc_get_variables just once
at the beginning and later reuse the gathered info. We can do this,
because even though we transform the shader in the meantime (like for
example adding extra MOVs) the transformations are not siginificant
enough to influence the relevant variable info we are using.
This reduces CPU time for my shader-db by more than a half. I also
checked that the generated code for all shaders in shader-db is
identical.
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Reviewed-by: Filip Gawin <filip@gawin.net>
Acked-by: Emma Anholt <emma@anholt.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18678>
This extension was promoted to Vulkan 1.3 so we should be setting its
properties directly in the VkPhysicalDeviceVulkan13Properties struct
which the common mesa code will use to populate outgoing properties.
Apparently, only the properties struct was promoted and not the features
struct.
Reviewed-by: Eric Engestrom <eric@igalia.com>
Tested-by: Eric Engestrom <eric@igalia.com>
Fixes: ee62a4c751 ('v3dv: implement VK_EXT_texel_buffer_alignment')
Fixes: dEQP-VK.api.info.get_physical_device_properties2.properties.basic
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18697>
../mesa/src/amd/common/ac_surface.c:2324:48: warning: implicit conversion from enumeration type 'AddrResourceType' (aka 'enum _AddrResourceType') to different enumeration type 'enum gfx9_resource_type' [-Wenum-conversion]
surf->u.gfx9.resource_type = AddrSurfInfoIn.resourceType;
~ ~~~~~~~~~~~~~~~^~~~~~~~~~~~
../mesa/src/amd/common/ac_surface.c:3046:38: warning: implicit conversion from enumeration type 'const enum gfx9_resource_type' to different enumeration type 'AddrResourceType' (aka 'enum _AddrResourceType') [-Wenum-conversion]
input.resourceType = surf->u.gfx9.resource_type;
~ ~~~~~~~~~~~~~^~~~~~~~~~~~~
../mesa/src/amd/common/ac_surface.c:3069:38: warning: implicit conversion from enumeration type 'const enum gfx9_resource_type' to different enumeration type 'AddrResourceType' (aka 'enum _AddrResourceType') [-Wenum-conversion]
input.resourceType = surf->u.gfx9.resource_type;
The enums are compatible so lets just add some casts.
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18694>
If we emit a ldunif to load the ubo/ssbo base address and
then we are immediately moving it to the unifa register we
can have the ldunif write directly to unifa and avoid the mov
in between, which won't be done by copy propagation because that
only works with temp registers.
Also, since we can't read from unifa we must be careful to disallow
reuse of the ldunif result for a future ldunif of the same base address.
We do that by only reusing ldunif results from temp registers.
total instructions in shared programs: 12468943 -> 12455139 (-0.11%)
instructions in affected programs: 1661233 -> 1647429 (-0.83%)
helped: 8307
HURT: 3994
total uniforms in shared programs: 3704532 -> 3704522 (<.01%)
uniforms in affected programs: 339 -> 329 (-2.95%)
helped: 7
HURT: 0
total max-temps in shared programs: 2148158 -> 2148290 (<.01%)
max-temps in affected programs: 9320 -> 9452 (1.42%)
helped: 175
HURT: 295
total spills in shared programs: 2202 -> 2202 (0.00%)
spills in affected programs: 0 -> 0
helped: 0
HURT: 0
total fills in shared programs: 3059 -> 3057 (-0.07%)
fills in affected programs: 27 -> 25 (-7.41%)
helped: 1
HURT: 0
total sfu-stalls in shared programs: 21167 -> 21056 (-0.52%)
sfu-stalls in affected programs: 497 -> 386 (-22.33%)
helped: 209
HURT: 127
total inst-and-stalls in shared programs: 12490110 -> 12476195 (-0.11%)
inst-and-stalls in affected programs: 1662875 -> 1648960 (-0.84%)
helped: 8312
HURT: 3987
total nops in shared programs: 316563 -> 313553 (-0.95%)
nops in affected programs: 24269 -> 21259 (-12.40%)
helped: 2158
HURT: 1006
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18667>
The upstream kernel supported everything needed to stop doing
kernel-side relocs before the first things with a6xx were fully
supported in upstream kernel. Take advantage to drop some extra
overhead in OUT_RELOC() and equiv in the pack macros.
Signed-off-by: Rob Clark <robdclark@chromium.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18646>
The current stack size is a significant limiter for occupancy, and
hence we need smaller stacks in LDS.
Rhys earlier had a patch that just put the N entries closest to the
root in LDS and the rest in scratch. However, this is not ideal for
performance as most of the activity is happening away from the root,
near the leaves. Of course we can't just switch it around, as the
leaf activity likely isn't happening all the way at the end of the
stack.
So what we do is make the LDS stack kinda a ringbuffer by always
accessing it using the stack index modulo the buffer size (always
a power of two so we can efficiently mask). If we then do not have
free space in this buffer we evict the entries closest to the root
to scratch and if we hit the "bottom" of the LDS space we load from
scratch.
Some rough perf numbers for indication with Q2RTX:
| evicting | LDS entries | perf |
|----------|-------------|------|
| no | 76 | 55% |
| no | 32 | 100% |
| no | 24 | 105% |
| yes | 32 | 95% |
| yes | 16 | 100% |
| yes | 8 | 90% |
| yes | 4 | 75% |
(For the case with 4 entries we need to do some extra accounting as
a full batch may not be available to evict)
So an obvious choice is to use a stack of 16 entries.
One might wonder if Q2RTX perf is mainly good due to BVHs with very
little geometry and hence low depth, so I also did some profiling
with control. This is done with RGP instruction timing, so this is
instructions executed not weighted for enabled masks, i.e. divergence
effects included.
| game | LDS entries | scratch action | fraction of iterations |
|---------|-------------|----------------|------------------------|
| Control | 8 | store | 10.3% |
| Control | 8 | load | 34.8% |
| Control | 16 | store | 0.58% |
| Control | 16 | load | 2.62% |
| Q2RTX | 16 | store | 1.00% |
| Q2RTX | 16 | load | 3.07% |
So Q2RTX doesn't seem like an unreasonably good case for this
algorithm.
On the implementation side, we can always place the scratch stack at
address 0 by just reserving the scratch space, and in the case of fixed
callstack size moving that up. In the dynamic case the dynamic stack
base already takes any reserved scratch space into account.
Reviewed-by: Konstantin Seurer <konstantin.seurer@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18541>