Commit Graph

614 Commits

Author SHA1 Message Date
Iago Toral Quiroga 4483cd24af broadcom/compiler: sink uniform loads
total instructions in shared programs: 13014428 -> 13000420 (-0.11%)
instructions in affected programs: 743624 -> 729616 (-1.88%)
helped: 1392
HURT: 611

total threads in shared programs: 415858 -> 415874 (<.01%)
threads in affected programs: 16 -> 32 (100.00%)
helped: 8
HURT: 0

total uniforms in shared programs: 3720410 -> 3711652 (-0.24%)
uniforms in affected programs: 113442 -> 104684 (-7.72%)
helped: 635
HURT: 29

total max-temps in shared programs: 2154268 -> 2144876 (-0.44%)
max-temps in affected programs: 61279 -> 51887 (-15.33%)
helped: 1124
HURT: 187

total spills in shared programs: 4002 -> 3870 (-3.30%)
spills in affected programs: 265 -> 133 (-49.81%)
helped: 6
HURT: 0

total fills in shared programs: 5788 -> 5560 (-3.94%)
fills in affected programs: 603 -> 375 (-37.81%)
helped: 6
HURT: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga e228642cf5 broadcom/compiler: move constants before their first user
For us they are basically uniforms too so we want to make their
lifespans short to facilitate allocating them to accumulators.

total instructions in shared programs: 13043585 -> 13015385 (-0.22%)
instructions in affected programs: 8326040 -> 8297840 (-0.34%)
helped: 24939
HURT: 19894

total threads in shared programs: 415860 -> 415858 (<.01%)
threads in affected programs: 4 -> 2 (-50.00%)
helped: 0
HURT: 1

total uniforms in shared programs: 3721953 -> 3720451 (-0.04%)
uniforms in affected programs: 96134 -> 94632 (-1.56%)
helped: 744
HURT: 435

total max-temps in shared programs: 2173431 -> 2154260 (-0.88%)
max-temps in affected programs: 264598 -> 245427 (-7.25%)
helped: 10858
HURT: 841

total spills in shared programs: 4005 -> 4010 (0.12%)
spills in affected programs: 700 -> 705 (0.71%)
helped: 5
HURT: 10

total fills in shared programs: 5801 -> 5817 (0.28%)
fills in affected programs: 1346 -> 1362 (1.19%)
helped: 6
HURT: 11

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga a1998a9f43 broadcom/compiler: disallow TMU spills if max tmu spills is 0
If we are compiling with a strategy that does not allow TMU spills
we should not allow spilling anything that is not a uniform.
Otherwise the RA cost/benefit algorithm may choose to spill a
temp that is not uniform and that will cause us to immediately
fail the strategy and fallback to the next one, even if we
could've instead chosen to spill more uniforms to compile the
program successfully with that strategy.

Some relevant shader-db stats:

total instructions in shared programs: 13040711 -> 13043585 (0.02%)
instructions in affected programs: 234238 -> 237112 (1.23%)
helped: 73
HURT: 172

total threads in shared programs: 415664 -> 415860 (0.05%)
threads in affected programs: 196 -> 392 (100.00%)
helped: 98
HURT: 0

total uniforms in shared programs: 3717266 -> 3721953 (0.13%)
uniforms in affected programs: 12831 -> 17518 (36.53%)
helped: 6
HURT: 100

total max-temps in shared programs: 2174177 -> 2173431 (-0.03%)
max-temps in affected programs: 4597 -> 3851 (-16.23%)
helped: 79
HURT: 21

total spills in shared programs: 4010 -> 4005 (-0.12%)
spills in affected programs: 55 -> 50 (-9.09%)
helped: 5
HURT: 0

total fills in shared programs: 5820 -> 5801 (-0.33%)
fills in affected programs: 186 -> 167 (-10.22%)
helped: 5
HURT: 0

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga cbb4d0dded broadcom/compiler: increase cost of TMU spills to 10
Our cost was 5 which matches the number of instructions we have to
add for a TMU spill (a fill is 4 instructions).

Uniform spills on the other hand add an extra instruction for each
fill and remove one instruction for the spill itself. These have
a cost of 1.

Therefore, if we have a single spill+fill, we end up with +9
instructions if it is a TMU spill and +0 instructions with a uniform
spill, so making the former only 5 times more costly is probably
not a good idea, and this is without even considering the added
latency of the TMU accesses.

Relevant shader-db changes show this causes as a marginal instruction
count increase in a few shaders but better thread counts and lower
TMU spilling overall:

total instructions in shared programs: 13037315 -> 13040711 (0.03%)
instructions in affected programs: 370106 -> 373502 (0.92%)
helped: 187
HURT: 321

total threads in shared programs: 415090 -> 415664 (0.14%)
threads in affected programs: 574 -> 1148 (100.00%)
helped: 287
HURT: 0

total uniforms in shared programs: 3706674 -> 3717266 (0.29%)
uniforms in affected programs: 63075 -> 73667 (16.79%)
helped: 40
HURT: 395

total max-temps in shared programs: 2176080 -> 2174177 (-0.09%)
max-temps in affected programs: 15838 -> 13935 (-12.02%)
helped: 316
HURT: 34

total spills in shared programs: 4247 -> 4010 (-5.58%)
spills in affected programs: 2599 -> 2362 (-9.12%)
helped: 107
HURT: 14

total fills in shared programs: 6121 -> 5820 (-4.92%)
fills in affected programs: 3622 -> 3321 (-8.31%)
helped: 108
HURT: 13

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15168>
2022-03-02 08:09:11 +00:00
Iago Toral Quiroga cf99584f51 broadcom/compiler: move uniforms right before their first use after scheduling
On V3D the quality of the code we generate is significantly affected by
how we decide to assign accumulators during register allocation, which
is determined by liveness, favoring short-lived temps.

There are many shaders that end up doing a whole lot of uniform loads
first, and using them later, which is very inconvenient for our register
allocation process because this increases uniform liveness and causes
us to use accumulators less efficientely, leading to significant churn.

To fix this, we move uniforms right before their first use in the same
block, but we need to do this after NIR scheduling, which means we are
doing it in non-SSA form, since the scheduler has a tendency to undo
this optimization and it is not easy to modify it to avoid it, since it
works in more abstract terms, using instruction dependencies, estimated
register pressure and instruction delay information to do its work,
which are very different concepts.

total instructions in shared programs: 13316738 -> 13033613 (-2.13%)
instructions in affected programs: 10389172 -> 10106047 (-2.73%)
helped: 55442
HURT: 16144

total threads in shared programs: 413722 -> 415048 (0.32%)
threads in affected programs: 1428 -> 2754 (92.86%)
helped: 680
HURT: 17

total loops in shared programs: 1716 -> 1690 (-1.52%)
loops in affected programs: 26 -> 0
helped: 26
HURT: 0

total uniforms in shared programs: 3704313 -> 3705181 (0.02%)
uniforms in affected programs: 687730 -> 688598 (0.13%)
helped: 2920
HURT: 7384

total max-temps in shared programs: 2364785 -> 2175190 (-8.02%)
max-temps in affected programs: 1215387 -> 1025792 (-15.60%)
helped: 49667
HURT: 1556

total spills in shared programs: 4241 -> 4248 (0.17%)
spills in affected programs: 642 -> 649 (1.09%)
helped: 11
HURT: 19

total fills in shared programs: 6115 -> 6125 (0.16%)
fills in affected programs: 1276 -> 1286 (0.78%)
helped: 11
HURT: 21

total sfu-stalls in shared programs: 34381 -> 36578 (6.39%)
sfu-stalls in affected programs: 16055 -> 18252 (13.68%)
helped: 3647
HURT: 5206

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15056>
2022-02-24 11:36:00 +00:00
Iago Toral Quiroga c4a78a2d2a broadcom/compiler: fix register class patching for postponed spills
If we have a postponed spill, the temp we create at ip is no longer
the spilled temp and therefore is affected by the thrsw injection.

Fixes corruption in the additive blending animation demo from
Three.js.

Fixes: f3c3228522 ('broadcom/compiler: do not rebuild the interference graph after each spill')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15112>
2022-02-22 11:17:10 +00:00
Iago Toral Quiroga a4b164b57b broadcom/compiler: only patch temps that existed before the current spill
When we spill we add new temps. We should be careful not to access
liveness for these until we have re-computed it after all spills and
fill for that the spilled temp have been processed so as to avoid
out-of-bounds accesses to the c->temp_start and c->temp_end arrays.

This fixes a crash in a Three.js demo when we try to patch register
classes after a TMU spill that was caused because we would incorrectly
try to patch the same temps we had just added for the spill itself,
which is not only unnecessary but also incorrect since we these temps
would not have liveness information available yet and thus would
cause out of bounds accesses.

Fixes: f3c3228522 ('broadcom/compiler: do not rebuild the interference graph after each spill')
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15107>
2022-02-22 06:41:51 +00:00
Jose Maria Casanova Crespo 90f966e05f v3dv/v3d: Fix copyright holder to Raspberry Pi Ltd
Acked-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15057>
2022-02-18 11:50:07 +01:00
Iago Toral Quiroga 750eeecf4e broadcom/compiler: document that spill_base is used for spills and scratch
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga 8883975209 broadcom/compiler: drop spill_count and add spilling boolean
We added spill_count to handle uniform batch spills, which we no longer do.
What we want now is a way to know if we are spilling registers.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga f3c3228522 broadcom/compiler: do not rebuild the interference graph after each spill
Instead, we only recompute liveness and we add new nodes and
interferences to the graph manually (we also need to patch
register classes in some cases).

To assist in this process, we also add an ip counter to our
instructions that we also recompute after each spill, which we use
to identify registers that cross thrsw boundries introduced with
TMU spills and fills and adjust their register classes accordingly
(removing their capacity to use accumulators).

This significantly reduces the CPU cost of spills. Using
shaders/closed/gputest/piano/7.shader_test as reference:

Compile time up to the first successful compile strategy in main is
~24s and with this change it is ~11s. With this speed up, we can now
try all 2-thread compile strategies (including the fallback scheduler)
in only ~15s.

A full shader-db run results in:
Total CPU time (seconds): 9904.67 -> 9087.98 (-8.25%)

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga 59caaa7fb3 broadcom/compiler: reset spill/fill counts after lowering thread count.
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga 92d819aaa0 broadcom/compiler: fix end of TMU sequence check
We may be pipelining TMU writes and reads, in which case we can
see both TMUWT and LDTMU at the end of a TMU sequence, so we should
not assume that a TMUWT always terminates a sequence.

Also, we had a bug where we were using inst instead of scan_inst
to check if we find another TMUWT after the curent instruction.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga 40e091267d broadcom/compiler: define max number of tmu spills for compile strategies
Instead of whether they are allowed to spill or not. This is more flexible.
Also, while we are not currently enabling spilling on any 4-thread strategies,
should we do that in the future, always prefer a 4-thread compile.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga 919aedbfec broadcom/compiler: choose compile strategy with lowest spilling
Until now we would only allow spilling as a last resort in the
last 2 strategies, however, it is possible that in some cases
earlier strategies may produce less spills if we allowed spilling
on them.

Likewise, the fallback scheduler can sometimes produce less spills
than 2 threads with optimizations disabled.

With this change, we start allowing all our 2-thread strategies to
spill, and instead of choosing the first strategy that is successful,
we choose the one that doesn't spill or the one with the least amount
of spilling.

It should be noted that this may incur in a significant increase
of compile times. We will address this in a follow-up patch.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15041>
2022-02-18 08:38:19 +00:00
Iago Toral Quiroga 7561ea8fa1 broadcom/compiler: allow ldunifa with read-only SSBOs
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14830>
2022-02-03 07:35:07 +00:00
Iago Toral Quiroga 0a8449b07c broadcom/compiler: fix offset alignment for ldunifa when skipping
The intention was to align the address to 4 bytes (32-bit), not
16 bytes.

Fixes: bdb6201ea1 ("broadcom/compiler: use ldunifa with unaligned constant offset")

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14830>
2022-02-03 07:35:07 +00:00
Iago Toral Quiroga 5cec893384 broadcom/compiler: update comment on load_uniform fast-path
The comment for 16-bit applies to 8-bit uniforms as well.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 296fde31aa broadcom/compiler: allow vectorization to larger scalar type
Allow to vectorize operations from a smaller bit-size into
scalar operations of a larger bit-size. This allows us to
turn 2x8-bit into a equivalent scalar 16-bit load/store.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga a248ff0b5b broadcom/compiler: support 8-bit loads via ldunifa
This generalizes the support we added for 16-bit to also handle
8-bit loads via ldunifa. The story is the same: we align the address
to 32-bit downwards and we skip any bytes that are not of interest.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 4630f5f016 broadcom/compiler: handle to/from 8-bit integer conversions
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 1b530d948d broadcom/compiler: support 8-bit general store access
Just like with 16-bit, this mode only supports scalar access, but
we are already lowering all non 32-bit accesses to scalar.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga f7ff462421 broadcom/compiler: support 16-bit uniforms
Since ldunif is a 32-bit instruction we need to demote these to
UBO loads, like we do for indirect indexing, with the exception
of scalar 16bit uniforms with an offset that is 32-bit aligned.

For the exception where we can use lfdunif we read a 32-bit slot
from memory where the uniform data is in the lower 16-bit and we
will read garbage in the upper 16-bit which we won't use anyway.

It should be noted that by using ldunif, we are consuming
32-bit from the uniform stream, but this is fine because
if there is valid uniform data in the upper 16-bit (i.e.
we had a ivec2 uniform aligned to a 32-bit address), since
we scalarize 16-bit loads, we would see another load uniform
with an unaligned offset for the second component, which we
will demote to UBO.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 49a8fa152c broadcom/compiler: support f32 to f16 RTZ and RTE rounding modes
These are required by VK_KHR_16bit_storage. Our hardware, however,
doesn't provide any mechanism to decide on the rounding mode of
the conversion and it seems to be using RTE, so we implement
RTZ in software.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 1f639d5310 broadcom/compiler: implement 32-bit/16-bit conversion opcodes
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga bdb6201ea1 broadcom/compiler: use ldunifa with unaligned constant offset
If we know we have a load with a constant offset, then even if it
is not aligned to 32-bit we can still produce an aligned offset
and then skip over the bytes we don't need.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 2eb6910d96 broadcom/compiler: support ldunifa with some 16-bit loads
Even though ldunifa is strictly 32-bit we may be able to use it
to load 16-bit values that sit at 32-bit aligned addresses.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 2a420bdf92 broadcom/compiler: lower packing after vectorization
The vectorization pass can inject 32_2x16 (un)packing opcodes
upon successful vectorization of 16-bit operations into 32-bit
counterparts, so make sure we lower these to something our
backend can handle.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 4b24373137 broadcom/compiler: implement TMU general 16-bit load/store
This allows us to implement 16-bit access on uniform and
storage buffers.

Notice that V3D hardware can only do general access on scalar
16-bit elements, which we currently enforce by running a lowering
pass during shader compile.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 2443e45e76 broadcom/compiler: better document vectorization implications
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Iago Toral Quiroga 765d9feb46 broadcom/compiler: add lowering pass to scalarize non 32-bit general load/store
V3D hardware doesn't support vector access for general TMU load/store
operations like the ones we use for UBO and SSBO, so we need to split
these to scalar operations.

It should be noted that we also have a vectorization pass (which runs
later, during optimization), that may reconstruct some of these into
32-bit operations when possible (i.e. when the resulting operation
is 32-bit aligned).

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14648>
2022-01-25 09:08:26 +00:00
Dave Airlie ccbf700d6c nir: remove gl.h include from nir headers.
This saves a lot of pointless gl.h includes across the board,
it moves the one place that needs GLenum into a separate file
only used in those passes that require it.

Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14605>
2022-01-19 21:54:58 +00:00
Thomas H.P. Andersen c32c9014f5 broadcom/compiler: fix compile warning -Wabsolute-value
fixes a compile warning with clang

Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14302>
2022-01-03 20:20:37 +00:00
Alejandro Piñeiro 1c4f76672d broadcom/compiler: avoid unneeded sint/unorm clamping when lowering stores
They are being used on integer to integer stores. From Vulkan sec,
final paragraph of 16.4.4 "Texel Output Format Conversion":
    "Each component is converted based on its type and size (as
     defined in the Format Definition section for each
     VkFormat). ... Integer outputs are converted such that their value
     is preserved. The converted value of any integer that cannot be
     represented in the target format is undefined."

I didn't find a equivalent quote for OpenGL as all conversion entries
are forcused on float to integer, fixed-point to integer, etc, and not
on integer to integer. Didn't find any test failure with this change.

We didn't get any shader-db stats change with shaderdb (even
overriding to OpenGL 4.4 to get more shaders built), so as a reference
Vulkan shader-db stats with the pattern
dEQP-VK.image.*.with_format.*.*
   total instructions in shared programs: 37534 -> 36522 (-2.70%)
   instructions in affected programs: 12080 -> 11068 (-8.38%)
   helped: 241
   HURT: 0
   Instructions are helped.

   total uniforms in shared programs: 9100 -> 8550 (-6.04%)
   uniforms in affected programs: 3004 -> 2454 (-18.31%)
   helped: 229
   HURT: 0

   total max-temps in shared programs: 6110 -> 6014 (-1.57%)
   max-temps in affected programs: 402 -> 306 (-23.88%)
   helped: 43
   HURT: 0
   Max-temps are helped.

   total nops in shared programs: 1523 -> 1526 (0.20%)
   nops in affected programs: 21 -> 24 (14.29%)
   helped: 3
   HURT: 6
   Inconclusive result (value mean confidence interval includes 0).

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14194>
2021-12-15 11:53:20 +00:00
Iago Toral Quiroga 2630c8f546 broadcom/compiler: improve thrsw merge
Instead of stopping the merge process when we find an instruction
with an incompatible signal (such as an small immediate), keep
going and see if we can merge the thrsw in a previous instruction
that is compatible.

total instructions in shared programs: 13409835 -> 13356648 (-0.40%)
instructions in affected programs: 3556860 -> 3503673 (-1.50%)
helped: 17457
HURT: 18
Instructions are helped.

total max-temps in shared programs: 2353971 -> 2352956 (-0.04%)
max-temps in affected programs: 13960 -> 12945 (-7.27%)
helped: 703
HURT: 0
Max-temps are helped.

total spills in shared programs: 12301 -> 12301 (0.00%)
total sfu-stalls in shared programs: 32596 -> 32499 (-0.30%)
sfu-stalls in affected programs: 225 -> 128 (-43.11%)
helped: 79
HURT: 3
Sfu-stalls are helped.

total nops in shared programs: 347204 -> 325234 (-6.33%)
nops in affected programs: 99834 -> 77864 (-22.01%)
helped: 11515
HURT: 158
Nops are helped.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14172>
2021-12-14 09:50:17 +00:00
Juan A. Suarez Romero fd47c939f4 st/pbo: add the image format in the download FS
In the V3D driver there is a NIR lowering step for `image_store`
intrinsic, where the image store format is required for doing the proper
lowering.

Thus, let's define it for the download FS instead of
keeping it as NONE.

v2 (Illia)
 - Use format only for drivers not supporting format-less writing.

v4 (Illia):
 - Use PIPE_CAP_IMAGE_STORE_FORMATTED to reduce combinations.

v5 (Ilia):
 - Use indirect array for download FS in not formatless-store support
   drivers.

Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Ilia Mirkin <imirkin@alum.mit.edu>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13409>
2021-12-03 15:32:36 +00:00
Iago Toral Quiroga cc7db1fc53 broadcom/compiler: improve documentation for Z writes
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14037>
2021-12-03 10:39:08 +00:00
Iago Toral Quiroga a65c605365 broadcom/compiler: track passthrough Z writes
In some cases we need to make the shaders write the Z value produced
from rasterization (FEP). Track these instances because they are relevant
to early EZ setup.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14037>
2021-12-03 10:39:08 +00:00
Iago Toral Quiroga 6d4a645c90 broadcom/compiler: emit passthrough Z write if shader reads Z
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14037>
2021-12-03 10:39:08 +00:00
Iago Toral Quiroga 996f147fef broadcom/compiler: relax restriction on VPM inst in last thread end slot
According to the documentation, only vpmwt is disallowed in the last delay
slot of the thread end.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13975>
2021-11-29 14:06:43 +00:00
Iago Toral Quiroga 6923dd687c broadcom/compiler: allow color TLB writes in last instruction
Only Z writes are disallowed.

total instructions in shared programs: 11578449 -> 11577369 (<.01%)
instructions in affected programs: 38132 -> 37052 (-2.83%)
helped: 1080
HURT: 0
Instructions are helped.

total max-temps in shared programs: 2334416 -> 2334395 (<.01%)
max-temps in affected programs: 218 -> 197 (-9.63%)
helped: 21
HURT: 0
Max-temps are helped.

total inst-and-stalls in shared programs: 11607890 -> 11606810 (<.01%)
inst-and-stalls in affected programs: 38265 -> 37185 (-2.82%)
helped: 1080
HURT: 0
Inst-and-stalls are helped.

total nops in shared programs: 338316 -> 337236 (-0.32%)
nops in affected programs: 2625 -> 1545 (-41.14%)
helped: 1080
HURT: 0
Nops are helped.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13964>
2021-11-29 06:44:07 +00:00
Alejandro Piñeiro a9b4aef0f2 broadcom/compiler: make shaderdb debug output compatible with shaderdb's report tool
Even although the option is called shaderdb, it is not really used by
shaderdb (for V3D shaderdb uses the debug option "precompile"). And in
fact, right now the output format is not compatible with shaderdb.

This commit tries to fix that, and as we are here, also try to make
the option more useful for the Vulkan case, as that debug option also
works with v3dv.

We can't really fully imitate shaderdb use with OpenGL (run with a set
of glsl shader tests), but we can at least assign a unique name (the
pipeline sha1 in text format) so we can compare executions of the same
vulkan application. For that remember to disable the on-disk cache.

Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13938>
2021-11-24 13:02:08 +00:00
Iago Toral Quiroga 79dee14cc2 broadcom/compiler: don't move ldvary earlier if current instruction has ldunif
If we did, we would have the instruction coming right after ldvary write
to the same implicit destination as ldvary at the same time. We prevent
this when merging instructions, but we should make sure we prevent this
when we move ldvary around for pipelining too.

Reviewed-by: Juan A. Suarez <jasuarez@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13921>
2021-11-23 10:52:24 +00:00
Iago Toral Quiroga 7fec4f4135 broadcom/compiler: fix scoreboard locking checks
According to the spec the hardware locks the scoreboard on the first
or last thread switch (selected via shader state) and any TLB accesses
executed before this are not synchronized by hardware.

This change updates the logic to ensure we respect this requirement
and that we don't assume that the lock is acquired automatically
on the first TLB access, which is not valid at least since V3D 4.1+.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13910>
2021-11-22 12:53:43 +00:00
Iago Toral Quiroga bd7584c16b broadcom/compiler: don't allow RF writes from signals after thrend
Writes to physical registers are not allowed after thread end. We
were checking this for ALU writes, but we need to check it for
signal writes too.

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13910>
2021-11-22 12:53:43 +00:00
Juan A. Suarez Romero 457dbb81f5 broadcom/compiler: apply constant folding on early GS lowering
This solves a case where a NIR geometry shader was storing the output in
a non-constant:

  vec4 32 ssa_1 = load_const (0xc0800000 /* -4.000000 */, 0xc1100000 /* -9.000000 */, 0x40400000 /* 3.000000 */, 0x40e00000 /* 7.000000 */)
  vec1 32 ssa_7 = load_const (0x00000000 /* 0.000000 */)
  vec1 32 ssa_8 = load_const (0x00000001 /* 0.000000 */)
  vec1 32 ssa_9 = iadd ssa_7, ssa_8
  vec1 32 ssa_19 = mov ssa_1.x
  intrinsic store_output (ssa_19, ssa_9) (1, 1, 0, 160, 288) /* base=1 */ /* wrmask=x */ /* component=0 */ /* src_type=float32 */ /* location=32 slots=2 gs_streams(x=0 y=0 z=0 w=0) */

When lowering the VPM output we check if the destination (ssa_9 in this
case) is a constant to add to the VPM offset. We run a constant folding
optimization in an earlier VS lowering, and we should do the same for
GS.

This fixes multiple dEQP-VK.pipeline.interface_matching.* failures.

Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13884>
2021-11-22 09:32:50 +00:00
Juan A. Suarez Romero 7b21635057 broadcom/compiler: handle array of structs in GS/FS inputs
While fragment and geometry shader were handling structs as inputs, they
weren't doing for it arrays of structures.

This fixes multiple dEQP-VK.pipeline.interface_matching.* failures and
assertions.

v2:
 - Fix style (Iago).

Signed-off-by: Juan A. Suarez Romero <jasuarez@igalia.com>
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13884>
2021-11-22 09:32:50 +00:00
Iago Toral Quiroga 5e536c97a9 broadcom/compiler: fix early fragment tests setup
When early fragment tests are mandated by the shader, we must use
the Z value produced by the FEP even if there are elements that
would typically require late fragment tests (such as discards,
sample to coverage, etc).

This change means we also need to be a bit more careful when
we promote shaders to use early fragment tests so we don't
promote anything with discards for example.

Fixes:
dEQP-VK.fragment_operations.early_fragment.discard_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.discard_early_fragment_tests_stencil

Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13837>
2021-11-18 07:39:32 +00:00
Connor Abbott 508f917d8c util/dag: Make edge data a uintptr_t
Nobody was actually using it as a pointer, and I'm going to introduce a
shared function which relies on it not being a pointer so let's fix this
once and for all.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13722>
2021-11-17 13:41:47 +00:00
Iago Toral Quiroga 0cb58f80d2 v3d: use V3D_MAX_DRAW_BUFFERS instead of hardcoded constant
Reviewed-by: Alejandro Piñeiro <apinheiro@igalia.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13775>
2021-11-12 11:04:07 +00:00