Commit Graph

18 Commits

Author SHA1 Message Date
Caio Oliveira 388bac06ce brw: Add brw_dpas_inst
Fixed the types in brw_inst::bits so the struct is packed correctly.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:03 +00:00
Caio Oliveira 0fcce2722f brw: Add brw_send_inst
Move all the SEND specific fields from brw_inst into brw_send_inst.
This new instruction kind will contain all variants of SENDs plus the
virtual opcodes that were already relying on those SEND fields.

Use the `as_send()` helper to go from a brw_inst into the brw_send_inst
when applicable.  Some of the code was changed to use the brw_send_inst
type directly.

Until other kinds are added, all the instructions are allocated the same
amount of space as brw_send_inst.  This ensures that all
brw_transform_inst() calls are still valid.  This will change after
a few patches so that BASE instructions can use less memory.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:25:01 +00:00
Kenneth Graunke 6281a12822 brw: Remove brw_inst::no_dd_check/no_dd_clear
These dependency hints were primarily useful for the vec4 backend, where
it was common to write subsets of a vec4's components across multiple
instructions.  In the scalar backend, we rarely used them.  They also no
longer exist on Tigerlake and later in favor of software scoreboarding.

Dropping this allows us to clean up the IR a bit.

We still use the hardware hints in the generator in a couple places:

   - Gfx9-12.0 scratch headers
   - Quad swizzles
   - Indirect MOV lowering

In theory we might want them back if we moved that lowering to the IR.
For scratch at least, I suspect it won't have a huge impact, as we're
already incurring the cost of spills/fills.  The others are fairly rare
as well, so it may not be worth keeping.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36730>
2025-09-12 00:24:55 +00:00
Francisco Jerez dfc2a89d96 intel/brw: Allow using performance analysis pass pre-register allocation.
Mainly this involves changing 'struct state' so that the dep_ready
array is allocated with a dynamic size based on the number of VGRFs of
the program instead of assuming a fixed XE3_MAX_GRF count of GRF
dependencies.  VGRF register dependencies are then handled by using
one dep_ready entry per VGRF allocation instead of one per hardware
register.

The ability to use the performance analysis pass pre-regalloc will
mostly be useful on xe3+, but this also has the side effect of saving
some memory on xe2 and earlier platforms since we no longer need to
allocate XE3_MAX_GRF dep_ready entries for them.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez 3936a43496 intel/brw/xe3+: Tweak render target write timings in performance modeling pass.
Reduce the cycle-count cost estimate used by the performance model for
render target writes on xe3+ in order to match the real-world
observation of shaders with latency lower than the previously
estimated cost of its render target write.

In a shader used by Factorio this would have led us to incorrectly
model the shader as fillrate-bound, even though in reality the shader
is EU-bound and benefits from the higher parallelism of SIMD32, so the
subsequent commit that re-enables the static analysis-based SIMD32
heuristic on PTL would lead to a ~2% regression without this tweak.

There appear to be no other regressions nor other changes from this in
combination with the subsequent commit that enables it to have an
effect, but it is possible that the real cycle count cost of a render
target write still lies below the estimated value, ~400 is just the
upper bound that can be inferred from the behavior of this test case.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez 6ccf2a375a intel/brw/xe3+: Adjust weights of discard control flow for non-EU-fused platforms.
Currently on platforms without EU fusion (all platforms other than
gfx12.x) we were using a constant discard_weight = 1.0 regardless of
SIMD width.  This was far from ideal, in particular since it made the
performance analysis pass fully insensitive to the presence of discard
jumps, even though the scheduler is able to move code past a discard
statement so the range of the program under discard control flow can
vary and have a material effect on the relative performance of SIMD16
vs. SIMD32, since the scheduler is typically more constrained in
SIMD32 dispatch mode.

In order to fix this use a discard_weight lower than 1.0 for all
dispatch modes, so that the performance analysis pass accounts for the
presence and range of discard control flow.  In addition use a lower
discard_weight for SIMD16 dispatch like we do on Gfx12.x in order to
account for the higher likelihood of divergent discard in SIMD32 mode.

The specific weights were determined iteratively on PTL based on the
final FPS result of several traces that are sensitive to the dispatch
width of one or more fragment shaders that use discard, in order to
ensure that in none of those cases we end up using the
lower-performing dispatch width variant.  This avoids regressions
between 3.7% and 0.8% in Superposition-trace-dx11-2160p-extreme,
BaldursGate3-trace-dx11-1440p-ultra and
MetroExodus-trace-dx11-2160p-ultra after enabling the static
analysis-based SIMD32 heuristic in PTL.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> (v1)

v2: Limit to xe3+ for now since performance effect seems to be a wash
    on xe2.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:57 +00:00
Francisco Jerez 1272ff5ed1 intel/brw/xehp+: Adjust performance model weights of LSC atomic ops.
The LSC implements several optimizations for atomic operations on a
memory addresses that are uniform across all lanes, in which case its
cost is approximately O(1) instead of O(exec_size).  Even cases where
memory offsets are non-uniform but packed in a cacheline appear to
have a cost that is non-linear with the number of lanes.

In order to approximate this behavior more closely approximate its
back-end cost as roughly 1300 cycles instead of the previous 400 *
exec_size/8.  This fixes some cases where we were incorrectly
predicting the SIMD32 shader would be bound by the throughput of LSC
atomic operations, even though the observed cost per lane of the LSC
operations was significantly lower in SIMD32 mode so it would have the
best performance.

Clearly this is still a rough approximation and it might be possible
to obtain a more accurate result by plumbing divergence analysis data
all the way down to codegen, however the goal of the performance
analysis pass isn't to provide an exact prediction of the performance
of a shader (that's not really possible in general via static analysis
without solving the halting problem), but to provide a good enough
approximation at a low cost -- And the constant approximation seems to
be strictly better in practice than the approximation we were using
before, there appear to be no regressions from this change, and
ShadowTombRaider-trace-dx11-2160p-ultra shows 5.7% better performance
on PTL with a subsequent commit that re-enables the use of the static
analysis-based SIMD32 heuristic on xe3+.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:56 +00:00
Francisco Jerez 6eea9659db intel/brw/xe3+: Model trade-off between parallelism and GRF use in performance analysis.
This extends the performance analysis pass used in previous
generations to make it more useful to deal with the performance
trade-off encountered on xe3 hardware as a result of VRT.  VRT allows
the driver to request a per-thread GRF allocation different from the
128 GRFs that were typical in previous platforms, but this comes at
either a thread parallelism cost or benefit depending on the number of
GRF register blocks requested.

This makes a number of decisions more difficult for the compiler since
certain optimizations potentially trade off run-time in a thread
against the total number of threads that can run in parallel
(e.g. consider scheduling and how reordering an instruction to avoid a
stall can increase GRF use and therefore reduce thread-level
parallelism when trying to improve instruction-level parallelism).

This patch provides a simple heuristic tool to account for the
combined interaction of register pressure and other single-threaded
factors that affect performance.  This is expressed with the
redefinition of the pre-existing brw_performance::throughput estimate
as the number of invocations per cycle per EU that would be achieved
if there were enough threads to reach full load (in this sense this is
to be considered a heuristic since the penalty from VRT may be lower
than expected from this model at low EU load).

This will be used e.g. in order to decide whether to use a more
aggressive latency-minimizing mode during scheduling or a mode more
effective at minimizing register pressure (it makes sense to take the
path that will lead to the most invocations being serviced per cycle
while under load).  This also allows us to re-enable the old PS SIMD32
heuristic on xe3+, and due to this change it is able to identify cases
where the combined effect of poorer scheduling and higher GRF use of
the SIMD32 variant makes it more favorable to use SIMD16 only (see
last patch of the MR for details and numbers).

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>
2025-09-10 02:15:56 +00:00
Kenneth Graunke b848fa4595 brw: Rename is_send_from_grf to is_send, replace other is_send() helper
The is_send() helper is just a wrapper around inst->is_send_from_grf()
now, so we can combine the two.  Trim the name from is_send_from_grf()
to is_send(), as it's shorter, and also matches is_math().

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34040>
2025-08-08 22:12:05 +00:00
Ian Romanick fa74c31b22 brw: Allow additional flags registers on Xe2+
Xe2 adds two more flags registers. We barely use the second flags
register on previous platforms, so the omission was not previously
noticed.

There are several efforts in progress that will add using of more flags
registers.

Reviewed-by: Francisco Jerez <currojerez@riseup.net>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35415>
2025-07-24 23:08:08 +00:00
Sagar Ghuge bea9d79cb9 intel/compiler: Add support for MSAA typed load/store messages
Signed-off-by: Sagar Ghuge <sagar.ghuge@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32690>
2025-03-07 23:06:14 +00:00
Kenneth Graunke 88309a9818 brw: Rename shared function enums for clarity
Our name for this enum was brw_message_target, but it's better known as
shared function ID or SFID.  Call it brw_sfid to make it easier to find.

Now that brw only supports Gfx9+, we don't particularly care whether
SFIDs were introduced on Gfx4, Gfx6, or Gfx7.5.  Also, the LSC SFIDs
were confusingly tagged "GFX12" but aren't available on Gfx12.0; they
were introduced with Alchemist/Meteorlake.

GFX6_SFID_DATAPORT_SAMPLER_CACHE in particular was confusing.  It sounds
like the SFID to use for the sampler on Gfx6+, however it has nothing to
do with the sampler at all.  BRW_SFID_SAMPLER remains the sampler SFID.
On Haswell, we ran out of messages on the main data cache data port, and
so they introduced two additional ones, for more messages.  The modern
Tigerlake PRMs simply call these DP_DC0, DP_DC1, and DP_DC2.  I think
the "sampler" name came from some idea about reorganizing messages that
never materialized (instead, the LSC came as a much larger cleanup).

Recently we've adopted the term "HDC" for the legacy data cluster, as
opposed to "LSC" for the modern Load/Store Cache.  To make clear which
SFIDs target the legacy HDC dataports, we use BRW_SFID_HDC0/1/2.

We were also citing the G45, Sandybridge, and Ivybridge PRMs for a
compiler that supports none of those platforms.  Cite modern docs.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33650>
2025-02-27 08:49:24 +00:00
Caio Oliveira d2c39b1779 intel/brw: Always have a (non-DO) block after a DO in the CFG
Make the "block after DO" more stable so that adding instructions after
a DO doesn't require repairing the CFG.  Use a new SHADER_OPCODE_FLOW
instruction that is a placeholder representing "go to the next block"
and disappears at code generation.

For some context, there are a few facts about how CFG currently works

- Blocks are assumed to not be empty;
- DO is always by itself in a block, i.e. starts and ends a block;
- There are no empty blocks;
- Predicated WHILE and CONTINUE will link to the "block after DO";
- When nesting loops, it is possible that the "block after DO" is
  another "DO".

Reasons and further explanations for those are in the brw_cfg.c comments.

What makes this new change useful is that a pass might want to add
instructions between two DO instructions.  When that happens, a new
block must be created and any predicated WHILE and CONTINUE must be
repaired.

So, instead of requiring a repair (which has proven to be tricky in
the past), this change adds a block that can be "virtually" empty but
allow instructions to be added without further changes.

One alternative design would be allowing empty blocks, that would be
a deeper change since the blocks are currently assumed to be not empty
in various places.  We'll save that for when other changes are made to
the CFG.

The problem described happens in brw_opt_combine_constants, and a
different patch will clean that up.

Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33536>
2025-02-24 23:25:06 +00:00
Caio Oliveira cf3bb77224 intel/brw: Rename fs_visitor to brw_shader
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32536>
2025-02-11 09:13:28 +00:00
Caio Oliveira 352a63122f intel/brw: Rename files brw_fs.cpp/h to brw_shader.cpp/h
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/32536>
2025-02-11 09:13:28 +00:00
Kenneth Graunke ae60338142 brw: Lower MEMORY_FENCE and INTERLOCK in lower_logical_sends
We teach lower_logical_sends to lower these to SHADER_OPCODE_SEND
and drop all the corresponding generator and eu_emit code.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33297>
2025-02-08 01:07:22 +00:00
Caio Oliveira 1ade9a05d8 intel/brw: Use brw prefix instead of namespace for analysis implementations
Also drop the 'fs' prefix when applicable.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33048>
2025-02-05 21:47:07 +00:00
Caio Oliveira 0ebb75743d intel/brw: Use brw_analysis prefix for performance analysis files
Move declaration to the common header and rename definition file.

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33048>
2025-02-05 21:47:06 +00:00