brw: Eliminate duplicate fills

When the register allocator decides to spill a value, all reads of that
value are filled. This can result in cases where the same value is
filled many times in a single block. In those cases, the result of an
earlier fill may still be available when a later fill occurs.

This optimization replaces the later fill with a move from the result of
the earlier fill.

v2: Use FIXED_GRF for register overlap tests. Since this is after
register allocation, the VGRF values will not tell the whole truth.

v3: Use brw_transform_inst. Suggested by Caio. Add
brw_scratch_inst::offset instead of storing it as a source. Suggested by
Lionel.

v4: In intervening spill to the same location also invalidates the
value. 🤦

v5: Don't eliminate a fill if its destination partially overlaps the
preceeding fill destination. Fixes failures in cooperative matrix CTS.

shader-db:

Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
total instructions in shared programs: 17249903 -> 17249653 (<.01%)
instructions in affected programs: 35550 -> 35300 (-0.70%)
helped: 20 / HURT: 0

total cycles in shared programs: 893092398 -> 893101836 (<.01%)
cycles in affected programs: 2501720 -> 2511158 (0.38%)
helped: 6 / HURT: 14

total fills in shared programs: 1901 -> 1776 (-6.58%)
fills in affected programs: 1757 -> 1632 (-7.11%)
helped: 20 / HURT: 0

fossil-db:

Lunar Lake, Meteor Lake, and DG2 had similar results. (Lunar Lake shown)
Totals:
Instrs: 929949528 -> 926770338 (-0.34%)
Cycle count: 105126671329 -> 104851299099 (-0.26%); split: -0.28%, +0.02%
Fill count: 6520785 -> 5021518 (-22.99%)

Totals from 54281 (2.69% of 2018922) affected shaders:
Instrs: 239616289 -> 236437099 (-1.33%)
Cycle count: 22051883404 -> 21776511174 (-1.25%); split: -1.33%, +0.08%
Fill count: 6406295 -> 4907028 (-23.40%)

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37827>
This commit is contained in:
Ian Romanick
2025-06-26 10:09:53 -07:00
committed by Marge Bot
parent d2e3707ecc
commit 0c089a5c32

View File

@@ -149,6 +149,80 @@ brw_opt_fill_and_spill(brw_shader &s)
}
}
/* Optimize multiple fills from the same offset in a single block. */
foreach_inst_in_block(brw_inst, inst, block) {
if (inst->opcode != SHADER_OPCODE_LSC_FILL)
continue;
brw_reg inst_dst = brw_lower_vgrf_to_fixed_grf(devinfo, inst,
inst->dst);
foreach_inst_in_block_starting_from(brw_inst, scan_inst, inst) {
/* Instruction is a fill from the same location as the previous
* fill.
*/
brw_reg scan_dst = brw_lower_vgrf_to_fixed_grf(devinfo, scan_inst,
scan_inst->dst);
if (scan_inst->opcode == SHADER_OPCODE_LSC_FILL &&
scan_inst->force_writemask_all == inst->force_writemask_all &&
scan_inst->as_scratch()->offset == inst->as_scratch()->offset &&
scan_inst->size_written == inst->size_written &&
scan_inst->group == inst->group &&
scan_inst->as_scratch()->use_transpose == inst->as_scratch()->use_transpose) {
const unsigned reg_count = DIV_ROUND_UP(scan_inst->size_written, REG_SIZE);
const unsigned max_reg_count = 2 * reg_unit(devinfo);
/* If the resulting MOV would try to write more than 2
* registers, skip the optimization.
*
* FINISHME: It shouldn't be hard to generate multiple MOV
* instructions below to handle this case.
*/
if (reg_count > max_reg_count)
continue;
if (scan_dst.equals(inst_dst)) {
scan_inst = brw_transform_inst(s, scan_inst, BRW_OPCODE_NOP);
} else {
/* This can occur for fills in wider SIMD modes. In SIMD32
* on Xe2, a fill to r16 followed by a fill to r17 from the
* same location can't be trivially replaced. The resulting
* `mov(32) r17, r16` would have the same problems of memcpy
* with overlapping ranges.
*
* FINISHME: This is fixable, but it required emitting two
* MOVs with hald SIMD size. It might also "just work" if
* scan_dst.nr < inst_dst.nr.
*/
if (regions_overlap(scan_dst, scan_inst->size_written,
inst_dst, inst->size_written)) {
break;
}
scan_inst = brw_transform_inst(s, scan_inst, BRW_OPCODE_MOV);
scan_inst->src[0] = inst->dst;
}
s.shader_stats.fill_count--;
block_progress = true;
} else {
/* A spill to the same location invalidates the value. */
if (scan_inst->opcode == SHADER_OPCODE_LSC_SPILL &&
scratch_intersects(devinfo, inst->as_scratch(),
scan_inst->as_scratch())) {
break;
}
/* Write to the register being filled invalidates the value. */
if (regions_overlap(scan_dst, scan_inst->size_written,
inst_dst, inst->size_written)) {
break;
}
}
}
}
if (block_progress) {
foreach_inst_in_block_safe(brw_inst, inst, block) {
if (inst->opcode == BRW_OPCODE_NOP)