35ac517780
This defines a new pre-RA scheduling mode similar to BRW_SCHEDULE_PRE
but more aggressive at optimizing for minimum latency rather than
minimum register usage. The main motivation is that on recent xe3
platforms we use a register allocation heuristic that packs variables
more tightly at the bottom of the register file instead of the
round-robin heuristic we used on previous platforms, since as a result
of VRT there is a parallelism penalty when a program uses more GRF
registers than necessary. Unfortunately the xe3 tight-packing
heuristic severely constrains the work of the post-RA scheduler due to
the false dependencies introduced during register allocation, so we
can do a better job by making the scheduler aware of instruction
latencies before the register allocator introduces any false
dependencies.
This can lead to higher register pressure, but only when the scheduler
decides it could save cycles by extending a live range. It makes
sense to preserve the preexisting BRW_SCHEDULE_PRE as a separate mode
since some workloads can still benefit from neglecting latencies
pre-RA due to the trade-off mentioned between parallelism and GRF use,
a future commit will introduce a more accurate estimate of the
expected relative performance of BRW_SCHEDULE_PRE
vs. BRW_SCHEDULE_PRE_LATENCY taking into account this trade-off.
In theory this could also be helpful on earlier pre-xe3 platforms, but
the benefit should be significantly smaller due to the different RA
heuristic so it hasn't been tested extensively pre-xe3.
The following Traci tests are improved significantly by this change on
PTL (nearly all tests that run on my system are affected positively):
Ghostrunner2-trace-dx11-1440p-ultra: 7.12% ±0.36%
SpaceEngineers-trace-dx11-2160p-high: 5.77% ±0.43%
HogwartsLegacy-trace-dx12-1080p-ultra: 4.40% ±0.03%
Naraka-trace-dx11-1440p-highest: 3.06% ±0.43%
MetroExodus-trace-dx11-2160p-ultra: 2.26% ±0.60%
Fortnite-trace-dx11-2160p-epix: 2.12% ±0.53%
Nba2K23-trace-dx11-2160p-ultra: 1.98% ±0.30%
Control-trace-dx11-1440p-high: 1.93% ±0.36%
GodOfWar-trace-dx11-2160p-ultra: 1.62% ±0.47%
TotalWarPharaoh-trace-dx11-1440p-ultra: 1.55% ±0.18%
MountAndBlade2-trace-dx11-1440p-veryhigh: 1.51% ±0.37%
Destiny2-trace-dx11-1440p-highest: 1.44% ±0.34%
GtaV-trace-dx11-2160p-ultra: 1.26% ±0.27%
ShadowTombRaider-trace-dx11-2160p-ultra: 1.10% ±0.58%
Borderlands3-trace-dx11-2160p-ultra: 0.95% ±0.43%
TerminatorResistance-trace-dx11-2160p-ultra: 0.87% ±0.22%
BaldursGate3-trace-dx11-1440p-ultra: 0.84% ±0.28%
CitiesSkylines2-trace-dx11-1440p-high: 0.82% ±0.22%
PubG-trace-dx11-1440p-ultra: 0.72% ±0.37%
Palworld-trace-dx11-1080p-med: 0.71% ±0.26%
Superposition-trace-dx11-2160p-extreme: 0.69% ±0.19%
The compile-time cost of shader-db increases significantly by 1.85%
after this commit (14 iterations, 5% significance), the compile-time
of fossil-db doesn't change significantly in my setup.
v2: Addressed interaction with 81594d0db1,
since the code that calculates deps, delays and exits is no longer
mode-independent after this change. Instead of reverting that
commit (which is non-trivial and would have a greater compile-time
hit) simply reconstruct the scheduler object during the transition
between BRW_SCHEDULE_PRE_LATENCY and any other PRE mode that
doesn't require instruction latencies.
Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36618>