From cd70e49394ee9110e3461dfe866afc63c2efc953 Mon Sep 17 00:00:00 2001
From: Ian Romanick <ian.d.romanick@intel.com>
Date: Tue, 17 Oct 2023 09:48:38 -0700
Subject: [PATCH] intel/brw: Allow SIMD16 F and HF type conversion moves

On DG2, the lowering generated for these MOV instructions is
**awful**. The original SIMD16 MOV

    { 18}   67: mov(16) vgrf54+0.0:HF, vgrf46+0.0:F NoMask group0

is lowered to SIMD8 MOVs:

    { 18}  118: mov(8) vgrf54+0.0:HF, vgrf46+0.0:F NoMask group0
    { 18}  119: mov(8) vgrf54+0.16:HF, vgrf46+1.0:F NoMask group8

These MOVs violate Gfx12.5 region restrictions, so these are further
lowered:

    { 17}  119: mov(8) vgrf83<2>:HF, vgrf46+0.0:F NoMask group0
    { 19}  120: mov(8) vgrf54+0.0:UW, vgrf83<2>:UW NoMask group0
    { 19}  122: mov(8) vgrf84<2>:HF, vgrf46+1.0:F NoMask group8
    { 19}  123: mov(8) vgrf54+0.16:UW, vgrf84<2>:UW NoMask group8

The shader-db and fossil-db results are nothing to get excited
about. However, the affect on vk_cooperative_matrix_perf is substantial. In one subtest

shader: shaders/shmemfp16.spv

cooperativeMatrixProps = 8x8x16   A = float16_t B = float16_t C = float16_t D = float16_t scope = subgroup
TILE_M=128 TILE_N=128, TILE_K=32 BLayout=0

performance on my DG2 improved by ~60% due to a MASSIVE reduction in spills and fills:

-Native code for unnamed compute shader (null) (src_hash 0x00000000) (sha1 c6a41b1c4e7aa2da327a39a70ed36c822a4b172f)
-SIMD32 shader: 32484 instructions. 1 loops. 1893868 cycles. 737:1820 spills:fills, 442 sends, scheduled with mode none. Promoted 1 constants. Compacted 519744 to 492224 bytes (5%)
-   START B0 (20782 cycles)
+Native code for unnamed compute shader (null) (src_hash 0x00000000) (sha1 621e960daad5b5579b176717f24a315e7ea560a1)
+SIMD32 shader: 23918 instructions. 1 loops. 1089894 cycles. 432:1166 spills:fills, 442 sends, scheduled with mode none. Promoted 1 constants. Compacted 382688 to 353232 bytes (8%)

shader-db:

All Gfx9 and later platforms had similar results. (Meteor Lake shown)
total instructions in shared programs: 19656270 -> 19653981 (-0.01%)
instructions in affected programs: 61810 -> 59521 (-3.70%)
helped: 116 / HURT: 0

total cycles in shared programs: 823368888 -> 823375854 (<.01%)
cycles in affected programs: 1165284 -> 1172250 (0.60%)
helped: 51 / HURT: 57

fossil-db:

DG2 and Meteor Lake had similar results. (Meteor Lake shown)
*** Shaders only in 'before' results are ignored:
fossil-db/steam-dxvk/total_war_warhammer3/2a3ed2ca632a7cb7/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/18b9d4a3b1961616/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/04ac9f3146a6db19/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/f37ebec6aa1b379a/fs.32,
fossil-db/steam-dxvk/total_war_warhammer3/255c987feb0d4310/fs.32, and 25
more
from 1 apps: fossil-db/steam-dxvk/total_war_warhammer3

Totals:
Instrs: 160946537 -> 160928389 (-0.01%); split: -0.01%, +0.00%
Cycles: 14125908620 -> 14125873958 (-0.00%); split: -0.00%, +0.00%

Totals from 1002 (0.15% of 652134) affected shaders:
Instrs: 411261 -> 393113 (-4.41%); split: -4.41%, +0.00%
Cycles: 16676735 -> 16642073 (-0.21%); split: -0.48%, +0.27%

Tiger Lake
Totals:
Instrs: 164511816 -> 164497202 (-0.01%); split: -0.01%, +0.00%
Cycles: 13801675722 -> 13801629397 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 7955168 -> 7955152 (-0.00%)
Send messages: 8544494 -> 8544486 (-0.00%)

Totals from 997 (0.15% of 651454) affected shaders:
Instrs: 460820 -> 446206 (-3.17%); split: -3.17%, +0.00%
Cycles: 16265514 -> 16219189 (-0.28%); split: -0.84%, +0.56%
Subgroup size: 17552 -> 17536 (-0.09%)
Send messages: 26045 -> 26037 (-0.03%)

Ice Lake
Totals:
Instrs: 165504747 -> 165489970 (-0.01%); split: -0.01%, +0.00%
Cycles: 15145244554 -> 15145149627 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 8107032 -> 8107016 (-0.00%)
Send messages: 8598680 -> 8598672 (-0.00%)
Spill count: 45427 -> 45423 (-0.01%)
Fill count: 74749 -> 74747 (-0.00%)

Totals from 1125 (0.17% of 656115) affected shaders:
Instrs: 521676 -> 506899 (-2.83%); split: -2.83%, +0.00%
Cycles: 19555434 -> 19460507 (-0.49%); split: -0.59%, +0.10%
Subgroup size: 21616 -> 21600 (-0.07%)
Send messages: 28623 -> 28615 (-0.03%)
Spill count: 603 -> 599 (-0.66%)
Fill count: 1362 -> 1360 (-0.15%)

Skylake
*** Shaders only in 'after' results are ignored:
fossil-db/steam-native/red_dead_redemption2/cef460b80bad8485/fs.16,
fossil-db/steam-native/red_dead_redemption2/cd5fe081e2e5529d/fs.16
from 1 apps: fossil-db/steam-native/red_dead_redemption2

Totals:
Instrs: 141607617 -> 141593776 (-0.01%); split: -0.01%, +0.00%
Cycles: 14257812441 -> 14257661671 (-0.00%); split: -0.00%, +0.00%
Subgroup size: 7743752 -> 7743736 (-0.00%)
Send messages: 7552728 -> 7552720 (-0.00%)
Spill count: 43660 -> 43661 (+0.00%)
Fill count: 71301 -> 71303 (+0.00%)

Totals from 1017 (0.16% of 636964) affected shaders:
Instrs: 392454 -> 378613 (-3.53%); split: -3.53%, +0.00%
Cycles: 16622974 -> 16472204 (-0.91%); split: -1.04%, +0.13%
Subgroup size: 19840 -> 19824 (-0.08%)
Send messages: 23021 -> 23013 (-0.03%)
Spill count: 484 -> 485 (+0.21%)
Fill count: 1155 -> 1157 (+0.17%)

Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28281>
---
 src/intel/compiler/brw_eu_validate.c          |  6 ++-
 .../compiler/brw_fs_lower_simd_width.cpp      | 47 ++++++++-----------
 2 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/src/intel/compiler/brw_eu_validate.c b/src/intel/compiler/brw_eu_validate.c
index ff235673b7d..f7ad8ce066e 100644
--- a/src/intel/compiler/brw_eu_validate.c
+++ b/src/intel/compiler/brw_eu_validate.c
@@ -1137,7 +1137,8 @@ special_restrictions_for_mixed_float_mode(const struct brw_isa_info *isa,
     *    "No SIMD16 in mixed mode when destination is f32. Instruction
     *     execution size must be no more than 8."
     */
-   ERROR_IF(exec_size > 8 && dst_type == BRW_REGISTER_TYPE_F,
+   ERROR_IF(exec_size > 8 && dst_type == BRW_REGISTER_TYPE_F &&
+            opcode != BRW_OPCODE_MOV,
             "Mixed float mode with 32-bit float destination is limited "
             "to SIMD8");
 
@@ -1212,7 +1213,8 @@ special_restrictions_for_mixed_float_mode(const struct brw_isa_info *isa,
        *     Align1 and Align16."
        */
       ERROR_IF(exec_size > 8 && dst_is_packed &&
-               dst_type == BRW_REGISTER_TYPE_HF,
+               dst_type == BRW_REGISTER_TYPE_HF &&
+               opcode != BRW_OPCODE_MOV,
                "Align1 mixed float mode is limited to SIMD8 when destination "
                "is packed half-float");
 
diff --git a/src/intel/compiler/brw_fs_lower_simd_width.cpp b/src/intel/compiler/brw_fs_lower_simd_width.cpp
index 4ea06b00e8b..87b7a1b11a6 100644
--- a/src/intel/compiler/brw_fs_lower_simd_width.cpp
+++ b/src/intel/compiler/brw_fs_lower_simd_width.cpp
@@ -113,34 +113,27 @@ get_fpu_lowered_simd_width(const fs_visitor *shader,
    if (inst->is_3src(compiler) && !devinfo->supports_simd16_3src)
       max_width = MIN2(max_width, inst->exec_size / reg_count);
 
-   /* From the SKL PRM, Special Restrictions for Handling Mixed Mode
-    * Float Operations:
-    *
-    *    "No SIMD16 in mixed mode when destination is f32. Instruction
-    *     execution size must be no more than 8."
-    *
-    * FIXME: the simulator doesn't seem to complain if we don't do this and
-    * empirical testing with existing CTS tests show that they pass just fine
-    * without implementing this, however, since our interpretation of the PRM
-    * is that conversion MOVs between HF and F are still mixed-float
-    * instructions (and therefore subject to this restriction) we decided to
-    * split them to be safe. Might be useful to do additional investigation to
-    * lift the restriction if we can ensure that it is safe though, since these
-    * conversions are common when half-float types are involved since many
-    * instructions do not support HF types and conversions from/to F are
-    * required.
-    */
-   if (is_mixed_float_with_fp32_dst(inst) && devinfo->ver < 20)
-      max_width = MIN2(max_width, 8);
+   if (inst->opcode != BRW_OPCODE_MOV) {
+      /* From the SKL PRM, Special Restrictions for Handling Mixed Mode
+       * Float Operations:
+       *
+       *    "No SIMD16 in mixed mode when destination is f32. Instruction
+       *     execution size must be no more than 8."
+       *
+       * Testing indicates that this restriction does not apply to MOVs.
+       */
+      if (is_mixed_float_with_fp32_dst(inst) && devinfo->ver < 20)
+         max_width = MIN2(max_width, 8);
 
-   /* From the SKL PRM, Special Restrictions for Handling Mixed Mode
-    * Float Operations:
-    *
-    *    "No SIMD16 in mixed mode when destination is packed f16 for both
-    *     Align1 and Align16."
-    */
-   if (is_mixed_float_with_packed_fp16_dst(inst) && devinfo->ver < 20)
-      max_width = MIN2(max_width, 8);
+      /* From the SKL PRM, Special Restrictions for Handling Mixed Mode
+       * Float Operations:
+       *
+       *    "No SIMD16 in mixed mode when destination is packed f16 for both
+       *     Align1 and Align16."
+       */
+      if (is_mixed_float_with_packed_fp16_dst(inst) && devinfo->ver < 20)
+         max_width = MIN2(max_width, 8);
+   }
 
    /* Only power-of-two execution sizes are representable in the instruction
     * control fields.