From cd70e49394ee9110e3461dfe866afc63c2efc953 Mon Sep 17 00:00:00 2001 From: Ian Romanick Date: Tue, 17 Oct 2023 09:48:38 -0700 Subject: [PATCH] intel/brw: Allow SIMD16 F and HF type conversion moves On DG2, the lowering generated for these MOV instructions is **awful**. The original SIMD16 MOV { 18} 67: mov(16) vgrf54+0.0:HF, vgrf46+0.0:F NoMask group0 is lowered to SIMD8 MOVs: { 18} 118: mov(8) vgrf54+0.0:HF, vgrf46+0.0:F NoMask group0 { 18} 119: mov(8) vgrf54+0.16:HF, vgrf46+1.0:F NoMask group8 These MOVs violate Gfx12.5 region restrictions, so these are further lowered: { 17} 119: mov(8) vgrf83<2>:HF, vgrf46+0.0:F NoMask group0 { 19} 120: mov(8) vgrf54+0.0:UW, vgrf83<2>:UW NoMask group0 { 19} 122: mov(8) vgrf84<2>:HF, vgrf46+1.0:F NoMask group8 { 19} 123: mov(8) vgrf54+0.16:UW, vgrf84<2>:UW NoMask group8 The shader-db and fossil-db results are nothing to get excited about. However, the affect on vk_cooperative_matrix_perf is substantial. In one subtest shader: shaders/shmemfp16.spv cooperativeMatrixProps = 8x8x16 A = float16_t B = float16_t C = float16_t D = float16_t scope = subgroup TILE_M=128 TILE_N=128, TILE_K=32 BLayout=0 performance on my DG2 improved by ~60% due to a MASSIVE reduction in spills and fills: -Native code for unnamed compute shader (null) (src_hash 0x00000000) (sha1 c6a41b1c4e7aa2da327a39a70ed36c822a4b172f) -SIMD32 shader: 32484 instructions. 1 loops. 1893868 cycles. 737:1820 spills:fills, 442 sends, scheduled with mode none. Promoted 1 constants. Compacted 519744 to 492224 bytes (5%) - START B0 (20782 cycles) +Native code for unnamed compute shader (null) (src_hash 0x00000000) (sha1 621e960daad5b5579b176717f24a315e7ea560a1) +SIMD32 shader: 23918 instructions. 1 loops. 1089894 cycles. 432:1166 spills:fills, 442 sends, scheduled with mode none. Promoted 1 constants. Compacted 382688 to 353232 bytes (8%) shader-db: All Gfx9 and later platforms had similar results. (Meteor Lake shown) total instructions in shared programs: 19656270 -> 19653981 (-0.01%) instructions in affected programs: 61810 -> 59521 (-3.70%) helped: 116 / HURT: 0 total cycles in shared programs: 823368888 -> 823375854 (<.01%) cycles in affected programs: 1165284 -> 1172250 (0.60%) helped: 51 / HURT: 57 fossil-db: DG2 and Meteor Lake had similar results. (Meteor Lake shown) *** Shaders only in 'before' results are ignored: fossil-db/steam-dxvk/total_war_warhammer3/2a3ed2ca632a7cb7/fs.32, fossil-db/steam-dxvk/total_war_warhammer3/18b9d4a3b1961616/fs.32, fossil-db/steam-dxvk/total_war_warhammer3/04ac9f3146a6db19/fs.32, fossil-db/steam-dxvk/total_war_warhammer3/f37ebec6aa1b379a/fs.32, fossil-db/steam-dxvk/total_war_warhammer3/255c987feb0d4310/fs.32, and 25 more from 1 apps: fossil-db/steam-dxvk/total_war_warhammer3 Totals: Instrs: 160946537 -> 160928389 (-0.01%); split: -0.01%, +0.00% Cycles: 14125908620 -> 14125873958 (-0.00%); split: -0.00%, +0.00% Totals from 1002 (0.15% of 652134) affected shaders: Instrs: 411261 -> 393113 (-4.41%); split: -4.41%, +0.00% Cycles: 16676735 -> 16642073 (-0.21%); split: -0.48%, +0.27% Tiger Lake Totals: Instrs: 164511816 -> 164497202 (-0.01%); split: -0.01%, +0.00% Cycles: 13801675722 -> 13801629397 (-0.00%); split: -0.00%, +0.00% Subgroup size: 7955168 -> 7955152 (-0.00%) Send messages: 8544494 -> 8544486 (-0.00%) Totals from 997 (0.15% of 651454) affected shaders: Instrs: 460820 -> 446206 (-3.17%); split: -3.17%, +0.00% Cycles: 16265514 -> 16219189 (-0.28%); split: -0.84%, +0.56% Subgroup size: 17552 -> 17536 (-0.09%) Send messages: 26045 -> 26037 (-0.03%) Ice Lake Totals: Instrs: 165504747 -> 165489970 (-0.01%); split: -0.01%, +0.00% Cycles: 15145244554 -> 15145149627 (-0.00%); split: -0.00%, +0.00% Subgroup size: 8107032 -> 8107016 (-0.00%) Send messages: 8598680 -> 8598672 (-0.00%) Spill count: 45427 -> 45423 (-0.01%) Fill count: 74749 -> 74747 (-0.00%) Totals from 1125 (0.17% of 656115) affected shaders: Instrs: 521676 -> 506899 (-2.83%); split: -2.83%, +0.00% Cycles: 19555434 -> 19460507 (-0.49%); split: -0.59%, +0.10% Subgroup size: 21616 -> 21600 (-0.07%) Send messages: 28623 -> 28615 (-0.03%) Spill count: 603 -> 599 (-0.66%) Fill count: 1362 -> 1360 (-0.15%) Skylake *** Shaders only in 'after' results are ignored: fossil-db/steam-native/red_dead_redemption2/cef460b80bad8485/fs.16, fossil-db/steam-native/red_dead_redemption2/cd5fe081e2e5529d/fs.16 from 1 apps: fossil-db/steam-native/red_dead_redemption2 Totals: Instrs: 141607617 -> 141593776 (-0.01%); split: -0.01%, +0.00% Cycles: 14257812441 -> 14257661671 (-0.00%); split: -0.00%, +0.00% Subgroup size: 7743752 -> 7743736 (-0.00%) Send messages: 7552728 -> 7552720 (-0.00%) Spill count: 43660 -> 43661 (+0.00%) Fill count: 71301 -> 71303 (+0.00%) Totals from 1017 (0.16% of 636964) affected shaders: Instrs: 392454 -> 378613 (-3.53%); split: -3.53%, +0.00% Cycles: 16622974 -> 16472204 (-0.91%); split: -1.04%, +0.13% Subgroup size: 19840 -> 19824 (-0.08%) Send messages: 23021 -> 23013 (-0.03%) Spill count: 484 -> 485 (+0.21%) Fill count: 1155 -> 1157 (+0.17%) Reviewed-by: Lionel Landwerlin Part-of: --- src/intel/compiler/brw_eu_validate.c | 6 ++- .../compiler/brw_fs_lower_simd_width.cpp | 47 ++++++++----------- 2 files changed, 24 insertions(+), 29 deletions(-) diff --git a/src/intel/compiler/brw_eu_validate.c b/src/intel/compiler/brw_eu_validate.c index ff235673b7d..f7ad8ce066e 100644 --- a/src/intel/compiler/brw_eu_validate.c +++ b/src/intel/compiler/brw_eu_validate.c @@ -1137,7 +1137,8 @@ special_restrictions_for_mixed_float_mode(const struct brw_isa_info *isa, * "No SIMD16 in mixed mode when destination is f32. Instruction * execution size must be no more than 8." */ - ERROR_IF(exec_size > 8 && dst_type == BRW_REGISTER_TYPE_F, + ERROR_IF(exec_size > 8 && dst_type == BRW_REGISTER_TYPE_F && + opcode != BRW_OPCODE_MOV, "Mixed float mode with 32-bit float destination is limited " "to SIMD8"); @@ -1212,7 +1213,8 @@ special_restrictions_for_mixed_float_mode(const struct brw_isa_info *isa, * Align1 and Align16." */ ERROR_IF(exec_size > 8 && dst_is_packed && - dst_type == BRW_REGISTER_TYPE_HF, + dst_type == BRW_REGISTER_TYPE_HF && + opcode != BRW_OPCODE_MOV, "Align1 mixed float mode is limited to SIMD8 when destination " "is packed half-float"); diff --git a/src/intel/compiler/brw_fs_lower_simd_width.cpp b/src/intel/compiler/brw_fs_lower_simd_width.cpp index 4ea06b00e8b..87b7a1b11a6 100644 --- a/src/intel/compiler/brw_fs_lower_simd_width.cpp +++ b/src/intel/compiler/brw_fs_lower_simd_width.cpp @@ -113,34 +113,27 @@ get_fpu_lowered_simd_width(const fs_visitor *shader, if (inst->is_3src(compiler) && !devinfo->supports_simd16_3src) max_width = MIN2(max_width, inst->exec_size / reg_count); - /* From the SKL PRM, Special Restrictions for Handling Mixed Mode - * Float Operations: - * - * "No SIMD16 in mixed mode when destination is f32. Instruction - * execution size must be no more than 8." - * - * FIXME: the simulator doesn't seem to complain if we don't do this and - * empirical testing with existing CTS tests show that they pass just fine - * without implementing this, however, since our interpretation of the PRM - * is that conversion MOVs between HF and F are still mixed-float - * instructions (and therefore subject to this restriction) we decided to - * split them to be safe. Might be useful to do additional investigation to - * lift the restriction if we can ensure that it is safe though, since these - * conversions are common when half-float types are involved since many - * instructions do not support HF types and conversions from/to F are - * required. - */ - if (is_mixed_float_with_fp32_dst(inst) && devinfo->ver < 20) - max_width = MIN2(max_width, 8); + if (inst->opcode != BRW_OPCODE_MOV) { + /* From the SKL PRM, Special Restrictions for Handling Mixed Mode + * Float Operations: + * + * "No SIMD16 in mixed mode when destination is f32. Instruction + * execution size must be no more than 8." + * + * Testing indicates that this restriction does not apply to MOVs. + */ + if (is_mixed_float_with_fp32_dst(inst) && devinfo->ver < 20) + max_width = MIN2(max_width, 8); - /* From the SKL PRM, Special Restrictions for Handling Mixed Mode - * Float Operations: - * - * "No SIMD16 in mixed mode when destination is packed f16 for both - * Align1 and Align16." - */ - if (is_mixed_float_with_packed_fp16_dst(inst) && devinfo->ver < 20) - max_width = MIN2(max_width, 8); + /* From the SKL PRM, Special Restrictions for Handling Mixed Mode + * Float Operations: + * + * "No SIMD16 in mixed mode when destination is packed f16 for both + * Align1 and Align16." + */ + if (is_mixed_float_with_packed_fp16_dst(inst) && devinfo->ver < 20) + max_width = MIN2(max_width, 8); + } /* Only power-of-two execution sizes are representable in the instruction * control fields.