intel/brw: Temporarily disable result=float16 matrix configs

Even though the hardware does not naively support these configurations, there are many potential benefits to advertising them. These configurations can theoretically use half the memory bandwidth for loads and stores. For large matrices, that can be the limiting in performance. The current implementation, however, has a number of significant problems. The conversion from float16 to float32 is performed in the driver during conversion from NIR. As a result, many common usage patterns end up doing back-to-back conversions to and from float16 between matrix multiplications (when the result of one multiplication is used as the accumulator for the next). The float16 version of the matrix waste half the possible register space. Each float16 value sits alone in a dword. This is done so that the per-invocation slice of an 8x8 float16 result matrix and an 8x8 float32 result matrix will have the same number of elements. This makes it possible to do straightforward implementations of all the unary_op type conversions in NIR. It would be possible to perform N:M element type conversions in the backend using specialized NIR intrinsics. However, per #10961, this would be very, very painful. My hope is that, once a suitable resolution for that issue can be found, support for these configs can be restored. Reviewed-by: Caio Oliveira <caio.oliveira@intel.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
2024-04-04 15:12:19 -07:00
parent 33dd38f9d5
commit ea6e10c0b2
3 changed files with 23 additions and 91 deletions
@@ -4348,35 +4348,10 @@ fs_nir_emit_cs_intrinsic(nir_to_brw_state &ntb,

      dest = retype(dest, dest_type);
      fs_reg src0 = retype(get_nir_src(ntb, instr->src[0]), dest_type);
-      const fs_reg dest_hf = dest;

      fs_builder bld16 = bld.exec_all().group(16, 0);
      fs_builder bldn = devinfo->ver >= 20 ? bld16 : bld.exec_all().group(8, 0);

-      /* DG2 cannot have the destination or source 0 of DPAS be float16. It is
-       * still advantageous to support these formats for memory and bandwidth
-       * savings.
-       *
-       * The float16 source must be expanded to float32.
-       */
-      if (devinfo->verx10 == 125 && dest_type == BRW_TYPE_HF &&
-          !s.compiler->lower_dpas) {
-         dest = bldn.vgrf(BRW_TYPE_F, rcount);
-
-         if (src0.file != ARF) {
-            const fs_reg src0_hf = src0;
-
-            src0 = bldn.vgrf(BRW_TYPE_F, rcount);
-
-            for (unsigned i = 0; i < 4; i++) {
-               bld16.MOV(byte_offset(src0, REG_SIZE * i * 2),
-                         byte_offset(src0_hf, REG_SIZE * i));
-            }
-         } else {
-            src0 = retype(src0, BRW_TYPE_F);
-         }
-      }
-
      bldn.DPAS(dest,
                src0,
                retype(get_nir_src(ntb, instr->src[2]), src_type),
@@ -4385,14 +4360,6 @@ fs_nir_emit_cs_intrinsic(nir_to_brw_state &ntb,
                rcount)
         ->saturate = nir_intrinsic_saturate(instr);

-      /* Compact the destination to float16 (from float32). */
-      if (!dest.equals(dest_hf)) {
-         for (unsigned i = 0; i < 4; i++) {
-            bld16.MOV(byte_offset(dest_hf, REG_SIZE * i),
-                      byte_offset(dest, REG_SIZE * i * 2));
-         }
-      }
-
      cs_prog_data->uses_systolic = true;
      break;
   }