intel/brw: Temporarily disable result=float16 matrix configs

Even though the hardware does not naively support these configurations,
there are many potential benefits to advertising them. These
configurations can theoretically use half the memory bandwidth for loads
and stores. For large matrices, that can be the limiting in performance.

The current implementation, however, has a number of significant
problems.

The conversion from float16 to float32 is performed in the driver during
conversion from NIR. As a result, many common usage patterns end up
doing back-to-back conversions to and from float16 between matrix
multiplications (when the result of one multiplication is used as the
accumulator for the next).

The float16 version of the matrix waste half the possible register
space. Each float16 value sits alone in a dword. This is done so that
the per-invocation slice of an 8x8 float16 result matrix and an 8x8
float32 result matrix will have the same number of elements. This makes
it possible to do straightforward implementations of all the unary_op
type conversions in NIR.

It would be possible to perform N:M element type conversions in the
backend using specialized NIR intrinsics. However, per #10961, this
would be very, very painful. My hope is that, once a suitable resolution
for that issue can be found, support for these configs can be restored.

Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28834>
This commit is contained in:
Ian Romanick
2024-04-04 15:12:19 -07:00
parent 33dd38f9d5
commit ea6e10c0b2
3 changed files with 23 additions and 91 deletions
-33
View File
@@ -4348,35 +4348,10 @@ fs_nir_emit_cs_intrinsic(nir_to_brw_state &ntb,
dest = retype(dest, dest_type);
fs_reg src0 = retype(get_nir_src(ntb, instr->src[0]), dest_type);
const fs_reg dest_hf = dest;
fs_builder bld16 = bld.exec_all().group(16, 0);
fs_builder bldn = devinfo->ver >= 20 ? bld16 : bld.exec_all().group(8, 0);
/* DG2 cannot have the destination or source 0 of DPAS be float16. It is
* still advantageous to support these formats for memory and bandwidth
* savings.
*
* The float16 source must be expanded to float32.
*/
if (devinfo->verx10 == 125 && dest_type == BRW_TYPE_HF &&
!s.compiler->lower_dpas) {
dest = bldn.vgrf(BRW_TYPE_F, rcount);
if (src0.file != ARF) {
const fs_reg src0_hf = src0;
src0 = bldn.vgrf(BRW_TYPE_F, rcount);
for (unsigned i = 0; i < 4; i++) {
bld16.MOV(byte_offset(src0, REG_SIZE * i * 2),
byte_offset(src0_hf, REG_SIZE * i));
}
} else {
src0 = retype(src0, BRW_TYPE_F);
}
}
bldn.DPAS(dest,
src0,
retype(get_nir_src(ntb, instr->src[2]), src_type),
@@ -4385,14 +4360,6 @@ fs_nir_emit_cs_intrinsic(nir_to_brw_state &ntb,
rcount)
->saturate = nir_intrinsic_saturate(instr);
/* Compact the destination to float16 (from float32). */
if (!dest.equals(dest_hf)) {
for (unsigned i = 0; i < 4; i++) {
bld16.MOV(byte_offset(dest_hf, REG_SIZE * i),
byte_offset(dest, REG_SIZE * i * 2));
}
}
cs_prog_data->uses_systolic = true;
break;
}