Sadly I don't see an obvious way to use it for int8 matrices, therefore
the code is a bit of a mess right now.
It allows us to vectorize load/stores more often as we can simply
transpose row/col major matrices when needed.
And the movm optimization is also only enabled for 16 bit types, even
though we _could_ do it for 32 bit. It's not clear yet if using it for 32
bit types is an overall advantage or not.
Reviewed-by: Mel Henning <mhenning@darkrefraction.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37998>