We need to emit 2 32-bit load messages to load a full dvec4. If only
1 or 2 double components are needed dead-code-elimination will remove
the second one.
We also need to shuffle the result of the 32-bit messages to form
valid 64-bit SIMD4x2 data.
v2:
- use byte_offset() instead of offset() (Iago)
- keep the const. offset as an immediate like the original code did (Juan)
Reviewed-by: Matt Turner <mattst88@gmail.com>