Previously OUT_BATCH was just a macro around an inline function which
does
brw->batch.map[brw->batch.used++] = dword;
When making consecutive calls to intel_batchbuffer_emit_dword() the
compiler isn't able to recognize that we're writing consecutive memory
locations or that it doesn't need to write batch.used back to memory
each time.
We can avoid both of these problems by making a local pointer to the
next location in the batch in BEGIN_BATCH().
Cuts 18k from the .text size.
text data bss dec hex filename
4946956 195152 26192 5168300 4edcac i965_dri.so before
4928956 195152 26192 5150300 4e965c i965_dri.so after
This series (including commit c0433948) improves performance of Synmark
OglBatch7 by 8.01389% +/- 0.63922% (n=83) on Ivybridge.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
So that everything writing to the batch between BEGIN_BATCH() and
ADVANCE_BATCH() goes through OUT_BATCH.
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
BEGIN_BATCH() and ADVANCE_BATCH() will contain "do {" and "} while (0)"
respectively to allow declaring local variables used by intervening
OUT_BATCH macros. As such, BEGIN_BATCH() and ADVANCE_BATCH() need to be
in the same control flow.
Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
This field should always be set for gen8. In the bdw PRM, Volume 2d:
Command Reference: Structures under INTERFACE_DESCRIPTOR_DATA, DWORD
6, Bits 9:0, Number of Threads in GPGPU Thread Group:
"This field should not be set to 0 even if the barrier is disabled,
since an accurate value is needed for proper pre-emption."
In the HSW PRM, the it doesn't mention that it must always be set, but
it should not hurt.
Reported-by: Kristian Høgsberg <krh@bitplanet.net>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Reviewed-by: Ben Widawsky <ben@bwidawsk.net>
I needed to rewrite this a bit for safety checking in the next commit.
Despite being a static inline of the same thing that was being done, we
lose 36 bytes of code for some reason.
Now that RCL generation is in the kernel, we don't have any other
callers. Oddly, the compiler generates another 8 bytes of code for
this, but the simplification is worth it.
Now that we don't resize the CL as we build (it's set up at the top by
vc4_start_draw()), we can store the pointers instead of offsets from
the base. Saves a bit of math in emitting relocs (about 60 bytes of
code).
vkDestroyColorAttachmentView
vkDestroyDepthStencilView
These functions are not in the 0.132 header, but adding them will help
us attain the type-safety API updates more quickly.