Age | Commit message (Collapse) | Author |
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
References: https://bugs.freedesktop.org/show_bug.cgi?id=47597
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Do as told; both the LRI and WAIT_FOR_EVENT need to be in the same
cacheline for an unspecified reason.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
So that in the cache where we are driving multiple independent screens
each having their own device, we do not share the global reserved
request in the event of an allocation failure.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
In the case where the kernel is inserting semaphores to serialise work
between rings, we want to only delay the surface that is coming from the
other ring and not interfere with work already queued.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Avoid offsetting the overhead of the render copy only to be penalised by
the overhead of the semaphore. So compromise.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The exported function is not used, so mark it static and strengthen the
assertions.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Avoid having to walk the full relocation array for the few entries that
need to be updated for the batch buffer offset.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Having reduced all the vb code for these generations to the same set of
routines, we can refactor them into a single set of functions.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
So it appears that we end up performing a context switch on an empty
batch, but already has a mode. This is caught later, too late, by
assertions. However, we can change the guards slightly to prevent those
assertions without altering the code too greatly. And I can then think
how to detect where we are setting a mode on the batch but doing no
work - which is likely masking a bigger bug.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47597
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Keeping a set of pinned batches in userspace is considerably faster as
we can avoid the blit overhead. However, combining the two approaches
yields even greater performance, as fast as without either w/a, and yet
stable.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
This is to make it easier to extend in future.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Track the most recent ring each bo is executed on, and prefer to keep it
on that ring for the next operation.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
As we may write preparatory instructions into the batch before checking
for a flush.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=26345
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Oops, I thought the 'busy' bit was now used and apparently forgot it is
used to control the periodic flushing...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Further experimentation...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The aim is to improve GPU concurrency by keeping it busy. The possible
complication is that we incur more overhead due to small batches.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Previously, before every operation we would look to see if the GPU was
idle and we were running under a DRI compositor. If the GPU was idle, we
would flush the batch in the hope that we reduce the cost of the context
switch and copy from the compositor (by completing the work earlier).
However, we would complete the work far too earlier and as a result
would need to flush the batch before every single operation resulting in
extra overhead and reduced performance. For example, the gtkperf
circles benchmark under gnome-shell/compiz would be 2x slower on
Ivybridge.
Reported-by: Michael Larabel <michael@phoronix.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
E.g. that BLT can always write to cacheable memory, inflexible fences
are a thing of the past, etc.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
A few callers of kgem_bo_is_mappable operate on freed bo, and so need to
avoid the assert(bo->refcnt).
References: https://bugs.freedesktop.org/show_bug.cgi?id=47597
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
References: https://bugs.freedesktop.org/show_bug.cgi?id=47597
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
x11perf -copywinwin10 on gm45 with c2d L9400:
before: 553,000 op/s
after: 565,000 op/s
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Reported-by: Jiri Slaby <jirislaby@gmail.com>
References: https://bugs.freedesktop.org/show_bug.cgi?id=47597
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Reported-by: Ognian Tenchev <drJeckyll@Jeckyll.net>
References: https://bugs.freedesktop.org/show_bug.cgi?id=56324
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The only path where this is correct already handles it as the special
case that it is, everywhere else it just nonsense.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Based on a suggestion by Chad Versace (taken from a patch for mesa).
This allows for a faster upload of pixel data through a ShmImage, or for
complete replacement of a GPU bo.
Using a modified version of x11perf to upload to a pixmap rather than
scanout on an IVB i7-3720qm:
Before:
40000000 trep @ 0.0007 msec (1410000.0/sec): ShmPutImage 10x10 square
4000000 trep @ 0.0110 msec ( 90700.0/sec): ShmPutImage 100x100 square
160000 trep @ 0.1689 msec ( 5920.0/sec): ShmPutImage 500x500 square
After:
40000000 trep @ 0.0007 msec (1450000.0/sec): ShmPutImage 10x10 square
6000000 trep @ 0.0061 msec ( 164000.0/sec): ShmPutImage 100x100 square
400000 trep @ 0.1126 msec ( 8880.0/sec): ShmPutImage 500x500 square
However, the real takeaway from this is that the overheads for
ShmPutImage are substantial, only hitting around 70% expected efficiency,
and overshadowed by PutImage, which for reference is
60000000 trep @ 0.0006 msec (1800000.0/sec): PutImage 10x10 square
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
As we now regularly retire and so discard the temporary large buffers,
we find them in short supply and ourselves wasting lots of time creating
and destroying the transient buffers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
As we know that such operations are likely to be slow and consume
precious GTT space, mark them as candidates for flushing.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The DRI2 code tries to create pixmaps with non-zero width/height,
whoops.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
Preliminary prime support.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
The subtly is that we need to reset the mode correctly after
submitting the batch which was not handled by kgem_flush(). If we fail
to set the appropriate mode then the next operation will be on a random
ring, which can prove fatal with SandyBridge+.
Reported-by: Reinis Danne <reinis.danne@gmail.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|
|
In order to properly track when the GPU is idle, we need to account for
the completion order that may differ on architectures like SandyBridge
with multiple mostly independent rings.
Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=54127
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
|