I’m writing this article from a plane flying from Montréal to Helsinki, on my way back from the X.Org Developer’s Conference. This is the second time I’ve attended XDC, since last year in Spain.
This year’s conference has been pretty interesting. I’ve met a lot of people I’ve been working with throughout the year, from a wide range of organizations and projects. That was pretty cool! I’ll try to summarize the main discussions we had.
A while ago, I’ve started working on a new project called libliftoff. This was the highlight of the conference for me, so let me break it down in detail. What problem does libliftoff try to solve?
What is libliftoff?
Compositors are, among other things, responsible for taking buffers from clients, drawing them together on a single buffer, and displaying the buffer on screen. I currently have my text editor and a terminal opened; the compositor will copy these two window buffers over to the screen’s buffer. The copy is usually performed using OpenGL.
However, copying buffers can be pretty wasteful. It often requires alpha-blending, conversion between different formats, keeping the render engine 1 awake and a lot of other things. Copying takes some time (increasing latency) and drains your battery.
To improve performance, many GPUs come with a feature called hardware planes. Planes can make the display engine 2 perform the composition. This is called direct scan-out and allows the compositor to avoid copying entirely.
On Android, hardware planes are used extensively via a piece of software called the Hardware Composer. Many Wayland compositors use cursor planes, but a more general usage of planes is very rare. Weston is one of the only compositors which makes partial use of planes.
Using planes is not straightforward. Planes come with an opaque set of restrictions; it’s not always possible to put a buffer into a plane. The restrictions are hardware-specific and often be a little bit surprising. For instance, with some buffer formats, Intel hardware can only position planes at even coordinates. Some buffers are allocated in memory that cannot be scanned out so they can’t be put into a plane at all. Also, display hardware generally has bandwidth limitations and using too many planes with large buffers can fail. On some ARM hardware, some planes can’t overlap.
Today, compositors have no way to query these limitations. Designing an API to query hardware constraints is difficult, because they are very different from one piece of hardware to another – and new GPUs may have even weirder limitations. The only way to make sure a given configuration works is to try it.
For this reason, designing and implementing an algorithm to use planes effectively can become pretty complicated. I’ve been wondering if sharing the code to use planes between compositors could help, and begun designing libliftoff for this purpose.
I’d organized a workshop at XDC to discuss about libliftoff. The goal was to get both compositor writers and driver experts in the same room and figure out how to make these planes useful for compositors.
I’ve been amazed how well the libliftoff project has been received. Around the table were sitting:
- On the compositor side, the wlroots gang (Drew DeVault, Scott Anderson and myself), Roman Gilg for KDE, Daniel Stone for Weston, and Keith Packard for the X server.
- On the driver side, a lot of different developers from many vendors: AMD, Arm, Google, Nvidia, Qualcomm and more.
Everyone seemed pretty excited about the idea and provided a lot of valuable feedback. Thanks to everyone who participated in the discussion!
Short-term plans include turning the libliftoff experiment into something today’s compositors can actually use. I also need to figure out how to properly support multiple outputs: because each has its own timings, it’s tricky to migrate planes from one output to another. Finally, it would be nice to assign a priority to each layer to put layers that are frequently updated on a plane before the others.
We also discussed some longer-term plans.
First, we want to fix an issue related to memory: often clients will allocate
buffers with memory that cannot be directly scanned out on a plane. Notably, on
Intel clients may use a
Y_TILED format which cannot be scanned out on older
hardware. On AMD clients will allocate memory in a region that cannot be
accessed from the display engine.
To fix this issue, we want to implement a feedback loop: the compositor could
send a hint to clients to make them use a buffer that can be put on a plane.
Typically, the compositor could say “right now you are using
Y_TILED so I
can’t put you on a plane, but if you were using
X_TILED I could”. The client
could then decide to switch its buffer format. I sent a Wayland
patch almost one year ago for this. The hints could be
computed by libliftoff, in which case the compositor could just forward it to
The second issue is about the kernel API. Right now the only way to know if we can use a plane is to just try to use it via an atomic test-only commit. This makes it quite tedious to figure out the best way to use planes, since we need to try a lot of combinations, basically brute-forcing the solution. Moreover, figuring out the optimal solution and hints is highly hardware-specific.
To fix this, we could make the kernel provide more information, but because constraints are very different from a piece of hardware to another, designing a general enough interface is tedious. Another solution would be to add vendor-specific plugins to libliftoff, allowing each driver to add code to take better decisions. This seems the best way to go so far.
Here’s the lightning talk summarizing the workshop discussions (slides):
Scott Anderson also wrote a summary with more details about the Wayland protocol idea.
Three years ago at XDC, James Jones from Nvidia presented the allocator project to fix the GBM/EGLStreams situation. This year, he gave a new talk focusing on GBM, Nouveau and transitions. In contrast to previous proposals, this one aims at building an incremental approach on top of existing APIs.
To reduce bandwidth usage in the render engine, some GPUs can use compressed buffers. The GPU can perform OpenGL/Vulkan operations on these compressed buffers directly and more efficiently than on uncompressed buffers. In short, we want to use compressed buffers when rendering.
However, compressed buffers cannot be scanned out directly on a plane. They need to be uncompressed first. One way to do it is by uncompressing the buffer into a new buffer but this performs a copy which can be slow.
Some GPUs have a more optimized way to do this: they can uncompress the buffer in-place. So if a Wayland client renders into a compressed buffer, this buffer can be uncompressed in-place and then scanned out on a plane. This process of uncompressing a buffer in-place is called a transition3.
A workshop was organized to continue discussing about transitions. We discussed there when and where the transition should be performed. One idea is to do it in the compositor, if the compositor decides to put the buffer in a plane. libliftoff could help figuring out when a transition can be applied. Another idea would be to perform the transition in the client right before submitting the buffer to the compositor. The client would need feedback from the compositor to know when it needs to perform the transition – that’s something the Wayland protocol update I mentionned earlier can do.
Here’s the workshop summary:
Variable Refresh Rate (VRR)
Harry Wentland from AMD gave a talk about Adaptive Sync (the DisplayPort technology), Variable Refresh Rate (the HDMI technology) and FreeSync (the AMD technology). All of these allow screens – which usually have a fixed refresh rate like 60Hz – to wait a little bit more for the next frame.
The primary use-case for this feature is gaming. Games generally submit new frames at a variable rate: depending on the complexity of the scene, the next frame can come in faster or slower. Games also want to reduce latency, that is to avoid delay between the time at which the frame is rendered and the time at which it’s displayed on screen.
In the case of a fixed refresh rate screen, if a frame takes a little bit longer than usual to render, the deadline for the refresh deadline will be missed. This results in lag and stutter. VRR allows the screen to wait a little bit longer and avoid the missed frame.
Another use-case is video players. Videos have a fixed rate, but it’s usually different from the screens' refresh rate. Video players need to resort to frame interpolation and end up with imperfect timings. VRR would allow to lower the screen refresh rate to get a perfect timing.
VRR could also be useful to reduce battery usage when the screen doesn’t change often, for instance when the user types in a text editor. Instead of rendering at 60FPS, the compositor could temporarily reduce the frame-rate.
The gaming use-case is the simpler. The compositor can submit frames a little bit later than the deadline and the hardware will cope with it. Other use-cases require more work. To make VRR useful for video players, we need some kind of timing API to frame submissions. To make VRR useful for reducing battery usage, the compositor would need to smoothe frame-rate changes otherwise the screen flickers (this is a limitation of VRR screens).
It seems to me that for the time being, we should focus on a Wayland protocol for gaming only. The two other use-cases need more work, more APIs (both for the kernel and for Wayland) and more experiments.
I’ve worked on a project called Chamelium during my Intel internship. It’s basically a screen emulator you can send commands to via the network. It’s used in ChromeOS and i915’s CI farms.
I gave a talk about the project and my work:
That’s it! Apart from these three topics, there is also lots of other things we’ve discussed, but that won’t fit in a single blog post. Thanks a lot to everyone who attended XDC, thanks Mark Filion for organizing the event, and thanks to all sponsors for making this possible!
The part of the GPU responsible for executing OpenGL/Vulkan commands ↩︎
The part of the GPU responsible for sending a video stream to the screen, as opposed to the part performing rendering ↩︎
In fact, transitions are more general than just uncompression, they can change the buffer layout arbitrarily ↩︎