Tuesday, November 1, 2011

Software occlusion culling trip. Part I: Packing the luggage.

          Occlusion culling in computer graphics especially in games becomes more and more important in several last years because the complexity of virtual worlds grows very quickly. Often in games are used much more complex materials with very heavy shading cost. But traditional hardware occlusion culling (HOC) which uses the extremely efficient and scalable GPUs power often becomes a bottleneck in the current games because rendering itself uses a lot of GPU power and often GPU’s more utilized than CPU (especially multicore CPU) and any additional work on GPU, not directly related to the rendering itself, may cause relatively big drops in the frame rate. So developers have started finding some way how to use occlusion culling because it helps cull a lot of invisible objects from rendering without worrying about GPU overworking.There is some “solution” called software occlusion culling (SOC) with difference in which kind of processing power is used. SOC is very similar to HOC by principle, but SOC  uses CPU for making decision about is desired object visible or not. In age where CPUs continue increasing the core number it seems not so bad idea, because SOC itself is very suitable for parallelization and to utilize all the power of modern CPUs. To be precise SOC has several another drawbacks and it isn’t single solution for the occlusion culling. There are some techniques which precomputes static scene and then uses this precomputed data to quickly answer the question “Which objects are visible from current point of view?” But they work well only with almost the static (often indoor) environment and only with a very few number of dynamic objects in the scene. So if you want to have both the occlusion culling and the relatively dynamic (potentially destructible) environment SOC is the only one suitable approach.
            That theme has been always very interesting for me and during this cycle of posts I want to implement software occlusion culling step by step. I know that this technique is not novel and these are examples where this technique was successfully commercially implemented and works (Frostbite 2 engine), but I’m interesting in taking experience in many areas around this technique. What areas are related to the SOC: software rasterization, SIMD instructions optimizations, and multithreading support and of course C++.

            The first step on our trip will be understading how the OC (HOC in particular) algorithm works and why HOC becomes less suitable for the heavy GPU usage games. OC is a very good optimization for virtual worlds with heavy number of virtual objects and with the relatively small amount of the entire world visible at any given time from the current camera point of view. OC algorithm consists of two basic conceptions: occluder objects and occluded objects. Occluder objects are renderable objects which hide another object behind themself. Not all objects are good occluders. Good occluders are for example walls, big buildings and so on. In other worlds everything big enough that can hind a lot of another objects. Occluded objects are object which were occluded from rendering because they are located totally behind an occluder object. As we can see OC is not need to be performed with some kind of object rendering it may be even fully analytically, means be fully performed by some mathematical equation. For example PVS (Potentially Visible Set, it is one kind of the OC algorithms) uses analytical approach while HOC and SOC use so named numerical approach. Look at the image below to understand the roles of occluder objects (red color) and occluded objects (black color).  The green triangle is camera view frustum if we are looking along XOZ plane. As we can see occluder objects can significantly reduce visible objects count if OC is enabled.

            Now we will try to answer the question “Why does HOC become less suitable approach for heavy GPU usage games?” The answer is hidden in the basic principle how CPU and GPU work together. Most of the time they work independently and asynchronously. That means that CPU is just supplier of the work for the consumer GPU. During the rendering CPU just calls GAPI (Graphics API) function to do something, the graphic's card driver collects these commands in its internal command buffer and time to time flushes this command buffer to the GPU for execution. There is no implicit synchronization between CPU and GPU during the typical usage, even Present() GAPI function is also asynchronous, and it is an usual driver buffer command as many other. But there are some exceptions where synchronization is necessary. One of the cases when we try to retrieve data back from GPU or we ask GPU to return some query result synchronously.

            Let’s talk a little about query mechanism in modern GPUs. There are a lot of different queries, detailed description about most of them you can find in the official documentation DirectX or OpenGL standard paper. For HOC we are interesting in the one specific kind of the queries named occlusion query. What is it? Occlusion query helps us to answer the question “How many pixels of a rendered object were rendered” Before rendering some geometry we ask GAPI to get us occlusion query, then we render out geometry, and after that we ask GAPI to get us the result about how many pixels were rendered. Notice that third step, when we ask GAPI to return us the result, may be either synchronous or asynchronous. Often used synchronous or some kind of synchronous where the query is asynchronous itself but we ask GAPI about the result in the infinite loop. To understand why the synchronization CPU and GPU is really bad idea, when we retrieve results of an occlusion query let’s look at the image below. Here we illustrate driver command buffer, where green rectangles are usual GPU command like SetTexture, DrawPrimitive and so on, two red rectangles are respectively acquire an occlusion query and retrieve results for this query and the blue rectangle between two red rectangles is DrawPrimitive command for which we want to get the number of the rendered pixels.

As you can see on the image our CPU is slightly ahead over the GPU, means CPU has produced more operations then GPU can do right now, but CPU doesn't wait while GPU is executing these commands, it just continue to write command in the command buffer as usual and it even doesn’t guess that GPU currently is really busy and can’t perform all these commands immediately. But what happens when we acquire an occlusion query and are waiting for the results. We manually make CPU to wait until GPU hasn’t done its works. It means that now CPU waits GPU until the red vertical line. And if we have a lot of these queries, that may significantly drop overall performance.  Don’t forget also about an additional work for the GPU introduced by the drawing occluder geometry and tested geometry of the occluded objects.

Now we know the answers on both questions. Let’s think about how SOC can help us in that situation. In the last several years CPUs have increased their core number relatively quickly and now it’s typical that gamer’s PC has a CPU with 4 and more cores. Unfortunately PC game developers are not so much experienced in multicore programming as our colleagues from console world, where Xbox 360 and PS3 have 4-6 general use cell processors. On the PC a typical game engine uses only 2 threads (but there are some exceptions of course and often most advanced game studios try to implement much more advanced multithreading): the first one is the entry point of the game where the main command flow is executed and the second one is so named background tasks thread, usually for loading content in background to be sure that we don’t interrupt main thread and don't cause short-time fps drop . But CPU utilization is not uniform for these threads, while main thread utilizes almost 100% of one CPU core; second thread uses usually 20-40% of CPU core power. So typically we use only less than 70% power of the dual core processor, 35% for the quad core and only 15% for eight cores. That shows us that we have a lot of unused processing power which we can use to improve our game and make it even better. As I mentioned earlier SOC is very suitable for parallel processing, so we can relatively easily use this processing power to accelerate OC algorithm. More over as well as we will do occlusion culling fully on the CPU we have to not see any negative influence of CPU-GPU synchronization.

So let’s think about what we need to implement a fairly good SOC. We will do this step by step and will describe common principles. Well that is the list of steps to achieve our goal:
·         Simple software rasterization algorithm.
·         Simple single-threaded software occlusion culling algorithm.
·         SIMD optimization of software rasterization and culling algorithm.
·         Job system to effectively parallel SOC.
·         Some automated tool to generate occluder geometry from the rendering geometry without artist’s part.

That sounds not very hard but as always devil is in the details.

1 comment: