Earlier this month I decided to rework the back end of the renderer to improve performance on lower spec machines.
I made the following changes:
- Visibility buffer has been removed. I had conceived of the visibility buffer sometime ago when counting the prohibitive cost of overdraw in the shaders. It was a decent solution, but represented too much overhead for what it saved.
- Hierarchical z buffer replaces the visibility buffer. This is the depth buffer expressed as a tiled hierarchy of maximum depth values to allow for the early z reject of whole tiles in the rasteriser.
- Single pixel shader. Previously a triangles’ pixel shader was a series of function pointers to small shader code blocks. I had thought about using ASMJit or LLVM to output some machine code and stitch it together, but felt I needed a more simpler sort term solution. So now every game surface has the same shader applied: a perspective correct 3 channel colour pass and a point sampled mip-mapped texture pass. The renderer now expects the same attributes and shading pathway for every scene triangle. This simplified code flow in the back end
The hierarchical z buffer has the following constituents:
- Level 0 is the native tile depth buffer containing the current scene depth per pixel
- Level 1 records maximum depth for each 4×4 tile
- Level 2 records maximum depth for each 16×16 tile
- Level 3 records maximum depth for each 64×64 tile
- 2 x 64-bit masks for tracking writes at level 2 & 3
Depth writes from valid triangle fragments must be propagated up the hierarchy, which is handled as follows: for each fragment that writes to level 0 the shader returns a maximum depth value for the 4×4 pixel block, which is written into level 1. The level 2 & 3 tiles this block resides in are bit-marked.
After all the fragments from a triangle have been shaded the level 2 then level 3 depth tiles whose bits are set in their respective masks are updated by writing up maximum values for each 4×4 block. SIMD is used to accelerate the process.
In the rasteriser we step interpolated depth values from tile corners using pre-computed step tables in the same way we step edge values. By taking the minimum depth for the 4 corners of each tile we have a conservative depth value for the raster tile which can be compared to the corresponding depth tile entry. If the minimum raster depth in that tile is less than the maximum buffer depth we can trivially reject the whole tile before it is emitted to the fragment stream.
The depth buffer hierarchy is cleared following bin processing.
Depth interpolant setup is done in the tiles ahead of rasterisation using SIMD to process 4 triangles at a time. A real win has come from the deferred nature of the binned renderer: if no triangle fragments survive rasterisation then we avoid attribute fetch, attribute shading and vertex lighting calculations for that triangle completely!
The hierarchical depth buffer’s effectiveness hangs largely on the order triangles arrive at the buffer. You want to process them front to back ( closest to the camera first) to get the most out of it. I considered briefly doing a full per-triangle depth sort in the bins but quickly discounted it. Even with the modest scenes Lamorna Engine is intended to handle that could mean sorting thousands of triangles at a time. And I reasoned it was unnecessary to sort triangles, only models. Since triangles in the bin are laid down according to draw call, I implemented a draw call sort, then ensured that lower numbered draw calls contained large occluding models such as the map elements. I also depth sort the animated models prior to rendering as well, since they have the highest poly counts.
I am really impressed with the effectiveness of the hierarchical depth buffer. Not only did it not take long to implement, but it gave a large performance boost on lower spec machines.
I have a number of dev machines for testing purposes, and my low spec one is a dual core Core i5 2.3Ghz 4Gb HP laptop, circa 2011. Previously versions of the demo have run very choppily on this machine, 30-40 fps, but with the new changes the demo barely drops a frame below 60.