For some reason, information on collision detection & response is quite sparse compared to all the other subjects. Worse yet, there are some bad information out there that are praised to be good resources. Pieces of information are also often discoherent, such as resources on collision detection says nothing about collision response and vice versa. It took me a while to find good information on this subject, and I have finally implemented a working collision system that works reasonably well for games.

A new collision object representation
This time, I chose convex polyhedrons to be the collision representation for the small objects in my game, such as trees, rocks, and so on. Concave objects, such as houses, are broken down into convex polyhedrons. I am aware of some of the convex decomposition techniques, but I didn’t want to do them, as I think that would take too much time. Instead, I manually decompose concave mesh into convex polyhedrons in Blender.

As for terrain, I use a completely different representation. This time, I render the terrain from top down in orthographic view, then store away the depthmap into some texture I can easily use later. Therefore, querying terrain height at any arbitrary point becomes very fast.

Unlike last time, we don’t have the fortune to have a uniform representation for all collidables in our game world, so we have to write two separate solutions. However, it is widely more efficient compared to last time. Instead of feeding huge numbers of polygons into the collision system, I can approximate them with cheap convex shapes now. As for terrain, it just becomes four texture reads.

First, let’s talk about how per-object collision works.


GJK is a powerful technique that queries the minimal distance between two convex objects, and it is also what I mainly use for my per-object collision system. I will only briefly summarize it here since there are already really good information on this topic, which I will link later.

First, imagine convex objects to be made of infinite small points. If a convex object A, and a convex object B are considered colliding, that must mean some of the points in this space belong to both A and B. If we subtract all points belong to B from all points belong to A, we must be subtracting a small subset of points from themselves. Therefore, some of these subtractions result in zero.

Imagine the subtractions we did form a new shape, C. For A and B to intersect, C must contain the origin. That is due to some of the subtractions resulting in zero. If A and B do not intersect, then C does not contain origin.

Now, to check if A and B intersect, we can check if C contains the origin. This is the essence of GJK; we have reduced the problem to checking whether or not the origin is contained by some convex shape.

How do we obtain C? Do we have to subtract all the points from B by all the points from A? No. If we can find support functions for A and B, then we can combine them to be the support function for C.

Given the support function of C, we can use it to map directions to extreme points in C. It can also be shown that for any convex shape, if it contains an origin, a tetrahedron made up of four points from the convex shape can capture the origin (meaning contain the origin). That means, we can use the support functions to find extreme points, and try to let these extreme points form a tetrahedron that contains the origin. That is what GJK does.

GJK does this by evolving a group of points into that tetrahedron that would capture the origin, we will call this group of points a simplex. For the 3D case, a simplex can be a point, a line segment, a triangle, or a tetrahedron.

In each iteration of GJK, it finds which voronoi region the origin is in relative to the simplex, then keep only the subset of that simplex which is contained by that voronoi region’s feature, and then change the search direction towards that voronoi region.

We can check the progress by finding the closest point on the simplex to the origin. Functions that do this, such as ClosestPointOnLineSegmentToPoint(), ClosestPointOnTriangleToPoint(), and ClosestPointonTetrahedronToPoint(), are well covered in Ericson’s book “Real Time Collision Detection”. The distance from this closest point to the origin tells us our progress. If this closest point stops getting closer, that means we will never be able to capture the origin. In that case, we return the distance as our minimal distance.

I know in Casey’s lecture on GJK, the termination condition is quite different. He simply checks if the extreme point in the search direction can get to the other side of the origin. That is enough for determining whether or not two objects collide, but it is not enough to find the minimal distance and the closest point on the simplex, which is vital to us, since we will need these information to resolve collision. Furthermore, some of the interesting optimizations I did will require GJK to reach the simplex that has the minimal distance in order to work.

Lastly, a word of warning. What I said above is purely in theory. There is more to termination condition in 3D GJK, such as clever ways of checking progression. I always find it reassuring to actually keep a minimal distance every iteration. Furthermore, you will also need to watch out for degenerate simplices, since most algorithms that determines a point’s voronoi regions will have unexpected behaviors if a simplex is degenerate, such as when a triangle becomes collinear and a tetrahedron becomes collinear in certain faces or coplanar. These are indications that GJK cannot make further progresses, because the new points is still within the current simplex.

In addition, a naive GJK implementation doesn’t work well with quadric shapes, such as spheres or ellipsoids, which are often shapes used to represent a player’s collision volume. In order for GJK to work for them, a tolerance value must be added to the progression check.

More details on GJK, read Erin’s talk.

Accounting for movement in GJK

As I described. GJK is an algorithm for computing the minimal distance between two convex shapes. It cannot be directly used as a collision detection algorithm in games, since it doesn’t account for movement.

One simple trick in Ericson’s book can change that, though. Keep in mind that GJK only requires the geometry’s support function to work, it does not care about the explicit definition of that geometry. We can turn one of the object to be swept across space by its motion by changing the support function for that object.

support function only returns points from B if search direction D is along motion vector

Consider the above swept sphere. The support function of the convex hull of the moving sphere will only return points from sphere B if the search direction is along the motion vector. In other words, we can simply do a dot product between the search direction and motion vector and use it to determine whether to use A’s support mapping or B’s support mapping. This effectively gives us a support function for a swept geometry.

Incorporating GJK into game’s collision detection

In my game, I am using a capsule to approximate player’s collision volume and convex polyhedrons for other geometries in the world. There are two pieces of information that we need to handle collision correctly each frame. One is time-of-impact (TOI) and contact point. We use TOI to stop the player right before collision and use contact point to compute the normal plane, which is then used to reflect player’s motion in order to achieve “gliding”.

My use of capsule is intentional, as it simplifies things a lot.

Obtaining time-of-impact

To obtain time of impact, one GJK isn’t enough. It only tells me whether my moving capsule will collide with any object, but doesn’t say when it will collide with them. As recommended by Ericson in his book, bisection method can be used here to obtain a relatively accurate TOI. The concept of bisection method is simply halving the interval to ensure that the margin within the interval is where the possible TOI is. It is analogous to binary search.

finding TOI with bisection method

However, as I have stated before, quadric shapes don’t play well with GJK. We can simplify the problem a lot here by reducing the capsule into its inner line segment. This is inspired by Randy Gaul’s answer to my post. His suggestion is reduce the capsule to a line segment in order to compute contact point, but I realized it can also work in collision detection phase.

I am not sure if this is the canonical definition of a capsule, but the “capsule” that I am using is defined by two parameters: the inner line segment and a radius. The inner line segment defines its height, and the radius expands that line segment out radially so it actually has a volume.

Given the above definition, it is clear that an object can only intersect with the capsule if the minimal distance between the inner line segment and the object is less or equal to the capsule’s radius. Therefore, instead of feeding a capsule into GJK, I can instead feed its inner line segment to GJK, and then compare the capsule’s radius with the returned minimal distance to detect collision.

Obtaining contact point

After determining TOI, we move the capsule forward by that amount, and we now determine the contact point between capsule and the closest object. As Randy Gaul suggested, the capsule is now reduced to line segment to find the contact point. That gives the advantage of only having to deal with non-quadric shapes and expect a non-tetrahedron simplex in the last iteration of GJK, since it’s impossible for the two to intersect when the radius of the capsule is non-zero, and our capsule’s radius is obviously non-zero.

But how do we compute closest point on either the capsule or the colliding object? Think back to the GJK algorithm. If the two objects are non-intersecting (which is obviously the case here), GJK is terminated by detecting a lack of progression (namely, comparing the minimal distance with the last iteration). Recall that we keep a closest point on the simplex to origin. This is not the closest point we need, though, as it is a point belong to the minkowski difference instead of the two original convex shapes. At the last terminating iteration, we convert the point into its barycentric form relative to the simplex. Because the current closest point on the simplex is a direct subtraction between A and B, we must be able to find two simplices on A and B that correspond to the end minkowski simplex as “source”, with the same dimension.

there must be a 1 to 2 mapping for simplex kept during GJK

In order to find the corresponding simplex in A or B, we have to keep track of the source points when we construct the simplex in GJK. By retrieving these points and applying the barycentric coordinates to them, we can obtain the closest point on either A or B.

After obtaining the contact point, the rest is old news; reflecting the motion back towards the collision normal so that the resultant motion vector is along the collision normal plane, and all that goodies.

That’s it!

That is all for player vs object collision detection & response routine. I wanted to include terrain collision in this post as well, but it’s getting quite long, so I will save it for next time. And as always, a video of the working collision system. The pink spheres are debug drawings of the closest point computed by this algorithm:

Continuing from last post, DDGI requires a high amount of memory to store its depth maps. I present a compression scheme that compresses the data to 5~10% of its original size. Note this is completely orthogonal to BC texture compression or any bit-saving tricks, so this compression ratio can be improved further with those existing techniques.

Memory Constraint

For DDGI to work effectively, we need a 16x16 depth map attached to each probe, with each texel storing both depth and depth^2. If we use R16G16 to store each texel, then that would give us 1KB per probe, not counting the gutter padding.

Let’s say we have a generous budget of 1GB for light probes, and we have a 200x200x50 meter^3 world. Placing probes one meter apart, we would need 2 million probes, which will take up 1.9G. So a dumb uniform distribution of probes won’t cut it.

A smarter way to go about it is building a voxel tree around geometry, like the one talked about by Remedy (Multi-scale GI talk). This will save us a lot of memory, but turned out that isn’t enough either. First of all, dynamic objects need probes as well, so we still need to scatter a decent amount of them out in the opening. Secondly, this space optimization is pretty meaningless if your world has geometry everywhere. So not a really robust solution to our memory issue.


In Monter, I have probes placed conservatively and one meter apart. The probes eat up about 300MB. That’s pretty bad, because we can’t scale with that. Firstly, the algorithm breaks down in some places because the probes are just too damn far apart:

As you can see, either probes can’t fit in that crevasse, or it’s clipped against the wall and has nobody to interpolate with. This is pretty glaring, and the only way to solve this is placing more probes. So I bumped the probe resolution to be 0.5 meters apart:

Much better, But we just increased the memory footprint to 2.4GB … which makes it unusable. As I’ve said, the dominating factor is the radial depth map, let’s try to compress that.

”Dead” Probes

An obvious thought: probes are small distance apart from each other, so do we really need to store the true depth value? No, because during interpolation, probes must be within some distance threshold. If a probe happens to be farther from the shading point than the threshold, the shading point will query another probe that’s closer instead, because it’s guaranteed that such a probe will exist given our uniform probe density. We can just clip the depth value at max probe distance.

depth_to_store = min(true_depth, max_dist_from_probe_to_probe);

We just reduced the range of the depth value from [0, inf] to [0, N], N being the maximum distance possible from one probe to another during interpolation. Just some small value, anyway.

This makes our lives a lot easier. Immediately, a lot of probes that are farther from geometry than this maximum distance will have its radial depth data consist of only the value N. Then we can just toss all their depth maps out of the window and slap a label on them. Every time their depth gets queried, always return depth N for all directions.

This ends up working pretty well. For my case, the depth storage size has been compressed to 50% of its original size. Which gives us 1.2GB. I mean it’s not bad, but not great either. We can do better.

Dictionary Compression

Another thought: can we dictionary compress these depth map blocks? We see a clipped depth map of value N is very common, but are there any unknown configurations that are just as common, but we are not taking advantage of?

Well, there’s simply no tradeoff here. Dictionary compression is just better because what we did is simply a subset of dictionary compression. It’s only going to achieve a better compression ratio, not worse.

So here’s what we are going to do. We queue up light probe baking requests, and process them one by one. This process spits out a 16x16 depth block per request. Then we try to match this depth block with the previous depth blocks. If we find a match, we just spit out its index, otherwise we stamp the new depth block in the depth block list and spit out that new index.

The block matching criteria is peak pixel difference thresholding. As long as the biggest per-texel difference is below 0.01, I accept that block as a match. I don’t enforce an exact zero difference, and this implies lossy compression.

My look-back window size is 10k.

GPU implementation

This algorithm is slow if implemented as a single thread. Since my baking process is all on the GPU, I’ve added this in-place streaming compressor onto the GPU as well. Instead of processing one light probe, I batch a whole bunch per dispatch. This causes issues, however. Turns out the compressor works best if blocks within vicinity are available for matching, and the probe indices are ordered in a spatially coherent way. By batching probes, these close-by probes cannot cross-talk. I can run one probe per dispatch, but that will negate any latency-hiding capability of the GPU.

Instead, I swizzle the indices like so:

int SwizzleProbeIndex(int FlatIndex, int ProbeCount)
    int BlockSize = 100;
    int BlockCount = 500;
    int GroupSize = BlockSize*BlockCount;
    int GroupIndex = FlatIndex / GroupSize;
    int GroupHead = GroupIndex * GroupSize;
    int GroupTail = GroupHead + GroupSize;
    if (GroupTail >= ProbeCount) 
        return FlatIndex;
    int Offset = FlatIndex - GroupHead;
    Offset = ((BlockSize * Offset) % GroupSize) + Offset/BlockCount;
    return GroupHead + Offset;

This code iterates through the indices at large steps, which means a batch contains distant probes instead of neighbor probes. This enables neighbor block sharing without performance penalty.


The results are very nice, for the following configurations:
(dilation is applied to the probe volume to cover open ground)

density=0.5 meters, dilation=3 meters: 4.5% of its original size
density=1.0 meters, dilation=1.5 meters: 9% of its original size

If we map this level of efficiency to our original problem, then we get to store 2.4GB worth of DDGI probes in 168MB, awesome!

The compression ratio seems to improve as dilation gets bigger due to more “dead” probes, which are easy targets to get compressed. This is cool because it permits more probes in the open field.

Global illumination is one key effect I want to achieve with Monter. For a low-poly style game, GI is an important effect that gives the fidelity, which the low-poly game asset itself lacks. I’ve attempted to achieve this effect with irradiance probes, drawing ideas from both DDGI for probe interpolation and Remedy’s GI talk for optimal probe placement.

Global Illumination with Irradiance Probes

Before light probes, Monter used a constant term for indirect illumination, so that surfaces in shadow are not completely dark. It doesn’t account for indirect illumination or skylight illumination, which are key elements to producing realistic lighting. So I am aiming to finally bring global illumination into this game with the help of irradiance probes.

flat ambient term, looks awful!

this is what we want: indirect illumination!

An irradiance probe is a probe that captures incoming light from all directions towards a single point in space. I will twist this definition and say each probe encodes only the incoming indirect light + skylight, which is my specific use case. For simplicity’s sake, let’s just say we are totally spamming these probes everywhere.

How are these probes useful? Well, imagine we need indirect illumination at a surface point. We can look up the nearest eight probes, where these probes will form a box around the surface point. Then we just do a trilinear interpolation among these eight probes to approximate the incoming indirect light at the surface point. After that, we just convolve the incoming light with the surface’s BRDF, et voila, we will have something close to the indirect illumination at the surface. Of course there can be severe issues with this approach (probes stuck in walls, drawing irradiance from probes on the other side of a wall, etc), but we will talk about those later. For now, this is sort of the big picture we are working with.

Okay, cool, shall we start spamming those probes?

Irradiance Encoded as Spherical Harmonics

For each probe, we need to store the incoming light from every single direction, since we don’t know where the surface point will be facing at shading time. This poses a challenge as to how we should store them. There are potentially very many light probes, so we need the memory footprint of each probe to be low.

Luckily, people have figured this out. The incoming light can be modelled as a spherical function, and a spherical function can be decomposed into spherical harmonics.

Spherical harmonics sound scary, I know. I’ve been intimidated by them for quite a while, especially because introductory texts illustrate them as the following:

Good thing is, we don’t have to care about that. These guys are far more friendly when put in Cartesian form (at least for the first couple of bands). In short, they are just scalar functions that take in a 3d direction as input:

You can very clearly see that these guys are just simple spherical functions! In fact, after we fold the constants and place these equations into code, they just look like this:

float SH(float3 Dir, int Basis)
    float Res = 0.0;
    if (Basis == 0)
        Res = 0.28209479177387814347;
    else if (Basis == 1)
        Res = -0.48860251190291992159 * Dir.y;
    else if (Basis == 2)
        Res = 0.48860251190291992159 * Dir.z;
    else if (Basis == 3)
        Res = -0.48860251190291992159 * Dir.x;
    return Res;

Look a lot more friendly now, don’t they?

Now I’m not going to pretend I know anything about spherical harmonics, but here’s the big picture. First, we select a few spherical harmonics that will serve as the basis functions to reconstruct the original function. Then, for each of the spherical harmonics, we _project_ the original spherical functions into it to obtain a coefficient. We compute these coefficients such that the sum of all spherical harmonics, each scaled by their own coefficient, equals the original spherical function. Now, if we only select the first few spherical harmonics, we can only obtain a very rough estimate, but that estimate is already good enough for our purpose. Diffuse GI is quite low-detail and a low order encoding is sufficient.

Anyways, after that, we just end up with a handful of floats encoding the influence of each spherical harmonics if we were to reconstruct the original spherical function using them. In this case, the spherical function is just the incoming skylight+indirect light.

Spherical harmonics are scalar functions, so we need three sets of spherical harmonics coefficients, one set for each color channel. I ended up using FORMAT_R11G11B10_FLOAT, which is good enough for storing these coefficients. I’m using the first four spherical harmonics, so that means I need four coefficients each channel. That means that the irradiance of each probe can be stored at as small as 16 bytes, pretty good!

Baking the Probes

To start out, let’s lay the probes wastefully in a very dense 3D regular grid in space. This way we ensure every surface in the game world will be able to extrapolate indirect light from some probes.

To bake these probes, I could rasterize a cubemap around each probe and then turn them into spherical harmonics coefficients, but this idea just isn’t that appealing to me. First of all, the cubemap resolution will be pretty small, let’s say we are doing 64x64 texels per face. You have to realize that, for each of those faces, the _entire_ scene geometry will have to go through the raster pipe _six_ times per probe. Not to mention, all those triangles are going to end up as tiny pixels on the cubemap faces. It’s a terrible case for raster.

Instead, I chose to go with ray tracing, since we already have a ray tracing system in place. We can simply prep all the rays for some probes, and then dispatch all these rays within one single compute dispatch, fully saturating the GPU shader cores.

A diagram showing the baking pipeline

We can nicely pipeline the baking stages as a handful of compute passes, shown as above. A nice side effect is that this actually allows real-time probe update. We can have a scheme that selects a few key probes, and send them through this pipeline to recompute.

Probe Interpolation

Here comes the real problem, you see, since we laid the probes in a 3D grid, we can easily fall into situations like these:

We want to sample indirect illumination at the shading point. Unfortunately, probe 0 and 2 falls inside a wall. Clearly we can’t use data from probe 0 or 2, but a trilinear interpolation doesn’t know that. We can improve, though, by adding a geometric factor. Let’s say, on top of the trilinear weight, we only take data from probes that are above the shading surface’s tangent plane:

Using dot product of the probe’s relative positions and shading surface normal, we can tell which half-space the probes lie within w.r.t the surface’s tangent plane. We can incorporate them into our weighting function to eliminate probes from behind the surface.

However, when we put this trick in practice, we still get corner cases with light leaks:

an example of GI with trilinear + geometric interpolation

Turns out, the tangent plane test alone isn’t enough to eliminate bad interpolations alone. Here is one of many corner cases where this traditional probe interpolation breaks down:

Despite being in the positive half space of the shading surface’s tangent plane, probe 1 is placed inside the ceiling, and therefore should not be sampled from. As you can see, we just do not have enough data to reliably determine which probes to use during interpolation.

DDGI’s Contribution

Like I mentioned in the beginning of the post, a very recent technique was developed to combat this sort of artifact, namely DDGI. The big contribution from them is the extra visibility data they store for each probe. They store the spherical depth and depth^2, and use this information to reliably detect which probes should not be visible to the shading surface. Normally if you do this in a boolean fashion, you’d end up with sharp discontinuities. What’s cooler is that the DDGI paper has come up with a heuristic that would eliminate light leaks _and_ maintain smooth blending at the same time. I don’t believe I can do a better job explaining their ideas than the paper, but I do want to share a couple of important implementation details that made this technique work, which might have been overlooked during its original presentation/paper.

Storage of spherical depth

Spherical depth can be interpreted as a scalar spherical function, just like spherical light. I have tried encoding them as spherical harmonics, but the ringing artifacts caused the algorithm to break, so I didn’t go that route. I ended using an octahedral map to encode spherical depth and depth^2, just like the DDGI paper proposed.

This scheme was referred to as “octahedral encoding” during the talk, which really confused me. I think it’s much more correct to call this octahedral-mapping. All it does, really, is mapping 3d points on a unit sphere to a 2d point on a unit square. Exploiting this mapping function, we can pack a spherical function onto a unit texture. I’ve found out that this algorithm only works well enough with a 16x16 oct-map, which is also what the original paper proposed as well. Pushing the resolution any lower, and I started seeing artifacts.

Process of octahedral mapping

A 16x16 resolution oct-map can only store 256 discrete samples on the unit sphere, sampling any directions other than those discrete points will require interpolation. A nice thing about this mapping is that we can use hardware bilinear filter to do interpolation, since the neighbor oct-mapped texels are neighbors when unfolded back to the unit sphere space as well. The only thing we need to be aware of are sampling on the edges. Padding the texture borders is more complicated than just duplicating them.

Here’s how we should stitch the texels on an oct-map. If we examine the sphere unpacking process, we can see each border of the unit square are all edges of the tetrahedron that has been cut open.

Essentially, when we pad the texture borders, we need to “stitch” together these open edges, so that the bilinear filter can pick up the texels that correspond to their real neighbors on the unit sphere. Here’s roughly how the stitching will look:

oct-map stitched

And here are the texel borders actually laid out. I’ve numbered the texels so you can see the arrangement of the border:

oct-map padded with border to allow seamless bilinear filter

Implementation details of spherical depth storage

Depth map of each probe is prefiltered to allow smooth interpolation. One issue that arises from this is that big depth values can dominate the entire filtered depth function. In order to get this working, depths need to be truncated before storing them into depth maps and filtering them. This truncation is okay, knowing that no shading surface will query probes that are farther than a certain distance.

Another crucial detail is dealing with probes that fall inside geometry. Imagine a case like this:

Probe 0 and 2 are inside the wall, and therefore should be discarded during interpolation. These probes’ depth maps indicate that they should be directly visible to the shading surface however, because their visibility is unobstructed inside the wall; no triangle blocks their view of the shading surface:

The dark green arrows indicate that, from probe 0 and 2’s point of view, the shading surface should be within their reach, since these two rays are unobstructed on their paths.

Special care should be taken when a probe sees a wall that is facing outward. To fix this, I force the depth value to 0 when ray hits the backface. This will turn all probes to be completely invisible inside walls, except cases where the probe is literally on the surface itself.

rays are clamped to a depth of zero, since they’ve hit a backface triangle

Now that we have depth-based probe interpolation to prevent light leak, let’s take a look at our GI, see how it looks:

GI of light probes, incorporating spherical depths

Looks much better, but still has some artifacts. As I have alluded, this depth-based probe interpolation has one weakness. If the probe itself is placed directly on the shading surface or very close to it, then the depth data will be rendered useless. We can combat this by pulling the sampling point outwards along the shading surface’s normal and the view direction towards the camera. By adding this bias, we can finally achieve GI without much artifacts:

GI of light probes, incorporating spherical depths and sampling bias

Final thoughts

Now, we have decent GI, but how much have I paid for it? The above picture looks quite nice, but it’s because light probes are only 0.5 meters apart. My tiny world is 100x100x40 meters, to fill this space with probes, that means we will have to allocate and bake 800,000 probes.

That’s no issue if we only consider lighting data. We are storing 16 bytes per probe for irradiance since we are using spherical harmonics. The big problem is the visibility data: we need 1KB of visibility data per probe for this interpolation scheme to work! (16x16 res oct-map of depth and depth^2 value, each is stored as a 16-bit floating point, which totals to 1024 bytes). To store this many probes, we need 784MB … not good! The actual implementation of course uses a much smarter probe placement scheme, but I will save that for the next post.

Now, I’m not saying optimal probe placement completely solves the problem. We potentially need _that_ many probes once we have a lot of surfaces in the game asset. It’s really bad that each probe needs 1KB of data. What’s worse is that this algorithm really only holds up for high density probe fields. Here’s a GI picture with probes placed 1 meter apart (slightly sparser than what I used before).

GI artifacts by placing probes 1 meter apart

Now this is no light leak, but it’s not pleasant looking either. I am forced to place probes at a high density near the surfaces to remove these artifacts. After visualizing my voxelized probe placement, it really reminds me of light maps. If I were to encode static GI, I will probably end up with a similar amount of light map texels as there are probes, and I don’t need 1KB of visibility data for those lightmap texels … This is really making me wonder, is this scheme really worth it?

To give you a general idea, this is the probe resolution required for the algorithm to _just_ work …

Anyways, that’s it for obtaining GI from probe interpolation. I still haven’t said a word about my probe placement yet. The probe placement I am using is derived from Remedy’s GI talk in 2015. I might make the next blog post on that, stay tuned!
Last time I covered the design of acceleration structure (AS) for my game engine’s ray tracing system. This time, I will go into depth on how ray traversal is done with such an AS. As I go on to describe how ray traversal is done, it should become much clearer on why the AS is designed this way.

In a traditional ray tracer, all you have to do is to build an AS over some static geometry during startup, and for the rest of the program, you just keep tracing some rays through the AS. That part is well-covered in literatures such as Physically Based Rendering, but it becomes messier when we bring ray tracing to real-time games.

Rigid Body Transformation

The first obstacle is to track rigid body transformations. Building BVH over a mesh is an expensive operation, so we only do it on startup time. Transforming the mesh then would invalidate the BVH we’ve built. It is not uncommon for a mesh to be continuously transformed throughout multiple frames, and we must have a way to handle that.

The simplest solution is to transform all the vertices and refit the BVH every frame. However, this can cause problems. First of all, BVH is composed of a hierarchy of axis-aligned bounding boxes (AABB), subdividing the space that the mesh occupies as the level goes deeper. The subdivision is carefully chosen at build-time to have optimal ray traversal performance. Expanding/shrinking the AABBs to fit the transformed mesh will no longer guarantee the same traversal characteristics. Especially if the mesh is rotated, the AABBs will change drastically and lead to almost crippled BVHs in terms of performance. Another obvious thing is we need to transform all the vertices first, which of course takes time.

A much better solution is to transform the ray to object space instead of transforming mesh to world space. This will preserve the same intersection, and it removes the need to transform vertices. And of course, the biggest win here is the BVH is no longer updated; we still get the optimal subdivision that we had from the first time we built it.

This is why each BLAS in the AS stores an inverse transformation matrix. Rays will be transformed to local space using this inverse transform before the actual traversal starts.


One thing I’d like to be supported is instancing. In the game, we inevitably have multiple entities that share the same static mesh. Storing duplicate BVHs and mesh data for these entities is a waste. Therefore, the AS is pulled into two levels. Top level entities who share the same mesh will point to the same bottom level AS.

Deformable Mesh

This is the trickiest case to handle. Vertices actually change, so we need to apply skinning every frame. This also means extra storage for these deformed vertices _and_ the BVHs that are refitted specifically for these deformed vertices. Here we just have to eat the cost of refitting BVH. BVH’s quality will suffer, but refitting is quite fast using GPU compute (covered in last post).

Two-Level Ray Traversal

The traversal itself is quite simple, but the fact that AS is two-level might seem daunting at first, so I will cover that briefly.

The algorithm used to traverse a single BVH is the same as the one in Physically Based Rendering. I don’t think I can do a better job explaining it, so here’s the link.

The Top-level AS is a BVH of instances, each instance node storing pointers to bottom-level AS and storage for the inverse transform. You can also say that it’s a BVH of BVHs. Applying the typical BVH traversal to top-level AS, we eventually hit one of the leaf nodes (instance nodes in this case). Now we stop traversal, remember this location in the top-level AS, then context switch to the bottom-level AS that it’s pointing to. This is also where we apply the inverse transform to our rays (though we should still preserve rays in global space for traversal of other instances).

After we context switched to the bottom-level AS, loading up the BVH address and transforming our ray, we again apply the same BVH traversal algorithm. Except this time, the leaf nodes are triangle nodes. A hit with the leaf node will become a candidate to ray intersection.

After the bottom-level traversal, we pop back to top-level. We continue this loop until we are done with all top-level nodes.

Now we are done, that’s all there is to it. The real work here is maintaining the two-level AS. The actual ray traversal is light work in comparison.
I started doing some work to add a GPU ray tracing system to my engine. I want to experiment and see what sort of effects can be enabled on current gen GPUS, even without special hardware support. And of course, before we can shoot some rays, we need an acceleration structure (AS) for all the geometries in the scene. There are some popular ones, such as BVH (which also happens to by my choice of AS in this case). However, it remains an issue as to how to efficiently maintain AS for a highly dynamic scene like Monter’s, with animatable meshes and dynamic geometries.

Problem Statement

I want high quality BVH for fastest ray tracing possible, so I rolled with a SAH-based top-down BVH builder. Despite producing high quality BVHs, it has two problems: it is slow, and it can’t run on the GPU.

Instead of trying to come up with some fancy GPU builder that would run per frame, I’ve sidestepped the issue and came up with a scheme of maintaining AS that removes the need of any fast BVH builder.

Here’s what the scheme achieves:

1. Zero rebuilds.
2. Allows instancing (reusing same geom but with different transforms).
3. Handles skeletal animated mesh (skinned mesh).

AS Maintenance Overview

The AS is divided into two levels. The higher level ASs store the instance data (transforms, BVH pointer and geometry pointer), and the lower level ASs store the actual BVH and geometry data. This separation allows instancing and easy transformations as you will see in a bit. Borrowing the terms from DXR, let’s call the higher level “Top Level Acceleration Structure” (TLAS) and the lower level “Bottom Level Acceleration Structure” (BLAS).

At startup, the system builds a BLAS for each mesh that might be used, even deformable meshes, but without skinning applied. These BLASs are sub allocated within the same buffer, let’s call it BLAS buffer. Recall that a BLAS stores the actual BVH and geometry data, so that’s what we need to actually do ray tracing.

During each frame, the system builds a fresh instance array. During that time, a transform is assigned to each instance according to the world data. Also, a BLAS pointer is assigned to each instance. This allows multiple instances pointing at the same BLAS, realizing instancing.

To make sure this idea is concrete, let’s start with a simple example. Say we have three instances in the scene, with two instances sharing the same underlying mesh but different transforms. It would look like this:

illustration of the simple AS example

As you can see, instance A and B use the same mesh, so they share the same BLAS. This saves memory footprint for duplicate geometries. In addition, note that BLAS 2 is created even though it’s not used. This is because the system creates BLASs for all meshes, even if they are not used. This becomes significant in a second.

Handling animated meshes

Recall that the BLASs we have so far are built around static meshes. Once the mesh starts animating, BLAS is no longer accurate. To handle this, BLAS buffer is partitioned into two parts: permanent and transient. The permanent segment stores BLAS of all meshes in static form. Transient segment stores BLAS for all animated meshes.

Extending upon our last example, let’s say we add a new instance D that is associated with an animated mesh. To accommodate that, the AS system will “build” a brand new BLAS for instance D, and place it inside the transient segment. All BLASs in the transient segment are flushed and “rebuilt” every frame:

illustration of an AS example with deformable mesh

BLAS Refit (as a Compute Pass)

But hold on a second, wasn’t a fresh BVH build every frame too expensive?

The trick is to “refit” instead of “rebuild”. We can make the assumption that the structure of a mesh does not change much during animation. Recall that BLAS stores both geometry data and BVH data. All we have to do is to skin the vertices and place them in the geometry data, then “refit” the BVH to the skinned geometry. The permanent BLASs act as mold that are copied from and refitted to create BVHs for animated meshes.

A BVH refit is a relatively cheap operation, compared to a full rebuild. I don’t think BVH refit on the GPU is well covered on the internet, so I’m going to try talk about mine. It’s definitely not the best, but it gets the job done.

To achieve fast BVH refits, I store the indices to leaf nodes and a list of parent pointers for all nodes. We start off with the BVH:

Using the leaf node indices, we can launch as many GPU threads as there are leaf nodes, and immediately start chewing on the leaf nodes.

To handle leaf node refit, I just recalculate AABB of the underlying primitives that are pointed to by these leaf nodes. Pretty simple.

After finishing processing its node, each thread tries to grab the parent node. However, for every internal node, there will be two threads trying to grab it. To get around this, I have an atomic lock on each node that only allows one thread in. The thread that fails to grab the parent gets killed.

Once a thread grabs the parent, it knows for sure the subtree from which it comes up are all done processing. But it doesn’t know if the other child of the node has done processing yet. In this case, each thread must “spinlock” until both children are done processing (communicated through their locks):

Once both children are done, the thread just union both its children’s AABBs and update that as parent’s new AABB. Recurse this process until the last thread has reached root, and there you have it. Now your BVH has been refitted completely.

Don’t literally spinlock your GPU thread (like a busy wait). GPU execute groups of threads in lockstep. If you spinlock on one thread, then all the neighbor threads are locked too. If the thread you are waiting on happens to be one of them, then it will be a deadlock. The way my compute shader is set up is that each thread is executing a while loop, and at the end of while loop, there is a barrier that ensures all threads get to execute some code at each iteration, so all threads are synchronized. If for any thread whose dependent lock is not ready, I just skip the iteration for that thread. Instructions will still be executed for neighboring threads, they will just get masked off for the threads that are still waiting.

TLAS rebuild

Another important detail I glossed over is the implementation of TLAS. So far, our TLAS is just an array. If we were to do a ray traversal, then we would have to linearly burn through all the instances, which is quite inefficient, especially if the instance count is big.

An obvious solution is to build a BVH of instances. That’s exactly what a TLAS is. Since the number of instances can never get as big as the number of polygons in the scene, it’s okay to do a full rebuild on these instances every frame.


And there you have it. Following this process allows me to maintain a correct BVH for all instances in the scene, be it deformable or newly spawned. There are some complications with the actual ray traversal through a two-level BVH like this, but that will be the topic for another post.