A fast and easy particle system using GPU instancing
These days, pretty much any game engine comes bundled with some form of particle system that handles the creation, movement and rendering of a high-number of (typically) small, homogeneous objects within a scene.
An example of a simple particle system simulating fire
When working on CocosSharp, we quickly realised that our particle system performance wasn’t ideal — something that was particularly noticeable on mobile platforms. Digging around at the implementation, I thought one could redesign things to allow for GPU-based instancing (we’ll get to this in a bit) to significantly reduce the CPU-overhead and thereby boosting the overall rendering performance.
While an instancing-based particle system is by no means a novel idea, I found it pretty tough to find any accessible resources online that described the nitty-gritty of such an implementation (at least the specific approach that I have in mind). So the motivation for this guide is to, at the very least, make developers aware of such an approach in the instance (that’s a pun) that they’re seeking to squeeze more performance out of their particle rendering.
If you’ve got a good understanding of particle systems and simply want to see the code, then skip to the implementation
- Before we get started
- The standard (and slow) implementation
- Using instancing to speed things up
- Is it worth it? Let me work it…out
- Some limitations and alternative approaches
- Wrapping up and sample project
Before we get started
While the sample implementation uses C# and MonoGame 3.5, the aim is to make the guide platform-agnostic, so hopefully it should be equally relevant regardless of your chosen language and tools.
This is by no means the definitive approach for fast particle rendering. At the conclusion, I’ll discuss the limitations and advantages relative to other popular techniques.
The standard (and slow) implementation
When designing a particle system it is important to keep in mind certain unique characteristics that distinguish them from other renderable objects — namely,
- we’re usually rendering thousands of particles concurrently
- even within a three-dimensional scene, particles are typically rendered as flat sprites — that is we geometrically represent a particle as a single quad
- generally homogeneous in that both the size, colour or underlying texture is consistent (we’ll see later how one add some variability)
- the (collision-free) movement path of particles is generally deterministic with small random pertubations to give the illusion of a chaotic system
- collisions between particles is usually ignored
Again, the above list is by no means a strict requirement, but rather features of a particle system’s typical use cases (particularly for game development).
Using sprites to represent each individual particle
The life of a particle
Additionally, any given particle is usually only rendered for a short, finite time and so any particle system implementation typically employs the following process
- We emit a particle
- For every update step, we update the particle transform (position, scale, rotation etc.)
- The particle is rendered with the new transform
- After a fixed time t the particle is destroyed and removed from the scene
- We continue emitting further particles up to some particle or time limit
Note, that obviously we can emit new particles while older particles are still alive — otherwise, it would be a pretty boring particle system!
Laying out our data
Now that we understand the particle lifecycle, a natural approach to setting up our data is to associate each particle with an affine transform that is represented as a 4x4 (floating-point) matrix and encapsulates all the information about the current position, rotation and scale. So we would have an array of transforms for each particle. i.e.
We know that the number of particles being rendered is dynamic over time so the question is how large should we make this array?
A cornerstone of high perfomance game development (or otherwise) is to keep dynamic memory allocation to a minimum.
Rather than constantly rejigging the size of this array to correspond to the number of active particles, we instead allocate the array once for the maximal number of particles that may be visible. For instance, if we have a particle life of 2 seconds and an emittance rate of 1000 particles/second, then we require an array of size 2000.
Finally, we need to layout our rendering data — that is, the vertex buffer that will be passed on to the GPU. Remember, that we’re assuming that a particle is rendered as a flat quad, meaning that we will require 4 vertices per quad.
Each vertex typically packs the associated position, texture coordinate and colour - a total of 6 bytes of data per vertex and hence 24 bytes per particle. Moreover, each particle transform is a 4x4 floating-point matrix meaning we have 64 associate bytes per particle. This may not seem like much, but once we scale the system to render hundreds of thousands or millions of particles, the required amount of data starts to add up!
The CPU update-GPU render cycle
Once familiar with the standard design and data involved with creating a particle system, the actual computational steps are relatively straight-forward. First, for a given game we setup a run-loop that will periodically request the particle system to perform the following
In MonoGame, a run-loop is automatically setup for you within a
Gameclass instance, periodically calling the
- Compute each particle’s new transform
- Update the corresponding vertex buffer
- Load vertex buffer onto GPU
Schematic representation of the CPU-GPU render cycle for a particle system. Note, that a lot of the work is performed on the CPU.
Hopefully, one can see that the bulk of the work is performed by the CPU and moreover there’s the further cost of constantly transferring a potentially large vertex buffer over to the GPU to render. If the particle system was the sole element of your scene then this may not be too big of a problem, but once you start adding other renderable objects, other CPU computations (e.g. physics calculations), input handling etc.; lessening the burden on the CPU can dramatically improve the overall performance of your game.
Using instancing to speed things up
Simply put, geometry instancing provides the opportunity to repeatedly render the same vertex buffer on the GPU within the same draw call. More importantly, we can further include a corresponding instance buffer to specify attributes that are unique to each instance of the rendered objects. For example, a common use of the buffer would be to contain the affine transform matrices of each instance.
A schematic representation showcasing the steps involved when performing geometry instancing on the GPU.
In this way, we can repeatedely render a complex geometric object in different positions, scales, rotations (and more) within the scene using a single draw call.
There’s instancing and then there’s instancing
The problem with the standard approach of creating a instance buffer filled with transforms is that we’re still doing a lot of the heavy computation on the CPU. Moreover, we still need to transfer this large buffer consisting of matrices over to the GPU.
Instead, while we still will employ geometry instancing for our solution, the approach we will take is to first slim down the data of our instance buffer to simply contain the starting time of a particle. Then, in the upcoming section, we show how we can perform essentially all our computations related to updating a particle’s state on the GPU.
The proposed alternate instancing implementation. Notice how we’re highlighting that the passed-in instance buffer is much slimmer, which will then be subsequently expanded on the GPU.
Now that the theory is behind us, let’s get to coding! In my case, I kick-start
a new MonoGame Visual Studio project that provides the skeleton code for a game.
In particular, an instance of the
Game class will initiliase our content and it’s
from here that the game’s run-loop calls
Draw are called.
For more details on getting started with MonoGame check out this great post here
We’re going to be encapsulating the functionality of our particle system within a
FastParticleSystem class. To begin we define the properties of our system
Once a user has created and customised their particle system,
RefreshParticleSystemBuffers on the instance to generate
both the vertex and instance buffers. In the former case, the vertex buffer
is a very small one consisting of just four vertices — a single quad
There’s also an associated index buffer whose initialisation I’ve omitted from the snippet for the sake of brevity.
In MonoGame, the provided interface for geometry instancing requires the use of indexed-primitives — that is, your vertex buffer must have a corresponding index buffer
Next, we create our instance buffer which will store data specific to each particle. The key idea is that unlike other approaches that store entire matrix transforms per instance, our buffer is going to be far simpler - namely, we are solely going to specify the particle starting time
Finally, once the buffers are set we render them
Warning: As of MonoGame 3.5,
DrawInstancedPrimitivesis only supported by DirectX targets, however the aim is to provide full cross-platform support.
Remember, MonoGame is an open-source project and relies on the community to help improve the framework. If you’re interested in contributing, check out the guide here to get started.
The particle shader
In the previous section we described how our instance buffer consists of the starting time of a given particle, but we didn’t describe how this would result in updating the state (position, size, colour etc.) of a given particle. In short, we will compute this directly on the GPU via our custom vertex shader.
If you’re not familiar with shaders and the programmable GPU pipeline, a good introduction can be found here.
For MonoGame users, documentation for creating and using custom effects (i.e. shaders) can be found here.
As an example, let’s say we want to simply move a particle vertically down (in two-dimensions) over a distance d for a given life-time. We could then parameterise the position of our particle relative to time — that, is
where t is the time normalised to be between 0 and 1 (i.e. 0 corresponds to when the particle is emitted, 1 corresponds to the full life-time in seconds). Translating this into code, we define the shader constants
which would be initialised during the
Draw call i.e.
and subsequently within the vertex shader, we compute the current time of the particle and normalise it to lie between 0 and 1. Finally, we use the current time to fully determine the position of our given particle
Once we get the hang of expressing the particle state as a function of time, anything is possible. For example,
would give us a circular particle path, where r is a constant radius that could be passed-in via the shader. Of course, we’re not limited to simply parameterising the position, but can do the same for a particle’s colour, rotation, size and more.
Typically, a particle system is attempting to simulate a complex, chaotic system, which can be achieved by incorpoating some randomness into either both the starting or desired ending state for each distinct particle. For example, considering again a simple particle moving vertically, we may want to specify a range of x staring positions
So we would initialise each particle to randomly select a starting horizontal position witin this desired range. To incorpate this into our code, we first update our instance buffer to include a random interval between 0 and 1.
We then subsequently update the draw call to specify the range of starting positions a particle can take
Finally, we update our vertex shader code to incorporate our randomness
In general, it may not be necessary to constantly refresh the instance buffer with new random variables over time. Aside from there being a performance benefit to avoid this computation, if the number of particles rendered is relatively large, refreshing this large array of random variables may not visually be noticeable.
Putting it together
Finally, one can also utilise the instance index that’s passed into the vertex shader to further add some variability. Specifically, within our shader we can define our input as
which allows us to distinguish each instance call via the passed-in
Combining all the elements discussed together, one can construct quite complex and sophisticated particle systems. For example, I was able to create a very slick looking galaxy by doing the following
- Emitting particles radially
- Radius is a function of both time as well as the particle id
- Use variable starting and ending colours within a random range
- Colour alpha value is a function of time
- Depth of particles is a function of the particle id to enable depth-sorting
Galaxy consisting of 200,000 particles!
Again, your particle system doesn’t necessarily have to be this complex, but is more a highlight of what is possible.
Is it worth it? Let me work it…out
Despite the relatively complex computations performed on the vertex shader within my example, I was able to consistently render over 400,000 particles (800,000 triangles) at a consistent 60 frames per second on an Integrated GPU (Surface Pro 4), which, personally, I thought was fantastic given that my experience with standard particle systems struggling with numbers less than a hundred-thousand. However, there are a few things to keep in mind when you are comparing performance
- Resolution Within my example, my game was running at 1024x768. In general, the higher the resolution the lower the frame-rate.
- Fill-rate Similarly, the size of your particle sprites affects performance.
- Complexity of shader computations Simply translating particle positions versus, for instance, applying trigonometric calculations, colour, collisions etc.
- Alpha testing and culling can also impact the time spent processing each particle
So this isn’t to say that faster implementations exist, which we’ll discuss in the upcoming section, but simply that we need to be aware that when making comparisons, we’re comparing apples with apples.
Some limitations and alternative approaches
Both the strength and weakness of this implementation is that we are relying on the state of a particle to be fully determined as a function of time. It is for this reason we’re able to pass-in a slimmed-down instance buffer consisting of particle time (and random intervals to give the illusion of chaos) and then rely on the GPU to determine all the properties of the system. As soon as a particle’s state can’t be solely determined ahead of time (such as when incorporating collision effects), then this approach no longer becomes as viable.
Additionally, with respect to collision-free particle system’s there may be a few additional changes that we could have made to further help performance
Use a geometry shader In our approach, the vertex shader is called a total of four times for any given particle, resulting in a lot of repeated computations, such as determining the particle time. As an alternate approach, we could ask the vertex shader to render points rather than quads, which would subsequently be expanded into quads by the geometry shader. This would avoid repeated computations because a point would be uniquely mapped to a particle. As an additional benefit, the geometry shader gives us the opportunity to discard particles that aren’t visible, potentially improving performance.
Use a compute shader I have come across users speeding up their particle systems by employing compute shaders — a shader stage for performing arbitrary computations on the GPU. The advertised benefit is that we are off-loading the heavy computations of determining particle state to the compute shader, leaving the vertex and fragment shaders to solely perform the rendering, resulting in a more streamlined pipeline. It would be interesting to see what kind of gains could be achieved with this approach.
As of MonoGame 3.5, both geometry and compute shaders are unsupported.
Wrapping up and sample project
I hope you enjoyed this guide and that I’ve convinced you that, at the very least, there are better alternatives to the standard CPU update-GPU render approach when designing a particle system.
The sample project demonstrating my funky galaxy system can be found here, and if you have any feedback or suggestions please let me know down in the comments. Thanks!