The quest for the "best" random numbers on the GPU

Published:

While working with Perlin noise for procedural terrain generation, I noticed that the resulting terrain exhibited some unpleasant repeating patterns. It didn’t take me long to realize that the culprit was the infamous one-liner we have all been guilty of using at some point to get quick and dirty random numbers in shaders:

float rand(vec2 co) {
    return fract(sin(dot(co.xy, vec2(12.9898, 78.233))) * 43758.5453);
}

The function conveniently returns a floating point value between 0 and 1 for a given 2D coordinate. However, its behavior is machine dependent due to the implementation of the sine function being a black box and variable among vendors. It also tends to produce moiré patterns and breaks down when the input is large enough. So, can we do better? Are there more robust techniques that offer better quality with a similar (or even lower) performance cost?

What we are looking for

We are mainly interested in generating random numbers for the purpose of computer graphics. The literature on pseudorandom number generators (PRNGs) clearly distinguishes between cryptographically secure and non-cryptographic generators. We care about the latter ones, because we are only interested in producing random-looking output, and don’t care too much about the statistical properties of the generated numbers. In contrast, performance is a high priority - the chosen algorithm must have a similar cost as the infamous one-liner.

Interestingly, achieving random-looking output is not straightforward. The human brain is great at recognizing patterns, and many common PRNGs fail when doing a simple 2D plot of their output. So, even if we are not aiming for statistical quality, we might need to keep it in mind anyway.

Another restriction I imposed to myself is that the chosen approach must be self-contained and implemented entirely on GPU hardware, i.e. doing texture lookups is not a valid option. A single texture would not provide a sufficient amount of random values for many applications, and it can also be argued that modern GPUs handle multiple ALU instructions more efficiently than a single texture lookup.

The terms “wide” and “deep”, which to my knowledge were first coined by Nathan Reed in this blog post, are also something to take into account. PRNGs are designed to work by going “deep”: a single instance of the PRNG is initialized with a fixed seed, and its last output is used as the seed for the next call. These generators rely on sequentially updating their internal state after each call to produce “randomness”, so they do not yield good results when only the first number in the sequence is used, particularly if the initial seeds are highly correlated. This is specially problematic on the GPU, since we cannot share state between threads1, and we must be able to use adjacent initial seeds. For example, in GPGPU programming, the invocation id is usually used as the initial seed. For shaders in the graphics pipeline, any value that is unique to the shader invocation is used, such as the vertex or pixel position, UV coordinate, instance id, etc.

Hash functions, on the other hand, are stateless and designed to go “wide”. They do not depend on a single instance that updates its state after each sequential call, and produce random values that are well-distributed when using neighboring seeds. Unlike PRNGs, a hash function allows us to independently generate a random number for each shader invocation that is sufficiently random with respect to the ones produced by other invocations, and without sharing any state.

What are our options

So we need to come up with a hash function. Coming up with one from scratch is a bit pointless since there are many existing ones to choose from. Ideally, we are looking for a hash function that has the following properties:

A great starting point is this great 2020 paper by Mark Jarzynski and Marc Olano. Quoting the paper:

[…] many of the GPU hashes in use today have only appeared in blog posts or shadertoy code, and have never been formally evaluated or compared to other alternatives. This leaves anyone looking for a good GPU hash with little guidance to decide which to use.

Indeed, the state-of-the-art regarding GPU hash functions seems to be scattered across different sources. The paper provides a useful baseline to work with, and any of the recommended hashes can be a good choice. A specially popular one, probably because of Nathan Reed’s article, is the 32-bit PCG hash.

I also came across this amazing article by Chris Wellons, which describes a clever Monte Carlo-based approach to find the best candidate from a pool of billions of potential integer hash functions. Essentially, the program has a list of possible operations that serve as the building blocks of the hash function, and randomly combines five (or a configurable amount) of them to evaluate their resulting avalanche behavior.

Although it uses pure brute force to find good hashes, the prospector was quite successful, and found that the construction “XOR-right-shift, multiply, XOR-right-shift, multiply, XOR-right-shift” delivered pretty good results. This aligns with more theoretical/empirical findings which claim that the multiply-XORshift technique creates good avalanche. TheIronBorn used some optimization techniques to discover the best parameters currently known for the XORshift-multiply-XORshift construction that minimize the bias. This is the resulting hash, which will be referred to as lowbias32 from now on:

uint hash(uint x) {
    x ^= x >> 16;
    x *= 0x21F0AAADu;
    x ^= x >> 15;
    x *= 0xD35A2D97u;
    x ^= x >> 15;
    return x;
}

It might be worth mentioning that a user named fp64 described another potential operation called “xorsquare”, which might yield better theoretical results at the cost of a bit of performance. I decided to ignore it for now since the current “best” construction seems to have enough quality for all my use-cases.

Hash comparison

Choosing between lowbias32 and other, more theoretically-based hashes is mostly a matter of personal preference.2 When comparing their output visually, they all have sufficient quality that it’s very hard (at least for me) to find any obvious patterns. Still, for completion, I used PractRand to compare several common 32-bit integer hashes. PractRand, which seems to be the state-of-the-art when it comes to PRNG statistical testing, reads a random stream of bytes from standard input and performs statistical tests on it. Quoting an article by the PCG author:

Much like experiments in other fields of science, these tests typically produce a p-value, but whereas scientists usually desire results where the null hypothesis—that the observations were merely due to chance—is ruled out, we desire the opposite: a result confirming that the observed behavior is consistent with chance.

If PractRand detects that the observed output is not consistent with chance, it means that there is a clear bias in the PRNG algorithm, and the test fails. The number of bytes read up to that point indicates the “strength” of the PRNG algorithm, so reading more bytes usually means that the PRNG has a higher statistical quality. I also calculated the bias of each hash function using the hash prospector. Bias measures the correlation between which output bits flip for a particular flipped input bit, so lower is better.

32-bit Hash FunctionPractRandBias
iqint11KB704.6363…
xxHash16KB-
Wang Hash16KB36.00093…
MurmurHash316KB-
lowbias32 (with TheIronBorn’s parameters)64KB0.107602…
PCG (version from Jarzynski and Olano)4MB7.511839…

Some popular hashes have not been considered for the table above because they have very poor visual quality.

PractRand was designed with PRNGs in mind, not hash functions. Since hash functions lack an internal state, we use a counter as the input, with the additional advantage that it allows us to check how the hash performs with adjacent seeds. It’s important to note that chaining the hash calls, i.e. using the previous output as the seed for the next call, would yield better PractRand results. I use the following C code to output the random stream for PractRand:

#include <stdio.h>
#include <stdint.h>

uint32_t hash(uint32_t x) {
    /* Hash to test */
}

int main() {
    uint32_t i = 0;
    while (1) {
        uint32_t random = hash(i++);
        fwrite((void*)&random, sizeof(random), 1, stdout);
    }
    return 0; /* Never reached */
}

In terms of performance, they cannot be compared fairly. PCG and lowbias32 take almost the same time, while MurmurHash3 takes longer. This is to be expected: the Xorshift-multiply scheme used by lowbias32 serves as the final mix operation in MurmurHash3, meaning MurmurHash3 will always perform worse due to having more operations.

I personally decided to go for lowbias32. The bias of this hash function is significantly better than the alternatives, while also performing well-enough in PractRand. This makes sense as the hash prospector was explicitly optimizing for low bias, while PCG isn’t. PCG seemed to be perform better in formal statistical testing, but I wanted to prioritize a low bias rather than statistical robustness. It is also really fast, as it only involves 2 integer multiplications, 3 bit shifts, and 3 bitwise XORs.

Input dimensionality

The methods we have discussed so far accept just one dimension of input (a single 32-bit unsigned integer). There are multidimensional hash functions that can handle vectors (multiple 32-bit integers) for both input and output. These functions intertwine their inputs to ensure that a change in any input component affects all outputs.

For now, let’s focus on N -> 1 hash functions, i.e. functions that convert N-dimensional input to a single output. These functions are useful when our seed is a 2D or 3D vector, like a vertex position or a UV coordinate. A common way to convert a 1 -> 1 hash function to N -> 1 is to do a linear combination of the input components using a prime multiplier on each dimension:

\[ \text{hash}(X \cdot a + Y \cdot b + Z), \]

where \(X\), \(Y\), \(Z\) are the input components, and \(a\) and \(b\) are large prime numbers.

This method works fine in general, but the output might exhibit some repeating patterns depending on the chosen prime, and the general quality of the hash suffers greatly, as shown by the Jarzynski and Olano paper.

An alternative method, first shown by Perlin on his classic 1985 paper3, involves nesting the hash:

\[ \text{hash}(X + \text{hash}(Y + \text{hash}(Z))), \]

where \(X\), \(Y\), \(Z\) are the input components.

While this method avoids the quality issues of the linear combination approach, it requires multiple calls to the hash function, which negatively impacts performance. This may not be a problem if the hash is cheap enough, and the significant quality improvement usually justifies the extra cost. I would personally default to the nested method in most cases and use the linear combination method only when the target hardware is limited and known not to handle integer arithmetic well.

In GLSL we can take advantage of function overloading to implement the different variants:

uint hash(uint x) {
    // ...
}
uint hash(uvec2 x) { return hash(x.x + hash(x.y)); }
uint hash(uvec3 x) { return hash(x.x + hash(x.y + hash(x.z))); }
uint hash(uvec4 x) { return hash(x.x + hash(x.y + hash(x.z + hash(x.w)))); }

Output dimensionality

Sometimes we need to generate multiple random numbers on a single shader invocation. For instance, in path tracing, we require many independent random numbers per pixel to sample BRDFs. Similarly, in procedural terrain generation we usually combine several layers of noise, and each one must be independent of the others to avoid self-similarity.

If the number of independent random numbers needed per invocation is low, a common way to convert a 1 -> 1 hash to 1 -> N is to apply a translation to the input by offsetting it with an arbitrary number:

\[ [\text{hash}(X), \text{hash}(X + a), \text{hash}(X + b)], \]

where \(X\) is the input or seed, and \(a\) and \(b\) are the arbitrary numbers.

If we need a larger quantity of uncorrelated random numbers per invocation with better quality, we can use PRNGs. The approach involves using one of the previously mentioned hash functions to generate an initial seed for the PRNG. Each shader invocation has its own instance of the PRNG, eliminating the need for tracking a global state, as the state is maintained locally. As we mentioned earlier, PRNGs are not designed to go “wide”, but hashing their initial seeds ensures that their output remains uncorrelated.

Some sources, like this one, recommend against using different seeds when parallelizing PRNGs, and that jump-ahead should be preferred when possible. I guess this might apply to more serious simulations where the statistical quality of the generated numbers is very important. In my own experience, for computer graphics, chaining a hash and a PRNG delivers good results and I have yet to find any easy to spot patterns when using this method.

Choosing a PRNG for chaining

I will not go too in-depth in choosing a non-cryptographic PRNG. Mersenne Twister, xorshift, PCG, and xoshiro/xoroshiro are all great options. The last two (PCG and xoshiro) are more modern, so they are usually recommended nowadays. Their authors were involved in an interesting technical feud over which algorithm is superior, throwing statistical tests at each other until one emerged victorious. I accept the importance of mathematical theory and rigorous analysis, but I think that this choice is mostly a matter of fanaticism.

I personally use PCG, but I don’t have a special preference for one or another.

Random floating point numbers

In shaders we usually need random floats, not integers. High-quality random floating point numbers are difficult to obtain directly just using floating point operations. As we saw earlier, integer arithmetic has some great properties that are leveraged by PRNGs and hashes to obtain high-quality random sequences with good statistical properties.

Fortunately, modern GPU hardware provides great support for integer arithmetic, achieving similar performance to that of floating point operations. In the case of Nvidia, here is a nice table comparing the throughput of arithmetic instructions. In the past, this wasn’t the case, as int32 math on the GPU was either prohibitively expensive or not possible at all. At that time, it made sense to consider using the 24-bit integer multiplication intrinsics, or the one-liner from the beginning of the article that only involves fp32 operations.

GLSL, HLSL, Cuda and OpenCL all use the IEEE 754 binary32 format for single-precision floating point numbers.4 A well-known technique exploits this format to construct a random floating point number in the half-open range \([0,1)\) starting from a random 32-bit integer.

The real value represented by a given 32-bit IEEE 754 binary32 data is

\[ (-1)^{\text{sign}} \cdot 2^{\text{exponent}-127} \cdot (1 + \text{significand}). \]

To construct the random float, we select any 23 bits from the random integer to use as the significand.5 We then set the sign bit to 0, the exponent to 127, and subtract 1 from the final result. In GLSL:

float uint_to_normalized_float(uint x) {
    return uintBitsToFloat(x >> 9 | 0x3F800000u) - 1.0;
}

To use floating point numbers as the input for the integer hash, we can simply interpret the bits of the input single-precision float as a 32-bit unsigned integer. In GLSL:

uint float_to_uint(float x) {
    return floatBitsToUint(x);
}

Conclusion

My current setup employs the lowbias32 hash function as the primary RNG mechanism in shaders. It uses the nesting method to accept 1 to 4 input parameters based on the dimensionality of the shader invocation’s unique identifier. For example, in compute shaders the unique identifier is the thread id, while in vertex shaders I usually use the vertex position. If more than one random number is required per invocation, I first try to use the translation trick. Sometimes I need more than 4 numbers or better quality in general, so I create an instance of the PCG PRNG that is local to the invocation and feed it the output of lowbias32 as the initial seed. The floating point conversion described just now is almost always used at the end to obtain an useful floating point value.

There are many options when it comes to random number generation, and most are objectively better than the “fract sin” one-liner, so there is little reason to use it nowadays. Diving into the RNG rabbit hole has been a great learning experience, but hopefully this article saves valuable time for those who are busy exploring other rabbit holes!


  1. Even if we can, it would involve using shared memory and atomic operations, which is a big no-no in terms of performance. ↩︎

  2. Mathematical rigor and empirical justifications of statistical quality are much more important in “serious” applications - as I said, we only care about the hash looking random. ↩︎

  3. Perlin, K. (1985). An image synthesizer. ACM SIGGRAPH Computer Graphics, 19(3), 287–296. https://doi.org/10.1145/325165.325247 ↩︎

  4. Some hardware and older versions of these frameworks might not support the IEEE 754 standard, but it is safe to say that most do. They might also not adhere strictly to the standard because they add additional caveats or constraints that are not found in the original IEEE 754, but they certainly use the same internal bit representation, which is what matters to us. ↩︎

  5. It is possible to use any combination of the 23 bits because, theoretically, the chosen PRNG produces each bit as if it were a perfect coin toss, i.e. each bit has a 50% chance of being 0 or 1 and is independent of the others. However, in practice, there will always be a small bias. It is usually a good idea to pick the higher bits. ↩︎

#graphics #shaders #rng

Reply to this post by email ↪