Graphics Programming with Rust and WGPU Part 1: Introduction to Shaders
I've played around with Programming 3D graphical applications in rust recently. This is supposed to become the Guide I whish I had when I started. I want to show both the concepts and the Implementation while helping me structure my process through writing.
Understanding the GPU
So There are a few moving parts that we need to keep track of when writing a graphical application (this is true for every language). First off and most importantly, when writing applications that execute code on the GPU, it needs to deal with two segregated memory regions. One memory region that is private to the GPU and of course main memory that is accessible to the CPU.
In the same way that we have segregated memory, there are now multiple compute units. There is the CPU that we are programming directly and then there is the GPU that the main program can delegate tasks to and collect results from. As the GPU is a peripheral to our main thread of control that is running on the CPU, the CPU is tasked with loading the program and the data into the GPU memory. The programs that are executed on the GPU are called "shaders" and are typically loaded by our main application at runtime, are then compiled into the architecture of the particular GPU in our system and then transfered to the GPU. The program flow (including loading the shader and preparing the data that the GPU performs it's computations on is controlled by the CPU. The GPU acts as a high perfomance "extra" computer, similar to a control computer dispatching work to the "cloud" or a more powerful server machine.
In a sense a GPU is 'just another CPU' to run stuff on. This would however ignore the fact that the GPU is built differently than a CPU which makes it much better suited to run graphics related tasks than a CPU. While a CPU is optimized for general workloads, both it's instructions and it's memory access machinery are built to suit as many tasks as possible. This means that CPU memory controllers try to keep good performance for random memory acesses, while a CPU is built to perform well for may different kind of workloads. The GPU however is built to efficiently run very similar, operations on many distinct pieces of data (like computing the color of a pixel). A GPU is essentially a highly parallel batch processor. For a Video game there is at least one batch computation performed per frame, which takes the 3D geometry of the scene and draws it onto the screen. A single Batch operation is modeled as a Pipeline which is typically divided into multiple stages. There are different types of Pipelines with the most common being the Renderpipeline. A pipeline is the interface provided by the GPU driver and holds the information that tells the GPU what parts of it's memory need to be linked to what part in the shader code while also keeping track of the settings used for the various (in some cases optional) special features.
A GPU is controlled by 'running' data through a particular pipeline. Assuming our goal is to render an image there are some situations in wich a GPU will execute multiple different pipelines to generate a single image. The Process for executing any sort of pipeline is as follows:
- Prepare the data and shader code for processing on the GPU
- Send Data to the GPU memory from main memory
- Execute Pipeline on the GPU
- Read back Computation result from GPU (or have it render the calculation result as output image)
- Repeat
Sometimes Data will already be on the GPU in which case step 2 can be skipped.
Data on a GPU
For the next part we are assuming that the GPU is used to render a Frame of a 3D Scene, meaning that we want to display the view into a 3D world that we 'see' with the help of the GPU and a lot of Math. In this context, data on a GPU comes in two forms, Uniforms and Buffers and to explain this in a bit more detail, we need to take a look at how a 3D render is generated (from a mathematical standpoint) and how a GPU achieves it's high computational throughput.
Any 3D scene consists of a set of objects. These objects must be illuminated and the resulting light pattern captured by our Camera. Objects are represented by their surface geometry that must be described in such a way that the computer can process it. This is done by representing any object by a set of triangles in 3D space. The GPU does not care about their topology but simply treats every triangle as single unit. The topology is implicit in the vertices of the triangles. The only thing our rendering engine cares about is triangles. The idea here is that every triangle can be processed independently from the other, which means that a multi-core machine can work on many triangles simultaneously. This assumption of independent (more or less identical) processes executed in parallel is the usecase that the GPU hardware is optimized for and what gives the GPU it's advantage over the CPU for graphics applications. The Vertices of the triangles (along with additional per vertex information) is stored in Buffers. Buffers hold the the 'input' to the render pipeline. In contrast to Buffers, Uniforms store information that acts like environment variables for the program running on the GPU. While the direct program input comes from the Buffers and is different for every shader invocation, uniforms hold information that is static for the entire execution of the render pipeline. As such Shaders may read information additional information (like textures and the like) from uniform buffers and apply that information to the vertex or pixel being processed.
How to draw an image and a bit of history
The task of the GPU is to take this set of triangle coordinates along with some other data and compose a colorful image out of it. To turn a bunch of textures and an even larger bunch of triangles into a fancy computer generated image, quite a bit of calculation needs to be performed.
In the very early days of GPUs these calculations where hard wired on the GPU and could not really be altered. To be able to have the GPU produce a usable image, the data needed to be prepared carefully so that the GPU produced the intended result. Now with ever more transistors being put onto a single chip, GPUs have become 'programmble'. This means that in stead of having hard wired stages there are many processors placed beside each other on a single die, and share large amounts of memory with each other. These processors are in no way as fast as their CPU cousins but are instead optimised for crunching numbers in a way where a lot of numbers need to be processed in a very similar way. This means that the Processors on a GPU have very wide SIMD instructions and possibly many sets of internal registers to allow for rapid task switching. In contrast to a CPU however they are quite slow (only 100s of MHz instead of 2-5 GHz). The memory architecture is also optimized for fetching data sequentially instead of full random access, limiting the speed of random accesses in favour of being able loading one triangle after another very quickly (which is a nice example of making the common case fast).
As described before, a GPU is somewhat of a batch computer and each batch is split into multiple stages. In the case of rendering an image there may be a plethora of stages, but in the minimal case a programmer has to care about two stages the vertex and the fragment stages. In the vertex stage, the on-screen position of every corner of every triangle is determined. In the fragment stage the color of every pixel inside the visible triangle area is determined. In between, some fancy hard wired GPU circuits determin which pixels need to be run through the fragment stage for any given triangle. At the end the pixels are merged together, most often using a z-buffer algorithm to determin the final color of the pixel on the screen.
As GPU programmers we get to write (at least) two shaders (thats the name for a GPU program). We get to write the Vertex shader and the Fragment shader. Assuming sensible geometry the job of the Vertex shader is to determin the position of a given triangle vertex on the screen. For this linear transformations are used to first place a given object at the correct location inside a scene, and then to transform the scene in such a way that we get something that resebles the scene as viewed from the point of a Camera. This means a lot of matrix multiplications on quite a lot of triangle vertices (for large scenes this easily goes into the millions of vertices). The calculations however are essentially always the same (meaning we don't branch in our program a lot). This is why GPUs can outperform the CPU because it can process many hundreds of vertices simultaneously while with some assembly magic, a multi threaded program running on a CPU might be able to process ~20 vertices simultaneously.
The Fragment Shader's job is to determin the Color of the pixel in the visible part of the triangle. In preparation for this calculation special purpose hardware might sample textures to determin the base color of the pixel, given ideal lighting conditions. This color is then mixed with reflections and othe lighting effects (like the color and brightness of possibly many different light sources) to determin the color of the pixel, assuming this particular triangle is visible. This part is where Shaders get their name from, as they calculate the 'shade' of any given pixel. Assuming we have only a single triangle, the fragment shader would be executed once for every pixel in that triangle. If we have multiple triangles, we'd execute the fragment shader for every pixel in that triangle too, even if it overlaps with the first one. This means that the fragment shader will be executed very often.
More than one pass
Sometimes it is necessary to calculate things before a image can be rendered to the scene. It is necessary for example to calculate a 'shadow map' that is needed to determin if a particular part of any given triangle is in shadow or is illuminated by a light source. This means that as the light sources and the camera move, shadow maps need to be regenerated for every frame. As such we now need to run two different pipelines, with different shaders and data (The shadow map does not care about textures for example) in order to generate a frame. This makes it necessary for multiple pipelines to coexist and to be run in sequence before an image can be produced, but more on that later.
An example Shader in detail
struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) tex_coords: vec2<f32>,
    @location(2) normal: vec3<f32>,
};
struct InstanceInput {
    @location(5) transform_matrix_0: vec4<f32>,
    @location(6) transform_matrix_1: vec4<f32>,
    @location(7) transform_matrix_2: vec4<f32>,
    @location(8) transform_matrix_3: vec4<f32>,
    @location(9) scale: vec4<f32>,
};
struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) tex_coords: vec2<f32>,
    // the normal direction in the world reference frame
    @location(1) world_normal: vec3<f32>,
    // the location of the vertex in the world reference frame
    @location(2) position: vec3<f32>,
};
@vertex
fn vs_main(
    model: VertexInput,
    instance: InstanceInput,
) -> VertexOutput {
    let instance_transform = mat4x4<f32>(
        instance.transform_matrix_0,
        instance.transform_matrix_1,
        instance.transform_matrix_2,
        instance.transform_matrix_3,
    );
    let inverse_scale_matrix = mat4x4<f32>(
        vec4<f32>(1.0/instance.scale.x, 0.0, 0.0, 0.0),
        vec4<f32>(0.0, 1.0/instance.scale.y, 0.0, 0.0),
        vec4<f32>(0.0, 0.0, 1.0/instance.scale.z, 0.0),
        vec4<f32>(0.0, 0.0, 0.0, 1.0),
    );
    var out: VertexOutput;
    out.tex_coords = model.tex_coords;
    
    // translate the 3d vectors for position and normal to homogenious coordinates
    // also calculate the vectors in the "world coordinate system"
    // this is needed for calculating the lighting in the fragment shader
    out.world_normal = (inverse_scale_matrix * instance_transform * vec4<f32>(model.normal, 0.0)).xyz;
    var world_position: vec4<f32> = instance_transform * vec4<f32>(model.position, 1.0);
    out.position = world_position.xyz;
    // this is the thing that really matters to the clipping and rasterization process
    out.clip_position = observer.view_proj * world_position;
    return out;
}
The above shows the core of a fairly simple vertex-shader. It's not complete, and we will go into detail here.
This shader is written in wgsl a shader language. WGSL is the shader language used for WebGL and looks a bit like rust. the @vertex marker declares, that this is
a vertex shader and the vertex shader takes a VertexInput and a InstanceInput and produces a VertexOutput. The arguments of the function marked as vertex are the things
that are expected to be different for every invocation of the vertex shader and therefore are the things that are loaded into the Buffers before execution.
There is one thing however, that shows up in the vertex shader that is not declared as either input or output, and this is the observer variable (aka the camera).
This is due to the fact that for the Vertex shader, the Camera is declared as a Uniform, meaning it does not change from one invocation of the shader (inside a single render pass)
to the other. I think of it as something similar to an environment variable as the Camera will not change between vertices inside a Render Pass (complete execution of the pipeline).
The data stored in the camera may change between two invocations of the same Render pipeline, but is constant throughout all calculations perfomed as part of a single pipeline
execution.
One thing that is interesting here, is that in the above (fairly simple vertex shader), there is no branch instruction. This means that exactly the same code is executed for every Vertex, also meaning that the GPU will be able to make very good use of it's very wide SIMD instructions, as there is no branch possibly invalidating a single one of the SIMD operations.
To be able to execute a full render pass, we also need at leat the fragment shader. For this simple example, we shal use the following fragment shader:
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    let object_color: vec4<f32> = textureSample(t_diffuse, s_diffuse, in.tex_coords);
    
    let light_dir = normalize(light.position - in.position);
    let light_distance = length(light.position - in.position);
    let distance_factor = (1.0/(light_distance*light_distance));
    
    let diffuse_strength = 3.0 * max(dot(in.world_normal, light_dir), 0.0) * distance_factor;
    let diffuse_color = light.color * diffuse_strength;
    let ambient_strength = 0.001;
    let ambient_color = light.color * ambient_strength;
    let result = (ambient_color + diffuse_color) * object_color.xyz;
    return vec4<f32>(result, object_color.a);
}
This fragment shader receives the VertexOutput produced by the Vertex shader and starts calculating the color of a single pixels inside one of any given triangle.
We can see that it makes reference to a light struct as well as a t_diffuse and an s_diffuse. All of which are again Uniforms that are declared outside the fragmen
shader function body. The s_diffuse and t_diffuse are a texture sampler and corresponding texture (which together are used to generate the base color of the pixel befor lighting).
For a fragment shader, many things are linearly interpolated between the different vertices. This is done by special hardware in the GPU and is done before the fragment shader is
executed. So the color that is returned by the sampler will be sampled at a point that is linearly interpolated from the position of the Vertices texture coordinates for example.
The observer, light, are Uniforms while the t_diffuse and s_diffuse are a texture and a sampler respectively. For the light and observer a data structure is declared,
and a binding generated. The Sampler and texture do not need a data type declaration because this is allready built in to the language.
It is again noticable that there is no branch in the fragment shader either. The last thing left to do is to declare all the things that are present in the shaders as bindings (even though this is normally done above the definition of the shader functions).
struct Light {
    position: vec3<f32>,
    color: vec3<f32>,
}
struct Camera {
    view_proj: mat4x4<f32>,
};
@group(1) @binding(0)
var<uniform> camera: Camera;
@group(2) @binding(0)
var<uniform> light: Light;
@group(0) @binding(0)
var t_diffuse: texture_2d<f32>;
@group(0)@binding(1)
var s_diffuse: sampler;
Now with both a fragment shader and a vertex shader as well as all uniform and texture assets declared. This means that in principle the shader can be executed if all the data is properly placed onto the GPU and linked into the shader before the shader is executed. This is the job of the GPU driver that is given commands via the wgpu API/library. So the set of fragment shader and vertex shader together with all their needed bindings are a complete GPU program. Besides this, we still need to determin, where the output of the fragment shader is stored (in this case the frame that want to render), so there is a little bit more bookkeeping and setup to do before we can let the GPU loose on our vertices.
In the next step we will look at how the data is prepared on the CPU side of things, we will see how to set up stuff as to get an actual image on the screen.