I've played around with Programming 3D graphical applications in rust recently. This is the Guide I whish I had when I started. I want to show both the concepts and the Implementation while helping me structure my process through writing.

Some Structure to start with

So There are a few moving parts that we need to keep track of when writing a graphical application (this is true for every language). First off and most importantly, when writing applications that execute code on the GPU, it needs to deal with two segregated memory regions. One memory region that is private to the GPU and of course main memory that is accessible to the CPU.

In the same way that we have segregated memory, there are now multiple compute units. There is the CPU that the are programming directly and then there are "shaders", which are the programs that are executed on the GPU. Due to historical reasons, "shaders" are loaded by our main application at runtime, are then compiled by the Compiler embedded in the GPU driver and then transfered to the GPU just before they are executed. The program flow (including loading the shader and preparing the data that the GPU performs it's computations on is controlled by the CPU. The GPU acts as a high perfomance "extra" computer, similar to a control computer dispatching work to the "cloud" or a more powerful server machine.

The structure of the programs that perform well on a GPU make it unsuited to do the bookkeeping tasks that are necessary during setup of the GPU run. A GPU is essentially a highly parallel batch processor. For a Video game there are is at least one batch computation performed per frame. The structure of this Batch operation is dictated by a Pipeline and is typically divided into multiple stages. There are different types of Pipelines with the most common being the Renderpipeline. A pipeline holds the information that tells the GPU what parts of it's memory need to be linked to what part in the shader code while also keeping track of the settings used for the various (in some cases optional) fixed-function stages.

In some situations, a GPU will execute multiple different pipelines to generate a single image. The Process for executing any sort of pipeline is as follows:

  • Prepare the data and shader code for processing on the GPU
  • Send Data to the GPU memory from main memory
  • Execute Pipeline on the GPU
  • Read back Computation result from GPU (or have it render the calculation result as output image)
  • Repeat

Sometimes Data will already be on the GPU in which case step 2 can be skipped.

Data on a GPU

For the next part we are assuming that the GPU is used to render a Frame of a 3D Scene, meaning that we want to display the view into a 3D world that we 'see' with the help of the GPU and a lot of Math. In this context, data on a GPU comes in two forms, Uniforms and Buffers and to explain this in a bit more detail, we need to take a look at how a 3D render is generated (from a mathematical standpoint) and how a GPU achieves it's high computational throughput.

Any 3D scene consists of a set of objects. These objects must be illuminated and the resulting light pattern captured by our Camera. Objects are represented by their surface geometry that must be described in such a way that it can be represented as a set of many triangles in 3D space. This is what our rendering engine cares about, everything needs to be broken down into triangles, which are then processed by the GPU to produce the final Image on screen. The idea here is that every triangle can be processed independently from the other, which means that a multi-core machine can work on many triangles simultaneously. This assumption is the basis of performance advantage of the GPU over the CPU for graphics applications. It simply happens to be necessary that the same calculations need to be performed on a massive amount of triangles, in order to render an image of a 3D scene. Buffers store the data that is the input that is different for every execution of a shader, in our example case this would be a list of triangles.

The task of the GPU is to take this set of triangle coordinates along with some other data and compose a colorful image out of it. To do this a lot more information besides the triangles is needed. Objects in the 'real world' have a surface color and also a particular behaviour or 'texture' that influences how light reflects/scatters off of it. This sort of information is encoded in Textures which are a sort of Uniform. Textures are stored on the GPU. When different Triangles are processed, they might refer to the same Texture inside the shader. This makes the texture part of the environment of the triangle processing pipeline, as such they are Uniform over the triangles. A typical game will have hundreds of textures, with different triangles possibly referring to the same or to different Textures.

Programs on the GPU

To turn a bunch of textures and an even larger bunch of triangles into a fancy computer generated image, quite a bit of calculation needs to be performed. In the very early days of GPUs these calculations where hard wired on the GPU and could not really be altered. To be able to have the GPU produce a usable image, the data needed to be prepared carefully so that the GPU produced the intended result. Now with ever more transistors being put onto a single chip, GPUs are 'programmble'. This means that in stead of having hard wired stages there are many processors placed beside each other on a single die, and share large amounts of memory with each other. These processors are in no way as fast as their CPU cousins but are instead optimised for crunching numbers in a way where a lot of numbers need to be processed in a very similar way. This means that the Processors on a GPU have very wide SIMD instructions and possibly many sets of internal registers to allow for rapid task switching. In contrast to a CPU however they are quite slow (only 100s of MHz instead of 2-5 GHz). The memory architecture is also optimized for fetching data sequentially instead of full random access, limiting the speed of random accesses in favour of being able loading one triangle after another very quickly (which is a nice example of making the common case fast).

As described before, a GPU is somewhat of a batch computer and each batch is split into multiple stages. In the case of rendering an image there may be a plethora of stages, but in the minimal case a programmer has to care about two stages the vertex and the fragment stages. In the vertex stage, the on-screen position of every corner of every triangle is determined. In the fragment stage the color of every pixel inside the visible triangle area is determined. In between, some fancy hard wired GPU circuits determin which pixels need to be run through the fragment stage for any given triangle. At the end the pixels are merged together, most often using a z-buffer algorithm to determin the final color of the pixel on the screen.

As GPU programmers we get to write (at least) two shaders (thats the name for a GPU program). We get to write the Vertex shader and the Fragment shader. Assuming sensible geometry the job of the Vertex shader is to determin the position of a given triangle vertex on the screen. For this linear transformations are used to first place a given object at the correct location inside a scene, and then to transform the scene in such a way that we get something that resebles the scene as viewed from the point of a Camera. This means a lot of matrix multiplications on quite a lot of triangle vertices (for large scenes this easily goes into the millions of vertices). The calculations however are essentially always the same (meaning we don't branch in our program a lot). This is why GPUs can outperform the CPU because it can process many hundreds of vertices simultaneously while with some assembly magic, a multi threaded program running on a CPU might be able to process ~20 vertices simultaneously.

The Fragment Shader's job is to determin the Color of the pixel in the visible part of the triangle. In preparation for this calculation special purpose hardware might sample textures to determin the base color of the pixel, given ideal lighting conditions. This color is then mixed with reflections and othe lighting effects (like the color and brightness of possibly many different light sources) to determin the color of the pixel, assuming this particular triangle is visible. This part is where Shaders get their name from, as they calculate the 'shade' of any given pixel. Assuming we have only a single triangle, the fragment shader would be executed once for every pixel in that triangle. If we have multiple triangles, we'd execute the fragment shader for every pixel in that triangle too, even if it overlaps with the first one. This means that the fragment shader will be executed very often.

More than one pass

Sometimes it is necessary to calculate things before a image can be rendered to the scene. It is necessary for example to calculate a 'shadow map' that is needed to determin if a particular part of any given triangle is in shadow or is illuminated by a light source. This means that as the light sources and the camera move, shadow maps need to be regenerated for every frame. As such we now need to run two different pipelines, with different shaders and data (The shadow map does not care about textures for example) in order to generate a frame. This makes it necessary for multiple pipelines to coexist and to be run in sequence before an image can be produced, but more on that later.


An example Shader in detail

struct VertexInput {
    @location(0) position: vec3<f32>,
    @location(1) tex_coords: vec2<f32>,
    @location(2) normal: vec3<f32>,
};

struct InstanceInput {
    @location(5) transform_matrix_0: vec4<f32>,
    @location(6) transform_matrix_1: vec4<f32>,
    @location(7) transform_matrix_2: vec4<f32>,
    @location(8) transform_matrix_3: vec4<f32>,
    @location(9) scale: vec4<f32>,
};

struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) tex_coords: vec2<f32>,
    // the normal direction in the world reference frame
    @location(1) world_normal: vec3<f32>,
    // the location of the vertex in the world reference frame
    @location(2) position: vec3<f32>,
};


@vertex
fn vs_main(
    model: VertexInput,
    instance: InstanceInput,
) -> VertexOutput {
    let instance_transform = mat4x4<f32>(
        instance.transform_matrix_0,
        instance.transform_matrix_1,
        instance.transform_matrix_2,
        instance.transform_matrix_3,
    );
    let inverse_scale_matrix = mat4x4<f32>(
        vec4<f32>(1.0/instance.scale.x, 0.0, 0.0, 0.0),
        vec4<f32>(0.0, 1.0/instance.scale.y, 0.0, 0.0),
        vec4<f32>(0.0, 0.0, 1.0/instance.scale.z, 0.0),
        vec4<f32>(0.0, 0.0, 0.0, 1.0),
    );
    var out: VertexOutput;
    out.tex_coords = model.tex_coords;
    
    // translate the 3d vectors for position and normal to homogenious coordinates
    // also calculate the vectors in the "world coordinate system"
    // this is needed for calculating the lighting in the fragment shader
    out.world_normal = (inverse_scale_matrix * instance_transform * vec4<f32>(model.normal, 0.0)).xyz;
    var world_position: vec4<f32> = instance_transform * vec4<f32>(model.position, 1.0);
    out.position = world_position.xyz;

    // this is the thing that really matters to the clipping and rasterization process
    out.clip_position = observer.view_proj * world_position;
    return out;
}

The above shows the core of a fairly simple vertex-shader. It's not complete, and we will go into detail here. This shader is written in wgsl a shader language. WGSL is the shader language used for WebGL and looks a bit like rust. the @vertex marker declares, that this is a vertex shader and the vertex shader takes a VertexInput and a InstanceInput and produces a VertexOutput. The arguments of the function marked as vertex are the things that are expected to be different for every invocation of the vertex shader and therefore are the things that are loaded into the Buffers before execution.

There is one thing however, that shows up in the vertex shader that is not declared as either input or output, and this is the observer variable (aka the camera). This is due to the fact that for the Vertex shader, the Camera is declared as a Uniform, meaning it does not change from one invocation of the shader (inside a single render pass) to the other. I think of it as something similar to an environment variable as the Camera will not change between vertices inside a Render Pass (complete execution of the pipeline). The data stored in the camera may change between two invocations of the same Render pipeline, but is constant throughout all calculations perfomed as part of a single pipeline execution.

One thing that is interesting here, is that in the above (fairly simple vertex shader), there is no branch instruction. This means that exactly the same code is executed for every Vertex, also meaning that the GPU will be able to make very good use of it's very wide SIMD instructions, as there is no branch possibly invalidating a single one of the SIMD operations.

To be able to execute a full render pass, we also need at leat the fragment shader. For this simple example, we shal use the following fragment shader:

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    let object_color: vec4<f32> = textureSample(t_diffuse, s_diffuse, in.tex_coords);
    
    let light_dir = normalize(light.position - in.position);
    let light_distance = length(light.position - in.position);
    let distance_factor = (1.0/(light_distance*light_distance));
    
    let diffuse_strength = 3.0 * max(dot(in.world_normal, light_dir), 0.0) * distance_factor;
    let diffuse_color = light.color * diffuse_strength;

    let ambient_strength = 0.001;
    let ambient_color = light.color * ambient_strength;

    let result = (ambient_color + diffuse_color) * object_color.xyz;
    return vec4<f32>(result, object_color.a);
}

This fragment shader receives the VertexOutput produced by the Vertex shader and starts calculating the color of a single pixels inside one of any given triangle. We can see that it makes reference to a light struct as well as a t_diffuse and an s_diffuse. All of which are again Uniforms that are declared outside the fragmen shader function body. The s_diffuse and t_diffuse are a texture sampler and corresponding texture (which together are used to generate the base color of the pixel befor lighting). For a fragment shader, many things are linearly interpolated between the different vertices. This is done by special hardware in the GPU and is done before the fragment shader is executed. So the color that is returned by the sampler will be sampled at a point that is linearly interpolated from the position of the Vertices texture coordinates for example.

The observer, light, are Uniforms while the t_diffuse and s_diffuse are a texture and a sampler respectively. For the light and observer a data structure is declared, and a binding generated. The Sampler and texture do not need a data type declaration because this is allready built in to the language.

It is again noticable that there is no branch in the fragment shader either. The last thing left to do is to declare all the things that are present in the shaders as bindings (even though this is normally done above the definition of the shader functions).

struct Light {
    position: vec3<f32>,
    color: vec3<f32>,
}
struct Camera {
    view_proj: mat4x4<f32>,
};
@group(1) @binding(0)
var<uniform> camera: Camera;
@group(2) @binding(0)
var<uniform> light: Light;

@group(0) @binding(0)
var t_diffuse: texture_2d<f32>;
@group(0)@binding(1)
var s_diffuse: sampler;

Now with both a fragment shader and a vertex shader as well as all uniform and texture assets declared. This means that in principle the shader can be executed if all the data is properly placed onto the GPU and linked into the shader before the shader is executed. This is the job of the GPU driver that is given commands via the wgpu API/library. So the set of fragment shader and vertex shader together with all their needed bindings are a complete GPU program. Besides this, we still need to determin, where the output of the fragment shader is stored (in this case the frame that want to render), so there is a little bit more bookkeeping and setup to do before we can let the GPU loose on our vertices.

In the next step we will look at how the data is prepared on the CPU side of things, we will see how to set up stuff as to get an actual image on the screen.