Jump to content

Josh

Staff
  • Posts

    23,303
  • Joined

  • Last visited

Blog Entries posted by Josh

  1. Josh
    Until now, all my experiments with voxel cone step tracing placed the center of the GI data at the world origin (0,0,0). In reality, we want the GI volume to follow the camera around so we can see the effect everywhere, with more detail up close. I feel my productivity has not been very good lately, but I am not being too hard on myself because this is very difficult stuff. The double-blind nature of it (rendering the voxel data and then using that data to render an effect) makes development very difficult. The intermediate voxel data consists of several LODs and is difficult to visualize. My work schedule lately has been to do nothing for several days, or just putter around with some peripheral tasks, and then suddenly solve major problems in a short two-hour work session.
    Here you can see a single GI stage following the camera around properly. More will be added to increase the area the effect covers, and the edges of the final stage will fade out the effect for a smooth transition:

    My new video project1.mp4 This all makes me wonder what "work" is when you are handling extremely difficult technical problems. I have no problem working 8+ hours a day on intermediate programming tasks, but when it comes to these really advanced problems I can't really be "on" all day. This morning, I went for a walk, for seven miles. Was I subconsciously working during that time, so that I can later sit down and quickly solve a problem I was completely stuck on previously?
    I definitely underestimated the difficulty of making this feature work as a robust engine feature that can be used reliably. There is a lot of nuance and small issues that come up when you start testing in a variety of scenes, and this information could easily fill an hour-long talk about the details of voxel cone step tracing. However, there is just one more step, to make the moving volumes work with multiple GI stages. Once that is working I can proceed with more testing, look for artifacts to eliminate, and optimize speed.
    This is the last big feature I have to finish. It seems fitting that I should get one final big challenge before completing Ultra Engine, and I am enjoying it.
  2. Josh

    Articles
    Before proceeding with multiple GI volumes, I decided to focus on just getting the lighting to look as close to perfect as possible, with a single stage.
    Injecting the ambient light into the voxel data made flat-lit areas appear much more "3D", with color bleeding and subtle contours everywhere.
    Lighting only:

    Lighting + albedo

    Some adjustments to the way the sky color is sampled gave a more lifelike appearance to outdoor lighting.
    Before:

    After. Notice the shaded area still has a lot of variation:

    Initial performance testing gives results consistent with my expectations. I'm running at half of 1920x1080 resolution, on a GEForce 1660 TI, and performance is about a third what it would be without GI. At 1920x1080, that drops to 90 FPS. Because it is so resource-intensive, I plan to render the effect at half-resolution, then upscale it and use an edge detection filter to fill in info for any pixels that need it. This card has only 1536 stream processors, about half as much as a 2080.

    Further experiments with motion did not resolve the artifacts I was experiencing earlier, and in fact caused new ones because of the flickering introduced by the GPU voxelization. You can read a detailed discussion of these issues on the Gamedev.net forum. My conclusion for now is that moving objects should not be baked into the voxelized data, because they cause a lot of flashing and flickering artifacts. These could be added in the future by storing a separate voxel grid for each dynamic object, along with some kind of data structure the shader can use to quickly find the objects a ray can pass through.
    This is great though, because it means voxelization only has to be updated when the camera moves a certain distance, of if a new static object is created or deleted. You still have completely dynamic direct lighting, and the GI system just follows you around and generates indirect lighting on the fly. I could run the update in the background and then show a smooth transition in between updates, and all the flickering problems go away. Performance should be very good once I have further optimized the system. And every surface in your game can show reflections everywhere. Moving lights work really really well, as you have seen.
    The end is in sight and I am very pleased how this system is turning out. My goal was to create a completely dynamic system that provided better 3D reflections than cubemaps, and did not require manual placement or baking of probes, fast enough to use on older mid-range discrete GPUs, and that is what we got.
  3. Josh
    I've begun working on an implementation of voxel cone tracing for global illumination. This technique could potentially offer a way to perfrorm real-time indirect lighting on the entire scene, as well as real-time reflections that don't depend on having the reflected surface onscreen, as screen-space reflection does.
    I plan to perform the GI calculations all on a background CPU thread, compress the resulting textures using DXTC, and upload them to the GPU as they are completed. This means the cost of GI should be quite low, although there is going to be some latency in the time it takes for the indirect lighting to match changes to the scene. We might continue to use SSR for detailed reflections and only use GI for semi-static light bounces, or it might be fast enough for moving real-time reflections. The GPU-based implementations I have seen of this technique are techically impressive but suffer from terrible performance, and we want something fast enough to run in VR.
    The first step is to be able to voxelize models. The result of the voxelization operation is a bunch of points. These can be fed into a geometry shader that generates a box around each one:
    void main() { vec4 points[8]; points[0] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, -0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[1] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, -0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[2] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, 0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[3] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, 0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[4] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, -0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); points[5] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, -0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); points[6] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, 0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); points[7] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, 0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); vec3 normals[6]; normals[0] = (vec3(-1,0,0)); normals[1] = (vec3(1,0,0)); normals[2] = (vec3(0,-1,0)); normals[3] = (vec3(0,1,0)); normals[4] = (vec3(0,0,-1)); normals[5] = (vec3(0,0,1)); //Left geometry_normal = normals[0]; gl_Position = points[0]; EmitVertex(); gl_Position = points[4]; EmitVertex(); gl_Position = points[3]; EmitVertex(); gl_Position = points[7]; EmitVertex(); EndPrimitive(); //Right geometry_normal = normals[1]; gl_Position = points[1]; EmitVertex(); gl_Position = points[2]; EmitVertex(); ... } Here's a goblin who's polygons have been turned into Lego blocks.

    Now the thing most folks nowadays don't realize is that if you can voxelize a goblin, well then you can voxelize darn near anything.

    Global illumination will then be calculated on the voxels and fed to the GPU as a 3D texture. It's pretty complicated stuff but I am very excited to be working on this right now.

    If this works, then I think environment probes are going to completely go away forever. SSR might continue to be used as a low-latency high-resolution first choice when those pixels are available onscreen. We will see.
    It is also interesting that the whole second-pass reflective water technique will probably go away as well, since this technique should be able to handle water reflections just like any other material.
  4. Josh
    At this point I have successfully created a sparse octree class and can insert voxelized meshes into it. An octree is a way of subdividing space into eight blocks at each level of the tree:

    A sparse octree doesn't create the subnodes until they are used. For voxel data, this can save a lot of memory.
    It was difficult to get the rounding and all the math completely perfect (and it has to be completely perfect!) but now I have a nice voxel tree that can follow the camera around and is aligned correctly to the world axis and units. The code that inserts a voxel is pretty interesting: A voxel tree is created with a number of levels, and the size of the structure is equal to pow(2,levels). For example, an octree with 8 levels creates a 3D grid of 256x256x256 voxels. Individual voxels are then inserted to the top-level tree node, which recursively calls the SetSolid() function until the last level is reached. A voxel is marked as "solid" simply by having a voxel node at the last level (0). (GetChild() has the effect of finding the specified child and creating it if it doesn't exist yet.)
    A bitwise flag is used to test which subnode should be called at this level. I didn't really work out the math, I just intuitively went with this solution and it worked as I expected:
    void VoxelTree::SetSolid(const int x, const int y, const int z, const bool solid) { int flag = pow(2, level); if (x < 0 or y < 0 or z < 0) return; if (x >= flag * 2 or y >= flag * 2 or z >= flag * 2) return; flag = pow(2, level - 1); int cx = 0; int cy = 0; int cz = 0; if ((flag & x) != 0) cx = 1; if ((flag & y) != 0) cy = 1; if ((flag & z) != 0) cz = 1; if (solid) { if (level > 0) { GetChild(cx, cy, cz)->SetSolid(x & ~flag, y & ~flag, z & ~flag, true); } } else { if (level > 0) { if (kids[cx][cy][cz] != nullptr) { kids[cx][cy][cz]->SetSolid(x & ~flag, y & ~flag, z & ~flag, false); } } else { //Remove self auto parent = this->parent.lock(); Assert(parent->kids[position.x][position.y][position.y] == Self()); parent->kids[position.x][position.y][position.y] = nullptr; } } } The voxel tree is built by adding all scene entities into the tree. From there it was easy to implement a simple raycast to see if anything was above each voxel, and color it black if another voxel is hit:

    And here is the same program using a higher resolution voxel tree. You can see it's not much of a stretch to implement ambient occlusion from here:

    At a voxel size of 0.01 meters (the first picture) the voxelization step took 19 milliseconds, so it looks like we're doing good on speed. I suspect the rest of this project will be more art than science. Stay tuned!
  5. Josh
    I added a raycast function to the voxel tree class and now I can perform raycasts between any two positions. This is perfect for calculating direct lighting. Shadows are calculated by performing a raycast between the voxel position and the light position, as shown in the screenshot below. Fortunately the algorithm seems to work great an there are no gaps or cracks in the shadow:

    Here is the same scene using a voxel size of 10 centimeters:

    If we move the light a little lower, you can see a shadow appearing near two edges of the floor:

    Why is this happening? Well, the problem is that at those angles, the raycast is hitting the neighboring voxel on the floor next to the voxel we are testing:

    You might think that if we just move one end of the ray up to the top of the voxel it will work fine, and you'd be right, in this situation.

    But with slightly different geometry, we have a new problem.

    So how do we solve this? At any given time, a voxel can have up to three faces that face the light (but it might have as few as one). In the image below I have highlighted the two voxel faces on the right-most voxel that face the light:

    If we check the neighboring voxels we can see that the voxel to the left is occupied, and therefore the left face does not make a good position to test from:

    But the top voxel is clear, so we will test from there:

    If we apply the same logic to the other geometry configuration I showed, we also get a correct result. Of course, if both neighboring voxels were solid then we would not need to perform a raycast at all because we know the light would be completely blocked at this position.
    The code to do this just checks which side of a voxel the light position is on. As it is written now, up to three raycasts may be performed per voxel:
    if (lightpos.x < voxel->bounds.min.x) { if (GetSolid(ix - 1, iy, iz) == false) { result = IntersectsRay(p0 - Vec3(voxelsize * 0.5f, 0.0f, 0.0f), lightpos); } } if (lightpos.x > voxel->bounds.max.x and result == false) { if (GetSolid(ix + 1, iy, iz) == false) { result = IntersectsRay(p0 + Vec3(voxelsize * 0.5f, 0.0f, 0.0f), lightpos); } } if (lightpos.y < voxel->bounds.min.y and result == false) { if (GetSolid(ix, iy - 1, iz) == false) { result = IntersectsRay(p0 - Vec3(0.0f, voxelsize * 0.5f, 0.0f), lightpos); } } if (lightpos.y > voxel->bounds.max.y and result == false) { if (GetSolid(ix, iy + 1, iz) == false) { result = IntersectsRay(p0 + Vec3(0.0f, voxelsize * 0.5f, 0.0f), lightpos); } } if (lightpos.z < voxel->bounds.min.z and result == false) { if (GetSolid(ix, iy, iz - 1) == false) { result = IntersectsRay(p0 - Vec3(0.0f, 0.0f, voxelsize * 0.5f), lightpos); } } if (lightpos.z > voxel->bounds.max.z and result == false) { if (GetSolid(ix, iy, iz + 1) == false) { result = IntersectsRay(p0 + Vec3(0.0f, 0.0f, voxelsize * 0.5f), lightpos); } } .With this correction the artifact disappears:

    It even works correctly at a lower resolution:

    Now our voxel raycast algorithm is complete. The next step will be to calculate direct lighting on the voxelized scene using the lights that are present.
  6. Josh
    Now that we can voxelize models, enter them into a scene voxel tree structure, and perform raycasts we can finally start calculating direct lighting. I implemented support for directional and point lights, and I will come back and add spotlights later. Here we see a shadow cast from a single directional light:

    And here are two point lights, one red and one green. Notice the distance falloff creates a color gradient across the floor:

    The idea here is to first calculate direct lighting using raycasts between the light position and each voxel:

    Then once you have the direct lighting, you can calculate approximate global illumination by gathering a cone of samples for each voxel, which illuminates voxels not directly visible to the light source:

    And if we repeat this process we can simulate a second bounce, which really fills in all the hidden surfaces:

    When we convert model geometry to voxels, one of the important pieces of information we lose are normals. Without normals it is difficult to calculate damping for the direct illumination calculation. It is easy to check surrounding voxels and determine that a voxel is embedded in a floor or something, but what do we do in the situation below?

    The thin wall of three voxels is illuminated, which will leak light into the enclosed room. This is not good:

    My solution is to calculate and store lighting for each face of each voxel.
    Vec3 normal[6] = { Vec3(-1, 0, 0), Vec3(1, 0, 0), Vec3(0, -1, 0), Vec3(0, 1, 0), Vec3(0, 0, -1), Vec3(0, 0, 1) }; for (int i = 0; i < 6; ++i) { float damping = max(0.0f,normal[i].Dot(lightdir)); //normal damping if (!isdirlight) damping *= 1.0f - min(p0.DistanceToPoint(lightpos) / light->range[1], 1.0f); //distance damping voxel->directlighting[i] += light->color[0] * damping; } This gives us lighting that looks more like the diagram below:

    When light samples are read, the appropriate face will be chosen and read from. In the final scene lighting on the GPU, I expect to be able to use the triangle normal to determine how much influence each sample should have. I think it will look something like this in the shader:
    vec4 lighting = vec4(0.0f); lighting += max(0.0f, dot(trinormal, vec3(-1.0f, 0.0f, 0.0f)) * texture(gimap, texcoord + vec2(0.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(1.0f, 0.0f, 0.0f)) * texture(gimap, texcoord + vec2(1.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, -1.0f, 0.0f)) * texture(gimap, texcoord + vec2(2.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, 1.0f, 0.0f)) * texture(gimap, texcoord + vec2(3.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, 0.0f, -1.0f)) * texture(gimap, texcoord + vec2(4.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, 0.0f, 1.0f)) * texture(gimap, texcoord + vec2(5.0 / texwidth, 0.0)); This means that to store a 256 x 256 x 256 grid of voxels we actually need a 3D RGB texture with dimensions of 256 x 256 x 1536. This is 288 megabytes. However, with DXT1 compression I estimate that number will drop to about 64 megabytes, meaning we could have eight voxel maps cascading out around the player and still only use about 512 megabytes of video memory. This is where those new 16-core CPUs will really come in handy!
    I added the lighting calculation for the normal Vec3(0,1,0) into the visual representation of our voxels and lowered the resolution. Although this is still just direct lighting it is starting to look interesting:

    The last step is to downsample the direct lighting to create what is basically a mipmap. We do this by taking the average values of each voxel node's children:
    void VoxelTree::BuildMipmaps() { if (level == 0) return; int contribs[6] = { 0 }; for (int i = 0; i < 6; ++i) { directlighting[i] = Vec4(0); } for (int ix = 0; ix < 2; ++ix) { for (int iy = 0; iy < 2; ++iy) { for (int iz = 0; iz < 2; ++iz) { if (kids[ix][iy][iz] != nullptr) { kids[ix][iy][iz]->BuildMipmaps(); for (int n = 0; n < 6; ++n) { directlighting[n] += kids[ix][iy][iz]->directlighting[n]; contribs[n]++; } } } } } for (int i = 0; i < 6; ++i) { if (contribs[i] > 0) directlighting[i] /= float(contribs[i]); } } If we start with direct lighting that looks like the image below:

    When we downsample it one level, the result will look something like this (not exactly, but you get the idea):

    Next we will begin experimenting with light bounces and global illumination using a technique called cone tracing.
  7. Josh
    I was having trouble with cone tracing and decided to first try a basic GI algorithm based on a pattern of raycasts. Here is the result:

    You can see this is pretty noisy, even with 25 raycasts per voxel. Cone tracing uses an average sample, which eliminates the noise problem, but it does introduce more inaccuracy into the lighting.
    Next I wanted to try a more complex scene and get an estimate of performance. You may recognize the voxelized scene below as the "Sponza" scene frequently used in radiosity testing:

    Direct lighting takes 368 milliseconds to calculate, with voxel size of 0.25 meters. If I cut the voxel grid down to a 64x64x64 grid then lighting takes just 75 milliseconds.
    These speeds are good enough for soft GI that gradually adjusts as lighting changes, but I am not sure if this will be sufficient for our purposes. I'd like to do real-time screen-independent reflections.
    I thought about it, and I thought about it some more, and then when I was done with that I kept thinking about it. Here's the design I came up with:

    The final output is a 3D texture containing light data for all six possible directions.  (So a 256x256x256 grid of voxels would actually be 1536x256x256 RGB, equal to 288 megabytes.) The lit voxel array would also be six times as big. When a pixel is rendered, three texture lookups are performed on the 3D texture and multiplied by the normal of the pixel. If the voxel is empty, there is no GI information for that volume, so maybe a higher mipmap level could be used (if mipmaps are generated in the last step). The important thing is we only store the full-resolution voxel grid once.
    The downsampled voxel grid use an alpha channel for coverage. For example, a pixel with 0.75 alpha would have six out of eight solid child voxels.
    I do think voxelization is best performed on the CPU due to flexibility and the ability to cache static objects.
    Direct lighting, in this case, would be calculated from shadowmaps. So I have to implement the clustered forward renderer before going forward with this.
  8. Josh
    We left off on voxels when I realized the direct lighting needed to be performed on the GPU. So I had to go and implement a new clustered forward renderer before I could do anything else. Well, I did that and now I finally have voxel lighting calculation being performed with the same code that renders lighting. This gives us the data we need to perform cone step tracing for real-time dynamic global illumination.

    The shadows you see here are calculated using the scene shadowmaps, not by raycasting other voxels:

    I created a GPU timer to find out how much time the lighting took to process. On the CPU, a similar scene took 368 milliseconds to calculate direct lighting. On the GPU, on integrated graphics, (so I guess it was still the CPU!) this scene took 11.61064 milliseconds to process. With a discrete GPU this difference would increase, a lot. So that's great, and we're now at the third step in the diagram below:

    Next we will move this data into a downsampled cube texture and start performing the cone step tracing that gives us fast real-time GI.
  9. Josh
    The polygon voxelization process for our voxel GI system now takes vertex, material, and base texture colors into account. The voxel algorithm does not yet support a second color channel for emission, but I am building the whole system with that in mind. When I visualize the results of the voxel building the images are pretty remarkable! Of course the goal is to use this data for fast global illumination calculations but maybe they could be used to make a whole new style of game graphics.

    Direct lighting calculations on the CPU are fast enough that I am going to stick with this approach until I have to use the GPU. If several cascading voxel grids were created around the camera, and each updated asynchronously on its own thread, that might give us the speed we need to relieve the GPU from doing any extra work. The final volume textures could be compressed to DXT1 (12.5% their original size) and sent to the GPU.
    After direct lighting has been calculated, the next step is to downsample the voxel grid. I found the fastest way to do this is to iterate through just the solid voxels. This is how my previous algorithm worked:
    for (x=0; x < size / 2; ++x) { for (y=0; y < size / 2; ++y) { for (z=0; z < size / 2; ++z) { //Downsample this 2x2 block } } } A new faster approach works by "downsampling" the set of solid voxels by dividing each value by two. There are some duplicated values but that's fine:
    for (const iVec3& i : solidvoxels) { downsampledgrid->solidvoxels.insert(iVec3(i.x/2,i.y/2,i.z/2)) } for (const iVec3& i : downsampledgrid->solidvoxels) { //Downsample this 2x2 block } We can then iterate through just the solid voxels when performing the downsampling. A single call to memset will set all the voxel data to black / empty before the downsampling begins. This turns out to be much much faster than iterating through every voxel on all three axes.
    Here are the results of the downsampling process. What you don't see here is the alpha value of each voxel. The goblin in the center ends up bleeding out to fill very large voxels, because the rest of the volume around him is empty space, but the alpha value of those voxels will be adjusted to give them less influence in the GI calculation.




    For a 128x128x128 voxel grid, with voxel size of 0.125 meters, my numbers are now:
    Voxelization: 607 milliseconds Direct lighting (all six directions): 109 First downsample (to 64x64): 39 Second downsample (to 32x32): 7 Third downsample (to 16x16): 1 Total: 763 Note that voxelization, by far the slowest step here, does not have to be performed completely on all geometry each update. The direct lighting time elapsed is within a tolerable range, so we are in the running to make GI calculations entirely on the CPU, relieving the GPU of extra work and compressing our data before it is sent over the PCI bridge.
    Also note that a smaller voxel grids could be used, with more voxel grids spread across more CPU cores. If that were the case I would expect our processing time for each one to go down to 191 milliseconds total (39 milliseconds without the voxelization step), and the distance your GI covers would then be determined by your number of CPU cores.
    In fact there is a variety of ways this task could be divided between several CPU cores.
  10. Josh
    I've moved the GI calculation over to the GPU and our Vulkan renderer in Leadwerks Game Engine 5 beta now supports volume textures. After a lot of trial and error I believe I am closing in on our final techniques. Voxel GI always involves a degree of light leakage, but this can be mitigated by setting a range for the ambient GI. I also implemented a hard reflection which was pretty easy to do. It would not be much more difficult to store the triangles in a lookup table for each voxel in order to trace a finer polygon-based ray for results that would look the same as Nvidia RTX but perform much faster.
    The video below is only performing a single GI bounce at this time, and it is displaying the lighting on the scene voxels, not on the original polygons. I am pretty pleased with this progress and I think the final results will look great and run fast. In addition, the need for environment probes placed in the scene will soon forever be a thing of the past.

    2035564276_VoxelGI_raytracingprogress.mp4.a173c8eb756aa1403cccc972a3306d49.mp4 There is still a lot of work to do on this, but I would say that this feature just went from something I was very overwhelmed and intimidated by to something that is definitely under control and feasible.
    Also, we might not use cascaded shadow maps (for directional lights) at all but instead rely on a voxel raytrace for directional light shadows. If it works, that would be my preference because CSMs waste so much space and drawing a big outdoors scene 3-4 times can be taxing.
  11. Josh
    I implemented light bounces and can now run the GI routine as many times as I want. When I use 25 rays per voxel and run the GI routine three times, here is the result. (The dark area in the middle of the floor is actually correct. That area should be lit by the sky color, but I have not yet implemented that, so it appears darker.)


    It's sort of working but obviously these results aren't usable yet. Making matters more difficult is the fact that people love to show their best screenshots and love to hide the problems their code has, so it is hard to find something reliable to compare my results to.
    I also found that the GI pass, unlike all previous passes, is very slow. Each pass takes about 30 seconds in release mode! I could try to optimize the C++ code but something tells me that even optimized C++ code would not be fast enough. So it seems the GI passes will probably need to be performed in a shader. I am going to experiment a bit with some ideas I have first to provide better quality GI results first though.
     
  12. Josh
    After three days of intense work, I am proud to show you this amazing screenshot:

    What is so special about this image? I am now successfully uploading voxel data to the GPU and writing lighting into another texture, using a texture buffer object to store the voxel positions as unsigned char uvec3s. The gray color is the ambient light term coming from the Blinn-Phong shading used in the GI direct light calculation. The next step is to create a light grid for the clustered forward renderer so that each light can be added to the calculation. Since voxel grids are cubic, I think I can just use the orthographic projection method to split lights up into different cells. In fact, the GI direct light shader actually includes the same lighting shader file that all the model shaders use. Once I have that done, that will be the direct lighting step, and then I can move on to calculating a bounce with cone step tracing.
    Clustered forward rendering, real-time global illumination, and physically-based rendering are all going to come together really nicely, but this is definitely one of the hardest features I have ever worked on!
    Here are a few wacky screenshots from the last few days.
    Why are half my voxels missing?!

    Why is only one texture layer being written to?!

    Ah, finally rendering to six texture layers simultaneously...

  13. Josh
    So far the new Voxel ray tracing system I am working out is producing amazing results. I expect the end result will look like Minecraft RTX, but without the enormous performance penalty of RTX ray tracing.
    I spent the last several days getting the voxel update speed fast enough to handle dynamic reflections, but the more I dig into this the more complicated it becomes. Things like a door sliding open are fine, but small objects moving quickly can be a problem. The worst case scenario is when the player is carrying an object in front of them. In the video below, the update speed is fast, but the limited resolution of the voxel grid makes the reflections flash quite a lot. This is due to the reflection of the barrel itself. The gun does not contribute to the voxel data, and it looks perfectly fine as it moves around the scene, aside from the choppy reflection of the barrel in motion.
    The voxel resolution in the above video is set to about 6 centimeters. I don't see increasing the resolution as an option that will go very far. I think what is needed is a separation of dynamic and static objects. A sparse voxel octree will hold all static objects. This needs to be precompiled and it cannot change, but it will handle a large amount of geometry with low memory usage. For dynamic objects, I think a per-object voxel grid should be used. The voxel grid will move with the object, so reflections of moving objects will update instantaneously, eliminating the problem we see above.
    We are close to having a very good 1.0 version of this system, and I may wrap this up soon, with the current limitations. You can disable GI reflections on a per-object basis, which is what I would recommend doing with dynamic objects like the barrels above. The GI and reflections are still dynamic and will adjust to changes in the environment, like doors opening and closing, elevators moving, and lights moving and turning on and off. (If those barrels above weren't moving, showing their reflections would be absolutely no problem, as I have demonstrated in previous videos.)
    In general, I think ray tracing is going to be a feature you can take advantage of to make your games look incredible, but it is something you have to tune. The whole "Hey Josh I created this one weird situation just to cause problems and now I expect you to account for this scenario AAA developers would purposefully avoid" approach will not work with ray tracing. At least not in the 1.0 release. You're going to want to avoid the bad situations that can arise, but they are pretty easy to prevent. Perhaps I can combine screen-space reflections with voxels for reflections of dynamic objects before the first release.
    If you are smart about it, I expect your games will look like this:
    I had some luck with real-time compression of the voxel data into BC3 (DXT5) format. It adds some delay to the updating, but if we are not trying to show moving reflections much then that might be a good tradeoff. Having only 25% of the data being sent to the GPU each frame is good for performance.
    Another change I am going to make it a system that triggers voxel refreshes, instead of constantly updating it no matter what. If you sit still and nothing is moving, then the voxel data won't get recalculated and processed, which will make the performance even faster. This makes sense if we expect most of the data to not change each frame.
    I haven't run any performance benchmarks yet, but from what I am seeing I think the performance penalty for using this system will be basically zero, even on integrated graphics. Considering what a dramatic upgrade in visuals this provides, that is very impressive.
    In the future, I think I will be able to account for motion in voxel ray tracing, as well as high-definition polygon raytracing for sharp reflections, but it's not worth delaying the release of the engine. Hopefully in this article I showed there are many factors, and many approaches we are can use to try to optimize for different aspects of the effect. For the 1.0 release of our new engine, I think we want to emphasize performance above all else.
  14. Josh
    Light is made up of individual particles called photons. A photon is a discrete quantum of electromagnetic energy. Photons are special because they have properties of both a particle and a wave. Photons have mass and can interact with physical matter. The phenomenon of "solar pressure" is caused by photons bombarding a surface and exerting force. (This force actually has to be accounted for in orbital mechanics.) However, light also has a wavelength and frequency, similar to sound or other wave phenomenon.
    Things are made visible when millions of photons scatter around the environment and eventually go right into your eyes, interacting with photoreceptor cells on the back surface of your interior eyeball (the retina). A "grid" of receptors connect into the optic nerve, which travels into your brain to the rear of your skull, where an image is constructed from the stimulus, allowing you to see the world around you.
    The point of that explanation is to demonstrate that lighting is a scatter problem. Rendering, on the other hand, is a gather problem. We don't care about what happens to every photon emitted from a light source, we only care about the final lighting on the screen pixels we can see. Physically-based rendering is a set of techniques and equations that attempt to model lighting as a gather problem, which is more efficient for real-time rendering. The technique allows us to model some behaviors of lighting without calculating the path of every photon.
    One important behavior we want to model is the phenomenon of Fresnel refraction. If you have ever been driving on a desert highway and saw a mirage on the road in the distance, you have experienced this. The road in the image below is perfectly dry but appears to be submerged in water.

    What's going on here? Well, remember when I explained that every bit of light you see is shot directly into your eyeballs? Well, at a glancing angle, the light that is most likely to hit your eyes is going to be bouncing off the surface from the opposite direction. Since you have more light coming from one single direction, instead of being scattered from all around, a reflection becomes visible.
    PBR models this behavior using a BRDF image (Bidirectional reflectance distribution function). These are red/green images that act as a look-up table, given the angle between the camera-to-surface vector and the incoming light vector. They look something like this:

    You can have different BRDFs for leather, plastic, and all different types of materials. These cannot be calculated, but must be measured with a photometer from real-world materials. It's actually incredibly hard to find any collection of this data measured from real-world materials. I was only able to find one lab in Germany that was able to create these. There are also some online databases available that these as a text table. I have not tried converting any of these into images.
    Now with PBR lighting, the surrounding environment plays a more important role in the light calculation than it does with Blinn-Phong lighting. Therefore, PBR is only as good as the lighting environment data you have. For simple demos it's fine to use a skybox for this data, but that approach won't work for anything more complicated than a single model onscreen. In Leadwerks 4 we used environment probes, which create a small "skybox" for separate areas in the scene. These have two drawbacks. First, they are still 2D projections of the surrounding environment and do not provide accurate 3D reflections. Second, they are tedious to set up, so most of the screenshots you see in Leadwerks are not using them.

    Voxel ray tracing overcomes these problems. The 3D voxel structure provides better reflections with depth and volume, and it's dynamically constructed from the scene geometry around the camera, so there is no need to manually create anything.

    I finally got the voxel reflection data integrated into the PBR equation so that the BRDF is being used for the reflectance. In the screenshot below you can see the column facing the camera appears dull.

    When we view the same surface at a glancing angle, it becomes much more reflective, almost mirror-like, as it would in real life:

    You can observe this effect with any building that has a smooth concrete exterior. Also note the scene above has no ambient light. The shaded areas would be pure black if it wasn't for the global illumination effect. These details will give your environments a realistic lifelike look in our new engine.
  15. Josh

    Articles
    The VK_KHR_dynamic_rendering extension has made its way into Vulkan 1.2.203 and I have implemented this in Ultra Engine. What does it do?
    Instead of creating renderpass objects ahead of time, dynamic rendering allows you to just specify the settings you need as your are performing filling in command buffers with rendering instructions. From the Khronos working group:
    In my experience, post-processing effects is where this hurt the most. The engine has a user-defined stack of post-processing effects, so there are many configurations possible. You had to store and cache a lot of renderpass objects for all possible combinations of settings. It's not impossible but it made things very very complicated. Basically, you have to know every little detail of how the renderpass object is going to be used in advance. I had several different functions like the code below, for initialing renderpasses that were meant to be used at various points in the rendering routine.
    bool RenderPass::InitializePostProcess(shared_ptr<GPUDevice> device, const VkFormat depthformat, const int colorComponents, const bool lastpass) { this->clearmode = clearmode; VkFormat colorformat = __FramebufferColorFormat; this->colorcomponents = colorComponents; if (depthformat != 0) this->depthcomponent = true; this->device = device; std::array< VkSubpassDependency, 2> dependencies; dependencies[0] = {}; dependencies[0].srcSubpass = VK_SUBPASS_EXTERNAL; dependencies[0].dstSubpass = 0; dependencies[0].srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT; dependencies[0].srcAccessMask = 0; dependencies[0].dstStageMask = VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT; dependencies[0].dstAccessMask = VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT | VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT; dependencies[1] = {}; dependencies[1].srcSubpass = VK_SUBPASS_EXTERNAL; dependencies[1].dstSubpass = 0; dependencies[1].srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT; dependencies[1].srcAccessMask = 0; dependencies[1].dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT; dependencies[1].dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT; renderPassInfo = {}; renderPassInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO; renderPassInfo.attachmentCount = colorComponents; renderPassInfo.dependencyCount = colorComponents; if (depthformat == VK_FORMAT_UNDEFINED) { dependencies[0] = dependencies[1]; } else { renderPassInfo.attachmentCount++; renderPassInfo.dependencyCount++; } renderPassInfo.pDependencies = dependencies.data(); colorAttachment[0] = {}; colorAttachment[0].format = colorformat; colorAttachment[0].samples = VK_SAMPLE_COUNT_1_BIT; colorAttachment[0].initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; colorAttachment[0].loadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; colorAttachment[0].storeOp = VK_ATTACHMENT_STORE_OP_STORE; colorAttachment[0].stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; colorAttachment[0].stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; colorAttachment[0].finalLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL; if (lastpass) colorAttachment[0].finalLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR; VkAttachmentReference colorAttachmentRef = {}; colorAttachmentRef.attachment = 0; colorAttachmentRef.layout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL; depthAttachment = {}; VkAttachmentReference depthAttachmentRef = {}; if (depthformat != VK_FORMAT_UNDEFINED) { colorAttachmentRef.attachment = 1; depthAttachment.format = depthformat; depthAttachment.samples = VK_SAMPLE_COUNT_1_BIT; depthAttachment.loadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; depthAttachment.initialLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL;// VK_IMAGE_LAYOUT_UNDEFINED; depthAttachment.storeOp = VK_ATTACHMENT_STORE_OP_STORE; depthAttachment.stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; depthAttachment.stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; depthAttachment.finalLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL; depthAttachmentRef.attachment = 0; depthAttachmentRef.layout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL; } colorAttachment[0].initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; depthAttachment.initialLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL;// VK_IMAGE_LAYOUT_UNDEFINED; subpasses.push_back( {} ); subpasses[0].pipelineBindPoint = VK_PIPELINE_BIND_POINT_GRAPHICS; subpasses[0].colorAttachmentCount = colorComponents; subpasses[0].pColorAttachments = &colorAttachmentRef; subpasses[0].pDepthStencilAttachment = NULL; if (depthformat != VK_FORMAT_UNDEFINED) subpasses[0].pDepthStencilAttachment = &depthAttachmentRef; VkAttachmentDescription attachments[2] = { colorAttachment[0], depthAttachment }; renderPassInfo.subpassCount = subpasses.size(); renderPassInfo.pAttachments = attachments; renderPassInfo.pSubpasses = subpasses.data(); VkAssert(vkCreateRenderPass(device->device, &renderPassInfo, nullptr, &pass)); return true; } This gives you an idea of just how many render passes I had to create in advance:
    // Initialize Render Passes shadowpass[0] = make_shared<RenderPass>(); shadowpass[0]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 0, true);//, CLEAR_DEPTH, -1); shadowpass[1] = make_shared<RenderPass>(); shadowpass[1]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 0, true, true, true, 0); if (MULTIPASS_CUBEMAP) { cubeshadowpass[0] = make_shared<RenderPass>(); cubeshadowpass[0]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 0, true, true, true, CLEAR_DEPTH, 6); cubeshadowpass[1] = make_shared<RenderPass>(); cubeshadowpass[1]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 0, true, true, true, 0, 6); } //shaderStages[0] = TEMPSHADER->shaderStages[0]; //shaderStages[4] = TEMPSHADER->shaderStages[4]; posteffectspass = make_shared<RenderPass>(); posteffectspass->InitializePostProcess(dynamic_pointer_cast<GPUDevice>(Self()), VK_FORMAT_UNDEFINED, 1, false); raytracingpass = make_shared<RenderPass>(); raytracingpass->InitializeRaytrace(dynamic_pointer_cast<GPUDevice>(Self())); lastposteffectspass = make_shared<RenderPass>(); lastposteffectspass->InitializeLastPostProcess(dynamic_pointer_cast<GPUDevice>(Self()), depthformat, 1, false); lastcameralastposteffectspass = make_shared<RenderPass>(); lastcameralastposteffectspass->InitializeLastPostProcess(dynamic_pointer_cast<GPUDevice>(Self()), depthformat, 1, true); { std::vector<VkFormat> colorformats = { __FramebufferColorFormat ,__FramebufferColorFormat, VK_FORMAT_R8G8B8A8_SNORM, VK_FORMAT_R32_SFLOAT }; for (int earlyZPass = 0; earlyZPass < 2; ++earlyZPass) { for (int clearflags = 0; clearflags < 4; ++clearflags) { renderpass[clearflags][earlyZPass] = make_shared<RenderPass>(); renderpass[clearflags][earlyZPass]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 1, false, false, false, clearflags, 1, earlyZPass); renderpassRGBA16[clearflags][earlyZPass] = make_shared<RenderPass>(); renderpassRGBA16[clearflags][earlyZPass]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), colorformats, depthformat, 4, false, false, false, clearflags, 1, earlyZPass); firstrenderpass[clearflags][earlyZPass] = make_shared<RenderPass>(); firstrenderpass[clearflags][earlyZPass]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 1, false, true, false, clearflags, 1, earlyZPass); lastrenderpass[clearflags][earlyZPass] = make_shared<RenderPass>(); lastrenderpass[clearflags][earlyZPass]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), { VK_FORMAT_UNDEFINED }, depthformat, 1, false, false, true, clearflags, 1, earlyZPass); //for (int d = 0; d < 2; ++d) { for (int n = 0; n < 5; ++n) { if (n == 2 or n == 3) continue; rendertotexturepass[clearflags][n][earlyZPass] = make_shared<RenderPass>(); rendertotexturepass[clearflags][n][earlyZPass]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), colorformats, depthformat, n, true, false, false, clearflags, 1, earlyZPass); firstrendertotexturepass[clearflags][n][earlyZPass] = make_shared<RenderPass>(); firstrendertotexturepass[clearflags][n][earlyZPass]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), colorformats, depthformat, n, true, true, false, clearflags, 1, earlyZPass); // lastrendertotexturepass[clearflags][n] = make_shared<RenderPass>(); // lastrendertotexturepass[clearflags][n]->Initialize(dynamic_pointer_cast<GPUDevice>(Self()), depthformat, n, true, false, true, clearflags); } } } } } With dynamic rendering, you still have to fill in most of the same information, but you can just do it based on whatever the current state of things is, instead of looking for an object that hopefully matches the exact settings you want:
    VkRenderingInfoKHR renderinfo = {}; renderinfo.sType = VK_STRUCTURE_TYPE_RENDERING_INFO_KHR; renderinfo.renderArea = scissor; renderinfo.layerCount = 1; renderinfo.viewMask = 0; renderinfo.colorAttachmentCount = 1; targetbuffer->colorAttachmentInfo[0].imageLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL; targetbuffer->colorAttachmentInfo[0].clearValue.color.float32[0] = 0.0f; targetbuffer->colorAttachmentInfo[0].clearValue.color.float32[1] = 0.0f; targetbuffer->colorAttachmentInfo[0].clearValue.color.float32[2] = 0.0f; targetbuffer->colorAttachmentInfo[0].clearValue.color.float32[3] = 0.0f; targetbuffer->colorAttachmentInfo[0].imageView = targetbuffer->imageviews[0]; renderinfo.pColorAttachments = targetbuffer->colorAttachmentInfo.data(); targetbuffer->depthAttachmentInfo.clearValue.depthStencil.depth = 1.0f; targetbuffer->depthAttachmentInfo.clearValue.depthStencil.stencil = 0; targetbuffer->depthAttachmentInfo.imageLayout = VK_IMAGE_LAYOUT_DEPTH_ATTACHMENT_OPTIMAL; renderinfo.pDepthAttachment = &targetbuffer->depthAttachmentInfo; device->vkCmdBeginRenderingKHR(cb->commandbuffer, &renderinfo); Then there is the way render passes effect the image layout state. With the TransitionImageLayout command, it is fairly easy to track the current state of the image layout, but render passes automatically switch the image layout after completion to a predefined state. Again, not impossible to handle, in and of itself, but when you add these things into the complexity of designing a full engine, things start to get ugly.
    void GPUCommandBuffer::EndRenderPass() { vkCmdEndRenderPass(commandbuffer); for (int k = 0; k < currentrenderpass->layers; ++k) { for (int n = 0; n < currentrenderpass->colorcomponents; ++n) { if (currentdrawbuffer->colortexture[n]) currentdrawbuffer->colortexture[n]->imagelayout[0][currentdrawbuffer->baseface + k] = currentrenderpass->colorAttachment[n].finalLayout; } if (currentdrawbuffer->depthtexture != NULL and currentrenderpass->depthcomponent == true) currentdrawbuffer->depthtexture->imagelayout[0][currentdrawbuffer->baseface + k] = currentrenderpass->depthAttachment.finalLayout; } currentdrawbuffer = NULL; currentrenderpass = NULL; } Another example where this was causing problems was with user-defined texture buffers. One beta tester wanted to implement some interesting effects that required rendering to some HDR color textures, but the system was so static it couldn't handle a user-defined color format in a texture buffer. Again, this is not impossible to overcome, but the practical outcome is I just didn't have enough time because resources are finite.
    It's interesting that this extension also removes the need to create a Vulkan framebuffer object. I guess that means you can just start rendering to any combination of textures you want, so long as they use a format that is renderable by the hardware. Vulkan certainly changes a lot of conceptions we had in OpenGL.
    So this extension does eliminate a significant source of problems for me, and I am happy it was implemented.
  16. Josh
    I am surprised at how quickly Vulkan development is coming together. The API is ridiculously verbose, but at the same time it eliminates a lot of hidden states and implicit behavior that made OpenGL difficult to work with. I have vertex buffers working now. Vertices in the new engine will always use this layout:
        struct VkVertex     {         float position[3];         float normal[3];         float texcoords0[2];         float texcoords1[2];         float tangent[3];         unsigned char color[4];         unsigned char boneweights[4];         unsigned char boneindices[4];     }; Note there are no longer vertex binormals, as these are calculated in the vertex shader, with the assumption that the texture coordinates have no shearing. There are two sets of UV coordinates available to use. Up to 256 bones per mesh are supported.
    I am creating a few internal classes for Vulkan, out of necessity, and the structure of the new renderer is forming. It's very interesting stuff:
    class VkMesh { public: Vk* environment; VkBuffer vertexBuffer; VmaAllocation allocation; VkBuffer indexBuffer; VmaAllocation indexallocation; VkMesh(); ~VkMesh(); }; I have hit the memory management part of Vulkan. Something that used to be neatly done for you is now pushed onto the developer for no apparent reason. I think this is really pointless because we're all going to end up using a bunch of open-source helper libraries anyways. It's like they are partially open-sourcing the driver.

    You can't just allocate memory buffers as you wish. From vulkan-tutorial.com:
    Nvidia explains it visually. It is better to allocate a smaller number of memory blocks and buffers and split them up:

    I added the Vulkan Memory Allocator library and it works. I honestly have no idea what it is doing, but I am able to delete the Vulkan instance with no errors so that's good.
    Shared contexts are also working so we can have multiple Windows, just like in the OpenGL renderer:

  17. Josh

    Articles
    I have procrastinated testing of our new 3D engine on AMD hardware for a while. I knew it was not working as-is, but I was not too concerned. One of the promises of Vulkan is better support across-the-board and fewer driver bugs, due to the more explicit nature of the API. So when I finally tried out the engine on an R9 200 series card, what would actually happen? Would the promise of Vulkan be realized, or would developers continue to be plagued by problems on different graphics cards? Read on to find out how Vulkan runs on AMD graphics cards.
    To test how Vulkan on AMD graphics cards, the first thing I had to do was run the new engine on a machine with an AMD graphics card. I removed the Nvidia card from my PC tower and inserted an AMD R9 200 series graphics card into the PCI-E slot of the motherboard. Then I turned on the computer's power by pressing the power button and powering up my computer with an AMD graphics card. What would happen? Would the AMD graphics card run Vulkan successfully?
    The first error I encountered while running the new 3D engine with Vulkan on an AMD graphics card was that the shadowmap texture format had been explicitly declared as depth-24 / stencil 8, and it should have checked the supported formats to find a depth format the AMD graphics card supported for Vulkan. That was easily fixed.
    The second issue was that my push constants structure was too big. There is a minimum limit of 128 bytes for the push constants structure in Vulkan 1.1 and 1.2. I never encountered this limit before, but on the AMD graphics card it was 128. I was able to eliminate an unneeded vec4 value to bring the structure size down to 128 bytes from 144 bytes.
    With these issues fixed, the new engine with Vulkan ran perfectly on my AMD graphics card. The engine also works without a hitch on Intel graphics, with the exception that the number of shader image units seems to be a hardware-limited feature on those chipsets. We've also seen very reliable performance on Intel chips, although the number of shader image units is severely restricted on that hardware. Overall it appears the promise of fewer driver bugs under Vulkan is holding true, although there is very wide variability in the hardware capabilities that requires rendering fallbacks.
  18. Josh

    Articles
    When it comes to complex projects I like to focus on whatever area of technology causes me the most uncertainty or worry. Start with the big problems, solve those, and as you continue development the work gets easier and easier. I decided at this stage I really wanted to see how well Vulkan graphics work on Mac computers.
    Vulkan doesn't run natively on Mac, but gets run through a translation library called MoltenVK. How well does MoltenVK actually work? There was only one way to find out...
    Preparing the Development Machine
    The first step was to set up a suitable development machine. The only Mac I currently own is a 2012 MacBook Pro. I had several other options to choose from:
    Use a remote service like MacInCloud to access a new Mac remotely running macOS Big Sur. Buy a new Mac Mini with an M1 chip ($699). Buy a refurbished Mac Mini ($299-499). What are my requirements?
    Compile universal binaries for use on Intel and ARM machines. Metal graphics. I found that the oldest version of Xcode that supports universal binaries is version 12.2. The requirements for running Xcode 12.2 are macOS Catalina...which happens to be the last version of OSX my 2012 MBP supports! I tried upgrading the OS with the Mac App Store but ran into trouble because the hard drive was not formatted with the new-ish APFS drive format. I tried running disk utility in a safe boot but the option to convert the file system to APFS was disabled in the program menu, no matter what I did. Finally, I created a bootable install USB drive from the Catalina installer and did a clean install of the OS from that.
    I was able to download Xcode 12.2 directly from Apple instead of the App Store and it installed without a hitch. I also installed the Vulkan SDK for Mac and the demos worked fine. The limitations on this Mac appear to be about the same as an Intel integrated chip, so it is manageable (128 shader image units accessible at any time). Maybe this is improved in newer hardware. Performance with this machine in macOS Catalina is actually great. I did replace the mechanical hard drive with an SSD years ago, so that certainly helps.
    Adding Support for Universal Binaries
    Mac computers are currently in another big CPU architecture shift from Intel x64 to arm64. They previously did this in 2006 when they moved from PowerPC to Intel processors, and just like now, they previously used a "universal binary" for static and shared libraries and executables.
    Compiling my own code for universal binaries worked with just one change. The stat64 structure seems to be removed for the ARM builds, but changing this to "stat" worked without any problems. The FreeImage texture loader plugin, on the other hand, required a full day's work before it would compile. There is a general pattern that when I am working with just my own code, everything works nicely and neatly, but when I am interfacing with other APIs productivity drops ten times. This is why I am trying to work out all this cross-platform stuff now, so that I can get it all resolved and then my productivity will skyrocket.
    macOS File Access
    Since Mojave, macOS has been moving in a direction of requiring the developer to explicitly request access the parts of the file system, or for the user to explicitly allow access. On one hand, it makes sense to not allow every app access to all your user files. On the other hand, this really cripples the functionality of Mac computers. ARM CPUs do no in and of themselves carry any restrictions I am aware of, but it does look like the Mac is planned to become another walled garden ecosystem like iOS.
    I had to change the structure of user projects so that the project folders are included in the build. All files and subfolders in the blue folders are packaged into your Xcode application automatically, and the result is a single app file (which is really a folder) ready to publish to the Mac App Store.

    However, this means the Leadwerks-style "publish" feature is not really appropriate for the new editor. Maybe there will be an optional extension that allows you to strip unused files from a project?
    There are still some unanswered questions about how this will work with an editor that involves creating and modifying large numbers of files, but the behavior I have detailed above is the best for games and self-contained applications.
    MacOS is now about as locked down as iOS. You cannot run code written on another machine that is not signed with a certificate, which means Apple can probably turn anyone's program off at any time, or refuse to grant permission to distribute a program. So you might want to think twice before you buy into the Mac ecosystem.

    MoltenVK
    Integration with the MoltenVK library actually went pretty smoothly. However, importing the library into an Xcode project will produce an error like "different teamID for imported library" unless you add yet another setting to your entitlements ilst, "com.apple.security.cs.disable-library-validation" and set it to YES.
    I was disappointed to see the maximum accessible textures per shader are 16 on my Nvidia GEForce 750, but a fallback will need to be written for this no matter what because Intel integrated graphics chips have the same issue.
    Finally, after years of wondering whether it worked and months of work, I saw the beautiful sight of a plain blue background rendered with Metal:

    It looks simple, but now that I have the most basic Vulkan rendering working on Mac things will get much easier from here on out.
  19. Josh
    I was going to write about my thoughts on Vulkan, about what I like and don't like, what could be improved, and what ramifications this has for developers and the industry. But it doesn't matter what I think. This is the way things are going, and I have no say in that. I can only respond to these big industry-wide changes and make it work to my advantage. Overall, Vulkan does help us, in both a technical and business sense. That's as much as I feel like explaining.

    Beta subscribers can try the demo out here:
    This is the code it takes to add a depth buffer to the swap chain ?:
    //---------------------------------------------------------------- // Depth attachment //---------------------------------------------------------------- auto depthformat = VK_FORMAT_D24_UNORM_S8_UINT; VkImage depthimage = nullptr; VkImageCreateInfo image_info = {}; image_info.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO; image_info.pNext = NULL; image_info.imageType = VK_IMAGE_TYPE_2D; image_info.format = depthformat; image_info.extent.width = chaininfo.imageExtent.width; image_info.extent.height = chaininfo.imageExtent.height; image_info.extent.depth = 1; image_info.mipLevels = 1; image_info.arrayLayers = 1; image_info.samples = VK_SAMPLE_COUNT_1_BIT; image_info.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; image_info.usage = VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT; image_info.queueFamilyIndexCount = 0; image_info.pQueueFamilyIndices = NULL; image_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE; image_info.flags = 0; vkCreateImage(device->device, &image_info, nullptr, &depthimage); VkMemoryRequirements memRequirements; vkGetImageMemoryRequirements(device->device, depthimage, &memRequirements); VmaAllocation alllocation = {}; VmaAllocationInfo allocinfo = {}; VmaAllocationCreateInfo allocCreateInfo = {}; allocCreateInfo.usage = VMA_MEMORY_USAGE_GPU_ONLY; VkAssert(vmaAllocateMemory(GameEngine::Get()->renderingthreadmanager->instance->allocator, &memRequirements, &allocCreateInfo, &alllocation, &allocinfo)); VkAssert(vkBindImageMemory(device->device, depthimage, allocinfo.deviceMemory, allocinfo.offset)); VkImageView depthImageView; VkImageViewCreateInfo view_info = {}; view_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO; view_info.pNext = NULL; view_info.image = depthimage; view_info.format = depthformat; view_info.components.r = VK_COMPONENT_SWIZZLE_R; view_info.components.g = VK_COMPONENT_SWIZZLE_G; view_info.components.b = VK_COMPONENT_SWIZZLE_B; view_info.components.a = VK_COMPONENT_SWIZZLE_A; view_info.subresourceRange.aspectMask = VK_IMAGE_ASPECT_DEPTH_BIT; view_info.subresourceRange.baseMipLevel = 0; view_info.subresourceRange.levelCount = 1; view_info.subresourceRange.baseArrayLayer = 0; view_info.subresourceRange.layerCount = 1; view_info.viewType = VK_IMAGE_VIEW_TYPE_2D; view_info.flags = 0; VkAssert(vkCreateImageView(device->device, &view_info, NULL, &depthImageView)); VkAttachmentDescription depthAttachment = {}; depthAttachment.format = depthformat; depthAttachment.samples = VK_SAMPLE_COUNT_1_BIT; depthAttachment.loadOp = VK_ATTACHMENT_LOAD_OP_CLEAR; depthAttachment.storeOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; depthAttachment.stencilLoadOp = VK_ATTACHMENT_LOAD_OP_DONT_CARE; depthAttachment.stencilStoreOp = VK_ATTACHMENT_STORE_OP_DONT_CARE; depthAttachment.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED; depthAttachment.finalLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL; VkAttachmentReference depthAttachmentRef = {}; depthAttachmentRef.attachment = 1; depthAttachmentRef.layout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL; VkPipelineDepthStencilStateCreateInfo depthStencil = {}; depthStencil.sType = VK_STRUCTURE_TYPE_PIPELINE_DEPTH_STENCIL_STATE_CREATE_INFO; depthStencil.depthTestEnable = VK_TRUE; depthStencil.depthWriteEnable = VK_TRUE; depthStencil.depthCompareOp = VK_COMPARE_OP_LESS; depthStencil.depthBoundsTestEnable = VK_FALSE; depthStencil.minDepthBounds = 0.0f; depthStencil.maxDepthBounds = 1.0f; depthStencil.stencilTestEnable = VK_FALSE; depthStencil.front = {}; depthStencil.back = {}; I was hoping I would put a month into it and be up to speed with where we were with OpenGL, but it is much more complicated than that. Using Vulkan is going to be tough but we will get through it and I think the benefits will be worthwhile:
    Vulkan makes our new renderer 80% faster Better compatibility (Mac, Intel on Linux) There's a lot of demand for Vulkan products thanks to Khronos and Valve's promotion.
  20. Josh
    Having completed a hard-coded rendering pipeline for one single shader, I am now working to create a more flexible system that can handle multiple material and shader definitions. If there's one way I can describe Vulkan, it's "take every single possible OpenGL setting, put it into a structure, and create an immutable cached object based on those settings that you can then use and reuse". This design is pretty rigid, but it's one of the reasons Vulkan is giving us an 80% performance increase over OpenGL. Something as simple as disabling backface culling requires recreation of the entire graphics pipeline, and I think this option is going away. The only thing we use it for is the underside of tree branches and fronds, so that light appears to shine through them, but that is not really correct lighting. If you shine a flashlight on the underside of the palm frond it won't brighten the surface if we are just showing the result of the backface lighting.

    A more correct way to do this would be to calculate the lighting for the surface normal, and for the reverse vector, and then add the results together for the final color. In order to give the geometry faces for both direction, a plugin could be added that adds reverse triangles for all the faces of a selected part of the model in the model editor. At first the design of Vulkan feels restrictive, but I also appreciate the fact that it has a design goal other than "let's just do what feels good".
    Using indirect drawing in Vulkan, we can create batches of batches, sorted by shader. This feature is also available in OpenGL, and in fact is used in our vegetation rendering system. Of course the code for all this is quite complex. Draw commands, instance IDs, material IDs, entity 4x4 matrices, and material data all has to be uploaded to the GPU in memory buffers, some of which are more or less static, and some of which are updated each frame, and some for each new visibility set. It is complicated stuff, but after some time I was able to get it working. The screenshot below shows a scene with five unique objects being drawn in one single draw call, and accessing two different materials with different diffuse colors. That means an entire complex scene like The Zone will be rendered in one or just a few passes, with the GPU treating all geometry as if it was a single collapsed object, even as different objects are hidden and shown. Everyone knows that instanced rendering is faster than unique objects, but at some point the number of batches can get high enough to be a bottleneck. Indirect rendering batches the batches to eliminate this slowdown.

    This is one of the features that will help our new renderer run an order of magnitude faster, for high-performance VR and regular 3D games.
  21. Josh
    Shadow maps are now supported in the new Vulkan renderer, for point, spot, and box lights!

    Our new forward renderer eliminates all the problems that deferred renderers have with transparency, so shadows and lighting works great with transparent surfaces. Transparent objects even receive lighting from their back face automatically!

    There is some shadow acne, which I am not going to leave alone right now because I want to try some ideas to eliminate it completely, so you never have to adjust offsets or other settings. I also want to take another look at variance shadow maps, as these can produce much better results than depthmap shadows. I also noticed some seams in the edges of point light shadows.
    Another interesting thing to note is that the new renderer handles light and shadows with orthographic projection really well.

    A new parameter has also been added to the JSON material loader. You can add a scale factor inside the normalTexture block to make normal maps appear bumpier. The default value is 1.0:
    "normalTexture": { "file": "./glass_dot3.tex", "scale": 2.0 } Also, the Context class has been renamed to "Framebuffer". Use CreateFramebuffer() instead of CreateContext().
  22. Josh
    One of the best points of Vulkan is how shaders are loaded from precompiled Spir-V files. This means GLSL shaders either work or they don't. Unlike OpenGL, there is no different outcome on Intel, AMD, or nVidia hardware. SPIR-V files can be compiled using a couple of different utilities. I favor LunarG's compiler because it supports #include directives.
    Shader.vert:
    #version 450 #extension GL_ARB_separate_shader_objects : enable #include "VertexLayout.glsl" layout(push_constant) uniform pushBlock { vec4 materialcolor; } pushConstantsBlock; layout(location = 0) out vec3 fragColor; void main() { gl_Position = vec4(inPosition.xy, 0.0, 1.0); fragColor = inColor.rgb * materialcolor.rgb; } VertexLayout.glsl:
    layout(location = 0) in vec3 inPosition; layout(location = 1) in vec3 inNormal; layout(location = 2) in vec2 inTexCoords0; layout(location = 3) in vec2 inTexCoords1; layout(location = 4) in vec3 inTangent; layout(location = 5) in vec4 inColor; layout(location = 6) in vec4 inBoneWeights; layout(location = 7) in uvec4 inBoneIndices; If the shader compiles successfully, then you don't have to worry about whether it works on different manufacturers' hardware. It just works. So if someone writes a new post-processing effect they don't need to test on other hardware or worry about people asking for help when it doesn't work. Because it always works the same.
    You can try it yourself with these files:
    Shaders.zip
  23. Josh
    In Vulkan all shader uniforms are packed into a single structure declared in a GLSL shader like this:
    layout(push_constant) uniform pushBlock { vec4 color; } pushConstantsBlock; You can add more values, but the shaders all need to use the same structure, and it needs to be declared exactly the same inside the program.
    Like everything else in Vulkan, shaders are set inside a command buffer. But these shader values are likely to be constantly changing each frame, so how do you handle this? The answer is to have a pool of command buffers and retrieve an available one when needed to perform this operation.
    void Vk::SetShaderGlobals(const VkShaderGlobals& shaderglobals) { VkCommandBuffer commandbuffer; VkFence fence; commandbuffermanager->GetManagedCommandBuffer(commandbuffer,fence); VkCommandBufferBeginInfo beginInfo = {}; beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO; beginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT; vkBeginCommandBuffer(commandbuffer, &beginInfo); vkCmdPushConstants(commandbuffer, pipelineLayout, VK_SHADER_STAGE_ALL, 0, sizeof(shaderglobals), &shaderglobals); vkEndCommandBuffer(commandbuffer); VkSubmitInfo submitInfo = {}; submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO; submitInfo.commandBufferCount = 1; submitInfo.pCommandBuffers = &commandbuffer; vkQueueSubmit(devicequeue[0], 1, &submitInfo, fence); } I now have a rectangle that flashes on and off based on the current time, which is fed in through a shader uniform structure. Now at 1500 lines of code.
    You can download my command buffer manager code at the Leadwerks Github page:
  24. Josh
    Now that I have all the Vulkan knowledge I need, and most work is being done with GLSL shader code, development is moving faster. Before starting voxel ray tracing, another hard problem, I decided to work one some *relatively* easier things for a few days. I want tessellation to be an every day feature in the new engine, so I decided to work out a useful implementation of it.
    While there are a ton of examples out there showing how to split a triangle up into smaller triangles, useful discussion and techniques of in-game tessellation is much more rare. I think this is because there are several problems to solve before this technical feature can really be made practical.
    Tessellation Level
    The first problem is deciding how much to tessellate an object. Tessellation should use a single detail level per set of primitives being drawn. The reason for this is that cracks will appear when you apply displacement if you try to use a different tessellation level for each polygon. I solved this with some per-mesh setting for tessellation parameters.
    Note: In Leadwerks Game Engine, a model is an entity with one or more surfaces. Each surface has a vertex array, indice array, and a material. In Turbo Game Engine, a model contains one of more LODs, and each LOD can have one or more meshes. A mesh object in Turbo is like a surface object in Leadwerks.
    We are not used to per-mesh settings. In fact, the material is the only parameter meshes contained other than vertex or indice data. But for tessellation parameters, that is exactly what we need, because the density of the mesh polygons gives us an idea of how detailed the tessellation should be. This command has been added:
    void Mesh::SetTessellation(const float detail, const float nearrange, const float farrange) Here are what the parameters do:
    detail: the higher the detail, the more the polygons are split up. Default is 16. nearrange: the distance below which tessellation stops increasing. Default is 1.0 meters. farrange: the distance below which tessellation starts increasing. Default is 20.0 meters. This command gives you the ability to set properties that will give a roughly equal distribution of polygons in screen space. For convenience, a similar command has been added to the model class, which will apply the settings to all meshes in the model.
    Surface Displacement
    Another problem is culling offscreen polygons so the GPU doesn't have to process millions of extra vertices. I solved this by testing if all control vertices lie outside one edge of the camera frustum. (This is not the same as testing if any control point is inside the camera frustum, as I have seen suggested elsewhere. The latter will cause problems because it is still possible for a triangle to be visible even if all its corners are outside the view.) To account for displacement, I also tested the same vertices with maximum displacement applied.
    To control the amount of displacement, a scale property has been added to the displacementTexture object scheme:
    "displacementTexture": { "file": "./harshbricks-height5-16.png", "scale": 0.05 } A Boolean value called "tessellation" has also been added to the JSON material scheme. This will tell the engine to choose a shader that uses tessellation, so you don't need to explicitly specify a shader file. Shadows will also be rendered with tessellation, unless you explicitly choose a different shadow shader.
    Here is the result:

    Surface displacement takes the object scale into account, so if you scale up an object the displacement will increase accordingly.
    Surface Curvature
    I also added an implementation of the PN Triangles technique. This treats a triangle as control points for a Bezier curve and projects tessellated curved surfaces outwards.
     
    You can see below the shot using the PN Triangles technique eliminates the pointy edges of the sphere.


    The effects is good, although it is more computationally expensive, and if a strong displacement map is in use, you can't really see a difference. Since vertex positions are being changed but texture coordinates are still using the same interpolation function, it can make texture coordinates appear distorted. To counter this, texture coordinates would need to be recalculated from the new vertex positions.
    EDIT:
    I found a better algorithm that doesn't produce the texcoord errors seen above.


    Finally, a global tessellation factor has been added that lets you scale the effect for different hardware levels:
    void SetTessellationDetail(const float detail) The default setting is 1.0, so you can use this to scale up or down the detail any way you like.
  25. Josh

    Articles
    Before finalizing Ultra App Kit I want to make sure our 3D engine works correctly with the GUI system. This is going to be the basis of all our 3D tools in the future, so I want to get it right before releasing the GUI toolkit. This can prevent breaking changes from being made in the future after the software is released.
    Below you can see our new 3D engine being rendered in a viewport created on a GUI application. The GUI is being rendered using Windows GDI+, the same system that draws the real OS interface, while the 3D rendering is performed with Vulkan 1.1. The GUI is using an efficient event-driven program structure with retained mode drawing, while Vulkan rendering is performed asynchronously in real time, on another thread. (The rendering thread can also be set to render only when the viewport needs to be refreshed.)

    The viewport resizes nicely with the window:

    During this process I learned there are two types of child window behavior. If a window is parented to another window it will appear on top of the parent, and it won’t have a separate icon appear in the Windows task bar. Additionally, if the WS_CHILD window style is used, then the child window coordinates will be relative to the parent, and moving the parent will instantly move the child window with it. We need both types of behavior. A splash screen is an example of the first, and a 3D viewport is an example of the second. Therefore, I have added a WINDOW_CHILD window creation flag you can use to control this behavior.
    This design has been my plan going back several years, and at long last we have the result. This will be a strong foundation for creating game development tools like the new engine's editor, as well as other ideas I have.
    This is what "not cutting corners" looks like.
×
×
  • Create New...