Josh

Staff

View Profile See their activity

Posts
23,290
Joined
November 22, 2009
Last visited
Just now

Content Type

All Activity

Blogs

Forums

Topics
Posts

Store

Product Reviews

Gallery

Videos

Blog Entries posted by Josh

Sort By

Multisampled Shadowmaps

By Josh, July 3, 2018

Because variance shadow maps allow us to store pre-blurred shadow maps it also allows us to take advantage of multipled textures. MSAA is a technique that renders extra pixels around the target pixel and averages the results. This can help bring out fine lines that are smaller than a pixel onscreen, and it also greatly reduces jagged edges. I wanted to see how well this would work for rendering shadow maps, and to see if I could reduce the ragged edge appearance that shadow maps are sometimes prone to.
Below is the shadow rendered at 1024x1024 with no multisampling and a 3x3 blur:

Using a 4X MSAA texture eliminates the appearance of jagged edges in the shadow:

Here they are side by side:

This is very exciting stuff because we are challenging some of the long-held limitations of real-time graphics.
- Read more...
- 2 comments
- 2,160 views
Multiple Shadows

By Josh, July 3, 2018

Texture arrays are a feature that allow you to pack multiple textures into a single one, as long as they all use the same format and size. In reality, this is just a convenience feature that packs all the textures into a single 3D texture. It allows things like cubemap lookups with a 3D texture, but the implementation is sort of inconsistent. In reality it would be much better if we were just given 1000 texture units to use. However, these can be used to pack all scene shadow maps into a single texture so that they can be rendered in a single pass with the clustered forward renderer.
The results are great and speed is very fast. However, there are some limitations. I said early on that my top priority with the design of the new renderer is speed. That means I will make decisions that favor speed over other flexibility, and here is a situation where we will see that in action. All scene shadow maps need to be packed into a single array texture of fixed size, which means there is a hard upper limit on total shadow-casting lights in the world.
I've also discovered that my beautiful variance shadow maps use a ton of memory. At maximum quality they use an RGBA 32-bit floating point format, so that means a single 1024x1024 cubemap consumes 96 megabytes! (A standard shadow map at the same resolution uses 24 megabytes VRAM). Because all shadows are packed into a single texture, the driver can't even page the data in and out of video memory. If you don't have enough VRAM, you will get an OUT_OF_MEMORY error. So anticipating and handling this issue will be important. Hopefully I can just use appropriate defaults. I think I can cut the size of the VSMs down to 25%, but without the beautiful shadow scattering effect. Because the textures all have to be the same size, it is also impossible to set just one light to use higher resolution settings.
If you want speed, I have to build more constraints into the engine. This is the kind of thing I was talking about. I want great graphics and the absolute fastest performance, so that is what Ia m doing.
Okay, so with all that information and disclaimers out of the way, I give you the first shot showing multiple lights being rendered with shadows in a single pass in our new forward renderer.

Here are three lights:

And here I lowered the shadow map resolution and added 50 randomly placed lights. There are some artifacts and glitches, but it's still a pretty cool shot. All running in real-time, in a single pass:

Keep in mind this is all before any indirect lighting has been added. The future looks bright!
- Read more...
- 5 comments
- 2,587 views
Variance Shadow Maps

By Josh, July 1, 2018

After a couple days of work I got point light shadows working in the new clustered forward renderer. This time around I wanted to see if I could get a more natural look for shadow edges, as well as reduve or eliminate shadow acne. Shadow acne is an effect that occurs when the resolution of the shadow map is too low, and incorrect depth comparisons start being made with the lit pixels: By default, any shadow mapping alogirthm will look like this, because not every pixel onscreen has an exact match in the shadow map when the depth comparison is made:

We can add an offset to the shadow depth value to eliminate this artifact:
\
However, this can push the shadow back too far, and it's hard to come up with values that cover all cases. This is especially problematic with point lights that are placed very close to a wall. This is why the editor allows you to adjust the light range of each light, on an individual basis.
I came across a techniqe called variance shadow mapping. I've seen this paper years ago, but never took the time to implement it because it just wasn't a big priority. This works by writing the depth and depth squared values into a GL_RG texture (I use 32-bit floating points). The resulting image is then blurred and the variance of the values can be calculated from the average squared depth stored in the green channel.

Then we use Chebyshev's inequality to get an average shadow value:

So it turns out, statistics is actually good for something useful after all. Here are the results:

The shadow edges are actually soft, without any graininess or pixelation. There is a black border on the edge of the cubemap faces, but I think this is caused by my calculated cubemap face not matching the one the hardware uses to perform the texture lookup, so I think it can be fixed.
As an added bonus, this eliminates the need for a shadow offset. Shadow acne is completely gone, even in the scene below with a light that is extremely close to the floor.

The banding you are seeing is added in the JPEG compression and it not visible in the original render.
Finally, because the texture filtering is so smooth, shadowmaps look much higher resolution than with PCF filtering. By increasing the light range, I can light the entire scene, and it looks great just using a 1024x1024 cube shadow map.

VSMs are also quite fast because they only require a single texture lookup in the final pass. So we get better image quality, and probably slightly faster speed. Taking extra time to pay attention to small details like this is going to make your games look great soon!
- Read more...
- 0 comments
- 5,382 views
Clustered Forward Rendering Victory

By Josh, June 26, 2018

I got the remaining glitches worked out, and the deal is that clustered forward rendering works great. It has more flexibility than deferred rendering and it performs a lot faster. This means we can use a better materials and lighting system, and at the same time have faster performance, which is great for VR especially. The video below shows a scene with 50 lights working with fast forward rendering
One of the last things I added was switching from a fixed grid size of 16x16x16 to an arbitrary layout that can be set at any time. Right now I have it set to 16x8x64, but I will have to experiment to see what the optimum dimensions are.
There are a lot of things to add (like shadows!) but I have zero concern about everything else working. The hard part is done, and I can see that this technique works great.
- Read more...
- 13 comments
- 4,563 views
Taking Care of Business

By Josh, June 22, 2018

This is about financial stuff, and it's not really your job to care about that, but I still think this is cool and wanted to share it with you.
People are buying stuff on our website, and although the level of sales is much lower than Steam it has been growing. Unlike Steam, sales through our website are not dependent on a third party and cannot be endangered by flooded marketplaces, strange decisions, and other random events.
Every customer I checked who used a credit card has kept it on file for further purchases. The credit card number isn't actually stored on our server, and I never see it. Instead, PayPal stores the number and we store a token that can only be used for purchases through this domain, so it is impossible for a hacker to steal your credit card from our site. I feel better doing things this way because it's much safer for everyone.

Anyways, having a customer who has the ability to buy anything they want on your site at a moment's notice is a very good thing. You give them a good product and they can easily buy it, if it is something they want.
This is what I hoped to see by introducing these features to the site, so it is nice to see it working. Thank you for your support! I will work hard to bring you more great software.
- Read more...
- 5 comments
- 2,421 views
Clustered Forward Rendering Progress

By Josh, June 25, 2018

In order to get the camera frustum space dividing up correctly, I first implemented a tiled forward renderer, which just divides the screen up into a 2D grid. After working out the math with this, I was then able to add the third dimension and make an actual volumetric data structure to hold the lighting information. It took a lot of trial and error, but I finally got it working.

This screenshot shows the way the camera frustum is divided up into a cubic grid of 16x16x16 cells. Red and green show the XY position, while the blue component displays the depth:

And here you can see the depth by itself, enhanced for visibility:

I also added dithering to help hide light banding that can appear in gradients. Click on the image below to view it properly:

I still have some bugs to resolve, but the technique basically works. I have no complete performance benchmarks yet to share but I think this approach is a lot faster than deferred rendering. It also allows much more flexible lighting, so it will work well with the advanced lighting system I have planned.
- Read more...
- 1 comment
- 2,662 views
Clustered Forward Rendering - First Performance Metrics

By Josh, June 23, 2018

I was able to partially implement clustered forward rendering. At this time, I have not divided the camera frustum up into cells and I am just handing a single point light to the fragment shader, but instead of a naive implementation that would just upload the values in a shader uniform, I am going through the route of sending light IDs in a buffer. I first tried texture buffers because they have a large maximum size and I already have a GPUMemBlock class that makes them easy to work with. Because the GPU likes things to be aligned to 16 bytes, I am treating the buffer as an array of ivec4s, which makes the code a little trickier, thus we have a loop within a loop with some conditional breaks:
vec4 CalculateLighting(in vec3 position, in vec3 normal) { vec4 lighting = vec4(0.0f); int n,i,lightindex,countlights[3]; vec4 lightcolor; ivec4 lightindices; mat4 lightmatrix; vec2 lightrange; vec3 lightdir; float l,falloff; //Get light list offset int lightlistpos = 0;//texelFetch(texture12, sampleCoord, 0).x; //Point Lights countlights[0] = texelFetch(texture11, lightlistpos).x; for (n = 0; n <= countlights[0] / 4; ++n) { lightindices = texelFetch(texture11, lightlistpos + n); for (i = 0; i < 4; ++i) { if (n == 0 && i == 0) continue; //skip first slot since that contains the light count if (n * 4 + i > countlights[0]) break; //break if we go out of bounds of the light list lightindex = lightindices[1]; lightmatrix[3] = texelFetch(texture15, lightindex * 4 + 3); vec3 lightdir = position - lightmatrix[3].xyz; float l = length(lightdir); falloff = max(0.0f,-dot(normal,lightdir/l)); if (falloff <= 0.0f) continue; lightrange = texelFetch(texture15, lightindex * 4 + 4).xy; falloff *= max(0.0f, 1.0f - l / lightrange.y); if (falloff <= 0.0f) continue; lightmatrix[0] = texelFetch(texture15, lightindex * 4); lightmatrix[1] = texelFetch(texture15, lightindex * 4 + 1); lightmatrix[2] = texelFetch(texture15, lightindex * 4 + 2); lightcolor = vec4(lightmatrix[0].w,lightmatrix[1].w,lightmatrix[2].w,1.0f); lighting += lightcolor * falloff; } } return lighting; } I am testing with Intel graphics in order to get a better idea of where the bottlenecks are. My GEForce 1080 just chews through this without blinking an eye, so the slower hardware is actually helpful in tuning performance. I was dismayed at first when I saw my framerate drop from 700 to 200+. I created a simple scene in Leadwerks 4 with one point light and no shadows, and the performance was quite a bit worse on this hardware, so it looks like I am actually doing well. Here are the numbers:
Turbo (uniform buffer): 220 FPS Turbo (texture buffer): 290 FPS Leadwerks 4: 90 FPS Of course a discrete card will run much better. The depth pre-pass has a very slight beneficial effect in this scene, and as more lights and geometry are added, I expect the performance savings will become much greater.
Post-processing effects like bloom require a texture with the scene rendered to it, so this system will still need to render to a single color texture when these effects are in use. The low quality settings, however, will render straight to the back buffer and thus provide a much better fallback for low-end hardware.

Here we can see the same shader working with lots of lights. To get good performance out of this, the camera frustum needs to be divided up into cells with a list of relevant lights for each cell.

There are two more benefits to this approach. Context multisample antialiasing can be used when rendering straight to the back buffer. Of course, we can do the same with deferred rendering and multisample textures now, so that is not that big of a deal.

What IS a big deal is the fact that transparency with shadows will work 100%, no problems. All the weird tricks and hacks we have tried to use to achieve this all go away. (The below image is one such hack that uses dithering combined with MSAA to provide 50% transparency...sort of.)

Everything else aside, our first tests reveal more than a 3X increase in performance over the lighting approach that Leadwerks 4 uses. Things look fantastic!
- Read more...
- 0 comments
- 1,662 views
Clustered Forward Rendering

By Josh, June 20, 2018

I decided I want the voxel GI system to render direct lighting on the graphics card, so in order to make that happen I need working lights and shadows in the new renderer. Tomorrow I am going to start my implementation of clustered forward rendering to replace the deferred renderer in the next game engine. This works by dividing the camera frustum up into sectors, as shown below.

A list of visible lights for each cell is sent to the GPU. If you think about it, this is really another voxel algorithm. The whole idea of voxels is that it costs too much processing power to calculate something expensive for each pixel, so lets calculate it for a 3D grid of volumes and then grab those settings for each pixel inside the volume. In the case of real-time global illumination, we also do a linear blend between the values based on the pixel position.
Here's a diagram of a spherical point light lying on the frustum.

But if we skew the frustum so that the lines are all perpendicular, we can see this is actually a voxel problem, and it's the light that is warped in a funny way, not the frustum. I couldn't figure out how to warp the sphere exactly right, but it's something like this.

For each pixel that is rendered, you transform it to the perpendicular grid above and perform lighting using only the lights that are present in that cell. This tecnnique seems like a no-brainer, but it would not have been possible to do this when our deferred renderer first came to be. GPUs were not nearly as flexible back then as they are now, and things like a variable-length for loop would be a big no-no.

Well, something else interesting occurred to me while I was going over this. The new engine is an ambitious project, with a brand new editor to be built from scratch. That's going to take a lot of time. There's a lot of interest in the features I am working on now, and I would like to get them out sooner rather than later. It might be possible to incorporate the clustered forward renderer and voxel GI into Leadwerks Game Engine 4 (at which point I would probably call it 5) but keep the old engine architecture. This would give Leadwerks a big performance boost (not as big as the new architecture, but still probably 2-3x in some situations). The visuals would also make a giant leap forward into the future. And it might even be possible to release in time for Christmas. All the shaders would have to be modified, but if you just updated your project everything would run in the new Leadwerks Game Engine 5 without any problem. This would need to be a paid update, probably with a new app ID on Steam. The current Workshop contents would not be accessible from the new app ID, but we have the Marketplace for that.
This would also have the benefit of bringing the editor up to date with the new rendering methods, which would mean the existing editor could be used seamlessly with the new engine. We presently can't do this because the new engine and Leadwerks 4 use completely different shaders.
This could solve a lot of problems and give us a much smoother transition from here to where we want to go in the future:
Leadwerks Game Engine 4 (deferred rendering, existing editor) [Now] Leadwerks Game Engine 5 (clustered forward rendering, real-time GI, PBR materials, existing architecture, existing editor) [Christmas 2018] Turbo Game Engine (clustered forward rendering, new architecture, new editor) [April 2020] I just thought of this a couple hours ago, so I can't say right now for sure if we will go this route, but we will see. No matter what, I want to get a version 4.6 out first with a few features and fixes.
You can read more about clustered forward rendering in this article
- Read more...
- 7 comments
- 9,607 views
Voxel Cone Tracing Part 5 - Hardware Acceleration

By Josh, June 18, 2018

I was having trouble with cone tracing and decided to first try a basic GI algorithm based on a pattern of raycasts. Here is the result:

You can see this is pretty noisy, even with 25 raycasts per voxel. Cone tracing uses an average sample, which eliminates the noise problem, but it does introduce more inaccuracy into the lighting.
Next I wanted to try a more complex scene and get an estimate of performance. You may recognize the voxelized scene below as the "Sponza" scene frequently used in radiosity testing:

Direct lighting takes 368 milliseconds to calculate, with voxel size of 0.25 meters. If I cut the voxel grid down to a 64x64x64 grid then lighting takes just 75 milliseconds.
These speeds are good enough for soft GI that gradually adjusts as lighting changes, but I am not sure if this will be sufficient for our purposes. I'd like to do real-time screen-independent reflections.
I thought about it, and I thought about it some more, and then when I was done with that I kept thinking about it. Here's the design I came up with:

The final output is a 3D texture containing light data for all six possible directions. (So a 256x256x256 grid of voxels would actually be 1536x256x256 RGB, equal to 288 megabytes.) The lit voxel array would also be six times as big. When a pixel is rendered, three texture lookups are performed on the 3D texture and multiplied by the normal of the pixel. If the voxel is empty, there is no GI information for that volume, so maybe a higher mipmap level could be used (if mipmaps are generated in the last step). The important thing is we only store the full-resolution voxel grid once.
The downsampled voxel grid use an alpha channel for coverage. For example, a pixel with 0.75 alpha would have six out of eight solid child voxels.
I do think voxelization is best performed on the CPU due to flexibility and the ability to cache static objects.
Direct lighting, in this case, would be calculated from shadowmaps. So I have to implement the clustered forward renderer before going forward with this.
- Read more...
- 0 comments
- 2,259 views
Voxel Cone Tracing Part 4 - Direct Lighting

By Josh, June 14, 2018

Now that we can voxelize models, enter them into a scene voxel tree structure, and perform raycasts we can finally start calculating direct lighting. I implemented support for directional and point lights, and I will come back and add spotlights later. Here we see a shadow cast from a single directional light:

And here are two point lights, one red and one green. Notice the distance falloff creates a color gradient across the floor:

The idea here is to first calculate direct lighting using raycasts between the light position and each voxel:

Then once you have the direct lighting, you can calculate approximate global illumination by gathering a cone of samples for each voxel, which illuminates voxels not directly visible to the light source:

And if we repeat this process we can simulate a second bounce, which really fills in all the hidden surfaces:

When we convert model geometry to voxels, one of the important pieces of information we lose are normals. Without normals it is difficult to calculate damping for the direct illumination calculation. It is easy to check surrounding voxels and determine that a voxel is embedded in a floor or something, but what do we do in the situation below?

The thin wall of three voxels is illuminated, which will leak light into the enclosed room. This is not good:

My solution is to calculate and store lighting for each face of each voxel.
Vec3 normal[6] = { Vec3(-1, 0, 0), Vec3(1, 0, 0), Vec3(0, -1, 0), Vec3(0, 1, 0), Vec3(0, 0, -1), Vec3(0, 0, 1) }; for (int i = 0; i < 6; ++i) { float damping = max(0.0f,normal[i].Dot(lightdir)); //normal damping if (!isdirlight) damping *= 1.0f - min(p0.DistanceToPoint(lightpos) / light->range[1], 1.0f); //distance damping voxel->directlighting[i] += light->color[0] * damping; } This gives us lighting that looks more like the diagram below:

When light samples are read, the appropriate face will be chosen and read from. In the final scene lighting on the GPU, I expect to be able to use the triangle normal to determine how much influence each sample should have. I think it will look something like this in the shader:
vec4 lighting = vec4(0.0f); lighting += max(0.0f, dot(trinormal, vec3(-1.0f, 0.0f, 0.0f)) * texture(gimap, texcoord + vec2(0.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(1.0f, 0.0f, 0.0f)) * texture(gimap, texcoord + vec2(1.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, -1.0f, 0.0f)) * texture(gimap, texcoord + vec2(2.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, 1.0f, 0.0f)) * texture(gimap, texcoord + vec2(3.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, 0.0f, -1.0f)) * texture(gimap, texcoord + vec2(4.0 / texwidth, 0.0)); lighting += max(0.0f, dot(trinormal, vec3(0.0f, 0.0f, 1.0f)) * texture(gimap, texcoord + vec2(5.0 / texwidth, 0.0)); This means that to store a 256 x 256 x 256 grid of voxels we actually need a 3D RGB texture with dimensions of 256 x 256 x 1536. This is 288 megabytes. However, with DXT1 compression I estimate that number will drop to about 64 megabytes, meaning we could have eight voxel maps cascading out around the player and still only use about 512 megabytes of video memory. This is where those new 16-core CPUs will really come in handy!
I added the lighting calculation for the normal Vec3(0,1,0) into the visual representation of our voxels and lowered the resolution. Although this is still just direct lighting it is starting to look interesting:

The last step is to downsample the direct lighting to create what is basically a mipmap. We do this by taking the average values of each voxel node's children:
void VoxelTree::BuildMipmaps() { if (level == 0) return; int contribs[6] = { 0 }; for (int i = 0; i < 6; ++i) { directlighting[i] = Vec4(0); } for (int ix = 0; ix < 2; ++ix) { for (int iy = 0; iy < 2; ++iy) { for (int iz = 0; iz < 2; ++iz) { if (kids[ix][iy][iz] != nullptr) { kids[ix][iy][iz]->BuildMipmaps(); for (int n = 0; n < 6; ++n) { directlighting[n] += kids[ix][iy][iz]->directlighting[n]; contribs[n]++; } } } } } for (int i = 0; i < 6; ++i) { if (contribs[i] > 0) directlighting[i] /= float(contribs[i]); } } If we start with direct lighting that looks like the image below:

When we downsample it one level, the result will look something like this (not exactly, but you get the idea):

Next we will begin experimenting with light bounces and global illumination using a technique called cone tracing.
- Read more...
- 1 comment
- 3,476 views
Voxel Cone Tracing Part 2 - Sparse Octree

By Josh, June 12, 2018

At this point I have successfully created a sparse octree class and can insert voxelized meshes into it. An octree is a way of subdividing space into eight blocks at each level of the tree:

A sparse octree doesn't create the subnodes until they are used. For voxel data, this can save a lot of memory.
It was difficult to get the rounding and all the math completely perfect (and it has to be completely perfect!) but now I have a nice voxel tree that can follow the camera around and is aligned correctly to the world axis and units. The code that inserts a voxel is pretty interesting: A voxel tree is created with a number of levels, and the size of the structure is equal to pow(2,levels). For example, an octree with 8 levels creates a 3D grid of 256x256x256 voxels. Individual voxels are then inserted to the top-level tree node, which recursively calls the SetSolid() function until the last level is reached. A voxel is marked as "solid" simply by having a voxel node at the last level (0). (GetChild() has the effect of finding the specified child and creating it if it doesn't exist yet.)
A bitwise flag is used to test which subnode should be called at this level. I didn't really work out the math, I just intuitively went with this solution and it worked as I expected:
void VoxelTree::SetSolid(const int x, const int y, const int z, const bool solid) { int flag = pow(2, level); if (x < 0 or y < 0 or z < 0) return; if (x >= flag * 2 or y >= flag * 2 or z >= flag * 2) return; flag = pow(2, level - 1); int cx = 0; int cy = 0; int cz = 0; if ((flag & x) != 0) cx = 1; if ((flag & y) != 0) cy = 1; if ((flag & z) != 0) cz = 1; if (solid) { if (level > 0) { GetChild(cx, cy, cz)->SetSolid(x & ~flag, y & ~flag, z & ~flag, true); } } else { if (level > 0) { if (kids[cx][cy][cz] != nullptr) { kids[cx][cy][cz]->SetSolid(x & ~flag, y & ~flag, z & ~flag, false); } } else { //Remove self auto parent = this->parent.lock(); Assert(parent->kids[position.x][position.y][position.y] == Self()); parent->kids[position.x][position.y][position.y] = nullptr; } } } The voxel tree is built by adding all scene entities into the tree. From there it was easy to implement a simple raycast to see if anything was above each voxel, and color it black if another voxel is hit:

And here is the same program using a higher resolution voxel tree. You can see it's not much of a stretch to implement ambient occlusion from here:

At a voxel size of 0.01 meters (the first picture) the voxelization step took 19 milliseconds, so it looks like we're doing good on speed. I suspect the rest of this project will be more art than science. Stay tuned!
- Read more...
- 0 comments
- 2,267 views
Voxel Cone Tracing Part 3 - Raycasting

By Josh, June 13, 2018

I added a raycast function to the voxel tree class and now I can perform raycasts between any two positions. This is perfect for calculating direct lighting. Shadows are calculated by performing a raycast between the voxel position and the light position, as shown in the screenshot below. Fortunately the algorithm seems to work great an there are no gaps or cracks in the shadow:

Here is the same scene using a voxel size of 10 centimeters:

If we move the light a little lower, you can see a shadow appearing near two edges of the floor:

Why is this happening? Well, the problem is that at those angles, the raycast is hitting the neighboring voxel on the floor next to the voxel we are testing:

You might think that if we just move one end of the ray up to the top of the voxel it will work fine, and you'd be right, in this situation.

But with slightly different geometry, we have a new problem.

So how do we solve this? At any given time, a voxel can have up to three faces that face the light (but it might have as few as one). In the image below I have highlighted the two voxel faces on the right-most voxel that face the light:

If we check the neighboring voxels we can see that the voxel to the left is occupied, and therefore the left face does not make a good position to test from:

But the top voxel is clear, so we will test from there:

If we apply the same logic to the other geometry configuration I showed, we also get a correct result. Of course, if both neighboring voxels were solid then we would not need to perform a raycast at all because we know the light would be completely blocked at this position.
The code to do this just checks which side of a voxel the light position is on. As it is written now, up to three raycasts may be performed per voxel:
if (lightpos.x < voxel->bounds.min.x) { if (GetSolid(ix - 1, iy, iz) == false) { result = IntersectsRay(p0 - Vec3(voxelsize * 0.5f, 0.0f, 0.0f), lightpos); } } if (lightpos.x > voxel->bounds.max.x and result == false) { if (GetSolid(ix + 1, iy, iz) == false) { result = IntersectsRay(p0 + Vec3(voxelsize * 0.5f, 0.0f, 0.0f), lightpos); } } if (lightpos.y < voxel->bounds.min.y and result == false) { if (GetSolid(ix, iy - 1, iz) == false) { result = IntersectsRay(p0 - Vec3(0.0f, voxelsize * 0.5f, 0.0f), lightpos); } } if (lightpos.y > voxel->bounds.max.y and result == false) { if (GetSolid(ix, iy + 1, iz) == false) { result = IntersectsRay(p0 + Vec3(0.0f, voxelsize * 0.5f, 0.0f), lightpos); } } if (lightpos.z < voxel->bounds.min.z and result == false) { if (GetSolid(ix, iy, iz - 1) == false) { result = IntersectsRay(p0 - Vec3(0.0f, 0.0f, voxelsize * 0.5f), lightpos); } } if (lightpos.z > voxel->bounds.max.z and result == false) { if (GetSolid(ix, iy, iz + 1) == false) { result = IntersectsRay(p0 + Vec3(0.0f, 0.0f, voxelsize * 0.5f), lightpos); } } .With this correction the artifact disappears:

It even works correctly at a lower resolution:

Now our voxel raycast algorithm is complete. The next step will be to calculate direct lighting on the voxelized scene using the lights that are present.
- Read more...
- 1 comment
- 3,642 views
Lua binding in Leadwerks 5

By Josh, March 27, 2018

The Leadwerks 5 API uses C++11 smart pointers for all complex objects the user interacts with. This design replaces the manual reference counting in Leadwerks 4 so that there is no Release() or AddRef() method anymore. To delete an object you just set all variables that reference that object to nullptr:
auto model = CreateBox(); model = nullptr; //poof! In Lua this works the same way, with some caveats:
local window = CreateWindow() local context = CreateContext(window) local world = CreateWorld() local camera = CreateCamera(world) camera:SetPosition(0,0,-5) local model = CreateBox() while true do if window:KeyHit(KEY_SPACE) then model = nil end world:Render() end In the above example you would expect the box to disappear immediately, right? But it doesn't actually work that way. Lua uses garbage collection, and unless you are constantly calling the garbage collector each frame the model will not be immediately collected. One way to fix this is to manually call the garbage collector immediately after setting a variable to nil:
if window:KeyHit(KEY_SPACE) then model = nil collectgarbage() end However, this is not something I recommend doing. Instead, a change in the way we think about these things is needed. If we hide an entity and then set our variable to nil we can just defer the garbage collection until enough memory is accrued to trigger it:
if window:KeyHit(KEY_SPACE) then model:Hide()-- out of sight, out of mind model = nil end I am presently investigating the sol2 library for exposing the C++ API to Lua. Exposing a new class to Lua is pretty straightforward:
lua.new_usertype<World>("World", "Render", &World::Render, "Update", &World::Update); lua.set_function("CreateWorld",CreateWorld); However, there are some issues like downcasting shared pointers. Currently, this code will not work with sol2:
local a = CreateBox() local b = CreateBox() a:SetParent(b)-- Entity:SetParent() expects an Entity, not a Model, even though the Model class is derived from Entity There is also no support for default argument values like the last argument has in this function:
Entity::SetPosition(const float x,const float y,const float z,const bool global=false) This can be accomplished with overloads, but it would require A LOT of extra function definitions to mimic all the default arguments we use in Leadwerks.
I am talking to the developer now about these issues and we'll see what happens.
- Read more...
- 54 comments
- 10,352 views
Three Types of Optimization

By Josh, May 15, 2018

In designing the new engine, I have found that there are three distinct types of optimization.
Streamlining
This is refinement. You make small changes and try to gain a small amount of performance. Typically, this is done as a last step before releasing code. The process can be ongoing, but suffers from diminishing returns after a while. When you eliminate unnecessary math based on guaranteed assumptions you are streamlining code. For example, a 4x4 matrix multiplication can skip the calculations to fill the right-most column if the matrices are guaranteed to be orthogonal (non-sheared).
Quality Degradation
This is when you downgrade the quality of your results within a certain tolerable level where it won't be noticed much. An example of this is using a low-resolution copy of a model when it is far away from the camera. Quality degradation can be pretty arbitrary, and can mask your true performance, so it's best to keep an option to disable this.
Architectural
By designing algorithms in a way that makes maximum use of hardware and produces the most optimum results, we can greatly increase performance. Architectural optimization produces groundbreaking changes that can be ten or 100 times faster than the old architecture. An example of this is GPU hardware, which produces a massive performance increase over software rendering. We're seeing a lot of these types of improvements in Leadwerks Game Engine 5 because the entire system is being designed to make maximum use of modern graphics hardware.
- Read more...
- 0 comments
- 1,225 views
Threaded Animation

By Josh, May 18, 2018

The animation update routine has been moved into its own thread now where it runs in the background as you perform your game logic. We can see in the screenshot below that animation updates for 1025 characters take about 20 milliseconds on average. (Intel graphics below, otherwise it would be 1000 FPS lol.)

In Leadwerks 4 this would automatically mean that your max framerate would be 50 FPS, assuming nothing else in the game loop took any time at all. Because of the asynchronous threaded design of Leadwerks 5, this otherwise expensive operation has no impact whatsoever on framerate! The GPU is being utilized by a good amount (96%) while the CPU usage is actually quite low at 10%, even though there are four threads running:

Although the performance here is within an acceptable limit for a game running with a 30 hz loop (it's under 33 milliseconds) it would be too slow for a 60 hz game. (Note that game frequency and framerate are two different things.) In order to get the animation time under the 16.667 milliseconds that a 60 hz game allows, we can split the animation job up onto several different threads. This job is very easily parallelized, so the animation time is just the single-threaded time divided by the number of threads. We can't make our game run faster by adding more threads unnecessarily, all we have to do is make sure the job is completed within the allocated amount of time so the engine keeps running at the correct speed.
When I split the task into two threads, the average update time is about 10 milliseconds, and CPU usage only goes up 2%. Splitting the task into 16 threads brings the average time down to 1-2 milliseconds, and CPU usage is still only at 15%. What does this mean? Well, it seems each thread is spending a lot of time paused (intentionally) and we haven't begun to scratch the surface of CPU utilization. So I will do my best to keep the CPU clear for all your game code and at the same time Leadwerks Game Engine 5 will be using A LOT of threads in the background for animation, physics, navigation, and rendering.
The performance we're seeing with this system is absolutely incredible, beyond anything I imagined it would be when I started building an engine specifically designed for modern PC hardware. We're seeing results that are 10 times faster than anything else out there. In fact, here are 10,000 animated characters running at 200+ FPS on a GEForce 1070 with no LOD or any special optimization tricks. (Thanks to @AggrorJorn for the screen capture.)

It remains to be seen how performance is when lights, physics, and AI are added, but so far it looks extremely good. In fact, for anyone making an RTS game with lots of characters, Leadwerks 5 may be the only reasonable choice due to the insane performance it gets!
- Read more...
- 1 comment
- 2,179 views
Voxel Cone Tracing

By Josh, May 22, 2018

I've begun working on an implementation of voxel cone tracing for global illumination. This technique could potentially offer a way to perfrorm real-time indirect lighting on the entire scene, as well as real-time reflections that don't depend on having the reflected surface onscreen, as screen-space reflection does.
I plan to perform the GI calculations all on a background CPU thread, compress the resulting textures using DXTC, and upload them to the GPU as they are completed. This means the cost of GI should be quite low, although there is going to be some latency in the time it takes for the indirect lighting to match changes to the scene. We might continue to use SSR for detailed reflections and only use GI for semi-static light bounces, or it might be fast enough for moving real-time reflections. The GPU-based implementations I have seen of this technique are techically impressive but suffer from terrible performance, and we want something fast enough to run in VR.
The first step is to be able to voxelize models. The result of the voxelization operation is a bunch of points. These can be fed into a geometry shader that generates a box around each one:
void main() { vec4 points[8]; points[0] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, -0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[1] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, -0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[2] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, 0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[3] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, 0.5f * voxelsize.y, -0.5f * voxelsize.z, 0.0f)); points[4] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, -0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); points[5] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, -0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); points[6] = projectioncameramatrix[0] * (geometry_position[0] + vec4(0.5f * voxelsize.x, 0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); points[7] = projectioncameramatrix[0] * (geometry_position[0] + vec4(-0.5f * voxelsize.x, 0.5f * voxelsize.y, 0.5f * voxelsize.z, 0.0f)); vec3 normals[6]; normals[0] = (vec3(-1,0,0)); normals[1] = (vec3(1,0,0)); normals[2] = (vec3(0,-1,0)); normals[3] = (vec3(0,1,0)); normals[4] = (vec3(0,0,-1)); normals[5] = (vec3(0,0,1)); //Left geometry_normal = normals[0]; gl_Position = points[0]; EmitVertex(); gl_Position = points[4]; EmitVertex(); gl_Position = points[3]; EmitVertex(); gl_Position = points[7]; EmitVertex(); EndPrimitive(); //Right geometry_normal = normals[1]; gl_Position = points[1]; EmitVertex(); gl_Position = points[2]; EmitVertex(); ... } Here's a goblin who's polygons have been turned into Lego blocks.

Now the thing most folks nowadays don't realize is that if you can voxelize a goblin, well then you can voxelize darn near anything.

Global illumination will then be calculated on the voxels and fed to the GPU as a 3D texture. It's pretty complicated stuff but I am very excited to be working on this right now.

If this works, then I think environment probes are going to completely go away forever. SSR might continue to be used as a low-latency high-resolution first choice when those pixels are available onscreen. We will see.
It is also interesting that the whole second-pass reflective water technique will probably go away as well, since this technique should be able to handle water reflections just like any other material.
- Read more...
- 3 comments
- 5,565 views
Animation Tweening

By Josh, May 11, 2018

Leadwerks 5 uses a different engine architecture with a game loop that runs at either 30 (default) or 60 updates per second. Frames are passed to the rendering thread, which runs at an independent framerate that can be set to 60, 90, or unlimited. This is great for performance but there are some challenges in timing. In order to smooth out the motion of the frames, the results of the last two frames received are interpolated between. Animation is a big challenge for this. There could potentially be many, many bones, and interpolating entire skeletons could slow down the renderer.
In the screen capture below, I have slowed the game update loop down to 5 updates per second to exaggerate the problem that occurs when no interpolation is used:

My solution was to upload the 4x4 matrices of the previous two frames and perform the tweening inside the vertex shader:
//Vertex Skinning mat4 animmatrix[8]; for (int n=0; n<4; ++n) { if (vertex_boneweights[n] > 0.0f) { animmatrix[n] = GetAnimationMatrix(vertex_boneindices[n],0); animmatrix[n + 4] = GetAnimationMatrix(vertex_boneindices[n],1); } } vec4 vertexpos = vec4(vertex_position,1.0f); vec4 modelvertexposition; for (int n=0; n<4; ++n) { if (vertex_boneweights[n] > 0.0f) { modelvertexposition += animmatrix[n] * vertexpos * vertex_boneweights[n] * rendertweening + animmatrix[n+4] * vertexpos * vertex_boneweights[n] * (1.0f - rendertweening); } } modelvertexposition = entitymatrix * modelvertexposition; Bone matrices are retrieved from an RGBA floating point texture with this function:
mat4 GetAnimationMatrix(const in int index, const in int frame) { ivec2 coord = ivec2(index * 4, gl_InstanceID * 2 + frame); mat4 bonematrix; bonematrix[0] = texelFetch(texture14, coord, 0); bonematrix[1] = texelFetch(texture14, coord + ivec2(1,0), 0); //bonematrix[2] = texelFetch(texture14, coord + ivec2(2,0), 0); bonematrix[2].xyz = cross(bonematrix[0].xyz,bonematrix[1].xyz); //removes one texture lookup! bonematrix[2].w = 0.0f; bonematrix[3] = texelFetch(texture14, coord + ivec2(3,0), 0); return bonematrix; } This slows down the shader because up to 24 texel fetches might be performed per vertex, but it saves the CPU from having to precompute interpolated matrices for each bone. In VR, I think this cost savings is critical. Doing a linear interpolation between vertex positions is not exactly correct, but it's a lot faster than slerping a lot of quaternions and converting them to matrices, and the results are so close you can't tell any difference.
There's actually a similar concept in 2D animation I remember reading about.when I was a kid. The book is called The Illusion of Life: Disney Animation and it's a really interesting read with lots of nice illustrations.

Here is the same scene with interpolation enabled. It's recorded at 15 FPS so the screen capture still looks a little jittery, but you get the idea: Adding interpolation brought this scene down to 130 FPS from 200 on an Intel chip, simply because of the increased number of texel fetches in the vertex shader. Each character consists of about 4000 vertices. I expect on a discrete card this would be running at pretty much the max framerate (1000 or so).

With this in place, I can now confirm that my idea for the fast rendering architecture in Leadwerks Game Engine 5 definitely works.
The next step will be to calculate animations on a separate thread (or maybe two). My test scene here is using a single skeleton shared by all characters, but putting the animation on its own thread will allow many more characters to all be animated uniquely.
- Read more...
- 6 comments
- 3,512 views
First Animation Metrics

By Josh, May 10, 2018

I got skinned animation working in the new renderer, after a few failed attempts that looked like something from John Carpenter's The Thing. I set up a timer and updated a single animation on a model 10,000 times. Animation consists of two phases. First, all animations are performed to calculate the local position and quaternion rotation. Second, 4x4 matrices are calculated for the entire hierarchy in global space and copied into an array of floats. To test this, I placed this code inside the main loop:
float frame = Millisecs() / 10.0f; auto tm = Millisecs(); for (int n = 0; n < 10000; ++n) { monster->skeleton->SetAnimationFrame(frame, 1, 1, true); monster->UpdateSkinning(); } int elapsed = Millisecs() - tm; Print(elapsed); The result in release mode was around 60 milliseconds. When I tested each command lone, I found that UpdateSkinning() took around 30-35 milliseconds while SetAnimationFrame() hovered around 20 milliseconds.
When I cut the number of iterations in half, the result was right around 30 milliseconds, which is within our window of time (33 ms) for games that run at 30 hz. If your game uses a 60 hz loop then you can cut that number in half. The model I am using also only has 24 bones, but models with up to 256 bones are supported (with a pretty linear performance cost).
Now this is with a single call to SetAnimationFrame. If the animation manager is in use there could potentially be many more calculations performed as animations are smoothly blended.
Splitting the animations up into multiple threads could be done easily, but most computers only have four CPUs, so I don't see this being useful on more than 2-3 threads. Let's say we dedicate two threads to animation. That means right now our theoretical limit is about 10,000 zombies onscreen at once. I would like to go higher, but I think this is probably our realistic limit for CPU-performed animations. The alternative would be to upload the animation sequences themselves to the GPU and perform all animation entirely on the GPU, but then we would lose all feedback on the CPU side like the current bone orientations. Perhaps a solution would be to have both a CPU and GPU animation system that produces identical results, and the CPU side stuff would only be called when needed, but that makes things pretty complicated and I am not sure I want to go down that road.
In reality, the final results will probably be quite a lot less than this when all functionality is added, but from this data we can reasonably extrapolate that Leadwerks 5 will support thousands of animated characters onscreen. According to the developers of the Dead Rising series, a few thousand is the practical limit of how many characters you would ever want onscreen, so this is encouraging. Of course, there is no limit on the number of offscreen characters, since animation is only performed for characters that appear onscreen.
- Read more...
- 7 comments
- 2,477 views
Animation in Leadwerks 5

By Josh, May 6, 2018

The design of Leadwerks 4 was meant to be flexible and easy to use. In Leadwerks 5, our foremost design goals are speed and scalability. In practical terms that means that some options are going to go away in order to give you bigger games that run faster.
I'm working out the new animation system. There are a few different ways to approach this. In situations like this I find it is best to start by deciding the desired outcome and then figuring out how to achieve that. So what do we want?
Fast performance with as many animated characters as possible. Hopefully, tens of thousands. Feedback to the CPU on the orientations of bones for things like parenting a weapon to the character's hand, firing a shot, collision detection with limbs, etc. Low memory usage (so we can have lots and lots of characters). In Leadwerks 4, a bone is a type of entity. This is convenient because the same entity positioning commands work just fine with bones, it's easy to parent a weapon to a limb, and there is a lot of consistency. However, this comes at a cost of potential performance as well as memory consumption. A stripped-down Bone class without all the overhead of the entity system would be more efficient when we hit really large numbers of animated models.
So here's what I am thinking: Bones are a simplified class that do not have all the features of the entity system. The Model class has a "skeleton" member, which is the top-most bone in a hierarchy of bones for that model. You can call animation commands on bones only, and you cannot parent an entity to a bone, since the bone is not an entity. Instead you can attach it by making a copy that is weighted 100% to the bone you specify, and it becomes part of the animated model:
weapon->Attach(model->FindBone("r_hand")); If you have any hierarchy in that weapon model, like a pivot to indicate where the firing position is, it would be lost, so you will need to retrieve those values in your script and save them before attaching the weapon.
This also means bones won't appear in the map as an editable entity, which I would argue is a good thing, since they clog up the hierarchy with thousands of extra entities.
When you call an animation command, it will be sent to the animation thread the next time the game syncs in the World::Update() command. Animations are then performed on a copy of all the visible skeletons in the scene, and their 4x4 matrices are retrieved during the next call to World::Update(). Animation data is then passed to the rendering thread where it is fed into a float texture the animation shader reads to retrieve the bone orientations for each instance in the batch it is rendering.
This means there is latency in the system and everything is always one frame behind, but your animations will all be performed on a separate thread and thus have basically no cost. In fact with the simplified bone class, it might not even be necessary to use a separate thread, since carrying out the animations is just a lot of quaternion Slerps and matrix multiplications. I'll have to try it and just see what the results are.
The bottlenecks here are going to be the number of animations we can calculate, the passing of data from the game thread to the animation thread and back, and the passing of data from the rendering thread to the GPU. It's hard to predict what we will hit first, but those are the things I have in mind right now.
It would be possible to carry out the animation transforms entirely on the GPU, but that would result in us getting no feedback whatsoever on the orientation of limbs. So that's not really useful for anything but a fake tech demo. I don't know, maybe it's possible to get the results asynchronously with a pixel buffer object.
In addition to animation, having tons of characters also requires changes to the physics and navmesh system, which I am already planning. The end result will be a much more scalable system that always provides fast performance for VR. As we are seeing, the optimizations made for VR are having a beneficial effect on general performance across the board. As explained above, this may sometimes require a little more work on your part to accomplish specific things, but the benefits are well worth it, as we will easily be able to run games with more characters than the video below, in VR, perhaps even on Intel graphics.
- Read more...
- 8 comments
- 2,992 views
First performance demonstration

By Josh, April 20, 2018

I am proud to show off our first performance demonstration which proves that my idea for the Leadwerks 5 renderer works. To test the renderer I created 100,000 instanced boxes. The demo includes both regular and a mock VR mode that simulates single-pass stereoscopic rendering with a geometry shader.
The hardware I tested on is an Intel i7-4770R (for graphics too) which is a few years old.
Now this is not a perfect benchmark for several reasons. There is no frustum culling being performed, the renderer just adds everything into the scene and renders it. I am not showing threaded and non-threaded performance side by side. You may also see all objects disappear for a single frame occasionally, and some triangles may be discarded prematurely in stereoscopic mode.
However, the results are incredible and worth bragging about. In normal mode, with a polygon load of 1.2 million per frame, this little machine with integrated graphics is getting 115 FPS, and in mock VR mode (2.4 million polys) it is hovering right around 90! With 100,000 objects, on integrated graphics!
Alpha subscribers can download the test now.

The secret of this massive performance is an efficient architecture built to unlock the full power of your graphics hardware. Below you can see that GPU utilization is around 95%:
- Read more...
- 7 comments
- 3,216 views
Second Performance Test: nearly 400% faster!

By Josh, April 27, 2018

After observing the behavior of the previous test, I rearranged the threading architecture for even more massive performance gains. This build runs at speeds in excess of 400 FPS with 100,000 entities....on Intel integrated graphics!

I've had more luck with concurrency in design than parallelism. (Images below are taken from here.)
Splitting the octree recursion up into separate threads produced only modest gains. It's difficult to optimize because the sparse octree is unpredictable.

Splitting different parts of the engine up into multiple threads did result in a massive performance boost.

The same test in Leadwerks 4 runs at about 9 FPS. making Leadwerks 5 more than 45 times faster under heavy loads like this.
Alpha subscribers can try the test out here.
- Read more...
- 5 comments
- 4,721 views
Building a Zero-Overhead Renderer

By Josh, April 6, 2018

The Leadwerks 4 renderer was built for maximum flexibility. The Leadwerks 5 renderer is being built first and foremost for great graphics with maximum speed. This is the fundamental difference between the two designs. VR is the main driving force for this direction, but all games will benefit.
Multithreaded Design
Leadwerks 4 does make use of multithreading in some places but it is fairly simplistic. In Leadwerks 5 the entire architecture is based around separate threads, which is challenging but a lot of fun for me to develop. I worked out a way to create a command buffer on the main thread that stores a list of commands for the rendering thread to perform during the next rendering frame. (Thanks for the tip on Lambda functions @Crazycarpet) Each object in the main thread has a simplified object it is associated with that lives in the rendering thread. For example, each Camera has a RenderCamera object that corresponds to it. Here's how changes in the main thread get added to a command buffer to be executed when the rendering thread is ready:
void Camera::SetClearColor(const float r,const float g,const float b,const float a) { clearcolor.x = r; clearcolor.y = g; clearcolor.z = b; clearcolor.w = a; #ifdef LEADWERKS_5 GameEngine::cullingthreadcommandbuffer.push_back( [this->rendercamera, this->clearcolor]() { rendercamera->clearcolor = clearcolor; } ); #endif } The World::Render() command is still there for conceptual consistency, but what it really does it add all the accumulated commands onto a stack of command buffers for the rendering thread to evaluate whenever it's ready:
void World::Render(shared_ptr<Buffer> buffer) { //Add render call onto command buffer GameEngine::cullingthreadcommandbuffer.push_back(std::bind(&RenderWorld::AddToRenderQueue, this->renderworld)); //Copy command buffer onto culling command buffer stack GameEngine::CullingThreadCommandBufferMutex->Lock(); GameEngine::cullingthreadcommandbufferstack.push_back(GameEngine::cullingthreadcommandbuffer); GameEngine::CullingThreadCommandBufferMutex->Unlock(); //Clear the command buffer and start over GameEngine::cullingthreadcommandbuffer.clear(); } The rendering thread is running in a loop inside a function that looks something like this:
shared_ptr<SharedObject> GameEngine::CullingThreadEntryPoint(shared_ptr<SharedObject> o) { while (true) { //Get the number of command stacks that are queued CullingThreadCommandBufferMutex->Lock(); int count = cullingthreadcommandbufferstack.size(); CullingThreadCommandBufferMutex->Unlock(); //For each command stack for (int i = 0; i < count; ++i) { //For each command for (int n = 0; n < cullingthreadcommandbufferstack[i].size(); ++n) { //Execute command cullingthreadcommandbufferstack[i][n](); } } //Remove executed command stacks CullingThreadCommandBufferMutex->Lock(); int newcount = cullingthreadcommandbufferstack.size(); if (newcount == count) { cullingthreadcommandbufferstack.clear(); } else { memcpy(&cullingthreadcommandbufferstack[0], &cullingthreadcommandbufferstack[count], sizeof(sizeof(cullingthreadcommandbufferstack[0])) * (newcount - count)); cullingthreadcommandbufferstack.resize(newcount); } CullingThreadCommandBufferMutex->Unlock(); //Render queued worlds for (auto it = RenderWorld::renderqueue.begin(); it != RenderWorld::renderqueue.end(); ++it) { (it->first)->Render(nullptr); } } return nullptr; } I am trying to design the system for maximum flexibility with the thread speeds so that we can experiment with different frequencies for each stage. This is why the rendering thread goes through and executes all commands an all accumulated command buffers before going on to actually render any queued world. This prevents the rendering thread from rendering an extra frame when another one has already been received (which shouldn't really happen, but we will see).
As you can see, the previously expensive World::Render() command now does almost nothing before returning to your game loop. I am also going to experiment with running the game loop and the rendering loop at different speeds. So let's say previously your game was running at 60 FPS and 1/3 of that time was spent rendering the world. This left you without about 11 milliseconds to execute your game code, or things would start to slow down. With the new design your game code could have up to 33 milliseconds to execute without compromising the framerate. That means your code could be three times more complex, and you would not have to worry so much about efficiency, since the rendering thread will keep blazing away at a much faster rate.
The game loop is a lot simpler now with just two command you need to update and render the world. This gives you a chance to adjust some objects after physics and before rendering. A basic Leadwerks 5 program is really simple:
#include "Leadwerks.h" using namespace Leadwerks; int main(int argc, const char *argv[]) { auto window = CreateWindow("MyGame"); auto context = CreateContext(window); auto world = CreateWorld(); auto camera = CreateCamera(world); while (true) { if (window->KeyHit(KEY_ESCAPE) or window->Closed()) return 0; world->Update(); world->Render(context); } } This may cause problems if you try to do something fancy like render a world to a buffer and then use that buffer as a texture in another world. We might lose some flexibility there, and if we do I will prioritize speed over having lots of options.
Clustered Forward Rendering
Leadwerks has used a deferred renderer since version 2.1. Version 2.0 was a forward renderer with shadowmaps, and it didn't work very well. At the time, GPUs were not very good at branching logic. If you had an if / else statement, the GPU would perform BOTH branches (including expensive texture lookups) and take the result of the "true" one. To get around this problem, the engine would generate a new version of a shader each time a new combination of lights were onscreen, causing period microfreezes when a new shader was loaded. In 2.1 we switched to a deferred renderer which eliminated these problems. Due to increasingly smart graphics hardware and more flexible modern APIs a new technique called clustered forward rendering is now possible, offering flexibility similar to a deferred renderer, with the increased speed of a forward renderer. Here is a nice article that describes the technique:
http://www.adriancourreges.com/blog/2016/09/09/doom-2016-graphics-study/

This approach is also more scalable. Extra renders to the normal buffer and other details can be skipped for better scaling on integrated graphics and slower hardware. I'm not really targeting slow hardware as a priority, but I wouldn't be surprised if it ran extremely fast on integrated graphics when the settings are turned down. Of course, the system requirements will be the same because we need modern API features to do this.
I'm still a little foggy on how custom post-processing effects will be implemented. There will definitely be more standard features built into the renderer. For example, SSR will be mixed with probe reflections and a quality setting (off, static, dynamic) will determine how much processing power is used for reflections. If improved performance and integration comes at the cost of reduced flexibility in the post-process shaders, I will choose that option, but so far I don't foresee any problems.
Vulkan Graphics
The new renderer is being developed with OpenGL 4.1 so that I can make a more gradual progression, but I am very interested in moving to Vulkan once I have the OpenGL build worked out. Valve made an agreement with the developers of MoltenVK to release the SDK for free. This code translates Vulkan API calls into Apple's Metal API, so you basically have Vulkan running on Mac (sort of). I previously contacted the MoltenVK team about a special license for Leadwerks that would allow you guys to release your games on Mac without buying a MoltenVK license, but we did not reach any agreement and at the time the whole prospect seemed pretty shaky. With Valve supporting this I feel more confident going in this direction. In fact, due to the design of our engine, it would be possible to outsource the task of a Vulkan renderer without granting any source code access or complicating the internals of the engine one bit.
- Read more...
- 5 comments
- 3,239 views
Lua in Leadwerks 5 is Solved

By Josh, April 3, 2018

When considering the script system in Leadwerks 5, I looked at alternatives including Squirrel, which is used by Valve in many games, but these gave me a deeper appreciation for the simplicity of Lua. There are only a handful of rules you need to learn to use the language, it’s fun to use, yet somehow it does everything you could ever need.
These were three big issues I had to solve. First, the Leadwerks 5 API makes extensive use of smart pointers, which our binding library tolua++ does not support. Second, I wanted better auto completion and a better user experience in the IDE in general. Third, if an external IDE is going to be used it needs to be able to interface with the Leadwerks debugging system.
To support smart pointers, I found a new library called sol2 that does everything we need. @Rick and I discussed the idea at great length and I am happy to say we’ve come up with a design that is simple to use and quite a bit easier than Leadwerks 4.x even. The binding code is nowhere near done but at this point I can see that everything will work.
@AggrorJorn suggested using Visual Studio Code as our official script IDE in Leadwerks 5, and after investigating I think it’s a great idea. The auto completion is quite good and the IDE feels more natural then anything I could come up with using a custom text editor made with Scintilla. In fact eliminating the built-in script editor in Leadwerks 5 relieves me of a lot of uncertainty and potential issues when this is written.
Finally, VS Code has support for custom debuggers. I wrote an example command line debugger for Leadwerks and I will use this to show another programmer how to interface with Leadwerks. (I don’t plan on writing the debugger myself.)
With your feedback and ideas are shaping up to make Leadwerks 5 a huge leap forward over our previous designs. The improved simplicity of the new script system is a big cognitive relief. Having fewer things to worry about makes life better in a subtle but definite way.
There’s something else that consumes a lot of mental attention. Social media and the internet have grown and changed over the years and become more efficient at consuming our attention. (Many features of this site are designed the same way.)
The scary thing is that normal non-technical people seem to be more vulnerable than nerds. We’ll fire up the Witcher and play for an hour, but regular people are checking their phones 24/7 for feedback and validation. It’s much much worse than any accusation we got as kids of being “Nintendo zombies” because we spent an afternoon playing games instead of staring passively at broadcast TV. People who play games generally don’t care about posting photographs of their food or collecting followers.
Somewhere along the line the internet went from being a weird thing on your computer to the collective consciousness of humanity. Reality is online and the physical world around us is just a mirage, one possible instance of a million possible individual experiences. Maybe it was around the time they started using AI to optimize clickbait that things got out of hand.
Although my career and the way I live my life are only possible through the internet, I am old enough to remember life before the web, and in many ways it was better. Things were more focused. Even the early web before clickbait ads and online echo chambers was pretty nice. You could go to a record store and hang out talking to people about new music. Printed paper magazines were a thing.
I already removed the link to our Google+ page in the website footer and no one noticed. I think about deleting our Facebook and twitter accounts, or at least not linking to them on our site. Why must every website pay homage to these monopolies? What are they doing for me, besides a limited flow of hits that pale in comparison to what my own email list brings in? I have written about this before but now that it is fashionable to criticize social media I might act on it. I don’t know, we’ll see.
Please like, share, and retweet.
- Read more...
- 11 comments
- 3,921 views
Leadwerks Game Engine 5 Alpha Zero Released

By Josh, January 18, 2018

I'm happy to announce the very first alpha release of Leadwerks 5 is now available.
What's New
String commands now accept a unicode overload. Add "L" in front of a string to create a wide string in C++. Now using smart pointers. Simply set a variable to nullptr to delete an object. There is no Release() or AddRef() function. Exclusively 64-bit! Global states are gone. There is no "current" world or context. Instead, the object you want is passed into any function that uses it. We are now using constant integers like WINDOW_TITLEBAR instead of static members like Window::Titlebar. Now using global functions where appropriate (CreateWorld(), etc.). Renderer is being designed to be asynchronous so Context::Sync() is gone. 2D drawing is not implemented at this time. Here's the syntax for a simple program.
#include "Leadwerks.h" using namespace Leadwerks; int main(int argc, const char *argv[]) { auto window = CreateWindow(L"My Game", 0, 0, 1024, 768, WINDOW_TITLEBAR); auto context = CreateContext(window); auto world = CreateWorld(); auto camera = CreateCamera(world); camera->SetPosition(0, 0, -3); auto light = CreateDirectionalLight(world); light->Turn(45, 35, 0); auto model = CreateBox(world); while (true) { if (window->KeyHit(KEY_ESCAPE) or window->Closed()) return 0; if (window->KeyHit(KEY_SPACE)) model = nullptr; world->Update(); world->Render(context); } } You can get access to the Leadwerks 5 Alpha with a subscription of $4.99 a month. You will also be able to post in the Leadwerks 5 forum and give your feedback and ideas. At this time, only C++ is supported, and it will only build in debug mode. It is still very early in development, so this is really only intended for enthusiasts who want to play with the very bleeding edge of technology and support the development of Leadwerks 5.
- Read more...
- 7 comments
- 7,837 views
More amazing things you can do with Lua in Leadwerks 5

By Josh, March 29, 2018

Our implementation of Lua in Leadwerks 5 is shaping up to be a dream come true. Below are some of the great improvements that are being made.
Access STL Containers in Lua
You can access STL containers directly from Lua:
for n = 1, #entity.kids do entity.kids[n]:Move(1,0,0) end while #entity.kids > 0 do entity.kids[1]:SetParent(nil) end In fact, verbose commands like CountChildren() and GetChild() are no longer needed at all. On the C++ side you can use this:
for (int n=0; n<entity->kids.size(); n++) { entity->kids[n]->Move(1,0,0); } while (entity->kids.size()) { entity->kids[0]->SetParent(nullptr); } Note that in C++ arrays start with 0 and in Lua they start with 1.
This also allows us to return STL contains from functions or accept them as arguments. No more ForEachEntity... callbacks are needed:
local aabb = AABB(-10,10,0,5,-10,10) local entities = world:GetEntitiesInAABB(aabb) for n=1,#entities do entities[n]:AddForce(0,10,0) end Super Pro User-defined Values
There will be no more self.entity or entity.script conventions in Leadwerks 5. Functions and user-defined values will be attached directly to the entity itself. The example below shows user-defined values that persist even when the entity goes out of scope of the Lua virtual machine:
--Create child local a = CreateBox(world,1,1,1) --Set a user-defined value a.health = 100 --Create parent local b = CreateBox(world,1,1,1) --Set parent a:SetParent(b,true) --Let child go out of scope (parent keeps it from being deleted in C++) a = nil --Collect garbage collectgarbage() --Get the child local c = b.kids[1] --Check the value print(c.health) --prints '100' In Leadwerks 4 an entity script might look like this:
function Script:Update() self.entity:Turn(self.speed,0,0) end In Leadwerks 5 it is simpler because self is the actual entity:
function Entity:Update() self:Turn(self.speed,0,0) end In Leadwerks 4 you have to check to see if an entity has a script attached and us that to store all user-defined values:
if world:Pick(v1,v2,pickinfo) then if pickinfo.entity.script~=nil then if type(pickinfo.entity.script.TakeDamage)=="function" then pickinfo.entity.script:TakeDamage(10) end end end Leadwerks 5 is a lot simpler. You just check if the functions exists on the entity and then call it:
if world:Pick(v1,v2,pickinfo) then if type(pickinfo.entity.TakeDamage)=="function" then pickinfo.entity:TakeDamage(10) end end You can even assign custom properties to entities without worrying whether they have a script attached:
function Entity:Collision( collidedentity, position, normal, speed ) collidedentity.lasthitobject = self --No script? No problem! end In fact all a script does is attach some functions and values to an entity and then it is gone. There is no fundamental difference between a scripted and non-scripted entity.
Casting Objects
Casting objects in Leadwerks 4 uses syntax that is a little awkward. I actually had to look up the tolua.cast function just now because I couldn't remember the order of the arguments:
local a = Model:Box() local b = Model:Box() a:SetParent(b) local entity = b:GetChild(0) local model = tolua.cast(entity,"Model") Casting is simpler and more intuitive in Leadwerks 5:
local a = CreateBox() local b = CreateBox() a:SetParent(b) local entity = b:kids[1] local model = Model(entity) If the entity is not a model then the casting function will just return nil.
A big thanks goes out to the developers of sol2, an awesome modern Lua binding library with support for C++11 smart pointers.
- Read more...
- 6 comments
- 3,128 views

Sign In

Blogs

Forums

Store

Gallery

Videos

Blog Entries posted by Josh