Optimizing the Vertex Pipeline

Entry posted by Josh · August 19, 2020

1,439 views

In my work with NASA we visualize many detailed CAD models in VR. These models may consist of tens of millions of polygons and thousands of articulated sub-objects. This often results in rendering performance that is bottlenecked by the vertex rather than the fragment pipeline. I recently performed some research to determine how to maximize our rendering speed in these situations.

Leadwerks 4 used separate vertex buffers, but in Leadwerks 5 I have been working exclusively with interleaved vertex buffers. Data is interleaved and packed tightly. I always knew this could make a small improvement in speed, but I underestimated how important this is. Each byte in the data makes a huge impact. Now vertex colors and the second texture coordinate set are two vertex attributes that are almost never used. I decided to eliminate these. If required, this data can be packed into a 1D texture, applied to a material, and then read in a custom vertex shader, but I don't think the cost of keeping this data in the default vertex structure is justified. By reducing the size of the vertex structure I was able to make rendering speed in vertex-heavy scenarios about four times faster.

Our vertex structure has been cut down to a convenient 32 bytes:

struct Vertex
{
    Vec3 position;
    short texcoords[2];
    signed char normal[3];
    signed char displacement;
    signed char tangent[4];
    unsigned char boneweights[4];
    unsigned char boneindices[4];
};

I created a separate vertex buffer for rendering shadow maps, which only require position data. I decided to copy the position data into this and store it separately. This requires about 15% more vertex memory usage, but results in a much more compact vertex structure for faster shadow rendering. I may pack the vertex texture coordinates in there, since that would result in a 16-byte-aligned structure. I did not see any difference in performance on my Nvidia card and I suspect this is the same cost as a 12-byte structure on most hardware.

Using unsigned shorts instead of unsigned integers for mesh indices increases performance by 11%.

A vertex-limited scene is one in which our default setting of using an early Z-pass can be a disadvantage, so I added an option to disable this on a per-camera basis.

Finally, I found that vertex cache optimization tools can produce a significant performance increase. I implemented two different libraries. In order to do this, I added a new plugin function for filtering a mesh:

int FilterMesh(char* filtername, char* params, GMFSDK::GMFVertex*& vertices, uint32_t& vertex_count, uint32_t*& indices, uint32_t& indice_count, int polygonpoints);

This allows you to add new mesh processing routines such as flipping the indices of a mesh, calculating normals, or performing mesh modifications like bending, twisting, distorting, etc. Both libraries resulted in an additional 100% increase in framerate in vertex-limited scenes.

What will this help with? These optimizations will make a real difference when rendering CAD models and point cloud data.