Review : Kinda crappy, silly, but interesting article. Read with a grain of salt, and a glass of scotch. ----------------------------------------------------------------------------------------- A Very Flexible Render Pipeline by Charles Bloom Jan 13, 2000 ----------------------------------------------------------------------------------------- I am going to describe how all objects (TriMesh,BSP,Actor) should render. This pipeline is quite glorious (hence the name) because of its superb scalability. When I say "scalability" I mean ability to target different rendering hardware. The two extremes of hardware are very slow (old) cards (and the software driver), and very fast cards with hardware transform and lighting. At the same time the glorious pipeline also addresses the two extremes of clipping : when the object is small in model space (hence transform and clip it all in one pass) or the object is very large (hence clip first and then transform only the visible polys). First let me describe what you would do for each of these extremes, before I tell you how this pipeline addresses them all. For a very slow card, you want to do maximum culling in the CPU before sending any polys to the card. The goal is to only render things that are absolutely visible, and try to avoid overdraw as well. To do all the clipping, you want to use some heirarchical structure so that you can propagate "no need to clip" regions down a tree. To avoid overdraw, classic solutions are rendering front to back and using a span buffer or beamtree. For a very fast card, you basically want to just leave a giant array of verts on the card and let it clip, transform, light, and render for you. The only thing you want to do is very gross visibility culling, based on a PVS (potentially visible set) for example. Actors and TriMeshes will typically just be one geometry buffer which will be left on the card as long as the BBox of the whole object is visible. BSP Levels would ideally be split into Areas, and each visible Area would leave all of its geometry on the card. This gives you lots of bus bandwidth savings, and great speed as long as the card is fast enough to handle it. You want to make the size of the regions you batch adapt based on how much the card can handle, doing tighter vis for slower cards, grosser vis for faster cards. Now, for the small vs. large object compromise. If an object is small (so it is typically all on-screen or all off-screen) then the fast (CPU only) pipeline is to first gross-reject the object, then to light all the verts in world space, transform all the verts, and then dump them all on screen (typically not clipping at all). This gives you lots of memory-coherency and register continuity advantages. You can also use faster screen-space culling (after transform). If an object is very large (typically only a fraction of its polygons visible, like BSP's and Terrains, normally) then the ideal pipeline does heirarchical clipping of the polys to quickly find the few which are visible, and then light and transform and render those few polys. You gain the advantage of transforming only verts that are not culled, but must cull in world space which is slower than culling in screen space. If you use the first pipeline on a large object which is only partially visible, you get a large penalty, while if you use the second on a small object which is entirely on screen, you get a large penalty. So, how can we possibly get the best of all these worlds !? With a "Binary Polygon Array" (BPA). The BPA is a binary tree structure for polgons. Let me describe the construction of a BPA, then I will describe how it is used for rendering. A "triangle soup" list of polygons is taken as input. The entire triangle soup is put into an octtree, which recursively subdivides so that only one triangle centroid is inside each octtree node. Now, the triangles are read out of the octtree into an array. The indexing starts at I = 0, we recursively descend to a leaf, and put that triangle in the array at I, increment I and go up the tree. The result of the indexing is that the triangles contained within any octree node are within a linear range of indexes, from FirstIndex to LastIndex; these indexes are remembered in each node. For example, leaf nodes just have one triangle, so they always have FirstIndex == LastIndex; the root node will have FirstIndex == 0 and LastIndex == NumPolys -1. This octree is not the BPA, we can use it to now make the BPA. (see note 1, below) We make a binary tree from the octtree. This is very simple; starting from the leaves, whenever an octree node has more than one child, take any two children whose indexes are consecutive and make them the children of a new binary tree node. The new binary tree and the remaining children of the octree are then combined again until only two binary tree children remain. The result is a binary tree of polygons like this : [0][1][2][3] [01][23] [0-3] So that the tree is linear in index space, and also heirarchical in model space because of the octtree construction. Finally, a BBox is computed for each node. The BBoxes of the leaf nodes are just the bboxes of the polygons contained therein. The BBoxes of nodes are the union of the bboxes of its children. (see note 2, below). Each 'level 0' node (from the bottom up) is a leaf, with one polygon index (LastIndex == FirstIndex). Each level 1 has two polygons, level 2 has four, etc. The tree is balanced spatially, but may not be a balanced binary tree (see note 1). It is important to note that the polygons and their verts are actually in a linear array, only two indexes into the polygon array are in the tree. The memory usage of the tree is very low. The polygons are assumed to have indexes into the array of verts. (see note 5) Now that we have a BPA, we are ready to render. We simply start with the root node, and recursively follow this procedure : 1. Test the node BBox against the frustum, refining (reducing) the clipflags too. If the bbox is entirely culled, then return; 2. Let NumPolys = LastIndex - FirstIndex + 1 , the number of polys referred to by the node. (see note 2&3) 3. If NumPolys <= Parameter_NumRenderPolys OR the BBox is entirely included, then goto 4. Otherwise, descend to the two children. 4. For all shared verts that have not been touched this frame : Light and Transform them. (if the BBox was not totally included, the continue clipping in screen space) 5. Blast all polygons in the array from FirstIndex to LastIndex to the driver. This is a linear array of polygons, so it can simply be drawn with D3D's DrawIndexedPrimitive, for example. (see note 4). That's it! Let's examine what this has one for us by considering how the BPA performs on the cases we described above: On small objects entirely on or off screen : the BPA will do quick rejection or inclusion using heirarchical bboxes. Once a bbox is found to be all on screen, the entire linear list of polygons on screen will be shot to the card. This is ideal behaviour. We must do some extra work to check if shared verts have already been transformed, but that is minimal; most verts are lit and transformed in one big pass, for excellent cache continuity. On large objects only partially visible, we do heirarchical culling with the BPA bboxes. We then light and transform and render only visible portions. Note that the NumRenderPolys parameter lets us say "screw it" at some point and render some polys which may be off screen, doing faster non-heirarchical screen-space culling. The parameter Parameter_NumRenderPolys lets us control the fast card/slow card performance range. On slow cards (or software), you can set Parameter_NumRenderPolys to 1, so that the tree is always descended completely, and maximum culling is done. On fast cards, you can set Parameter_NumRenderPolys to be high, like 64, so that continuous poly batches can be sent, and the card can be trusted to clip them (with the remaining clip planes left over from the heirarchical culling). Hunks of 64 polys will be rendered at once, with the hunks spatially contiguous, and at least one of the them visible. The BPA structure could easily be extended to leave these polygon arrays on the card while they are still visible. In fact, in most cases the BPA gives the best of all worlds. It can always be (almost) as fast as a straight linear array of triangles (by setting NumRenderPolys very high) and it can always be as fast as a full heirarchical subdivision method (like a BSP) by descending the tree down to NumRenderPolys == 1. Also note that the BPA takes very little penalty for dynamic geometry. The tree does not need to be rebuilt, because it does not enforce any rigid subdivision scheme (eg. child nodes can intersect eachother). This means that a BPA could be made of actor polygons, and all the advantages of the BPA would remain when the actor moves around, as long as the proximity estimated for spatial coherence made in the octtree remain reasonable for all motions of the actor (eg. as long as the motions are reasonable, and the all the verts dont just fly on totally random paths and scramble the geometry). When polygons mode, the BBoxes must be percolated up the tree. Note that this is merely O(N) in the number of polygons. This updating can be further accelerated by making NumRenderPolys greater than one (see note 2), so that the tree is pruned and fewer bboxes need to updated. Finally, if this cost is still considered too great, "loose" bboxes could be used that enclose the polygons within their entire range of motion. Once the BPA is built for rendering, it is also an excellent tool for collision on objects that have no other structure (TriMesh and Actor would use it, but not BSP). ----------------------------------------------------------------------------------------- Appendix 1 : How to use the BPA with our BSP/Area system: ------------------------------------------------------------------------------------------ The BSP renderer currently has one large BSP tree, with portals separating 'Areas'. The portals are used for real-time visibility testing of entire areas. Currently, the areas are checked for visibility, and this visibility is percolated up the BSP tree from the leaves to mark all nodes which are visible with the current frame number. The nodes are then rendered top to bottom, front to back, using heirarchical frustum culling. Even though marking visibility is very fast, it requires a lot of memory IO, fetching a cache line for each node marked. In the best case, this is O(N log(N)) where N is the number of visible nodes. In the worst case, this is O(N log(N) ) where N is the total number of nodes. Careful partitioning of the tree must be done to ensure worst case behaviour is avoided (worst case must be avoided for very large scenes, eg. N approaching one million). The ideal BSP + BPA renderer would build a BPA for all the polygons in each Area. The BSP would then be used only for collision, and rendering would only use the Areas, the Portals, and the BPA's in each area. Each frame you would : 1. Find the current area. (this is O(log(N)) ) 2. If the current area has been rendered, return; 3. Render the BPA in the current Area. 4. Render all objects in the current Area (if they have not been rendered this frame) 5. Clip the frustum through all portals exiting the current area, and goto (2) for all of those areas. This process is O(N) in the number of visible areas, and O(N) in the number of visible polygons. Note that areas are (loosely) rendered front to back automatically. This improvement of worst case behaviour for very large scenes is nice, but not terribly compelling. There are two things that make the BPA + BSP scheme compelling. For one thing, all the flexibility of the BPA is inhereted, meaning that batches of polygons can be sent to the card. In the most extreme case, the BPA would be entirely visible, or NumRenderPolys would be set very high (on an HT&L card), so the entire set of polygons in an Area would be sent to the card in one batch. The other thing which makes the BPA compelling is that it allows the geometry rendered in the Area to differ from the BSP nodes. Thus, the BPA could contain arbitrary "skins" on the bsp faces, such as displacement mappings of the walls. These would be rendered in large polygon batches. (presumably high polygon skins would only be enabled for fast cards). ----------------------------------------------------------------------------------------- Appendix 2 : How to use the BPA with advanced occlusion culling ------------------------------------------------------------------------------------------ The BPA also provides structure for advanced occlusion testing. The BPA can be rendered (loosely) front to back by simply choosing the closer of the two children first at each recursion. Because this is loose, it cannot be used with pure occlusion buffers, but it can be used with Z buffers, and beamtrees or span buffers that also include Z depths. The front faces of a BPA node BBox can then be checked very quickly against a heirarchical Z pyramid, or against a beamtree. Presumably a smart algorithm would be used that would only test node BBoxes if the node contained a sufficiently interesting number of polygons ( >= 16 or so) and was also reasonable small in screen space ( <= 1024 pixels or so). This type of strong occlusion culling is only interesting for very very dense polygon models ( ~ one million polygons) or when the CPU is much faster than the rasterizer (probably rare). If hardware cards provided fast Z buffer (pyramid) tests, this would be much more interesting. Instead, it is just a curiousity for completeness. ----------------------------------------------------------------------------------------- Notes : ------------------------------------------------------------------------------------------ note 1: The octtree is not strictly necessary. Any spatial sorting of the indexes would be acceptable, and the octtree is certainly not optimal, but is very fast and easy to understand and implement. For example, a tree building technique to try to equalize the volume of each node at each level of the binary tree would be better. note 2: if Parameter_NumRenderPolys does not need to be real-time dynamic, then you can simply cull the tree so that there are no nodes with NumPolys < Parameter_NumRenderPolys, and the lowest leaves have NumPolys >= Parameter_NumRenderPolys. (this is recommended because it saves a lot of memory for the BBoxes, and in fact should be done even before the bboxes are computed). For example, if Parameter_NumRenderPolys is 8, then three levels of the tree are pruned, and only (N/8) leaves exist, where N is the number of polygons. note 3: There is an optional step here: if NumPolys <= Parameter_NumTransformPolys, then transform all the verts pointed to by the polys. To do this, we must keep the lowest and highest indeces of a vertex in the vertex array used by any poly in each node, and we must make sure those verts are roughly sorted spatially (this is easily done with an octtree just as we did the polys). Once we reach a depth where this occurs (for example, NumPolys <= 32, a depth of 5 above the leaves) then we can use screenspace clipping for all the remaining levels of the tree. Furthermore, it gives us the advantage of transforming verts in a fast array. Note that Parameter_NumTransformPolys must be >= Parameter_NumTransformPolys at all times!! note 4: You could alternatively gather triangle strips in the leaf nodes, or OpenGL DrawLists. note 5: The "polygons" should probably just be limited to triangles. ----------------------------------------------------------------------------------------- EOF ------------------------------------------------------------------------------------------