View Full Version : glDrawElements vs. glDrawArrays - The numbers are in!
As promised in a thread probably 2 people read, I wrote a little tester app to evaluate the comparative speed of glDrawElements (wrapped in array locks) and glDrawArrays. It tests each with three array setups: packed discrete arrays, aligned (to double-word boundires) discrete arrays, and interleaved arrays. it does also does a depth buffer clear and swap about 30 times a second (a bit less in reality). Some results:
iMac (G3/333mhz) "five flavors", Rage Pro OpenGL Engine (os9)
winner: glDrawElements (around 112kpolys/sec, aligned discrete arrays seemed to have an almost unmeasureable advantage)
iMac (G3/400mhz), Rage 128 OpenGL Engine (osX)
winner (by a nose): glDrawElements (no clear best array format) (around 125kpolys/sec)
Power Macintosh (G3/350mhz) "blue & white", Rage 128 OpenGL Engine (os9)
no clear winner. (around 150kpolys/sec)
Power Macintosh (g4/400) not sure what model, Rage 128 OpenGL Engine (os9)
no clear winner. (around 160kpolys/sec, with iTunes running)
Power Macintosh (g4/533x2) "digital audio", NVIDIA GeForce2 MX OpenGL Engine (osX)
winner: glDrawElements (no clear best array format) (around 1330kpolys/sec!!!)
This test was not designed to judge FILL RATE, and so real-world poly counts will be SIGNIFICANTLY lower.
Here (http://www.inio.org/~inio/code/SpeedTest.sit) is the app, data, and source code. I had to hack the code out of my current project, if I missed anything just let me know.
just did a few more tests with lighting disabled and a color array.
You get much higher poly counts with lighting off. My Geforce2MX crossed the 5Mpoly/sec mark! The Rage Pro iMac I mentioned got almost 200kpolys/sec, and the blue&white got 620 kpolys/sec. The peak poly rates on all of them definately came from double-word aligned discrete arrays drawn with glDrawElements, allthough all glDrawElements paths were releatively fast. In the case of the GF2MX, glDrawElements is over 4 times faster than glDrawArrays.
translation: if you can store static lighting (or quickly generate dynamic lighting) for some part of your scene, do it.
this is all with glColorPointer(4, GL_UNSIGNED_BYTE, 0, *) or glInterleavedArrays(GL_T2F_N3F_V3F, 0, *) BTW.
Nimrod
2002.07.06, 05:16 AM
Wow, thanks Ian, that's helped me quite a bit! Two questions:
So what is the fastest way of drawing dynamic (animated) meshes?
And have you tried using the VAR extension? (have Apple even written the drivers for it yet?)
Thanks :)
Originally posted by Nimrod
So what is the fastest way of drawing dynamic (animated) meshes?
glEnableClientState(GL_TEXTURE_COORD_ARRAY);
if (useGLLighting) {
glEnableClientState(GL_NORMAL_ARRAY);
glEnable(GL_LIGHTING); // might want to switch this around to assume lighting is on.
} else
glEnableClientState(GL_NORMAL_ARRAY);
glEnableClientState(GL_VERTEX_ARRAY);
glTexCoordPointer(2, GL_FLOAT, 0, texCoords)
// texCoords is array of struct{GLfloat u, v;};
if (useGLLighting)
glNormalPointer(GL_FLOAT, 0, normals);
// normals is array of struct{GLfloat x,y,z,ignore;};
else
glColorPointer(4, GL_UNSIGNED_BYTE, 0, colors);
// colors is array of struct{GLubyte r,g,b,a;};
glVertexPointer(3, GL_FLOAT, 0, vectors);
// vectors is array of struct{GLfloat x,y,z,ignore;};
glLockArraysEXT(0, numberOfVertices);
glDrawElements(GL_TRIANGLES, group->numFaces*3, GL_UNSIGNED_SHORT, indices);
// GL_UNSIGNED_INT may be faster - untested!
glUnlockArraysEXT();
glDisableClientState(GL_TEXTURE_COORD_ARRAY);
if (useGLLighting) {
glDisableClientState(GL_NORMAL_ARRAY);
glEnable(GL_LIGHTING); // might want to switch this around to leave lighting off.
} else
glDisableClientState(GL_NORMAL_ARRAY);
glDisableClientState(GL_VERTEX_ARRAY);
Calculating the normal vectors on the fly is beond the scope of this post.
your indicies should be arranged so that you're drawing triangle-strips (generalized triangle strips may be good enough, not sure) to help older cards which can't cache your entire vertex set. I found a great thesis on generating optimal tristrips (http://www.ams.sunysb.edu/~xxiang/thesis.pdf) (1MB, 85pg.) that went completley over my head, so I used the actc library (http://www.plunk.org/~grantham/public/actc/) I found instead (sortaBSD license - included in the archive).
That make sense?
And have you tried using the VAR extension? (have Apple even written the drivers for it yet?)Which one?
Nimrod
2002.07.06, 06:39 AM
Thanks!
VAR (Vertex Array Range) is a NVIDIA specific extension, available on GeForce cards and upwards. I hear it brings speed gains that make it worth using, even though it won't work on ATI stuff.
http://developer.nvidia.com/view.asp?IO=Using_GL_NV_fence
http://developer.nvidia.com/view.asp?IO=vardemo
These 2 links should be helpful, there might be other stuff if you poke around NVIDIA's site.
Although this thread (http://cgi.sfu.ca/~akirczen/cgi-bin/forum/ikonboard.cgi?s=3d26c60727ccffff;act=ST;f=1;t=24) would suggest there aren't mac drivers for it yet (and you might want to read this, because I notice you also use GLUT). IIRC Jaguar will bring driver support for such features.
OneSadCookie
2002.07.06, 08:24 AM
It's common knowledge (mac-opengl mailing list at any rate :)) that 10.2 brings GL_APPLE_vertex_array_range, GL_APPLE_fence and GL_APPLE_vertex_array_object, allowing AGP & VRAM storage of vertices.
It looks as if the extensions work slightly differently from their proprietary counterparts, but should provide good speedups.
Display lists are also improved by the new additions, and probably become the fast path for OSX, rather than the slow one as they currently are :)
Feanor
2002.07.06, 12:38 PM
Do you think using Compiled Vertex Arrays (on cards that support it) would make a notable difference? Many devs on mac-opengl have talked about them. The main consideration is that the arrays are limited to 2048 vertices apiece. Not a subject I've had much like finding info on.
-- FÎanor
OK, so I did more testing with the colorized (rather than lit) geometry, including comparing optimized (run through ACTC) to unoptimized geometry. Here are the results:
G4/533x2, OSX, GeForce2MX:
optimized: 5550 kpolys/sec
unoptimized: 4580 kpolys/sec
G3/350, OS9, Rage128:
optimized: 560 kpolys/sec
unoptimized: 360 kpolys/sec
And now the reason for the subject line:
G3/333, OS9, RagePro:
optimized: 188 kpolys/sec
unoptimized: 195 kpolys/sec
This is reliable, not a freak occurance. These numbers were repeated through several repetitions of the loop.
Check the end of this thread for a theory on why the unoptimized method is faster
App and data. (http://www.inio.org/~inio/code/SpeedTestNolight.sit)
Nimrod
2002.07.06, 04:14 PM
I'm very happy this thread exists, 'cos I'd been meaning to ask about this. I've just been playing about with rendering a mesh with 1682 polies, 901 vertices, normals for every vertex (which I calculated when I exported the 3ds file to my own format IIRC), and a 256*256 texture, and OpenGL lighting with one light (this is on a non TnL card).
This is all done on a G4 350MHz with Rage128, OS X 10.1.5.
Up till now this was done in immediate mode, where I was getting on average 70 FPS. Now I'm using glDrawElements() and getting almost 120 FPS. I think this comes out at about 175 - 180 thousand polies / sec. This is faster than on Ian's G4 400 for some reason (OSX vs OS9?). If the thread at macscene.org (linked to above) is to be believed, perhaps it's because I'm using Carbon and not GLUT.
Cool speed boost though!
This is without triangle strips, which I intend to look into next. Part of the speedup is probably also down to me previously using wrappers to get access to the vertices, indices etc... which were non-inlined, I don't know how much of a slowdown they would be.
Ian, do you know of the STRIPE algorithm for creating tri-strips? It works best on quads by triangulating them, but it decides on the optimal way to put the diagonal. Thanks for the link to ACTC.
EDIT:
I also meant to say that I had been put off trying glDrawElements, because I use STL vectors for storing all my data. I assume it works because the internal format of the data in a vector is no different from a plain array. But can I guarantee that this will always be the case?
Originally posted by Nimrod
Ian, do you know of the STRIPE algorithm for creating tri-strips? It works best on quads by triangulating them, but it decides on the optimal way to put the diagonal. Thanks for the link to ACTC.
I found them, but ACTC works good enough, and is free. As I just stated, it appars that on the RagePro, which is the minimum config we plan to support, RANDOMIZING the polygons may produce the best results.
edit: BTW, I would be interested on what numbers these apps get on a Radeon rig, I have yet to find one I can test on.
OK, I re-ran the tests after ramdomly shuffling the triangles in the input (http://www.inio.org/~inio/testlevel.obj). As expected, the Rage128 and GeForce2 got lower poly rates for the unoptimized geometry. Supprizingly, the Rage Pro got higher poly rates with the unoptimized geometry. At the peak it hit 207kpolys/sec - with tri strip optimized geometry its max was 189kpolys/sec.
tip: Check if you're running on a Rage Pro (glGetString(GL_RENDERER)=="Rage Pro OpenGL Engine"). If you are, shuffle the order of the polygons in your models around a bit.
OneSadCookie
2002.07.06, 08:34 PM
That's really strange that randomizing the triangles order would be consistently better.. maybe you've got something else going on (if it were Radeon/GF3, I'd say Z-buffer , but that doesn't seem to make sense for Rage Pro).
-----
CVA is supported on all hardware under OSX.
It doesn't provide significant benefits unless you can interleave your CPU work with your glDrawElements calls sufficiently. Basically, it seems like VAR/Fence is at least as good in the worst case, and significantly better in the best case.
henryj
2002.07.07, 06:25 PM
Nimrod:
I also meant to say that I had been put off trying glDrawElements, because I use STL vectors for storing all my data. I assume it works because the internal format of the data in a vector is no different from a plain array. But can I guarantee that this will always be the case?
STL vectors guarantee that they can be passed to functions that expect c style arrays so you will be fine. On the other hand you should take care using them because they can allocate and de-allocate at in-opportune times and when they do it's not cheap. You can avoid this if you take care.
henryj
2002.07.07, 06:33 PM
A couple of things about compiled vertex arrays...
On OSX you are limited to 2048 indices. Any more than this and openGL reverts to the non compiled path.
CVA is only going to benefit you if you are touching the same geometry multiple times. eg doing multi passes for lightmapping or you have some objects that some how share exactly the same vertices. You should be calling glDrawElements lots between your lock calls. This...
glLockArraysEXT( 0, size);
glDrawElements( GL_TRIANGLES, indices->Size(), GL_UNSIGNED_INT, indices->Data());
glUnlockArraysEXT();
is a waste of time.
Jeff Binder
2002.07.07, 06:53 PM
It may also be worth a shot if you're using indexed vertex arrays (i.e. glDrawElements()), if you're using any vertices more than once.
Originally posted by henryj
glLockArraysEXT( 0, size);
glDrawElements( GL_TRIANGLES, indices->Size(), GL_UNSIGNED_INT, indices->Data());
glUnlockArraysEXT();
is a waste of time. FALSE.
I just ran some tests after an inexplicable framerate drop in my app. Uncommenting the lock arrays called (used exactly as you show) increased my max polygon rate by over 300% on my GeForce2MX, and by about 100% on the Rage128. Only the iMac/RagePro didn't care if they were locked or not it seems.
henryj
2002.07.08, 08:03 PM
This is all quite interesting...
I commented out the lock calls in my game and it made NO difference at all, zero, zip.
Apple engineers have said for a while that display lists are slow on OSX, but a colleague of mine has tested them against vertex arrays and they were faster.
What does this mean?
OpenGL performance varies depending on the day, weather, colour of your wall paper?
Tests designed to test performance don't reflect 'real world' conditions?
Who knows?
Best bet is to profile your own code and work from there.
I will download you test and get back to you.
henryj
2002.07.08, 08:31 PM
I've given your test prog a spin and this is what I have found...
Performance on the same run varied up to 10% on the same test.
Performance on different runs varied up to 20%
I'm getting around 1100k polys per frame on a G4 dual 500 Radeon 10.1.5. This seems quite slow. I would expect around 4 million polys/sec.
The reason why the lock calls made such a difference is because you are rendering the same geometry every frame. This to be expected as I said...
CVA is only going to benefit you if you are touching the same geometry multiple times. eg doing multi passes for lightmapping or you have some objects that some how share exactly the same vertices. You should be calling glDrawElements lots between your lock calls.
You are effectively caching your mesh on the video card and repeatedly calling drawElements, which is the ideal situation but not very representative. I don't know a lot of games that only have one object.
My game comprises over 200 different meshes and renders about 90 per frame, with about 30 different textures and about 15 different material setups. How about doing a real world test. Load 2 different meshes with different textures and render them alternatively.
Good work though. It was quite interesting.
Originally posted by henryj
The reason why the lock calls made such a difference is because you are rendering the same geometry every frame. This to be expected as I said...I lock and unlock the arrays immidiately before and after each glDrawElements. Unless the driver is doing something really sneaky (checksuming a small sample of the array data?), the geomtry is being re-submitted to the card for every draw. Check the source.
My game comprises over 200 different meshes and renders about 90 per frame, with about 30 different textures and about 15 different material setups. How about doing a real world test. Load 2 different meshes with different textures and render them alternatively.This wasn't designed to be a real world speed test. It was only designed to determine what the fastest way to submit polygons to the graphics card was.
henryj
2002.07.08, 10:08 PM
I lock and unlock the arrays immidiately before and after each glDrawElements. Unless the driver is doing something really sneaky...
The driver may be caching the geometry. How else do you explain the differences we have seen? The best way to tell for sure is to add another mesh.
This wasn't designed to be a real world speed test. It was only designed to determine what the fastest way to submit polygons to the graphics card was.
Then what's the point. People are going to make decisions on their render path based on this data. If you just wanted to see the theoretical limit just render one static triangle strip of 1 pixel triangles. This is the fastest method.
I'm not criticising what you have done, it's really good, but why not make it so the data is actually useful. It wont take much more work to add another mesh. You could make 2 TriMesh of the same data and draw them alternatively. This would tells us whether the driver is doing a comparison of the pointers being passed to gl*Pointer(). Being sneaky as you said.
I for one would be interested in the results.
Originally posted by OneSadCookie
That's really strange that randomizing the triangles order would be consistently better.. maybe you've got something else going on (if it were Radeon/GF3, I'd say Z-buffer , but that doesn't seem to make sense for Rage Pro).
Alright, blast from the past time, but I think I've figured out why this strange behavior occurs.
Lets say it takes a non-trivial amount of CPU computation to throw out a back-facing poly, and that the video card does not have an draw queue, or has a very small one. If you sort the polygons into tri-strips, it's likely that there are long groups of sequential back-facing and front-facing polygons. during the back-facing groups the graphics card is idle while the CPU throws out the polys. During the front-facing sets the CPU is idle while it waits for the graphics card to accept the next poly.
Now, if you randomize the polygons, the chance of a sequence of more than a few front- or back-facing polygons becomes negligible. This means that while the CPU is waiting to submit the next polygon there's a good chance that it could throw out a back facing polygon. I think this slight efficiency gain might be enough to create the ~7% speed increase seen from randomizing the polygons.
As a result, if you're really targeting the Rage Pro (which you've got to be either insane, working on a demo, or hopelessly stuck in the near past to do), it might be interesting to try optimizing your polygons so that they're ordered so that any two sequential polygons are as close to facing directly away from each other as possible. this would pretty much guarantee that the CPU throws out one poly while it waits for another to draw every time.
OneSadCookie
2003.07.19, 06:35 AM
Wow, this is a real blast from the past!
Sounds like a good theory. I'm just glad it's a non-issue these days :)
Completely off topic, but speaking of the past, did anyone else notice who posted to this thread? Maybe I am seeing ghosts...
vBulletin® v3.6.8, Copyright ©2000-2008, Jelsoft Enterprises Ltd.