#import <ppc_intrinsics.h> for Universal Binary ?
how i can replace ppc_intrinsics.h ?
i need to replace only __fres and __frsqrte
any idea ?
i need to replace only __fres and __frsqrte
any idea ?
You could just do 1/ for _fres (reciprocal) and sqrtf() for _fsqrte. If you are using them and it's putting it in there for optimization purposes, then are you linking to the 10.4 Universal SDK rather than 10.3.9, 10.4, or "Current OS" SDKs? If not, you must link it to the 10.4 Universal SDK. (assuming you're in XCode; I don't know how to do that from the command line)
sqrt() calls on Intel will translate to a single machine instruction, so you may not need this optimization any more...
completely untested:
completely untested:
Code:
#include <xmmintrin.h>
#define __frsqrte(f) ({ float _f = f; _mm_rsqrt_ss(_mm_set_ss(_f)) })
#define __fres(f) ({ float _f = f; _mm_rcp_ss(_mm_set_ss(_f)) })
thanks going to try that :-)
Those are a single instruction on PPC as well, but not single-cycle.
the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction. All Intel Macs have such a thing:
Code:
iMacCoreDuo:~ keith$ cat > test.c
#include <math.h>
float square_root(float f) { return sqrtf(f); }
iMacCoreDuo:~ keith$ gcc -c -O2 -arch ppc test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000 b 0x20 ; symbol stub for: _sqrtf
iMacCoreDuo:~ keith$ gcc -c -O2 -arch ppc -mcpu=G5 test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000 fsqrts f1,f1
00000004 blr
iMacCoreDuo:~ keith$ gcc -c -O2 -arch i386 test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000 pushl %ebp
00000001 movl %esp,%ebp
00000003 subl $0x04,%esp
00000006 sqrtss 0x08(%ebp),%xmm0
0000000b movss %xmm0,0xfffffffc(%ebp)
00000010 fldsl 0xfffffffc(%ebp)
00000013 leave
00000014 ret
From TN2087:
Edit: Grr...OSC beat me to it...
Quote:The G5 has a full-precision hardware square root implementation. If your code executes square root, check for the availability of the hardware square root in the G5 and execute code calling the instruction directly (e.g. __fsqrt()) instead of calling the sqrt() routine. (Use __fsqrts() for single-precision.) You can use the GCC compiler flags -mpowerpc-gpopt and -mpowerpc64 to transform sqrt() function calls directly into the PPC sqrt instruction.
Edit: Grr...OSC beat me to it...
For completeness' sake:
Looks like GCC 4 for Intel is producing rather suboptimal code for this simple case...
Code:
iMacCoreDuo:~ keith$ icc -c test.c -O2
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
__text:
00000000 subl $0x0c,%esp
00000003 sqrtss 0x10(%esp,1),%xmm0
00000009 movss %xmm0,(%esp,1)
0000000e fldsl (%esp,1)
00000011 addl $0x0c,%esp
00000014 ret
00000015 nop
00000016 nop
00000017 nopLooks like GCC 4 for Intel is producing rather suboptimal code for this simple case...
OneSadCookie Wrote:the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction.
This is not so, what about fsqrt t,b and fsqrts t,b? Not to go on about it.
Also, don't forget that cycles count. Just because an instruction exists to do something doens't mean it's faster than several instructions. The 5-bit frsqrte and 8-bit fres are done instantly with parallel logic. It's unlikely (though I don't know) that either the PPC or intel do a division in a single cycle. It may take dozens.
fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|
And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.
And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.
OneSadCookie Wrote:fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|
OneSadCookie, that is certainly wrong. I have 2 data books and a pdf right in front of me, and they are all about a decade old, and they all have those instructions in them. I have been programming a wide variety of processors in assembler professionally for half my life and I don't like your tone at all. And I don't randomly argue with anyone.
I checked my FACTS in TWO PLACES before posting, please check yours!
I don't know when I've been so angry.
Quote:And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.
frsqrte is 5-bits. My common sense tells me it is surely single-cycle. 50 gates should be enough to budget that. fres too. For that matter I don't see how an extra cycle could be used to reduce the gate count in either case, the instructions are just too weak.
The PPC ISA contains fsqrt and fsqrts, and lists them as optional. This is straight from the Apple headers:
/*
* __fsqrt - Floating-Point Square Root (Double-Precision)
*
* WARNING: Illegal instruction for PowerPC 603, 604, 750, 7400, 7410,
* 7450, and 7455
*/
Latency for fsqrtre is 3 or 4 cycles, according to my sources.
/*
* __fsqrt - Floating-Point Square Root (Double-Precision)
*
* WARNING: Illegal instruction for PowerPC 603, 604, 750, 7400, 7410,
* 7450, and 7455
*/
Latency for fsqrtre is 3 or 4 cycles, according to my sources.
The instructions are *optional*. They're defined in the original PowerPC spec, yes, but the range of processors that implement them is limited.
Section 1.5.2 of this document: http://www.freescale.com/files/32bit/doc...pdf?srch=1 describes that G3s implement the "graphics group" but not the "general-purpose group"
Section 1.3.2.3 of this document: http://www.freescale.com/files/32bit/doc...7410UM.pdf describes that G4s implement the "graphics group" but not the "general-purpose group"
G5s implement both, as described in section 2.2.4 of this document: http://www-306.ibm.com/chips/techlib/tec...6FEB09.pdf
Why else would GCC only generate the fsqrt instruction when explicitly told it's generating code for G5 only?!
The documents I referenced have a detailed discussion of the latency of various instructions, but long story short, for a G3 or G4, the minimum latency of a floating-point arithmetic instruction is 3 cycles, and frsqrte executes within that. On the G5, frsqrte has a latency of 6 cycles, and fsqrt has a latency of 40.
That's nearly 30 minutes of my time wasted proving stuff that I already knew
PowerPC compiler-writers guide Wrote:The PowerPC architecture includes a set of optional instructions:
General-Purpose Group—fsqrt and fsqrts.
Graphics Group—stfiwx, fres, frsqrte, and fsel.
If an implementation supports any instruction in a group, it must support all of the instructions in the group. Check the documentation for a specific implementation to determine which, if any, of the groups are supported
Section 1.5.2 of this document: http://www.freescale.com/files/32bit/doc...pdf?srch=1 describes that G3s implement the "graphics group" but not the "general-purpose group"
Section 1.3.2.3 of this document: http://www.freescale.com/files/32bit/doc...7410UM.pdf describes that G4s implement the "graphics group" but not the "general-purpose group"
G5s implement both, as described in section 2.2.4 of this document: http://www-306.ibm.com/chips/techlib/tec...6FEB09.pdf
Why else would GCC only generate the fsqrt instruction when explicitly told it's generating code for G5 only?!
The documents I referenced have a detailed discussion of the latency of various instructions, but long story short, for a G3 or G4, the minimum latency of a floating-point arithmetic instruction is 3 cycles, and frsqrte executes within that. On the G5, frsqrte has a latency of 6 cycles, and fsqrt has a latency of 40.
That's nearly 30 minutes of my time wasted proving stuff that I already knew
DoG Wrote:Latency for fsqrtre is 3 or 4 cycles, according to my sources.
It's hard to find timing data for the PPC, this is the closest I could get from google:
http://www.google.com/search?q=frsqrte&h...rt=10&sa=N
Quote: Optimization and Optimization and Tuning on POWER4 Tuning on ...
File Format: PDF/Adobe Acrobat - View as HTML
Single cycle fres and frsqrte. Single cycle fres and frsqrte. Good for MASS instrinsics). Good for MASS instrinsics) ...
http://www.spscicomp.org/ScicomP5/Presentations/ Tutorial/Daresbury.POWER4.Tuning.tut.pdf - Similar pages
This link appears to go to a presentation by IBM.
frsqrte fres are also marked as 'optional' and have been present since the 601 (absent on the 601). The older of my databooks marks it (fsqrts) as optional and the more recent one published by IBM Microelectronics © 1994 does not.
I seem to remember using fsqrts, gentlemen.
Make sure you read my post immediately above this one (you might have missed it, since you've only responded to DoG's above that). I'd hate to have spent all that time for nothing
Possibly Related Threads...
| Thread: | Author | Replies: | Views: | Last Post | |
| Creating universal binary(ppc/i386) with XCode 3.1.2 | AdrianM | 1 | 3,293 |
Apr 13, 2009 09:11 AM Last Post: DoG |
|
| Universal Binary on PPC : a few questions | frozax | 6 | 3,421 |
Mar 7, 2008 04:02 PM Last Post: frozax |
|
| Initial Svn Import? | bronxbomber92 | 5 | 3,602 |
Mar 2, 2008 12:21 AM Last Post: Skorche |
|
| Universal Binary | skyhawk | 3 | 2,712 |
Feb 4, 2008 11:52 PM Last Post: sohta |
|
| Universal binary? | mac_girl | 9 | 4,488 |
Jan 13, 2007 11:02 PM Last Post: Frank C. |
|

