PDA

View Full Version : #import <ppc_intrinsics.h> for Universal Binary ?


Danlab
2006.04.03, 08:52 AM
how i can replace ppc_intrinsics.h ?

i need to replace only __fres and __frsqrte

any idea ?

akb825
2006.04.03, 12:24 PM
You could just do 1/ for _fres (reciprocal) and sqrtf() for _fsqrte. If you are using them and it's putting it in there for optimization purposes, then are you linking to the 10.4 Universal SDK rather than 10.3.9, 10.4, or "Current OS" SDKs? If not, you must link it to the 10.4 Universal SDK. (assuming you're in XCode; I don't know how to do that from the command line)

OneSadCookie
2006.04.03, 04:35 PM
sqrt() calls on Intel will translate to a single machine instruction, so you may not need this optimization any more...

completely untested:

#include <xmmintrin.h>

#define __frsqrte(f) ({ float _f = f; _mm_rsqrt_ss(_mm_set_ss(_f)) })
#define __fres(f) ({ float _f = f; _mm_rcp_ss(_mm_set_ss(_f)) })

Danlab
2006.04.03, 07:49 PM
thanks going to try that :-)

Chris Ball
2006.04.15, 01:33 AM
Those are a single instruction on PPC as well, but not single-cycle.

OneSadCookie
2006.04.15, 04:00 AM
the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction. All Intel Macs have such a thing:

iMacCoreDuo:~ keith$ cat > test.c
#include <math.h>

float square_root(float f) { return sqrtf(f); }
iMacCoreDuo:~ keith$ gcc -c -O2 -arch ppc test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000 b 0x20 ; symbol stub for: _sqrtf
iMacCoreDuo:~ keith$ gcc -c -O2 -arch ppc -mcpu=G5 test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000 fsqrts f1,f1
00000004 blr
iMacCoreDuo:~ keith$ gcc -c -O2 -arch i386 test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000 pushl %ebp
00000001 movl %esp,%ebp
00000003 subl $0x04,%esp
00000006 sqrtss 0x08(%ebp),%xmm0
0000000b movss %xmm0,0xfffffffc(%ebp)
00000010 fldsl 0xfffffffc(%ebp)
00000013 leave
00000014 ret

PowerMacX
2006.04.15, 04:06 AM
From TN2087 (http://developer.apple.com/technotes/tn/tn2087.html):
The G5 has a full-precision hardware square root implementation. If your code executes square root, check for the availability of the hardware square root in the G5 and execute code calling the instruction directly (e.g. __fsqrt()) instead of calling the sqrt() routine. (Use __fsqrts() for single-precision.) You can use the GCC compiler flags -mpowerpc-gpopt and -mpowerpc64 to transform sqrt() function calls directly into the PPC sqrt instruction.

Edit: Grr...OSC beat me to it...

OneSadCookie
2006.04.15, 05:34 AM
For completeness' sake:

iMacCoreDuo:~ keith$ icc -c test.c -O2
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
__text:
00000000 subl $0x0c,%esp
00000003 sqrtss 0x10(%esp,1),%xmm0
00000009 movss %xmm0,(%esp,1)
0000000e fldsl (%esp,1)
00000011 addl $0x0c,%esp
00000014 ret
00000015 nop
00000016 nop
00000017 nop


Looks like GCC 4 for Intel is producing rather suboptimal code for this simple case...

Chris Ball
2006.04.15, 06:50 PM
the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction.

This is not so, what about fsqrt t,b and fsqrts t,b? Not to go on about it.

Also, don't forget that cycles count. Just because an instruction exists to do something doens't mean it's faster than several instructions. The 5-bit frsqrte and 8-bit fres are done instantly with parallel logic. It's unlikely (though I don't know) that either the PPC or intel do a division in a single cycle. It may take dozens.

OneSadCookie
2006.04.16, 01:23 AM
fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|

And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.

Chris Ball
2006.04.16, 06:54 AM
fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|

OneSadCookie, that is certainly wrong. I have 2 data books and a pdf right in front of me, and they are all about a decade old, and they all have those instructions in them. I have been programming a wide variety of processors in assembler professionally for half my life and I don't like your tone at all. And I don't randomly argue with anyone.

I checked my FACTS in TWO PLACES before posting, please check yours!:mad:

I don't know when I've been so angry.

And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.

frsqrte is 5-bits. My common sense tells me it is surely single-cycle. 50 gates should be enough to budget that. fres too. For that matter I don't see how an extra cycle could be used to reduce the gate count in either case, the instructions are just too weak.

DoG
2006.04.16, 07:50 AM
The PPC ISA contains fsqrt and fsqrts, and lists them as optional. This is straight from the Apple headers:

/*
* __fsqrt - Floating-Point Square Root (Double-Precision)
*
* WARNING: Illegal instruction for PowerPC 603, 604, 750, 7400, 7410,
* 7450, and 7455
*/

Latency for fsqrtre is 3 or 4 cycles, according to my sources.

OneSadCookie
2006.04.16, 08:13 AM
The instructions are *optional*. They're defined in the original PowerPC spec, yes, but the range of processors that implement them is limited.

The PowerPC architecture includes a set of optional instructions:

General-Purpose Group—fsqrt and fsqrts.
Graphics Group—stfiwx, fres, frsqrte, and fsel.
If an implementation supports any instruction in a group, it must support all of the instructions in the group. Check the documentation for a specific implementation to determine which, if any, of the groups are supported

Section 1.5.2 of this document: http://www.freescale.com/files/32bit/doc/ref_manual/MPC750UM.pdf?srch=1 describes that G3s implement the "graphics group" but not the "general-purpose group"

Section 1.3.2.3 of this document: http://www.freescale.com/files/32bit/doc/ref_manual/MPC7410UM.pdf describes that G4s implement the "graphics group" but not the "general-purpose group"

G5s implement both, as described in section 2.2.4 of this document: http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/AE818B5D1DBB02EC87256DDE00007821/$file/970FX_user_manual.v1.6.2006FEB09.pdf

Why else would GCC only generate the fsqrt instruction when explicitly told it's generating code for G5 only?!

The documents I referenced have a detailed discussion of the latency of various instructions, but long story short, for a G3 or G4, the minimum latency of a floating-point arithmetic instruction is 3 cycles, and frsqrte executes within that. On the G5, frsqrte has a latency of 6 cycles, and fsqrt has a latency of 40.

That's nearly 30 minutes of my time wasted proving stuff that I already knew :mad:

Chris Ball
2006.04.16, 08:13 AM
Latency for fsqrtre is 3 or 4 cycles, according to my sources.

It's hard to find timing data for the PPC, this is the closest I could get from google:
http://www.google.com/search?q=frsqrte&hl=en&lr=&client=safari&rls=en-us&start=10&sa=N

Optimization and Optimization and Tuning on POWER4 Tuning on ...
File Format: PDF/Adobe Acrobat - View as HTML
Single cycle fres and frsqrte. Single cycle fres and frsqrte. Good for MASS instrinsics). Good for MASS instrinsics) ...
www.spscicomp.org/ScicomP5/Presentations/ Tutorial/Daresbury.POWER4.Tuning.tut.pdf - Similar pages


This link appears to go to a presentation by IBM.

frsqrte fres are also marked as 'optional' and have been present since the 601 (absent on the 601). The older of my databooks marks it (fsqrts) as optional and the more recent one published by IBM Microelectronics (C) 1994 does not.

I seem to remember using fsqrts, gentlemen.

OneSadCookie
2006.04.16, 08:25 AM
Make sure you read my post immediately above this one (you might have missed it, since you've only responded to DoG's above that). I'd hate to have spent all that time for nothing :mad:

Chris Ball
2006.04.16, 08:36 AM
OneSadCookie, I have never heard of freescale. Is the chip in my mac manufactured by them or IBM? Perhaps the compilers err too far on the side of caution.

Also I would point out that if you're upset, you have only yourself to blame. Your posting was inflammatory, not mine.

OneSadCookie
2006.04.16, 08:39 AM
Your posting was incorrect. I can't let that stand.

Freescale is the new name for Motorola's PowerPC business.

No, the compiler is not overcautious.

Chris Ball
2006.04.16, 08:46 AM
Thank you for your civility. Nothing I said was incorrect, and I do not doubt that all PPCs manufactured by IBM support fsqrts and provide a single cycle fres and fsqrte.

It may or may not be that the one in my computer (G4) does not, apparently depending on the manufacturer.

You were nevertheless out of line when you accused me of speaking without checking.

OneSadCookie
2006.04.16, 09:13 AM
It does *not* depend on the manufacturer, only the model. The fact that G3s and G4s happened to be made by Motorola and not implement fsqrt, and that G5s happen to be made by IBM and do implement fsqrt is completely irrelevant.

And when refuting my statement about G3s and G4s, you checked only that the PowerPC ISA defined those instructions -- you didn't even go far enough to discover that they were optional, let alone checking the particular models that the discussion was in relation to.

Your posts #9:


the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction.

This is not so, what about fsqrt t,b and fsqrts t,b? Not to go on about it.

[snip]

and #11:


fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|

OneSadCookie, that is certainly wrong.

[snip]

are undeniably incorrect; you never once qualified your statement to say "only IBM PowerPCs", and you specifically refuted what I said about the G3 and G4.

Posting incorrect information is bad enough, but trying to continue to claim that you're correct when you've been conclusively proven wrong is just plain rude. I'd appreciate an apology, but I'd settle for you just dropping the subject, at this stage.

Chris Ball
2006.04.16, 09:52 AM
OneSadCookie, I conceded that I was unaware that apple uses chips from freescale (I don't doubt that you and Dog are correct on that). I also stated that the 'optional' qualifier was absent in my IBM documentation from 1994. According to my databook right here (right or wrong) IBM G4 does support fsqrts.

While it is apparently true that the G4s used by apple do not support fsqrts, I insist that I was quoting accurately from the data in front of me.

I admit that the bottom line is that the instructions are absent; but I also assert that I took extraordinary care not to post incorrectly.

I have re-read the postings and I'm satisfied that you have been rude, not me. Ban me if you want to.

Taxxodium
2006.04.16, 12:46 PM
Yes I like boobies too, now please calm down or I'll calm you down ;)

ThemsAllTook
2006.04.16, 12:53 PM
No one is going to ban anyone. This argument seems to have arisen from a minor misunderstanding, and spiraled out of control from there. Instead of turning this into a who-was-rude-to-who debate, can we please drop it and stay on topic? If either of you want to continue the discussion, please do so with the possibility in mind that your information is incorrect, and don't take a challenge of its correctness as an affront to your intelligence. The forum is not a contest. Approach with a humble and open mind, and you may have an opportunity to learn new things. Approach with a closed mind, and all you're likely to get is heated argument.

DoG
2006.04.16, 01:28 PM
...According to my databook right here (right or wrong) IBM G4 does support fsqrts.

Don't you misunderstand something in your docs there? The G4 we are talking of is the PowerPC 7400, which IBM never manufactured. You must be thinking of the S/390 series IBM mainframes, which IBM also termed G-something, which supposedly do use some PowerPC chips to handle IO, but rely on custom big-iron processors otherwise.

So, in conclusion, your databook probably describes the IBM S/390 G4 mainframe, while the discussion was about PowerPC 603, 604 (G2) / 750 (G3) / 7400 (G4) processors.