Please help me conceptually with extended character support

Sage
Posts: 1,199
Joined: 2004.10
Post: #1
I'm writing a glyph-matrix-texture font renderer, and it's working pretty well. So long as I deal only with 7-bit ASCII.

Now, naively, I thought I'd give a stab at supporting extended characters by extending the range of glyphs drawn into the texture. E.g., instead of rendering just [32 -> 127] I did some tests on NSString to see what I got if I increased the upper bound of the loop and created a string using:

-(NSString*) stringWithCharacters: (const unichar *) chars length: (int) count;

Sort of like:
int character;
for ( character = 32, character < 1000; character++ )
{
NSString *s = [NSString stringWithCharacters: (const unichar *) &c length: 1];
NSLog( "%3d : %@", character, s );
}

So looking at the output I found that if I iterated up to 382 I got pretty much all the european accented characters following the normal 7-bit ascii chars. In principle, this is pretty good since all 350 glyphs will fit on a reasonably sized 8-bit texture.

My problem is I have *no* idea how UTF8 or Unicode or whathaveyou string encodings work, so I can't actually render any of these extra glyphs.

And to re-enforce my ignorance of the topic, I cut and pasted the name of a Sigur Ros song into a printf:
Code:
    const char *s = "Glósóli";
    printf( "%s\n", s );

And this is what was printed: "Gl\227s\227li"

So I thought: OK, let's look at char 227 from the table I generated above. Well, it was *not* a 'ó'. It was an accented a and nowhere close numerically to the right glyph.

Basically, I'm hoping somebody here can give me some pointers on how best to approach accented european character sets. Ideally, I'd like to know:

1) If there's a way that comfortably works with 8-bit strings, e.g., ( const char * )
2) What the character table is, for #1
3) If there's no good 8-bit encoding, what's unicode look like then from a character table standpoint, as well as from a programmer's standpoint. How, for example, would you create a unicode string literal in a text file ( or in a source code file, like a .cpp or .m )

Anyway, I'm just hoping somebody can help. There's quite a few folks here who are not from natively english-speaking countries, so I assume somebody must have dealt with this!

P.S. I'm really only interested in the accented characters of european languages, which is to say, I'm not interested in Korean, Arabic, etc.

P.P.S. Before anybody brow-beats me with the google-stick, I am simultaneously googling, and reading articles on unicode.org. I just hope people here can give me some pointers and suggestions.

Thanks,
Quote this message in a reply
Moderator
Posts: 1,560
Joined: 2003.10
Post: #2
TomorrowPlusX Wrote:And this is what was printed: "Gl\227s\227li"

So I thought: OK, let's look at char 227 from the table I generated above. Well, it was *not* a 'ó'. It was an accented a and nowhere close numerically to the right glyph.
To answer this particular part of your post, it looks as though those are octal numbers. 0227 == 151. 151 is 'ó'.
Quote this message in a reply
Member
Posts: 204
Joined: 2002.09
Post: #3
TomorrowPlusX Wrote:1) If there's a way that comfortably works with 8-bit strings, e.g., ( const char * )
2) What the character table is, for #1
3) If there's no good 8-bit encoding, what's unicode look like then from a character table standpoint, as well as from a programmer's standpoint. How, for example, would you create a unicode string literal in a text file ( or in a source code file, like a .cpp or .m )

1) UTF8 is what you want. In short, it is a variably lengthed encoding which acts just like a C programmer would expect it to (ie, all of the lower ASCII ranges are the same, and a null byte can still be used to represent the end of a string). UTF8 doesn't handle ALL of the languages out there (that's what UTF16 is for), but it does handle 95% of them.

2) More tricky to answer. The simpler solution is to use [insert-your-font-api-here] to render your character map to a texture, and then use that. As to auto-generating the progressively increasing values, you'll need to examine the UTF8 spec for how exactly to do that.

3) You need to specify the file as a UTF8 file, and then type your string in your language of choice. I've never had luck putting them directly in source files, instead they should be put in localized .strings files and loaded using the appropriate API.
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #4
I spent my lunch break reading a ton of resources on UTF-8 encoding. It does, in fact, seem the way to go.

My intent now is to have my font simply render a (const char *) string -- simply assuming it to be a UTF-8 string since 7-bit ASCII is a valid subset -- by examining each byte and if the high bit's set do what I need to do to convert it and the subsequent byte ( or more ) into a unicode index.

I'm looking for code that will do this for me, since it'd be probably of higher quality. In particular, I'm looking into mbtowc() -- but I'll probably have to write my own, since I'm not certain how mbtowc is going to report how many bytes were consumed to make the wide char. I'm a little scared of this, btw. There's very little sample code, and the code I've found is pretty heady stuff. All I want is to be able to iterate through the bytes of a UTF-8 string and get unicode indices.

As to the question of *how* you get UTF-8 text in, well, it could be any number of ways. I realize now that I can define a wide character string literal via the prefix "L", as in

char *str = L"Glósóli";

But most strings will be coming in from config files, so I'll have to make certain they're encoded properly.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #5
An alternative to UTF8 that will still cover most of Europe (but not Japanese etc) is MacRoman encoding.

If you want code that renders MacRoman chars 32-255 to a font texture, take a look at Untima (source available from the more info link.) It does not deal with 2D bin packing so it's pretty inefficient, but you can see how to handle the string encodings.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #6
Hm, I realized after posting that Untima was doing this in a very unsafe way; it just uses "%c" which picks up the users default encoding (the one chosen when they install the OS.) Use [NSString stringWithCString: encoding:] to force the encoding.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #7
KittyMac Wrote:UTF8 doesn't handle ALL of the languages out there (that's what UTF16 is for), but it does handle 95% of them.

Incorrect. UTF-8 handles all of Unicode.

UTF-8 is almost certainly what you want.

Don't put non-ASCII characters into your string constants -- their value in the compiled program is at the mercy of your text editor and the compiler.

Obviously, building a texture with all the tens of thousands of Unicode characters in it is not feasible, so you'll need a more intelligent caching mechanism. FTGL has a pretty good one if you want a reference / code to steal.

http://wikipedia.org/wiki/Utf-8 describes how to process UTF-8 into something more useful (UCS-4).

In addition to simply being able to render the characters, some languages have additional layout logic required. Hebrew, for example, is right-to-left, and Arabic, Thai, etc. require *much* more complex processing.

If you're Mac-only, the best way to get complete international text support is to use NSTextView to render the text, and copy from it to a texture.

If you need cross-platform support, it's possible to get FreeType + Pango working on Mac, Linux and Windows, rendering to a texture. The Windows side of things is *very* not fun, though.
Quote this message in a reply
Oldtimer
Posts: 834
Joined: 2002.09
Post: #8
Quote:UTF-8 is almost certainly what you want.
Out of curiosity, where does that leave me with my nifty wchar_t STL setup? That's Unicode or UCS-4, right?
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #9
careful; sizeof(wchar_t) is not particularly standard... I think you'll find MSVC uses 2 and GCC uses 4, but I could be wrong.

If it's 2, then you're in a tricky spot -- 2 bytes is not enough to fit all of Unicode, so you have the choice of dropping the characters that fall off the top, or using UTF-16.

If it's 4, then there's no problem -- it's UCS-4, 4 bytes per Unicode character, no fanciness.
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #10
Hey, I just wanted to tell you all I got it working. I'm using UTF8 ( I wrote my own parser yesterday ), since it's basically a universal standard. Admittedly, since I'm using a glyph matrix texture I'm only supporting a small range of unicode, and as such, I only support the glyphs from [32 - 127] + [161 - 382], but that seems to be enough for basic european characters.

Here's a screenshot, you can see it passes the "Zapfino" test, and at the bottom the Sigur Ros album title is correctly displayed.

[Image: GlyphGen.png]

Also, I've implemented three types of bounding rect calculations, to make it easier to properly align text in boxes ( such as vertically centering text in a button using the font's cap-height, as opposed to its ascender ).

Anyway, I'm nearly done! I'll post the source for the generator app and the C++ Font class I'm using to display if anybody's interested.

Features:
- exports metrics alongside the texture, so you can get ascent, descent, cap-height, x-height, per-char advance.
- easy to use C++ Font class for rendering
- 3 types of bounding rect calculations
- "BoundsCriteria_Default", which uses cap-height and advance
- "BoundsCriteria_Rendering" which uses the actual pixel coverage
- "BoundsCriteria_AscentDescent" which uses ascent + descent and advance
- Supports tabs and newlines in the string, allowing multiline rendering
- Supports two-byte UTF8, thee and four-byte are handled, but not rendered.
- Uses an 8-bit alpha texture for the glyphs, so it's relatively light on memory, plus that means you can render in any color, over any color, without ugly pre-multiplication artifacts.
- Not certain, but seems pretty fast, comparable to FTGL.

To Do:
- Maybe implement a box-packing algorithm to use space in glyph texture more efficiently. That would be pretty hard, I suspect.
- Right now, the metrics are an ObjC class, serialized via NSKeyedArchiver. This works, and is endian safe, but means that the C++ Font class depends on Cocoa to load. Fine with me, but obviously this would be an issue for cross-platform code.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #11
Sounds good.

Simple box packing.
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #12
That's interesting. I was thinking along different lines, but that's a very interesting idea.

I've got one relatively big bug I've got to solve before I can tackle box-packing etc -- it turns out I'm not getting reliable glyph information for characters which aren't present in the font. It's peculiar, since cocoa will cause a fallback font to be rendered, but when I use NSLayoutManager's getGlyphs: range: method to get all my glyphs, the ones which aren't in the typeface have bad data.

I'd expect them to have no data, or to have the glyph data for the fallback font. But not *bad* data.

Grumble! I'm sure I can find a workaround.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #13
I'd say it *fails* the Zapfino test... Zapfino includes lots of advanced kerning and shaping information, including alternate forms for various characters depending on context (eg. th, Ti). This is the sort of thing that Cocoa or Pango will handle seamlessly for you, but which is a *lot* of work to handle for yourself...
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #14
That's fair -- I'm explicitly not handling ligatures.

I guess I should explain why I'm moving away from FTGL. Well, quite simply, I don't feel comfortable distributing truetype fonts that I didn't author. And the good ones don't have liberal licenses for redistribution. Secondly, I tried and tried but I can't get reliable metrics from FTGL. It reports values which are consistently wrong.

So that's why I'm writing this.
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #15
I just wanted to post an update. I integrated the new type engine into my game, and profiled the rendering performance of a text stress-test using my old FTGL code and my new API.

Well, earlier I had supposed the performance was good, comparable to FTGL. Well, when I actually tested and compared, the FTGL stress-test got 30fps. And my new code got 120. 120!

So, the moral of the story here is that if you're using FTGL and your game has a lot of text or a complex hud with lots of widgets and text rendering, and you need a speedup and can live with simple text rendering, you might want to drop FTGL.

Anyway, that's all. I know OSC isn't happy that I'm not doing ligatures or supporting RTL languages ( since I'm really only rendering a tiny subset of unicode ) but frankly, on my crapulent GF 5200 Go this has resulted in a general performance boost everywhere, since I use an in-game hud with a fair amount of text rendering. I knew from profiling that my old FTGL text rendering was slow, but I had supposed I'd just have to deal with it...
Quote this message in a reply
Post Reply