2020-03-29

Text Encoding and Compression (featuring Lufia 2)

One of the things I have a lot of on my website is attempted translations of video game text, mostly SNES-era RPGs since that's what I grew up on and have the most familiarity with. This started out as something I did out of my own personal curiosity and for practice, before it occurred to me that I might as well go ahead and share it. Here's a little more insight into part of the process.

Text dumping overview


If you're interested in translating the text of a game, the first thing you need is, of course, that text. Naturally, you could transcribe it from a playthrough of the game, reading it off the screen and typing it manually, but that's both time-consuming and error-prone. It also has the drawback that you need to get the text to display in-game before you can do anything with it, which is problematic in scenes with multiple branches and variants, and means you may never even know about easily-missed text, let alone text that was left entirely unused.

Ideally, what you want is to pull the text directly from the game data. One of the biggest problems with that, assuming you have access to the game's raw data in the first place, is that the text is often encoded in some nonstandard fashion, and usually also compressed on top of that. This is particularly true of cartridge-era console games, which had to be developed within strict storage limits. All that often makes it difficult to even locate the text, much less extract it in an intelligible format.

Lufia II has been probably the most interesting case I've encountered so far. This game's text encoding (in the English version, at least) is based on standard ASCII, a rarity among console games of its time. That sounds like it ought to make picking out the text fairly easy. And in some places, it does! Spell, item, and monster names, for instance, are entirely readable without any additional interpretation, though there's sometimes some unreadable stuff between names. For instance, part of the item list follows (they're fixed-length entries; line breaks are added here for readability):

Charred newt
Potion      
Hi-Potion   
Ex-Potion   
Magic jar   
Hi-Magic    
Ex-Magic    
Regain      
Miracle     
Antidote    

However, most of the dialog looks more like this (unprintable bytes are shown as their hex values in braces, using the convention of a dollar sign prefix):

Y{$11}3{$01}{$00}1{$04}{$04}7{$0A}fI{$8E}{$06}5 {$05}y ok{$0E}{$09}? {$05}{ {$8C}Ðt{$03} {$81}Ì·ð{$01} aI{$9E}{$06}@ {$06}{$97}{$8E}¸ {$06}è{$03} {$06}N {$0A}É"y{$84}{$06}« {$8B}{$0E}{$05}{$7F} {$06}Ì.{$01}

So what went wrong? As I mentioned above, the encoding is only based on ASCII. Quite a few byte values have special meanings (for instance, $09 in a text block stands for the protagonist's name). More importantly, the text is compressed to save space. Not only that, this game uses three different types of text compression.

Basic text compression


Let's start with the simplest one. Numerous byte values, including everything from $80 through $FF, stand for common combinations of (usually) two characters. (This only applies to the English version. The Japanese version uses single-byte codes for each of 162 kana, plus 52 upper- and lowercase English letters, 10 digits, 15 punctuation symbols, and a space, leaving it very few spare bytes for that sort of thing.)

Figuring these out, along with bytes that act as various types of control codes (newline, end message, etc.) is generally a process of trial and error. A good starting point is finding something that's more obviously text, finding where it displays in the game, and comparing the two. Editing the data to see what the game does with it is another way of getting more useful information.

This game's ASCII-based encoding made it possible to find useful text just by skimming through the data in a standard hex editor. For games that use different encodings, specialized editors intended for this sort of thing tend to have a "relative search" function that work on the assumption that 'a', 'b', 'c', and so on have consecutive byte values. Searching for "cafe" with one of these, for example, will find sequences of bytes with values {x+2}{x}{x+5}{x+4}, which is usually enough to find something useful.

(Finding Japanese text with unknown encoding works similarly, but can be trickier. Many games use the typical あいうえお... sequence, but others got more creative and came up with sequences like あアいイうウ... or あぁいぃうぅ... or various other orderings that, okay, sure, do make sense in their own way, but are completely nonstandard. And that's not even getting into kanji. If the game uses any English words or phrases in its text anywhere, which is more common than you might expect, it may be less trouble to start by searching for those.)

In any case, with the single-byte codes figured out, adding those to our table results in something slightly more readable:

Y{$11}3<$01=NEXT>
<$00=END>
1{$04}{$04}7{$0A}fIs {$06}5 {$05}y ok,<br>
{Maxim}? {$05}“ a lot<br>
of money.<$01=NEXT>
aIt {$06}@ {$06}wes us {$06}ta<br>
{$06}N {$0A}ma"y's {$06}s. is,<br>
{$05}{$7F} {$06}mo.<$01=NEXT>

Multi-byte compression


That's starting to look a bit like actual dialog, but there's still a lot missing. This brings us to the second type of compression: two-byte codes. In the Japanese game, these are used to represent kanji (ideograms originally adapted from Chinese). English obviously doesn't have those, so many localized games that include kanji in their Japanese version repurpose the two-byte code system for text compression in the English version.

Notice how the text above has quite a few $05 and $06 bytes? In this game, those are the first bytes of two-byte codes. The English version encoding uses these to encode 512 different three- to fifteen- character words using only two bytes each, from $0500 for "Congratulations" to $06FF for "She". (On a side note, the Japanese version treats $07 similarly to $05 and $06, allowing for up to 255 additional kanji codes, of which 121 are used. However, $07 goes unused in the English version.)

Figuring these codes out is helped by the fact that the game needs a way of knowing what they stand for, too. Its data includes a listing, in plaintext, of all the compressed strings, one after another. (There's also a listing for the one-byte codes that comes right after, but the strings for that are so short that it would be nearly impossible to find it by searching without already knowing its contents in the first place.)

With that determined, we can add those to our encoding table and try again:

Y{$11}3<$01=NEXT>
<$00=END>
1{$04}{$04}7{$0A}fIs that really ok,<br>
{Maxim}? That's a lot<br>
of money.<$01=NEXT>
aIt just tells us how<br>
good {$0A}ma"y's work is,<br>
that's all.<$01=NEXT>

More complicated compression


We're getting close, but there are still two major issues. First, there's all the apparent garbage at the beginning, plus the apparently extraneous 'f' and 'a' that prefix the mostly-readable text. Those are easy enough to edit out when copying, though, so we'll worry about them later.

A more pressing concern is the awkward {$0A}ma" that's still in the middle of one of the lines. There's yet another form of compression going on here! If you guessed that it's a three-byte compression code, you'd be mostly right. This one isn't nearly as straightforward as the others, though.

Three-byte codes starting with $0A form what I've previously called something of a sophisticated ditto mark. Rather than each combination of values referring to a single predetermined string of text, the two bytes after the $0A (without getting any further into the technical details) basically tell the script engine to copy X characters of text from Y bytes earlier in the script.

(The Japanese version, incidentally, applies these $0A codes far more extensively and effectively than the English version, which uses them only infrequently and never copies anything that contains two-byte codes. I have to wonder if there was an automated tool involved that wasn't available, or simply wasn't used, for the localization.)

This isn't something a basic script dumper is capable of handling. As a result, I eventually ended up writing a simple program that would process just the $0A codes, and then ran the output from that through a regular script dumper:

Y{$11}3<$01=NEXT>
<$00=END>
1{$04}{$04}7oo going, Guy!s that really ok,<br>
{Maxim}? That's a lot<br>
of money.<$01=NEXT>
aIt just tells us how<br>
good Jaffy's work is,<br>
that's all.<$01=NEXT>

And now we've finally—wait, what happened to the beginning of the first line of previously-readable text? Although not shown here, I also found that similar garbling occurred in the middle of certain lines, seemingly at random, at least at first glance.

If you look at the more basic dumps, you'll notice that there's an $0A byte in the apparent garbage before the first readable line. Similarly, all the mid-text garbling corresponded to $050A and $060A (and $070A in Japanese) two-byte codes. In these particular cases, the $0A byte is part of a longer code, so does not indicate a lookback and should not be handled as one. But the crude $0A decoder program has no understanding of context and no possible way of knowing that it shouldn't be doing its thing there.

This brings us back to something we've been ignoring, but has become a more important question: What exactly is all that "garbage" that so often appears before and between segments of readable text? It must have some purpose, or it wouldn't be there, right?

Scripting data


Text isn't all of the data that a game needs to create a scene. Characters arrive, leave, and move around; parts of the map may shift and change; animations and sound effects play; the background music may stop or switch tracks. Besides text, the game also needs data that tells it what to do for all of that.

Most of the games I've dealt with so far put their scripting data in one place, and keep their text data sequestered by itself in its own separate area that contains only text and nothing else. When the script needs to display text, it will have an instruction that basically says, "show the text that's stored at offset $6B39", or something along those lines. That's convenient for editing the script and the text semi-independently of each other, and is particularly handy when the text needs to be translated, for example. (And, incidentally, makes things easier on anyone trying to extract the text!)

Lufia II doesn't do it that way. The script itself and the text that it displays are stored together, though they're at least grouped by map and then by NPC or event trigger. All the "junk" that we've been seeing in the dumps so far is actually scripting data! And since most bytes have one meaning as a scripting instruction and a different one as text, this also means that any fully functional parser needs to know when and how to switch between script and text modes.

So, after finding some technical documents by Relnqshd, including one that explains a significant number of the script codes, I finally decided to make an attempt at writing a proper script parser, not just something to dump text. The result for that same block of data now looks more like this:

{$59 11: Fade in from black, speed(?) 17.}
{$33 01 00: Apply movement path #1 to Actor $00 (Maxim).}
{$31 04 04: Actor $04 (Tia) faces right until next movement.}
{$37 0A: Brief delay, lasting 10 frames.}
{$66: Tia speaks:}
    Is that really ok,<br>
    {Maxim}? That's a lot<br>
    of money.<$01=NEXT>
{$61: Maxim speaks:}
    It just tells us how<br>
    good Jaffy's work is,<br>
    that's all.<$01=NEXT>

Now, it's far from perfect. My understanding of the script codes remains limited, so there are still quite a few bytes that the program doesn't know what to do with at all (except possibly to group the arguments with the instructions), along with others that are little more than placeholders. And, of course, bytes taken out of context will still be misinterpreted, like with the crude $0A parser, with every additional value the program tries to parse having the potential both to create new ways for that to happen (by interpreting additional byte values) and to avert old ways (by "consuming" additional instruction arguments before the parser tries to interpret them).

It's still a big improvement, though, and it's been fascinating to discover how some of the scenes work! I'll probably explain at least a few of them in future posts.

No comments:

Post a Comment