2020-06-04

The police are rioting

And it's happening in cities across the nation. If they could maybe, you know, not? And also refrain from killing (mostly black) people for funsies? And perhaps deign to allow some degree of accountability that it's increasingly obvious is lacking at present? And just generally stop confusing themselves with soldiers at war minus Geneva Convention rules? That'd be great.

2020-03-31

TDoV 2020

(The following was written for presentation as part of a local Unitarian Universalist service celebrating the 2020 Transgender Day of Visibility. The opening directly quotes from a previous blog entry, so should look familiar to anyone who's read that.)


It's been about two years, now, since one branch of a major Christian church justified its anti-transgender stance with an official statement bearing a name that referenced Genesis 1:27, as though this were not only an obvious and natural conclusion from the text of the verse, but the only conclusion possible.

As the verse reads in the New International Version: "So God created mankind in his own image, in the image of God he created them; male and female he created them."

"And" should be such a simple word.

Particularly for something written in a time when women were often considered little, if at all, better than property, that verse from Genesis makes a fascinating assertion. The image of God is not the sole domain of the male, nor, for that matter, of the female. Neither masculine nor feminine is better or more Godly. The man cannot say to the woman that he is more favored of God, nor the woman say to the man that nothing of God is in him. All are part of God's creation and exhibit facets of God's nature.

Created "in the image of God... male and female". Not only does this put both on the same level, it seems an unavoidable conclusion that the image of God is therefore neither purely male nor purely female, that this is a God who is both and neither. Why should those created in that image have to be purely one or the other?

But that isn't what not only religious groups, but also culture and media have for many years shown us. Schoolchildren are divided up into boys on one side, girls on the other. A popular book asserts that "Men Are from Mars, Women Are from Venus". Facilities are labeled "ladies" or "gentlemen". Paperwork often demands an "M" or an "F", even when it's completely irrelevant. Stores keep women's clothes over here and men's clothes over there, as though they were intrinsically incompatible. Seldom is there any hint that these might not be rigid, eternal, unchangeable certainties, much less that there could be anything between or beyond them.

So perhaps it's not so strange that I can remember being in high school with a fuzzy not-really-understanding that there were "normal" people over here, and "the gays" over there. I knew I liked girls too much to be one of "the gays" (which was just about all I knew about what that even meant), and even if boys sometimes sparked my interest, too, I also had a vague cultural-awareness sort of sense that "the gay" was best avoided. So I mostly tried not to think about the idea that I might not be "normal", either, because what else was there? And If I developed a certain fascination with things like unisex names, or fiction that involved people magically changing sex, or wondering how my life would be different if I had been born a girl, surely it didn't mean anything in particular, but was nothing more than idle curiosity from a mind bored with the tedium of schoolwork.

The media has generally gotten somewhat better since then, but even now, a lot of "representation" is still stuck on the idea that if you're not toeing the gender line, then you're either dead or working the street corners or trying to trick people for malicious reasons, or, at best, you're a punchline. Even attempts to be more respectful and speak to transgender people as people have an unfortunate tendency to get uncomfortably focused on the state of people's genitalia—which, to paraphrase John Oliver, is, medically speaking, none of anyone else's business. And, as with so many other topics, journalists are often so eager to avoid taking sides that they'll put verifiable facts, legitimate scientific studies, and real people's safety and wellbeing on the same footing as fanciful speculation, extremist propaganda, and baseless fearmongering.

The Internet, although it can amplify the worst voices, also allows others to be heard. For instance, I'm reading a webcomic by a trans author with characters that include a trans girl who's gone full time after moving and changing schools, a trans man who is her therapist, at least one lesbian, several people who are bisexual, a gay boy who likes to wear dresses and be pretty now and then, one girl who's pansexual, another who's described as asexual and homoromantic, someone else who's genderfluid and usually uses they/them pronouns, an intersex girl who was raised as a boy, a trans boy who's just starting to figure that out, a boy who might be gay or bi and is both curious and terrified to find out, and various authority figures and family members who run the gamut from supportive to confused to indifferent to openly hostile. Another webcomic, in a more sci-fi/fantasy genre involving transformation technology and magic, includes characters such as one with form-changing spells who strikes me as bigender and recently used the description "gender-casual", a female duplicate of an originally male character who has described herself as homoromantic and bisexual, a part-alien shapeshifter who's most likely demisexual and agender, a transformation enthusiast who has been confirmed as genderfluid, and someone who finds people of all genders attractive and appreciates sexuality between other people but is repulsed by the idea of being personally involved—and is there even a word for that? (ETA: of course there is)

But you're not likely to find nearly so much visible diversity unless you deliberately seek it out. Most of these concepts still aren't really out there in the public consciousness, much less taken for granted. And, ideally, they should be, if for no other reason than making sure that everyone knows they don't have to try to force themselves into this box or that box. There's no single standard that everyone has to live up to, and no preset path that everyone has to follow. There's just people being who they are as well as they can manage in the circumstances available to them.

2020-03-29

Text Encoding and Compression (featuring Lufia 2)

One of the things I have a lot of on my website is attempted translations of video game text, mostly SNES-era RPGs since that's what I grew up on and have the most familiarity with. This started out as something I did out of my own personal curiosity and for practice, before it occurred to me that I might as well go ahead and share it. Here's a little more insight into part of the process.

Text dumping overview


If you're interested in translating the text of a game, the first thing you need is, of course, that text. Naturally, you could transcribe it from a playthrough of the game, reading it off the screen and typing it manually, but that's both time-consuming and error-prone. It also has the drawback that you need to get the text to display in-game before you can do anything with it, which is problematic in scenes with multiple branches and variants, and means you may never even know about easily-missed text, let alone text that was left entirely unused.

Ideally, what you want is to pull the text directly from the game data. One of the biggest problems with that, assuming you have access to the game's raw data in the first place, is that the text is often encoded in some nonstandard fashion, and usually also compressed on top of that. This is particularly true of cartridge-era console games, which had to be developed within strict storage limits. All that often makes it difficult to even locate the text, much less extract it in an intelligible format.

Lufia II has been probably the most interesting case I've encountered so far. This game's text encoding (in the English version, at least) is based on standard ASCII, a rarity among console games of its time. That sounds like it ought to make picking out the text fairly easy. And in some places, it does! Spell, item, and monster names, for instance, are entirely readable without any additional interpretation, though there's sometimes some unreadable stuff between names. For instance, part of the item list follows (they're fixed-length entries; line breaks are added here for readability):

Charred newt
Potion      
Hi-Potion   
Ex-Potion   
Magic jar   
Hi-Magic    
Ex-Magic    
Regain      
Miracle     
Antidote    

However, most of the dialog looks more like this (unprintable bytes are shown as their hex values in braces, using the convention of a dollar sign prefix):

Y{$11}3{$01}{$00}1{$04}{$04}7{$0A}fI{$8E}{$06}5 {$05}y ok{$0E}{$09}? {$05}{ {$8C}Ðt{$03} {$81}Ì·ð{$01} aI{$9E}{$06}@ {$06}{$97}{$8E}¸ {$06}è{$03} {$06}N {$0A}É"y{$84}{$06}« {$8B}{$0E}{$05}{$7F} {$06}Ì.{$01}

So what went wrong? As I mentioned above, the encoding is only based on ASCII. Quite a few byte values have special meanings (for instance, $09 in a text block stands for the protagonist's name). More importantly, the text is compressed to save space. Not only that, this game uses three different types of text compression.

Basic text compression


Let's start with the simplest one. Numerous byte values, including everything from $80 through $FF, stand for common combinations of (usually) two characters. (This only applies to the English version. The Japanese version uses single-byte codes for each of 162 kana, plus 52 upper- and lowercase English letters, 10 digits, 15 punctuation symbols, and a space, leaving it very few spare bytes for that sort of thing.)

Figuring these out, along with bytes that act as various types of control codes (newline, end message, etc.) is generally a process of trial and error. A good starting point is finding something that's more obviously text, finding where it displays in the game, and comparing the two. Editing the data to see what the game does with it is another way of getting more useful information.

This game's ASCII-based encoding made it possible to find useful text just by skimming through the data in a standard hex editor. For games that use different encodings, specialized editors intended for this sort of thing tend to have a "relative search" function that work on the assumption that 'a', 'b', 'c', and so on have consecutive byte values. Searching for "cafe" with one of these, for example, will find sequences of bytes with values {x+2}{x}{x+5}{x+4}, which is usually enough to find something useful.

(Finding Japanese text with unknown encoding works similarly, but can be trickier. Many games use the typical あいうえお... sequence, but others got more creative and came up with sequences like あアいイうウ... or あぁいぃうぅ... or various other orderings that, okay, sure, do make sense in their own way, but are completely nonstandard. And that's not even getting into kanji. If the game uses any English words or phrases in its text anywhere, which is more common than you might expect, it may be less trouble to start by searching for those.)

In any case, with the single-byte codes figured out, adding those to our table results in something slightly more readable:

Y{$11}3<$01=NEXT>
<$00=END>
1{$04}{$04}7{$0A}fIs {$06}5 {$05}y ok,<br>
{Maxim}? {$05}“ a lot<br>
of money.<$01=NEXT>
aIt {$06}@ {$06}wes us {$06}ta<br>
{$06}N {$0A}ma"y's {$06}s. is,<br>
{$05}{$7F} {$06}mo.<$01=NEXT>

Multi-byte compression


That's starting to look a bit like actual dialog, but there's still a lot missing. This brings us to the second type of compression: two-byte codes. In the Japanese game, these are used to represent kanji (ideograms originally adapted from Chinese). English obviously doesn't have those, so many localized games that include kanji in their Japanese version repurpose the two-byte code system for text compression in the English version.

Notice how the text above has quite a few $05 and $06 bytes? In this game, those are the first bytes of two-byte codes. The English version encoding uses these to encode 512 different three- to fifteen- character words using only two bytes each, from $0500 for "Congratulations" to $06FF for "She". (On a side note, the Japanese version treats $07 similarly to $05 and $06, allowing for up to 255 additional kanji codes, of which 121 are used. However, $07 goes unused in the English version.)

Figuring these codes out is helped by the fact that the game needs a way of knowing what they stand for, too. Its data includes a listing, in plaintext, of all the compressed strings, one after another. (There's also a listing for the one-byte codes that comes right after, but the strings for that are so short that it would be nearly impossible to find it by searching without already knowing its contents in the first place.)

With that determined, we can add those to our encoding table and try again:

Y{$11}3<$01=NEXT>
<$00=END>
1{$04}{$04}7{$0A}fIs that really ok,<br>
{Maxim}? That's a lot<br>
of money.<$01=NEXT>
aIt just tells us how<br>
good {$0A}ma"y's work is,<br>
that's all.<$01=NEXT>

More complicated compression


We're getting close, but there are still two major issues. First, there's all the apparent garbage at the beginning, plus the apparently extraneous 'f' and 'a' that prefix the mostly-readable text. Those are easy enough to edit out when copying, though, so we'll worry about them later.

A more pressing concern is the awkward {$0A}ma" that's still in the middle of one of the lines. There's yet another form of compression going on here! If you guessed that it's a three-byte compression code, you'd be mostly right. This one isn't nearly as straightforward as the others, though.

Three-byte codes starting with $0A form what I've previously called something of a sophisticated ditto mark. Rather than each combination of values referring to a single predetermined string of text, the two bytes after the $0A (without getting any further into the technical details) basically tell the script engine to copy X characters of text from Y bytes earlier in the script.

(The Japanese version, incidentally, applies these $0A codes far more extensively and effectively than the English version, which uses them only infrequently and never copies anything that contains two-byte codes. I have to wonder if there was an automated tool involved that wasn't available, or simply wasn't used, for the localization.)

This isn't something a basic script dumper is capable of handling. As a result, I eventually ended up writing a simple program that would process just the $0A codes, and then ran the output from that through a regular script dumper:

Y{$11}3<$01=NEXT>
<$00=END>
1{$04}{$04}7oo going, Guy!s that really ok,<br>
{Maxim}? That's a lot<br>
of money.<$01=NEXT>
aIt just tells us how<br>
good Jaffy's work is,<br>
that's all.<$01=NEXT>

And now we've finally—wait, what happened to the beginning of the first line of previously-readable text? Although not shown here, I also found that similar garbling occurred in the middle of certain lines, seemingly at random, at least at first glance.

If you look at the more basic dumps, you'll notice that there's an $0A byte in the apparent garbage before the first readable line. Similarly, all the mid-text garbling corresponded to $050A and $060A (and $070A in Japanese) two-byte codes. In these particular cases, the $0A byte is part of a longer code, so does not indicate a lookback and should not be handled as one. But the crude $0A decoder program has no understanding of context and no possible way of knowing that it shouldn't be doing its thing there.

This brings us back to something we've been ignoring, but has become a more important question: What exactly is all that "garbage" that so often appears before and between segments of readable text? It must have some purpose, or it wouldn't be there, right?

Scripting data


Text isn't all of the data that a game needs to create a scene. Characters arrive, leave, and move around; parts of the map may shift and change; animations and sound effects play; the background music may stop or switch tracks. Besides text, the game also needs data that tells it what to do for all of that.

Most of the games I've dealt with so far put their scripting data in one place, and keep their text data sequestered by itself in its own separate area that contains only text and nothing else. When the script needs to display text, it will have an instruction that basically says, "show the text that's stored at offset $6B39", or something along those lines. That's convenient for editing the script and the text semi-independently of each other, and is particularly handy when the text needs to be translated, for example. (And, incidentally, makes things easier on anyone trying to extract the text!)

Lufia II doesn't do it that way. The script itself and the text that it displays are stored together, though they're at least grouped by map and then by NPC or event trigger. All the "junk" that we've been seeing in the dumps so far is actually scripting data! And since most bytes have one meaning as a scripting instruction and a different one as text, this also means that any fully functional parser needs to know when and how to switch between script and text modes.

So, after finding some technical documents by Relnqshd, including one that explains a significant number of the script codes, I finally decided to make an attempt at writing a proper script parser, not just something to dump text. The result for that same block of data now looks more like this:

{$59 11: Fade in from black, speed(?) 17.}
{$33 01 00: Apply movement path #1 to Actor $00 (Maxim).}
{$31 04 04: Actor $04 (Tia) faces right until next movement.}
{$37 0A: Brief delay, lasting 10 frames.}
{$66: Tia speaks:}
    Is that really ok,<br>
    {Maxim}? That's a lot<br>
    of money.<$01=NEXT>
{$61: Maxim speaks:}
    It just tells us how<br>
    good Jaffy's work is,<br>
    that's all.<$01=NEXT>

Now, it's far from perfect. My understanding of the script codes remains limited, so there are still quite a few bytes that the program doesn't know what to do with at all (except possibly to group the arguments with the instructions), along with others that are little more than placeholders. And, of course, bytes taken out of context will still be misinterpreted, like with the crude $0A parser, with every additional value the program tries to parse having the potential both to create new ways for that to happen (by interpreting additional byte values) and to avert old ways (by "consuming" additional instruction arguments before the parser tries to interpret them).

It's still a big improvement, though, and it's been fascinating to discover how some of the scenes work! I'll probably explain at least a few of them in future posts.