Steam Community :: Guide :: What a CoE4 modder should know

Store Page

Conquest of Elysium 4

All Discussions Screenshots Artwork Broadcasts Videos News Guides Reviews

Conquest of Elysium 4 > Guides > Marlin's Guides

Not enough ratings

What a CoE4 modder should know

By Marlin

This guide will focus on the very fundamentals of mod making: What computer files does a mod for Conquest of Elysium 4 consist of, and how, exactly, do you create and edit such files on your computer?
__________________________________________________

The idea that there might be a need for this sort of guide first struck me when I saw lots of modders struggle with the task of creating a TGA image file for use as a mod banner. I was about to write just a simple “make your own banner” guide. Then, when I saw that CoE4 has support for UTF-8 in text files, I realized that many modders could probably do with some help and explanations on that topic too – among users of Windows and Mac, at least, where UTF-8 still isn’t as ubiquitous as it is on Linux.

Perhaps the complete guide is a bit lengthy, but the idea isn’t that everybody would need to read the whole thing from start to finish. Just pick the parts that might be of interest to you.

The guide will not teach you how to create your own new CoE4 class – nor will it go through the list of available modding commands; there will be a modding manual for that. Only the two most fundamental modding commands are covered, the ones that need to be in every mod, along with explanations on how special characters are to be used in the mod.

Award

Created by

Marlin
Offline

Category: Modding or Configuration

Languages: English

Posted

Updated

23 Feb, 2016 @ 8:22am

18 Mar, 2016 @ 5:33pm

796	Unique Visitors
14	Current Favorites

Guide Index

Overview

What is a mod?

The c4m file

The banner

Creating a TGA image in GIMP

Image editing in GIMP – the basics

Selecting, copying & pasting in GIMP

Pixel Sharpness

A few further notes on GIMP

Transparency and shadows in CoE4

A closer look at plain text files

...on Mac

...on Windows

...on Linux

Entering Unicode characters

The fonts (and text system)

Typographic variants of ASCII characters

The history of UCS, Unicode and UTF-8 – Part 1

The history of UCS, Unicode and UTF-8 – Part 2

Comments

What is a mod?

A mod is a way to change, or add to, some aspects of the game. It consists of a plain text file, the name of which should end with “.c4m”, as well as banner, in the form of a 256×64 pixel TGA image file, to represent the mod in menus. There may be additional TGA images too.

When distributed (uploaded and downloaded on the Web), the mod should be a ZIP archive file, containing the c4m file at its root, and the rest of the files in a folder (with some suitable name). This will allow a user to just unpack the ZIP file in his or her mods folder and everything will go where it should. Unlike other archive file formats, the ZIP format can be handled out of the box by all modern operating systems.

The c4m file

The plain text file with name extension “.c4m” should be placed in the mods directory. You can find this directory if you ❶ start the game, ❷ select Mods, ❸ select Open Mods Directory (at the bottom of the list).

Within the c4m file you will write the modding commands, to achieve the changes and additions that you want, each command on its own line. Two commands should always be there though: icon and description. For instance:

icon "my_new_mod/banner.tga" description "This is my new mod.^^And this comes after two newlines."

The text string given with the icon command is a relative file path to the banner image for the mod. In this example, the banner is in a subdirectory to mods, called my_new_mod, in a file called banner.tga.

Note that an ordinary slash ( / ) is used to separate names of folders and files (like in Unix and in Internet URL paths), not the backslash ( \ ) used in Windows.

Also, consider sticking with lowercase only in your file (and folder) names, to avoid problems on Linux. On Windows, it doesn’t matter if you are inconsistent in your use of letter case – if you name a file “Banner.tga” but then refer to it as “banner.tga” in your mod, it’ll still work. But on Linux it does matter, and it won’t work. With only a banner, the risk of inconsistencies is small, but with loads of sprite files too, it’s easy to forget an uppercase letter here or there.
The text string given with the description command will be shown whenever you, or any other user of your mod, right click on the mod in the mods list. (Also in the list called up with F6 while playing.)

Actual newlines cannot be used within a text string. However, the character ^ (ASCII circumflex) will be replaced by a newline when the text is displayed by the game.

It’s important that everything outside quoted text strings (and/or comments; see below), and also any double quotation mark itself, is simple ASCII text only, or the game won’t understand the text as commands. In other words, the quotation mark should be this one: ", not the fancier “ or ”.

In text within ASCII quotes, you generally can include non-ASCII characters. The obvious exception is that characters meant to be read by the game program, rather than by human players, should still always be ASCII. One example of this is the ^ for newline. Other examples are operator characters in the text strings given with the addstring command (in ritual definitions), or with the reclimiter command (in recruitment definitions of a class). These too must be ASCII despite being used in strings.

If otherwise you do use non-ASCII, there are some further considerations, which we’ll get back to when we delve deeper into the concept of plain text.

With a # character, you’ll mark the rest of a line as a comment, to be entirely ignored by the game. Even if, perhaps, everything you do within the mod is clear to you while writing it, comments can be very helpful if and when you later want to do some changes to it. They will also be helpful to anyone else reading it, of course. Example:

# This is a comment that the game will ignore when interpreting the mod.

Colorful text editing

Some text editors offer the option to highlight the syntax of various programming languages. If the editor you are using happens to be one of them, you might try marking the mod text as a Unix shell script. (Look for an option called something like “shell”, “sh” or “bash”.) A CoE4 mod is no Unix shell script, of course, but comments work the same way, and the editor might be able to color those differently from the commands, probably also color string literals in a third way.

A note on newlines – and Windows Notepad

If, in Windows Notepad, you ever tried loading the example mod from Illwinter, you would have noticed an odd lack of proper newlines. This is because newlines are represented differently in Unix (including Linux and modern Mac) and in Windows. In Unix, each newline is a single ASCII character – with 10 for numeric value (also called LF or line feed) – whereas in Windows it’s a sequence of two ASCII characters – 13, 10 (where the first one is also called CR or carriage return).

The convention in Windows derives back via DOS and CP/M ultimately to the software used on DEC mainframes in the 1960s if not earlier. (The use of two characters gave more time for the print head of teleprinters to physically return from the far right to the beginning of the next line.) In Unix (first Multics), a new concept of a device driver allowed applications to ignore the gory details of the teleprinter hardware, so the simpler single-character encoding could be used for newlines. (This was all before the days of video terminals, of course, not to mention personal computers.)

It should be said that today most text editors on either kind of OS will understand either kind of newline, and, although the CoE3 manual recommends Unix newlines, it seems to me that CoE3 too, as well as CoE4, understands Windows newlines just fine. On Windows, I believe the inability of Notepad to comprehend Unix newlines is fairly unique, and it wouldn’t be the best choice for CoE4 modding on Windows anyway, for reasons that we’ll get back to. But don’t worry. There’ll be a solution in the chapter on plain text on Windows.

The banner

The banner is a TGA (also known as TARGA) image file* the size of which should be 256×64 pixels. It must be stored as an RGB image, with or without an alpha channel (32 bits/pixel with, 24 bits/pixel without). It can be RLE compressed.

To be precise, it seems the size can be pretty much anything and the game will scale the image to fit. However, if you want what you see to be what you get, you should at least make sure the width is four times the height. Also, although a somewhat larger size may allow for slightly more details on high resolution screens, there’s obviously a point where larger image sizes are just a waste of space.

* The game also supports SGI images, but since that’s a more rare format, I’ll here assume that only TGA images are used.

Transparency

TGA images can be stored with an “alpha channel”, marking pixels to be more or less transparent. (An alpha value of 255 stands for fully opaque, an alpha of 0 stands for fully transparent.) The game will make use of this.

When an image is stored without an alpha channel, the game will treat any black pixels (RGB 0,0,0) as fully transparent, and any pixels in bright magenta (RGB 255,0,255) as half-transparent black (used for sprite shadows in the game). This obviously reduces the size of the image file.

If you don’t use any transparency (or only simple transparency per the previous paragraph), don’t save the image with any alpha. First “flatten” it or remove its alpha, or you’ll just be wasting space.

Where?

The banner is best placed in a subdirectory to the mods directory. In the example above, I have a subdirectory called my_new_mod, where I placed a file banner.tga. This becomes particularly important if and when you add more images to your mod. If every mod instead just littered the main mods directory with its images, it would make for a horrible mess, so don’t do that.

How?

There are no doubt many ways to create TGA images. If, however, you are at all asking this question, I suggest you try GIMP[www.gimp.org], which is cross-platform, as well as free and open source. In the following chapters I go through the basics of how to do things in GIMP. (If you already have another favorite image editor, you may skip that part.)

Creating a TGA image in GIMP

❶ First, from the File menu, select New...

❷ Set the image size to width=256, height=64.

❸ Fill the image with some contents. (This step is a bit briefly described, I know, but I’ll expand a little bit on the topic in a moment.)

❹ From the File menu, select Export As... (The save options are instead for saving XCF files, which are useful for image editing projects, as they can contain multiple layers, paths and other things not needed in the final image.)

❺ Navigate to your subdirectory to the mods directory (my_new_mod in my example) and enter the name banner.tga. Or you can use some other file name, as long as it ends with “.tga”. Then click on Export.

❻ The defaults for TGA export should be fine. Just click on Export.

Image editing in GIMP – the basics

Basic layout

There are, of course, way too many features in GIMP for me to go through them, not that I would see myself as any GIMP expert in the first place. I’ll just mention a few fundamental things to keep in mind.

When you start GIMP with the default layout, you should get a main window (with the menu) for the image in the middle, plus two panels in their own, secondary windows, one on either side. This layout is highly configurable – for one thing it’s possible to switch to single-window mode with an option in the Windows menu (you’ll have to restart GIMP before it goes into effect) – but I’ll assume the default layout here.

The panels can be toggled on and off with the Tab key ( ↹ ). I’ll admit it initially confused me when, once in a while, the panels disappeared (after an accidental press on Tab), but the purpose is to allow quick viewing of otherwise concealed parts of an expanded image window. If you miss a panel even after trying Tab, you should find it under the Windows menu → Recently Closed Docks.

Toolbox

The left panel, the toolbox, contains, first, various tools for creating a selection (limiting what part of the image is edited or copied etc.), and, last, tools for editing the image, plus, in between them, a number of other tools. For each tool, when you select it, a number of options are shown in the lower part of the window.

Below the tool icons is a foreground/background color indicator. These colors (per default black and white) affect how some of the tools and options work, and can be changed if you click on them.

In the example here, the text tool has been selected. You’ll probably want to use it to write the title of your mod in your banner.

Note that in order to move the text, you’ll need to switch to the move tool:

As long as the text layer is active (see below), you can move the text with the cursor keys. Alternatively, regardless what layer is active, if you click on the text (some visible part of a character), you can drag it with the mouse into the position you want. If you later select the text tool again, you can click on the text to edit it some more.

Layers

The second panel shows the layers of the image. (It also, per default, has tabs for channels, paths and undo history.) Use of the just mentioned text tool will automatically add a separate layer for the text. Images copied from elsewhere (for instance some image viewer application) can also be pasted into your image as their own layers. (If you have more than one image opened in GIMP, you can just drag and drop layers between them.) And, of course, adding a new layer is among the options given when you right click in the layers tab (there’s also a button at the bottom for this).

In the combined image, top layers will conceal the ones below them, except where transparent (or, if smaller than the image, the parts they don’t cover). The order of the layers can be changed by dragging and dropping them.

If a layer name is shown in bold text (like the Background layer in the example), that layer is without an alpha channel (i.e. it’s everywhere opaque). If the name is non-bold, the layer has an alpha channel (i.e. it can be partly or fully transparent). Any parts you erase in a layer with alpha will become transparent, whereas parts erased in a layer without will be filled with the currently selected background color.

An eye to the left indicates that a layer is visible and will be used in the combined image. This can be toggled off and on by clicking on the eye. By clicking next to the eye, you’ll get a chain icon

. If two or more layers are marked with the chain icon, then certain operations (move, rotation and some more) on one of them, will affect them all.

Just above the layers are lock buttons, which when enabled will prevent editing of the entire layer, or just its alpha channel, respectively. (The alpha lock will allow you to edit the colors of visible parts without messing with the boundary to what’s transparent.) Unfortunately, when this is written, there’s a small bug in the UI preventing deactivation of these buttons to show until you switch from the layers dialog to the main, or some other, window. Not a big deal if you know about it, but can be a bit confusing otherwise.

One common rookie mistake, I believe, is to fail to keep track of what layer is currently active. If some editing you make doesn’t seem to have any effect, then don’t just continue clicking like crazy “because the tool didn’t take”. Instead check that the intended layer is really active. And, if not, an undo might be in order. (Other things to check are the just mentioned lock buttons, and also that you aren’t trying to work outside a current selection. The selection can be made invisible. Toggle this visibility with Ctrl+T.)

More on layers in the manual[docs.gimp.org].

Selecting, copying & pasting in GIMP

There are true artists, of course, who can create images from scratch, but I think most of us are left with grabbing things from elsewhere as the starting point for what we compose in GIMP.

Almost any type of bitmap image can be directly opened in GIMP. It will be declared, by GIMP, as “imported” – since the native file format of GIMP is the XCF format[en.wikipedia.org] – but it can be exported back from whence it came, or to almost any other type of bitmap file.

Pasting

The other way to import stuff is, of course, to paste it into GIMP, after you copied it from some other application – or from another image opened in GIMP.

Normal pasting (Ctrl+V) in GIMP creates a temporary layer, called a floating selection, for the pasted stuff. You can move this layer around, and work on it, but you can’t switch to another layer before you do one of two things: Either make it a new permanent layer – or you can anchor it (Ctrl+H), meaning it will be merged with the previously active layer.

Obviously, before pasting and anchoring something, you should make sure you have the right layer active. Alternatively, you can always make the pasted layer a new proper layer (this will happen automatically if you give it a name), and only at some later point merge it down with a below layer.

If instead you want what you paste to become a new image in GIMP, in its own window, then use Shift+Ctrl+V.

“Paste Into” works like normal paste except, if anchored, the pasted image will be clipped according to any current selection. The default is to cancel any current selection.

Copying

The default in GIMP is to copy from just the active layer. “Copy Visible” (Shift+Ctrl+C) copies the combined image (taking current visibility settings of layers into account).

Selections in GIMP

In order to limit what parts of an image are copied – for instance, if you only want to copy the image of a monster, without any of the background in the original image – a selection can be made. Selections are good also for limiting what’s to be edited or deleted etc. As long as there is a selection, editing tools generally can’t touch anything outside it.

There is a whole slew of different selection tools, but one thing they have in common is they can operate in one of four modes: replace, add, subtract and intersection. The default is the replace mode. The others can be picked by pressing Shift and/or Ctrl before you start working with the mouse (pressing the left mouse button). A small symbol should appear next to the mouse cursor, confirming that the intended type of operation will be performed.

Mode

Keys

Symbol

Previous selection

Replace

Removed

Add

Shift

Added to

Subtract

Ctrl

−

Subtracted from

Intersection

Shift+Ctrl

∩

Intersection with

As an alternative to Shift and/or Ctrl, each selection tool has mode buttons among its tool options.

Note that presses of Shift and/or Ctrl after you start working on a selection (press the left mouse button) will have other effects, varying with the selection tool.

For details on how the different selection tools work, see the their documentation[docs.gimp.org]. They all have their uses (except probably the Intelligent Scissors which even the GIMP manual seems to discourage from).

I’ll just point to one feature of the fuzzy selection tool (magic wand) – which also applies to the select by color tool. These tools have a threshold setting that determines how much colors can deviate from the color where you click and still be included in the selection. If the threshold is zero, only pixels of the exact same color as the pixel where you clicked will be included. Now, this threshold can be adjusted in the tools options dialog, but... a much neater way to increase the threshold is to drag with mouse downwards or to the right, after your first click on your chosen reference pixel. If the selection seems to overflow what you intended, then pull back a little (move the cursor up or to the left). When it looks right, release the mouse button. (Tip by Franknfurter.)

You can have the threshold start at any value and then adjust it with mouse movements up/left as well as down/right. For myself, though, I have set the default threshold to zero for both tools. Tool options can be saved via Edit → Preferences → Tool Options → Save Tool Options Now.

Quick Mask: By adding to and subtracting from a selection, using various selection tools, you can gradually refine it. Another way of working with selections, however, is to enter the Quick Mask mode. Do this by clicking on the little square at the lower left of the image window (or by pressing Shift+Q).

While in normal mode, the selection is shown as an animated dashed line (the so called “marching ants”). In Quick Mask mode, the selection is instead just clear of the red tinted mask that will cover unselected parts of the image. (You can configure this behavior if you right click on the little square.)

The Quick Mask, for one thing, will allow you to see the true nature of a selection that contains partially selected areas (with pixels that are neither fully selected, nor entirely unselected). Such areas will be somewhat covered by the red mask, but less so than entirely unselected areas. (The “marching ants” will merely show the division line between pixels more than half selected and those less than half selected.)

In Quick Mask mode, furthermore, any of the paint tools (the ones at the end of the tools list, beginning with the bucket fill and ending with the dodge/burn tool) will affect the selection mask instead of the image itself. Whatever you paint white will become part of the selection, what you paint black will be removed from the selection, whereas painting in shades of gray will result in something in between. Primarily, you’ll probably want to work with either the pencil or the paintbrush (depending on whether you want pixel sharpness or not), but any paint tool can be used, including, for instance, the blend tool, for gradients from white to black.

Click on the small squares in the lower left of the foreground/background color indicator to reset those colors, ensuring that you are working with truly black and white. Click on the double-headed arrow to switch the colors.

Exit the Quick Mask mode by again clicking on the little square at the lower left, or by pressing Shift+Q – and you’ll have the marching ants back.

For further useful selection options, have a look in the Select menu.

Pixel Sharpness

Many tools have features to make their effects on the image smooth and free from pixel jaggedness. However, when working with small, icon type images, like sprites for this game, this may not be what you want – and even less so when you try to create areas of a single one exact color, such as the ones used by the game for transparency and shadows in images without an alpha channel (RGB 0,0,0 and 255,0,255, respectively).

Most of the just mentioned selection tools have an antialiasing option, and all have feathering. If you want pixel sharpness, you should make sure both are unchecked.

If nothing else, it seems the “Sharpen” option in the Select menu doesn’t just sharpen a selection in relative terms. It instantly makes it, as far as I can tell, completely pixel sharp (with only black and white in the selection mask, no shades of gray at all).

Among the paint tools, the pencil is effectively a pixel sharp version of the paintbrush. It will produce only pixels exactly matching the set foreground color, no other shades. With the paintbrush, on the other hand, there will always be some softening at the edges.

A few further notes on GIMP

A 256×64 pixel TGA image file should never need to be as big as 50 kilobytes – unless it has an alpha channel.

If your image isn’t really transparent anywhere, one way to ensure it isn’t saved with an alpha channel (just wasting space), is to use the option Flatten Image. This will have two effects, one is to merge all layers into a single one, the other is to remove any alpha channel (replacing any transparent parts with the current background color).

Of course, before doing this, you may want to save your project (as an XCF file), with all your layers intact, in case you want to do some further editing later. The flattening should be possible to undo before you quit, but still.

Under Image → Mode, you can make sure that you are really working with an RGB image. If you import an image from some external source, it might occasionally be an indexed image instead, in which case you should convert it to RGB. Indexed images (including any image in the old GIF format) can be of smaller size than RGB images, but ❶ some GIMP operations will only properly work in RGB mode, and ❷ RGB images are what’s required by the game anyway.

The option Canvas Size... can be used to change the image size without rescaling its contents.

More on GIMP

In this brief overview, I haven’t touched the path tool, with which you can do “fake freehand drawing”, among other things. If your mouse hand isn’t rock steady, instead make a path for what curves and shapes you want to draw. You can fine-tune that path with anchor points in various ways, and then you can stroke it. Paths can also be used as yet another way to make selections.

Neither have I mentioned layer masks – kind of an (extra) alpha channel for a layer, but one that is more easily edited and manipulated on its own – nor anything about the various color tools, transform tools and filters. But I’ll stop here.

The first source for more on how GIMP works is, of course, the manual. You should have got it with the program, available from the Help menu, but it’s also available online[docs.gimp.org].

Some tutorials can also be found on the official GIMP site[www.gimp.org]. For more, three places with GIMP forums, and (links to) tutorials are GIMP Chat[gimpchat.com], GIMP Forums[gimpforums.com] and GIMP Users[www.gimpusers.com].

Transparency and shadows in CoE4

The sprites that come with the game (in Illwinter trs files) don’t have any alpha channel. Instead, pixels to be made entirely transparent in the game are black in the sprite images, while shadows in the game (areas of half-transparent black) are bright magenta in the sprites (green at zero, red and blue levels maxed out).

The same thing will work for any images that you provide as TGA files, whether for sprites or mod banner. The prerequisite is that your image has no alpha channel! If it has, then the game will take all transparency information to be in that channel, nowhere else – and any bright magenta will be taken to be meant literally as bright magenta in the game.

Thus, if you go that path, you need to remove any alpha channel from your sprites. You can do that with the flatten image option in GIMP, as mentioned above. (Use the foreground/background color control to ensure your background color is black, before you flatten the image.) This will save some file space.

On the other hand, if you already have a sprite on an all-transparent background (after having selected and deleted any previous background), then it may be more convenient to just paint a shadow with an alpha channel – allowing you to instantly see what you get.

One way to do this is to...
❶ first add a separate, transparent layer for the shadow below the sprite layer. ❷ Then paint the shadow pitch-black in that layer (no need to worry about over-painting the sprite when you are painting in a separate layer below it). ❸ Finally, lower the general opacity for the shadow layer to 50%.

A closer look at plain text files

Again, a c4m file is just a plain text file, which can be created with basically any text editor. If you are on Linux, this should be all you need to know.

On Mac and in particular Windows, the concept of plain text may require some explanations, though – at least if you want to include any non-ASCII characters in what descriptions or names you have in your mod (in the strings you have between ASCII double quotes).

What is a plain text file? – UTF-8, the modern ASCII

There was a time when plain text was the same as ASCII – i.e. a file with nothing but the 128 ASCII characters[en.wikipedia.org] in it – including control characters for newlines and other things. Nowadays each such character (or numeric code for it) is stored in an 8-bit byte with its highest bit always cleared (=0). This still is basically all you’ll ever use in any ordinary kind of program source code (and also for writing the commands in a c4m file).

For general text (like descriptions or perhaps even names in a CoE4 mod), ASCII might sometimes be a bit scant, though. Since that highest bit in each byte was in any case not used, the obvious first thing people did was to just set it (=1), and suddenly there were 128 more characters available in that byte. So still only one byte per character. Unfortunately, people did this in many different ways (and just 128 extra characters were never going to cover the needs of the world, of course), and there was much confusion. So there was for a time several different code pages, and in order for a reader to see what a writer had written, they both had to use the same code page. Code pages were a hideous thing of the dark ages, before there was light ...and Unicode ...and UTF-8.

Speaking of UTF-8, it too, just like the mentioned 8-bit extensions (code pages), is identical to ASCII as long as only ASCII characters are used in a file or character stream. Every byte in a UTF-8 file or stream that looks like ASCII (has its high bit cleared), actually is ASCII. So newlines and spaces and program code (or English) are just the same in UTF-8 as they always were. In UTF-8, just like in the code pages that came before it, extensions to ASCII are implemented only with bytes where the high bit is set. Which means that lots of software that only really knows ASCII can still properly handle UTF-8 too.

The difference is, in UTF-8, the extensions to ASCII are encoded as multi-byte sequences – 2, 3 or 4 bytes long. Which means anything in Unicode, potentially more than a million characters, can be encoded. Forget code pages. In the 21st century, the single only character encoding worth caring about is UTF-8.

So to summarize, plain text = UTF-8. Needless to say, this is true for CoE4 too.

UTF-8 on Windows and Mac

Unfortunately, UTF-8 is still not as ubiquitous on Windows and Mac as it is on Linux, and use of it on these systems requires some deliberate choices of the user. There are historical reasons for this, but I’ll save the background to this issue for a chapter at the end.

Of course, if you only ever plan on using ASCII anyway, in your descriptions and names, this may not matter to you. Otherwise, read on.

...on Mac

Although I’m no Mac expert, I’ll briefly convey what I have picked up regarding the UTF-8 situation on OS X. My understanding is it isn’t as bad as on Windows.

First, the default text editor – called TextEdit, if I’m not mistaken – should do fine, from what I hear. In particular, I don’t think it will prepend any silly BOM to your UTF-8. However, you should

probably turn off its fancy options[www.gottheknack.com] (though the HTML one is irrelevant here), and...
under Preferences → Open and Save → Plain Text File Encoding, set both options (opening and saving) to UTF-8.

UTF-8 as default for your environment

It’s apparently even possible on OS X to set UTF-8 as the default character encoding for your entire environment. (Without this, any non-ASCII in UTF-8 files might look weird in QuickLook.) It seems there’s a (hidden) little text file directly in your home directory, ~/.CFUserTextEncoding, containing a text string with two numbers separated by a colon. The number after the colon is your language setting (0 for English), whereas the first is the character encoding. By changing that first number to “0x08000100” (the “0x” prefix makes it hexadecimal), your default character encoding should be UTF‑8.

The catch[github.com], unfortunately, is that persistent bugs in software from Adobe (in particular Illustrator) apparently will make that software behave badly after such a change. So with Adobe software on your machine, you’ll probably want to refrain from it (until Adobe finally get their act together and fix this).

...on Windows

Making UTF-8 the default character encoding in your whole environment is still not an option at all on Windows, unfortunately, where, from what I understand, legacy code pages instead continue to be what’s mostly used even today – different (“DOS”) ones for the console than elsewhere, even. (There is a number 65001 defined for a UTF-8 “code page” on Windows, but its usability is very limited.)

Furthermore, the built-in Notepad, and any other MS software too, I think, will insist on prepending any UTF-8 it saves with a BOM (byte order mark), and there is no way to turn that behavior off. The BOM is an invisible Unicode character with the true and original purpose of helping Unicode-reading software determine the byte order or “endianness”[en.wikipedia.org] of multi-byte code units. (The byte order can be seen from what parts of the BOM code, U+FEFF, come first and last.) Other Unicode encodings exist, UTF-16 and UTF-32, which are based on such multi-byte units (one of the things that make them awkward for data interchange), but UTF-8 isn’t. There is no byte order in UTF-8, as each code unit is just one byte. A common rationalization for the BOMming of UTF-8 anyway – a purported need to give it a “signature” – doesn’t make much sense either, the real reason no doubt being found in the history of Unicode; see end chapter.

You are thus much better off using something else for your UTF-8, and by far the most popular Notepad replacement for Windows is, I believe, the free and open source Notepad++[notepad-plus-plus.org]. Apart from its nice handling of UTF-8, it obviously also has no problems reading Unix newlines. There might be lighter weight alternatives that would suffice too. But if you go with Notepad++, here is an overview:

Notepad++

First, at the bottom of the Notepad++ window is a status bar where the character encoding as well as the newline format of the currently edited file can be instantly seen:

The encoding can be changed in the Encoding menu. This menu is divided in two sections, the upper with “encode in” alternatives, the lower offering options to “convert to” things. With a new, empty file, they are the same. The option you want is “Encode in UTF-8” (no BOM).

With some actual data in the file, there is a distinction between the upper and the lower set of options, in that the upper ones are meant to adjust how Notepad++ interprets existing data, without changing those data. Whereas the lower options will perform an actual conversion of data (which, of course, will only work correctly if Notepad++ understood the original data correctly). Some further reading on this in the Notepad++ wiki[docs.notepad-plus-plus.org].

If you have loaded a previous file and some characters don’t look right, you can try the upper options, until (hopefully) the characters look right. This, of course, is mainly to find a particular legacy encoding (code page): There’s a whole slew of those found under Character sets, whereas the “ANSI” option refers to the particular one of them that happens to be set as default in your system.

Only once everything looks right should you use the lower (convert) option:

You may also, via EOL Conversion in the Edit menu, change the newline format to Unix – which is what the CoE3 manual recommends. I don’t think this is strictly necessary. (It seems to me that both CoE3 and CoE4 can handle Windows newlines just fine.) On the other hand it hardly hurts either, as the only Windows software I’m aware of that is unable to read Unix newlines is precisely Notepad.

Preferences

Instead of making these adjustments every time you start some editing, you can tell Notepad++ to use them as defaults. Go to the Settings menu and select Preferences...

To the left in the box that opens, select New Document.

For encoding, I believe UTF-8 should already be preselected – and the associated check box checked – but if not, make sure to set it that way.

The explanation to the check box, “Apply to opened ANSI files”, is slightly misleading. The actual effect is to make Notepad++ treat as UTF-8 any ASCII (not “ANSI”) files you open – i.e. files with only ASCII in them. The default in Windows is to treat ASCII files as encoded in whatever legacy code page that happens to be set as default on your system, inaccurately called “ANSI”[en.wikipedia.org]. As long as you keep the ASCII file ASCII only, it makes no difference, of course, whether it’s regarded as UTF-8 or “ANSI”, as they’re both 100% compatible with ASCII. The difference is how any non-ASCII characters added to the text will be treated. Should those be encoded in UTF-8, or should they be encoded in whatever legacy code page happens to be set as default on your particular system?

Even with that box checked, if a non-ASCII, non-UTF-8 file is loaded, Notepad++ should recognize it as such, and display it as “ANSI” (unless it’s some kind of UTF-16 or UCS-2, of course). One oddity is that an entirely empty file (zero bytes contents) when loaded can unfortunately also be treated as “ANSI” for some bizarre reason (not sure why), in which case you may need to explicitly select UTF-8 from the encoding menu.

The default newline (line ending) can also be changed – to Unix – if you like, although, again, I believe this is less important.

Language

The Default language option is mostly about aesthetics. It refers to a programming language assumed for your text, and you can certainly leave it as Normal Text. In the example above, I have set it to Shell, which will make Notepad++ colorize the text as if it were a Unix shell script. A CoE4 mod is certainly no Unix shell script, but comments beginning with a ‘#’ character work the same, and Notepad++ will color those green. (The Comment/Uncomment options in the Edit menu will also work.) String literals (between ASCII double quotes) will similarly be displayed differently from commands.

You can also pick a language option from the menu:

...on Linux

Plain text files made on Linux are obviously always UTF-8 – without BOM (with one nasty exception: if ever you save plain text from LibreOffice, it will have a BOM).

Should you, for whatever reason, have the bad fortune of a finding a UTF-8 file of yours tainted by a BOM, it shouldn’t be too hard to remove it, though. Simple text editors, like Gedit or Nano, will just treat it as an invisible zero-width character at the start of the text. So if, while moving the cursor around, you notice some invisible garbage there, just make sure you are at the start of the text, press the Delete key once, and save – and the BOM garbage is gone. More advanced editors, like Vim or Geany[www.geany.org] (also available for Windows and Mac), have special options to deal with the removal of BOMs.

Colorful editing

Users of Gedit will find syntax highlighting in the View menu.

If you type “sh” in the search bar, to shorten the list of programming languages, and then pick the sh option, it will make Gedit display your mod as if it were a shell script. Comments and string literals will then be easily recognizable.

One little oddity is that any occurrences of the free command in the mod will stand out from other commands. This is because free is an actual Linux command (though not a command in general Unix). Other CoE4 commands are not, and will just be shown in standard black text. This little inconsistency in the text display will hardly do any harm, though.

Entering Unicode characters

Assuming you’ve got your UTF-8 right (see previous chapters), you can now fill your descriptions and names with non-ASCII to your hearts content. For some of it, there should be direct support in your keyboard setup (e.g. accenting keys), but there are lots of Unicode characters that aren’t that easily accessed.

Copying and pasting

One way to solve this which should always work, regardless of your applications and operating system, is to copy the characters you want from somewhere else – such as a web page, or a word processor, or from the character map utility that I believe should be available on most or all operating systems – and then paste them into your plain text editor.

Ctrl+C to copy whatever is selected in an application, and then, after switching to another application, pressing Ctrl+V to paste it into that other application, are commands that should universally work anywhere and everywhere. (Except that on a Mac, a Command key – ⌘ – is used instead of Ctrl.)

Direct inserting of Unicode per hexadecimal code

On most systems there are also ways to enter Unicode characters directly per their hexadecimal codes (which is how Unicode characters are usually documented). For instance U+263A for ☺ (the “263A” part being the actual code).

On Windows, if/when it works, ❶ press and hold Alt, ❷ press + on the keypad, ❸ enter the code, ❹ release the Alt key.

Unfortunately, for some odd reason, this isn’t always or even usually enabled by default (it wasn’t on my Windows 8.1), and enabling it requires venturing into your registry settings. The registry key HKEY_CURRENT_USER\Control Panel\Input Method should contain a value EnableHexNumpad of type REG_SZ (string) set to "1". After a change here, you need to log out and back in before it takes effect. (With a really old Windows system – i.e. XP – a restart is needed.)

Perhaps I should also mention a simple utility[www.fileformat.info] offering an alternative method for Unicode entry (a bit similar to the GTK+ method on Linux mentioned below). I take no responsibility for it. It’s not open-source, the ostensible reason, according to the author (by the name of Andrew Marcuse[andrew.marcuse.info], it seems), being only that it’s so simple that he didn’t think it worth the trouble to publish the source code. But it’s free of charge, and it seems to do what it’s supposed to, and nothing else, as far as I can tell.
On Mac, well, I’ll just refer you to the solution presented in the Apple support forum[discussions.apple.com].
On Linux, this should work out of the box – if your graphical environment is GTK+ based (meaning most Linux systems, but not those with KDE which is Qt based):

❶ Press Shift+Ctrl+U (and release), an underlined u is shown, ❷ enter the code (the code is shown and can be edited while you enter it), ❸ finish with the Return or Enter key (or with Space or by pressing and releasing Shift+Ctrl again).

If this doesn’t work (you don’t see any u), then try instead to hold Shift+Ctrl while entering U followed by the code, and only then release Shift+Ctrl.
On Linux with KDE, there is no general solution, unfortunately. Within the KDE default text editor, Kate, though, an application specific method might be[ubuntuforums.org] F7 for command line, then enter: char 0x263A (or whatever the code for the desired Unicode character is).

See further Wikipedia on this[en.wikipedia.org].

The fonts (and text system)

At this point, I’m afraid it may be time to put a damper on the enthusiasm. Although, in theory, with UTF-8 the whole Unicode[unicode-table.com] is your oyster – already more than 120 000 characters assigned out of 1 112 064 available code points – there is in practice one little snag: The font.

When an application is making use of system fonts, and a graphical representation (a glyph) for a Unicode code point is missing in its preferred font, the operating system may be able to supply a fallback from whatever other fonts it has installed.

However, a game such as CoE cannot reasonably rely on any OS specific font systems. It has to provide its own fonts, which is what CoE4 does.

In the data folder for the game, there are three TrueType font files:

guifont.ttf, with, I believe, FreeSans, from the GNU FreeFont[en.wikipedia.org] family. This is the sans-serif[en.wikipedia.org] font with the largest set of glyphs – i.e. supported characters – and it is the one used for most things in the game, including mod descriptions, class descriptions, ritual descriptions and messages.
guifont_texty.ttf, with Liberation Serif[en.wikipedia.org], it seems, with a somewhat smaller set of glyphs. It is used for unit descriptions.
guifont_fancy.ttf, apparently with Cloister Black Light, by Dieter Steffmann. It has the smallest set of glyphs and is used for titles of various kinds, including the one with (name and) unit type for unit properties.

Symbols and math stuff

In recent years, a lot of new emoticons and other fun symbols have entered the Unicode standard, but it isn’t too surprising if these very new additions aren’t in the game fonts. (To be able to see some of them on various web sites, I recently found even my preinstalled system fonts weren’t enough, I had to install some more.) More disappointing, none of the game fonts, not even FreeSans, contains much from the Dingbats[unicode-table.com] and Miscellaneous Symbols[unicode-table.com] blocks either, despite that those have been in the standard for quite some time (apart from a few recent additions).

Another font in the GNU FreeFont family, FreeSerif, does have glyphs for most of the stuff in the just mentioned two blocks, and when I tried substituting it for the guifont.ttf of the game, symbols like these ❶❷❸❹ ⚔ ☠ ⚕ ✌ ☄ ⚓ ⚒ ⚠ ⚑ ⚐ ➳ ☛ ✔ ✘ ⚅ ♒ ☼ ☀ ☁ ⚡ ❄ ☃ ♨ ☤ ⚚ showed up just fine in the game. The same with DejaVuSans, from the DejaVu fonts project[en.wikipedia.org]. Obviously, though, since neither FreeSerif nor DejaVuSans come with the game, you can’t rely on CoE4 players in general having access to these characters. They aren’t in the fonts with the game.

What you can find, also in the game fonts (except the fancy one), are for instance ☺ ♠ ♣ ♥ ♦ ♪ ♫, the numero sign (№) and a few arrows from the arrows block[unicode-table.com] including ←↑→↓↔↕. Also, of course, apart from the ASCII set, all fonts (including the fancy one) come with the full Latin-1 Supplement[unicode-table.com]. (Some further symbols in all fonts are † ‡ • ‰ ∞ ≈ ≠ ≤ ≥.)

The FreeSans has some things extra, such as several more vulgar fractions and superscript digits over the ones in the Latin-1 Supplement (and also subscript digits), whereas the Liberation Serif font has a bit less of that.

One way to view the contents of the font files is to load them in FontForge[fontforge.github.io] – which is actually a fully featured font editor, not just a viewer. But it works.

Language support

The bulk of the characters in the Unicode set are the logograms used for East Asian writing, but I don’t think any of those are in the fonts of the game. (Had they been, the font files would have been a lot larger than they are, of course.) There doesn’t seem to be any Arabic either in those fonts, but then, the text system doesn’t seem able to handle right-to-left writing anyway.

Still, if you just want to use some European language other than English (the Greek and Cyrillic alphabets are also supported, except in the fancy font) – or you merely want to adorn your names with a few altered Latin letters – what’s in the fonts should probably suffice.

Example CoE4 map sporting names with a little bit of non-ASCII in them. (CoE4 maps – i.e. coem files – are UTF-8 text files just like the mods, of course.)

The SDL_ttf library

As it turns out, there is actually one more snag, beside the fonts.

The above mentioned DejaVuSans font, apart from good support of Dingbats and Miscellaneous Symbols, also has some of the emoticons beginning at U+1F600 – like 😀😁😂 etc. But, even after substituting that font for the one provided with the game, those emoticons still didn’t show up in the game. An experiment with the Aegyptus font by George Douros similarly failed to give any ancient Egyptian hieroglyphs[unicode-table.com].

The reason for this lies entirely in the SDL_ttf library used by the game for its text display. Although Unicode characters require 32-bit variables to hold them (or 21 bits, to be precise), they are, within that library, for some reason everywhere cut down to just 16 bits. (The exception is that characters from an UTF-8 string are correctly read as 32-bit values – which doesn’t help much as they are afterwards cut down to 16 bits anyway.) This means, as long as only the first 63 thousand or so Unicode characters are used, everything works, but any Unicode character beyond that, in the range of over a million potential characters that won’t fit within 16 bits, will be mangled.

I actually downloaded the source code for the library and fixed it (which turned out to be surprisingly easy), after which I did indeed get CoE4 to display those emoticons and hieroglyphs (with the right fonts). Perhaps the fix can eventually be added to the official SDL_ttf library too (although, at this point, I haven’t yet heard anything from the SDL people).

Typographic variants of ASCII characters

The ASCII set includes a few multi-purpose characters:

" (U+0022): The ASCII double quote. Used as both opening and closing quotation mark, and also for double prime symbol (for seconds or inches).
' (U+0027): The ASCII single quote or apostrophe. Also for a prime symbol (minutes/feet).
- (U+002D): The ASCII hyphen-minus. Used for hyphen or minus or en dash (or em dash).

When you type on your keyboard, the above ASCII characters are what you will primarily get. Word processors, such as LibreOffice[www.libreoffice.org], can, however, automatically replace some of them with appropriate typographic variants.

The full Unicode set defines characters for each special case, which generally could make the text look a little bit better. In the case of ASCII double quotes, furthermore, which can’t be used within a string literal at all, their typographic counterparts are the only way to solve any need for such quotation, as those won’t interfere with the game’s recognition of start and end of the string literal:

“ (U+201C): Left double quotation mark
” (U+201D): Right double quotation mark
″ (U+2033): Double prime (seconds/inches)
‘ (U+2018): Left single quotation mark
’ (U+2019): Right single quotation mark – or apostrophe.
′ (U+2032): Prime (minutes/feet)
‐ (U+2010): Hyphen
− (U+2212): Minus
– (U+2013): En dash
— (U+2014): Em dash

These are found in all the game fonts, except that minus, prime and double prime aren’t in the fancy font. (In Unicode, most of them are in the General Punctuation[unicode-table.com] block. The minus sign (U+2212) is in the Mathematical Operators block, though.)

No-break space and non-breaking hyphen

Another character that can often be handy is U+00A0, no-break space: It’ll look just like an ordinary space, but will prevent line breaks from happening where that space is. For instance, if your write “№ 2”, you probably don’t want a line break to happen between № and 2, and you can ensure it never will, with a no-break space between those two characters.

There is also a non-breaking hyphen, U+2011. However, I don’t think CoE4 will ever break lines after ordinary hyphens (or ASCII hyphen-minuses) either, so this is of less use while doing CoE4 mods.

Again, note that any characters intended to be read by the game – not by players – should be ASCII.

The history of UCS, Unicode and UTF-8 – Part 1

These end chapters can be safely skipped. They merely provide some historical background to the conspicuous slowness in the adoption of UTF-8 on Windows and Mac. I actually wrote them as a single one chapter, but Steam apparently doesn’t allow that long chapters (or sections), so I had to split it in two.

The Beginning

In the 1980s, two projects were launched, initially unaware of each other, both with the aim of creating a single unified character set for the world. One of them was initialized in 1984 by ISO/IEC, leading to drafts of a new standard, ISO/IEC 10646, starting to get circulated in 1989. It was published in 1990. The other, which came to be known as Unicode, began as an informal project by Xerox and Apple around 1987, grew with regular meetings and more companies involved in 1989 (Microsoft joined in 1990), and was formalized in the founding of the Unicode consortium in 1991.

The original goal of the Unicode project – into which Microsoft and Apple came to invest loads of money as well as prestige – was a fixed-width 16-bit character set. I.e. every character would invariably take up a 16-bit unit (=2 bytes). This meant it would be entirely incompatible with old ASCII files and with every other previous encoding, as well as with all previous text processing software (and it would also take up twice as much space, even for English text and program source files). But, on the plus side, thanks to still being fixed width (just using 16-bit units instead of bytes), it would be about as easy to handle for new software, as the previous single-byte encodings had been for the old one.

Meanwhile, the idea of the ISO/IEC project – the UCS, or the Universal Character Set – was a full 32‑bit set (or, to be precise, a 31-bit set), compared to the 16-bit set of Unicode, and, unlike Unicode, ISO/IEC always foresaw variable width encoding being used for it. In fact, the UTF-8[en.wikipedia.org] scheme, beautifully and cleverly composed in 1992 – which isn’t just ASCII compatible, but also self-synchronizing and (where there is non-ASCII) very easy to recognize – was designed to encode the entire 31-bit UCS of ISO/IEC 10646, not Unicode.

The Merge (and UTF-16)

The two projects eventually merged (after some peace brokering[www.unicode.org]), but it happened only in stages. First, the character sets of each project were aligned with each other, without either project giving up its general goals. This effectively made Unicode a subset, UCS-2, of the UCS according to ISO/IEC 10646, one that would still only be encoded in fixed-width 16-bit units. Only then did the Unicode people realize they had to throw in the towel regarding their hotly desired fixed width encoding anyway. 16 bits, or 65 536 characters (code points*), just wasn’t enough to fit everything they felt was necessary (in particular all the Chinese logographs, of course).

At this point (Unicode 2.0, July, 1996), UTF-8 was adopted by Unicode. However, since by then several of its members (including Microsoft, Apple and Sun) had already invested heavily in software for a fixed width 16-bit character encoding model, a UTF-16 scheme was cobbled together too, to save face and investments – trying to mimic the self-synchronization of UTF-8 but without any of its elegance. The new UTF-16 scheme was initially, in the language of Unicode 2.0, indicated to be the primary encoding, with UTF-8 only a secondary option. (This has since changed, and the Unicode consortium no longer recommends one encoding over the other.) No less a variable-width encoding than UTF-8, UTF-16 could nevertheless be touted by its proponents as “almost” fixed-width, on the grounds that a second 16-bit code unit would “almost never” be required (complete nonsense from a programming point of view, of course, as the double unit case still always needs to be catered for).

Whereas the UTF-8 scheme could encode anything within the whole 31-bit UCS originally conceived by ISO/IEC (more than two thousand millions of code points), the UTF-16 scheme is unable to encode more than a bit over a million code points, 1 112 064 usable ones to be precise. (2048 code points were removed, reserved for the UTF-16 scheme only, while 16 more 65 536 character “planes” were added to the one Unicode had originally aimed for.) ISO/IEC eventually (in 2000) agreed to limit the code space of ISO/IEC 10646 to no more than what could be encoded with UTF-16 (reducing the UCS from 31-bit to a not fully used 21-bit set), thus making the unification of the two standards 100% complete. Today, Unicode and UCS are basically the same thing (although Unicode doubtless sounds a lot catchier than UCS or ISO/IEC 10646).**

The UTF-8 path to Unicode

Since, in the GNU/Linux world, no internal support was ever implemented for 16-bit character encoding, the choice to go with UTF-8 for everything came easily. As late as 2002, Red Hat was the first Linux distribution to migrate to UTF-8, followed by SUSE in 2004, Ubuntu in 2005, Debian in 2007. All in all, the complete migration of Linux to Unicode went fairly quickly and painlessly.

The World Wide Web too, of course, has in the meanwhile gone mostly Unicode, again via UTF-8 (UTF-16 hardly ever even considered as an alternative). The most important reason for dismissing UTF-16 were in both cases no doubt its incompatibility with existing ASCII software and text files, but, in transmissions, the greater robustness of UTF-8[www.ibm.com] over UTF-16 is of relevance too. (Add to this generally smaller file sizes with UTF-8 than with UTF-16.)

Microsoft and Apple (among others in the consortium) were obviously not amused. They had in the early 1990s boldly and bravely taken on the future with an array of software to handle the brave new fixed-width character set, UCS-2, no doubt at considerable costs, only then (Unicode 2) to find themselves required to painfully adapt their fixed width UCS-2 systems into variable width UTF-16 – an adaption still not everywhere completed. (Leaving them with a mess of legacy code pages, UCS-2, UTF-16 and, inescapably of course, the steadily more popular UTF-8 too, on their systems.) And now, adding insult to injury, they could only watch others cleanly and (relatively) effortlessly take the much easier 100% UTF-8 path to Unicode instead.

No one seems to have had as hard as Microsoft to come to grips with the success of UTF-8, though. (By comparison, the Java platform of Sun, now Oracle, also uses UTF-16 internally – a consequence of it being created in the early 1990s – but they are nevertheless, since quite a while back, recommending UTF-8 externally.) Seeing how UTF-16LE is still in Microsoft software consistently and misleadingly called just “Unicode”[stackoverflow.com] (UTF-16BE in Microsoft speak is “Unicode big endian”), while a UTF-8 option is typically kept clinically free from any mention of Unicode at all, it’s hard to shake the feeling that Microsoft are still, as far as text handling goes, stuck in the happy days of 1993 – a blissful time when the Unicode standard was as yet untainted by any annoying UTF-8, and Microsoft was spearheading the way into the pure, new future world of fixed-width 16-bit characters everywhere that was Unicode.

The history of UCS, Unicode and UTF-8 – Part 2

Does UTF-8 need a yellow badge? – The UTF-8 “signature” myth

Which leads me to some final words on a remarkably widespread myth: UTF-8 would be in need of some kind of extraneous “signature”, prepending the real UTF-8 stream, for it to be possible to recognize as UTF-8.

Where this myth stems from, of course, is the fact that, despite that UTF-8 is free from byte order, certain software, in particular on Windows, has been casually dropping BOMs, byte order marks, in its UTF-8 anyway – a habit that had to be rationalized, somehow[inasmuch.as].

The facts are as follows:

In UTF-16:
1. If a random byte stream is interpreted as either kind of UTF-16 (BE or LE), the chance of each character passing as legal UTF-16 is 96.9%. (This not taking in account detection of currently unassigned code points and/or any analysis of textual contents.) Thus, it takes a fair amount of characters to rule out, with reasonable certainty, something else than UTF‑16.
2. Already the byte order issue of UTF-16, however, pretty much requires it to be prepended by a BOM (byte order mark), which will be two bytes – in UTF-16BE: 0xFE, 0xFF, in UTF‑16LE: 0xFF, 0xFE. Since it’s already there, it will also serve as a UTF-16 signature.
  
  The chance of two random leading bytes corresponding to the UTF-16LE BOM is 0.001 526%, making it more reliable for recognizing UTF-16LE than the reading of 350 random UTF-16LE characters would be.
In UTF-8, by contrast:
1. There is no byte order.
2. Any occurrence at all of legally decodable non-ASCII characters will make it very unlikely that it’s anything else than UTF-8.
  
  In a random byte stream, the chance of a non-ASCII byte (high bit set) beginning a sequence that’s legally UTF-8 decodable to an actual non-ASCII character is only 6.64%.
  
  Just five such characters (!) will more reliably confirm a byte stream to be UTF-8 (0.000 129% chance of error), than the BOM does for UTF-16LE. (The reliability of those five UTF-8 non-ASCII characters for format recognition corresponds to the reliability of the UTF-16LE BOM plus 78 or 79 more UTF-16 characters being checked for illegal UTF‑16.)
3. Without non-ASCII, on the other hand – if the stream is pure ASCII (high bit in each byte cleared) – the matter is already settled. ASCII per definition is also legal UTF-8. Whether you call it UTF-8, or just ASCII, or you slap some arbitrary legacy code page label on it (so called “ANSI” or whatever), does not matter. It’s all just ASCII. Except that if it’s treated as UTF-8, then any non-ASCII additions to it will be sensibly encoded.
4. The contraption of a UTF-8 “signature”, in the form of a byte order mark, in the byte order free UTF-8 becomes three bytes (0xEF, 0xBB 0xBF), making the chance of an error in format detection only 0.000 006%. This corresponds to the reliability of the UTF-16LE BOM plus 176 (!) more UTF-16 characters being checked for illegal UTF-16.
  
  On the other hand, just seven non-ASCII characters in the ordinary UTF-8 stream will be even more reliable than the BOM “signature”, with a chance of error less than a tenth of that.
Summarizing:
1. Although, unlike UTF-16, UTF-8 is really, really easy to recognize in itself, and has no byte order to mark, a byte order mark anyway will make it really, really, really easy. Yay! Oh, but wait...
2. If the probably most easily recognizable character encoding ever designed, already in itself (UTF-8), really needs this extraneous signature on top of that – shouldn’t the poorly recognizable UTF-16 be similarly “improved” with a gazillion extra BOMs just for signature?

A UTF-8 BOM breaks one of the beautiful aspects of UTF-8: its 100% compatibility with ASCII, and with ASCII software. Whether you find this to be a price worth paying for the sake of raising its recognizability to ridiculous levels, and beyond, will probably very much depend on how much love you had for UTF‑8 in the first place. (In the case of Microsoft, I believe it’s safe to assume that amount to be zero.) In its role as a spanner thrown in the works of ASCII-expecting software that would otherwise work fine also with UTF‑8, the UTF-8 “signature” works most excellently. (I would have linked to some Mr Burns from Simpsons here, but it seems redundant.)

As simple and straightforward as UTF-8 recognition is, it does seem, however, that Microsoft has never bothered to implement it in their own software – making their own invention of a UTF-8 “signature” indeed the one and only way for it to detect UTF-8.

Ironically, it seems the peculiar inability of Microsoft software to recognize UTF-8 the normal way might occasionally even be a good thing. If a developer on Windows tries to use a UTF-8 source file – with the BOM that Microsoft insists on for UTF-8 – then Microsoft’s own development tools will apparently take it upon themselves to convert that UTF-8 to something else, thereby corrupting any UTF-8 string literals in that source file. The workaround: Save the UTF-8 source file without BOM anyway[utf8everywhere.org], using a non-MS tool. Without the yellow badge to mark it as UTF-8, the MS development tools will fail to notice it (regard it as “ANSI”) and leave it unharmed.

Though, again, from a purely human perspective, their obviously very strong dislike of UTF-8 isn’t that hard to understand, I guess.

Notes

* In these end chapters, I have, somewhat carelessly perhaps, used the terms character and code point interchangeably, which is also the common practice in programming discussions. However, for the purposes of text display, and from an end user perspective, a character might be made up from two or more code points, where some could for instance be combining diacritical marks. On the other hand, a ligature could be seen as two or more characters, even if represented by a single code point.

** To a developer involved in the display of Unicode text, there is still a distinction between Unicode and ISO/IEC 10646, as the Unicode standard establishes various rules for such things that aren’t in the ISO/IEC standard.

Links:

A brief history of Unicode[scripts.sil.org] (in Understanding Unicode) by Peter Constable, 2001
Globalisation: efforts to unify character codes[ota.ox.ac.uk] (in Character encoding in corpus construction) by Anthony McEnery and Richard Xiao
The Unicode consortium on its own history[www.unicode.org]

4 Comments

< >

CornbreadChrist 16 May, 2020 @ 8:28pm

Through your various guides on this game and previous editions, I have a much better understanding of basic modding.

On top of that, your time commitment and effort in making these guides and useful tools for people who are otherwise complete strangers to you is truly humbling. A testament to your knowledge, work ethic, and your character.

I don't know if you'll ever see this, but I wish to show my appreciation all the same.

Thank you

Marlin [author] 19 Jun, 2016 @ 1:20pm

@seecat46: All currently published CoE4 mods are, I believe, to be found in the single one Modding sub forum on Steam – crammed together with discussions on how to make mods. Not ideal, perhaps, but that’s how it is. There is unfortunately no CoE4 forum as great as the CoE3 one on Proboards [coe3.proboards.com] is, or was (the driving force behind it sadly passed away, from what I understand).

And no, CoE3 mods cannot be used with CoE4. Some may be simple, even trivial, to convert (as long as they don’t make use of events or rituals), but no CoE3 mod can be directly used with CoE4.

seecat46 18 Jun, 2016 @ 3:11am

why are there no CoE 4 mods on google or can you use CoE 3 mods?

Tchey 13 Apr, 2016 @ 11:02am

Really over complicated for beginners. Mostly all of that "knowledge" is useless to start modding. That said, if i'm not looking for help about modding CoE4, the reading is enjoyable. Thanks for the effort.

< >