On character encodings

Background

For many years, EMG passed on both the message body and the data_coding fields with as few modifications as possible. To be able to correctly handle contents in both the Latin-1 and GSM-7 character sets, mapping files were used. This worked fine when the number of clients sending messages, and the number of operators receiving them, was limited.

However, for the SMS broker scenarios which are more common now, with a much larger number of message senders and operators, this was not enough. Different clients would send messages to the same EMG connector using different character sets, being routed to the same operator. Sometimes even the same client sent messages in both Latin-1 and GSM-7, resulting in one of them being shown incorrectly, even though the data_coding value was correct. It became increasingly difficult to know where to add mapping files to support new traffic, without also breaking the existing traffic.

New options in EMG 7

In EMG 7.1, the configuration option DEFAULT_CHARCODE_TEXT was added, both as a global setting and as a connector option. The data flow now looks like the following.

  1. Incoming messages with data_coding set to 1 or 3, for GSM-7 and Latin-1 respectively, is tagged with the correct character set.
  2. Incoming messages with data_coding set to 0, gets the character set from the connector option DEFAULT_CHARCODE_TEXT.
  3. If DEFAULT_CHARCODE_TEXT is also set globally, the message is then automatically converted to that character set, and retagged.
  4. If the connector option DEFAULT_CHARCODE_TEXT is set on the outgoing connector, outgoing messages in GSM-7 or Latin-1 are then converted again, and sent with data_coding set to 1 or 3. This handles the case when an operator only supports one of GSM-7 and Latin-1, and not both.

Messages in Unicode and binary data are not modified.

This makes mapping files obsolete, as all conversions are done automatically. It also means EMG will never, by default, use data_coding set to 0 for outgoing traffic.

With these options, new connector protocols such as HTTP-JSON (added in EMG 7.2.7), which only use UTF-8, could now correctly send messages received in any character set.

Improving the compatibility

Some new issues appeared at this point. In particular, some operators did not support data_coding values 1 or 3 correctly, leading to various characters being shown incorrectly on the handsets.

To be able to send messages to such operators without full data_coding support, the new connector options DCS_FOR_LATIN1 and DCS_FOR_IA5 were added in EMG 7.1.10 and EMG 7.2.0. This way EMG could enforce both the character set and data_coding value for outgoing traffic, without disturbing Unicode messages.

Recent updates

All IA5 options got new names in EMG 7.2.8, replacing IA5 with GSM7. The IA5 names will remain for backwards compatibility, but new configurations should use the GSM7 names for clarity.

The MERGE logic got updated in EMG 7.2.9, to handle the case when different parts used different character sets.

This logic has now been updated again, for EMG 7.2.25.