Welling Guzman

MySQL string columns key length

After switching the default charset from utf8 to utf8mb4 to support emojis on Directus, we started to receive errors that the key was too long. One my wondering how changing the charset affect the key length. Below can be see examples of the errors:

#1071 - Specified key was too long; max key length is 767 bytes
#1071 - Specified key was too long; max key length is 1000 bytes
#1071 - Specified key was too long; max key length is 3072 bytes

It can be any of previous errors depending on what is the storage engine of the table. MySIAM, InnoDb or InnoDb with innodb_large_prefix enabled have differents key length limitation.

TL;DR

The difference between utf8 and utf8mb4 charset is the bytes requires to store each characters. utf8 requires 3 bytes, while utf8mb4 requires 4 bytes. This means using utf8mb4 charset in a table with innodb engine with innodb_large_prefix disabled, at most 191 characters in a string column must be used.

191 characters × 4 bytes = 764 bytes which is less than the maximum length of 767 bytes allowed when innodb_large_prefix is disabled. Since MySQL 5.7 innodb_large_prefix is enabled by default allowing up to 3072 bytes.

String Storage

String storage size vary depends on whether the column is fixed-length or variable-length. It also depends on the charset, it takes more bytes to storage a japanese character than an ASCII/Latin letter.

As an example, CHAR is a fixed-length while VARCHAR and TEXT are variable-length.

All fixed-length data types uses all the bytes they were declared. For example CHAR(16), no matter what its value is, it's right padded with spaces to fill up to the specific length. On the other hand VARCHAR only uses 1 byte + the content size.

VARCHAR requires a prefix value of 1 byte to store the length of the string if the size is less than 256, otherwise it will uses 2 bytes.

One tip is not to use CHAR if you are not going to use all the characters almost all the time, because the size can pile up with empty strings column.

Character Set

The UTF8 character set uses a maximum of 3 bytes per character and only contains Basic Multilingual Plane (BMP) characters, which is the home of 65,536 characters (16 bits) from U+0000 to U+FFFF.

The UTF8mb4 character set uses a maximum of 4 bytes per character including all of BMP characters and Supplementary Multilingual Plane (SMP) which include another possibility of 65,536 new characters from U+10000 to U+1FFFF.

Emojis (Unicode characters)

UTF8 can support emojis, but not all of them. All of the new emojis are part of the SMP, so in order to support both basic and supplementary multilingual plane UTF8mb4 must be used.

The sparkle emoji (✨ U+2728) value is between U+0000 and U+FFFF then it can be used on utf8 charset, but the Woman Health Worker (👩 U+1F469) value which is not between U+0000 and U+FFFF, must use the utf8mb4 charset that range between U+10000 and U+1FFFF.

Index length

Now after using utf8mb4 all the characters use 4 bytes instead of 3, so all columns that has more than 191 characters now uses more than 767 bytes, because 192 x 4 bytes is 768 bytes.

Keep in mind the 768 bytes limit is only when using innodb engine and innodb_large_prefix is disabled. Since MySQL 5.7 innodb_large_prefix is enabled by default allowing up to 3072 bytes. MySIAM has a maximum length of 1000 bytes.

Engine Limit
InnodB with innodb_large_prefix disabled 768 bytes
MySAIM 1000 bytes
InnodB with innodb_large_prefix enabled 3072 bytes

Solutions

The solve this will depend on what we actually need it can be either removing the index, keep using utf8, add a length to the index key or reduce the length of the column.

Reduce length

For us removing the index wasn't a good option, neither keep using the utf8. Reducing the length was possible because the columns will probably never met the actual length which is 255 characters, reducing it to 191 was optimal and in no way impact the table.

Index length

If changing the length was not possible or desired option, changing the column index to only a chunk of n characters, is another possible option.

CREATE INDEX `index_name` ON `posts` (title(191));