After changing the default charset from
utf8mb4 to support emojis on Directus, we started to received errors that the key was too long.
How can be too long if we only change the charset? one may ask.
First these are the errors I'm talking about:
#1071 - Specified key was too long; max key length is 767 bytes #1071 - Specified key was too long; max key length is 1000 bytes #1071 - Specified key was too long; max key length is 3072 bytes
It can be any of those errors depending on whether the table's storage engine is: MySIAM, InnoDb or InnoDb with
utf8 charset requires only 3 bytes per character, while
utf8mb4 requires 4 bytes. This means when you use
utf8mb4 charset you have to use at most 191 characters in a string column.
191 characters x 4 bytes = 764 bytes which is less than the maximum length of 767 bytes.
String storage size vary depends on whether the column is fixed-length or variable-length. It also depends on the charset, it takes more bytes to storage a japanese character than an ASCII/Latin letter.
As an example,
CHAR is a fixed-length while
TEXT are variable-length.
All fixed-length data types uses all the bytes they were declared. For example
CHAR(16), no matter what's its value, it's right padded with spaces to fill up to the specific length. On the other hand
VARCHAR only uses 1 byte + the content size.
VARCHAR requires a prefix value of 1 byte to store the length of the string if the size is less than 256, otherwise it will uses 2 bytes.
One tip is not to use CHAR if you are not going to use all the characters almost all the time, because the size can pile up with empty strings column.
UTF8 character set uses a maximum of 3 bytes per character and only contains Basic Multilingual Plane (BMP) characters, which is the home of 65,536 characters (16 bits) from
UTF8mb4 character set uses a maximum of 4 bytes per character including all of BMP characters and Supplementary Multilingual Plane (SMP) which include another possibility of 65,536 new characters from
Emojis (Unicode characters)
UTF8 can support emojis, but not all of them. All of the new emojis are part of the SMP, so in other to support both basic and supplementary multilingual plane
UTF8mb4 must be used.
The sparkle emoji (✨
U+2728) value is between
U+FFFF then it can be used on
utf8 charset, but the Woman Health Worker (👩
U+1F469) value which is not between
U+FFFF, must use the
utf8mb4 charset that range between
Now all the characters use 4 bytes instead of 3, so all columns that has more than 191 characters now uses more than 767 bytes, because 192 x 4 bytes is 768 bytes.
The options are removing the index, keep using
utf8, add a length to the index key or reduce the length of the column.
Removing the index wasn't a good option, neither keep using the
utf8. Reducing the length was possible because the columns will probably never met the actual length which is 255 characters, reducing it to 191 was optimal and in no way will harm the system.
Changing all string columns with length greater than 191 characters to 191.
If changing the length was not possible or desired option, changing the column index to only a chunk of x characters.
CREATE INDEX `index_name` ON `posts` (title(191));