MySQL - Prepping MySQL Indexes for a Character Set Change

When changing a MySQL table or column to a new character set that uses more bytes than the old character set, you need to first check if any schema changes are needed to accommodate the change. For example, changing character sets from latin1 to utf8 is an increase from 1 to 3 bytes, and changing from utf8 to utf8mb4 is an increase from 3 to 4 bytes. The MySQL reference manual has a helpful page with some details on this, but I want to add some examples to show how this schema prep can be accomplished.

There are three different types of length limits to take into consideration:

In this post I will focus on index length limits, and I’ll save columns and rows for future posts. Read on for details.

Index Length Limits

The specific limits depend on which MySQL version and storage engine is used, and whether innodb_large_prefix is enabled (assuming MySQL 5.5 or higher). If you want to learn from about innodb_large_prefix you may want to read this post. A common example here is an index on a varchar(255) column in the utf8 character set without innodb_large_prefix. That index uses 765+2=767 bytes, so it conforms to the 767 byte limit per column of an InnoDB index without innodb_large_prefix. If I change that column to utf8mb4 then the index requires 1022 bytes, which causes the DDL to fail.

Here’s an example:

mysql> create table if not exists varchar_test (
    ->   id int auto_increment primary key,
    ->   str varchar(255)
    -> ) engine = InnoDB DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.08 sec)
 
mysql> alter table varchar_test add index str_index (str);
Query OK, 0 rows affected (0.15 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
mysql> alter table varchar_test DEFAULT CHARSET=utf8mb4;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
mysql> alter table varchar_test modify column str varchar(255) CHARACTER SET utf8mb4;
ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes.

For the record, here are the index length limits I need to worry about in a character set change:

In order to identify all of the indexes that I need to change, I run the following two queries on information_schema. (I tried to add the correct size for all relevant data types, but in case I forgot a data type or a new data type is added I put in a large default value to return false positives so I can go in and fix the query). These queries can be easily modified for a change to another character set such as utf8, utf16, utf32, etc.

Query #1:

-- Show me all index columns that will exceed the length limit if I change to utf8mb4
set @new_character_set = 'utf8mb4';
 
select table_schema, table_name, engine, index_name, column_name, column_type, 
  sub_part, index_column_length_in_bytes
from
(
select t.table_Schema,t.table_name,t.engine,s.index_name,c.column_name,
c.column_type,c.data_type,c.character_maximum_length,c.character_octet_length,
s.sub_part,c.character_set_name,cs.maxlen,
coalesce(
(
  case 
  when s.sub_part is null then (cs.maxlen * c.character_maximum_length) 
  else (cs.maxlen * s.sub_part) 
  end
),
(case (c.data_type)
when 'tinyint' then 1
when 'smallint' then 2
when 'mediumint' then 3
when 'int' then 4
when 'bigint' then 8
when 'decimal' then 4
when 'float' then 4
when 'year' then 1
when 'date' then 3
when 'datetime' then 8
when 'time' then 3
when 'timestamp' then 4
else 1000000
end)) as index_column_length_in_bytes 
from information_schema.statistics s 
inner join information_schema.columns c on c.table_schema = s.table_schema 
  and c.table_name = s.table_name 
  and c.column_name = s.column_name 
inner join information_schema.tables t on t.table_schema = c.table_schema 
  and t.table_name = c.table_name
inner join information_schema.character_sets cs on cs.character_set_name = @new_character_set
where s.index_type != 'FULLTEXT'
) sub_query
where index_column_length_in_bytes > (
  select case 
  when engine = 'MyISAM' then 1000 
  when engine = 'InnoDB' and (select max(innodb_large_prefix) from (
      select variable_value='ON' as innodb_large_prefix 
      from information_schema.global_variables 
      where variable_name = 'innodb_large_prefix' 
    union 
      select 0 as innodb_large_prefix
    ) sub_query) = 1 then 3072 
  else 767 
  end
)
order by table_schema, table_name, index_name;

Query #2:

-- show me all indexes that will exceed the length limit if I change to utf8mb4
set @new_character_set = 'utf8mb4';
 
select table_schema, table_name, engine, index_name, columns_in_index, index_length_in_bytes
from
(
select t.table_schema,t.table_name,t.engine,s.index_name,
group_concat(concat(s.column_name,
case 
  when s.sub_part is null 
  then '' 
  else concat('(',s.sub_part,')') 
  end
) order by s.seq_in_index) as columns_in_index,
sum(coalesce(
(
  case 
  when s.sub_part is null 
  then (cs.maxlen * c.character_maximum_length) 
  else (cs.maxlen * s.sub_part) 
  end
),
(
  case (c.data_type)
  when 'tinyint' then 1
  when 'smallint' then 2
  when 'mediumint' then 3
  when 'int' then 4
  when 'bigint' then 8
  when 'decimal' then 4
  when 'float' then 4
  when 'year' then 1
  when 'date' then 3
  when 'datetime' then 8
  when 'time' then 3
  when 'timestamp' then 4
  else 1000000
  end
)
)) as index_length_in_bytes 
from information_schema.statistics s 
inner join information_schema.columns c on c.table_schema = s.table_schema 
  and c.table_name = s.table_name 
  and c.column_name = s.column_name 
inner join information_schema.tables t on t.table_schema = c.table_schema 
  and t.table_name = c.table_name
inner join information_schema.character_sets cs on cs.character_set_name = @new_character_set
where s.index_type != 'FULLTEXT'
group by t.table_schema,t.table_name,t.engine,s.index_name
) sub_query
where index_length_in_bytes > (
  select case 
  when engine = 'MyISAM' then 1000 
  else 3072 
  end
)
order by table_schema, table_name, index_name;

Once I identify the affected indexes, I need to resolve each one either by shortening the sub-part of one or more columns in the index, removing columns from the index, decreasing the length of the column, or deciding not to change the character set of the relevant tables/columns.

NOTE: In MySQL 5.7, innodb_large_prefix + row format=DYNAMIC is enabled by default, so this should hopefully become a thing of the past.