Mysql convert utf8 to ascii

If you, like me, do not trust automation, this is how I have handled the problem.

First Stop digging!

Start with altering the default charset of new tables by changing the DB definition(like in all other answers):

ALTER DATABASE database_name 
  CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

Then generate sql to change the default charset for new columns of all existing tables:

SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name,"` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_swedish_ci;") as _sql
  FROM information_schema.TABLES
 WHERE table_schema like "database_name" and TABLE_TYPE="BASE TABLE"
 GROUP BY table_schema, table_name ;

Now we can handle the "legacy"

List character datatypes you are using:

select distinct data_type  from information_schema.columns where table_schema = "database_name" and CHARACTER_SET_NAME is not null;

For me that list was "varchar" and "text"

List character_SETS_ in use:

select distinct character_set_name from information_schema.columns where table_schema = "database_name";

This gives me "utf8", "latin1", and "utf8mb4" which is a reason I do not trust automation, the latin1 columns risk having dirty data.

Now you can make a list of all columns you need to update with:

select table_name, column_name, data_type, character_set_name, collation_name
  from information_schema.columns 
 where table_schema = "database_name" and CHARACTER_SET_NAME is not null AND CHARACTER_SET_NAME <> "utf8mb4"
 group by table_name, data_type, character_set_name, collation_name;

Edit: Original syntax above had an error.

Tables containing only utf8 or utf8mb4 could be converted with "CONVERT TO CHARACTER SET" as Mathias and MrJingles describes above, but then you risk MySQL changing the types for you, so you may be better of running "CHANGE COLUMN" instead since that gives you control of exactly what happens.

If you have non-utf8 columns these questions may give inspiration about checking the columns data: https://stackoverflow.com/q/401771/671282 https://stackoverflow.com/q/9304485/671282

Since you probably know what you expect to have in most of the columns something like this will probably handle most of them after modifying the non-ascii chars allowed to suit your needs:

SELECT distinct section FROM table_name WHERE column_name NOT REGEXP '^([A-Za-z0-9åäöÅÄÖ&.,_ -])*$';

When the above did not fit I used the below that have a bit "fuzzier" maching:

SELECT distinct
  CONVERT(CONVERT(column_name USING BINARY) USING latin1) AS latin1,
  CONVERT(CONVERT(column_name USING BINARY) USING utf8) AS utf8
FROM table_name
WHERE CONVERT(column_name USING BINARY) RLIKE CONCAT('[', UNHEX('C0'), '-', UNHEX('F4'), '][',UNHEX('80'),'-',UNHEX('FF'),']') limit 5;

This query matches any two characters that could start an utf8-character, thus allowing you to inspect those records, it may give you a lot of false positives. The utf8 conversion fails returning null if there is any character it can not convert, so in a large field there is a good chance of it not being useful.

Mysql convert utf8 to ascii

Class on MySQL and UTF8

Today, we'll look at strings in our database that are not all-ASCII.

Goal

By the end of today, you will be able to:

  • Read database values that contain UTF8-encoded strings
  • Render a template that displays them.

Concepts

  • In Python2, there's a difference between byte strings and unicode strings. Unicode strings are representation-neutral and can handle any unicode character. Byte strings are sequences of bytes, and so require knowing the representational encoding to understand them.
  • With plain ASCII characters, they look like this:
    • str1 = 'hello'
    • str2 = u'hello'
  • With non-ASCII characters, they might look like this (representing a single Unicode code-point, namely é (latin small letter e with accent acute)
    • e1 = '\xc3\xa9'
    • e2 = u'\U00e9'
  • With extremely non-ASCII characters, they might look like this (representing a single Unicode code-point, namely 💩 (pile of poo)
    • poo1 = '\xf0\x9f\x92\xa9'
    • poo2 = u'\U0001f4a9'
  • (In Python3, there are only unicode strings; things are very different, though you still have to know about encodings.)
  • You still need to know something about representations (encodings/decodings) in order to print unicode strings.
  • Later in the course, we'll learn about Unicode and different encodings (representations) of international character sets.
  • You can use several MySQL statements to tell the database what encoding you want the result sent as.
  • You then need to use those encodings to convert the byte strings to unicode strings.
  • We can directly render unicode strings into our HTML templates and send them to the browse as UTF-8.

Today's Example

Use curl to download utf8.tar.

DBI

First, we'll enhance the connection to the database. Because it's a bit more work now, and work that we always have to do, we'll put this function in a new module that I've called dbi for DataBase Interface.

Here's the function that gets a connection:

def getConn(db):
    conn = MySQLdb.connect(host='localhost',
                           user='ubuntu',
                           passwd='',
                           db=db)
    conn.set_character_set('utf8')
    curs = conn.cursor()
    curs.execute('set names utf8;')
    curs.execute('set character set utf8;')
    curs.execute('set character_set_connection=utf8;')
    return conn

These say that when we send stuff to the client, we want it sent using UTF8, which is a particular encoding of Unicode. UTF-8 is an extremely common encoding, and one that is not going to break if someone uses 💩 (pile of poo). Latin-1 is a different encoding and can't handle pile of poo.

Unicode strings

The MySQLdb api still gives us byte strings instead of unicode strings. So, we need to convert them. Since any Python code that uses non-ASCII needs these functions, we'll add them to our dbi module.

First a generic converter. This function assumes that the byte string is represented using UTF-8.

def utf8(val):
    return unicode(val,'utf8') if type(val) is str else val

Rows

Rows can be represented as either tuples or dictionaries, so we should be able to convert either kind:

def dict2utf8(dic):
    '''Because dictionaries are mutable,
this mutates the dictionary;
it also returns it'''
    for k,v in dic.iteritems():
        dic[k] = utf8(v)
    return dic

def tuple2utf8(tup):
    '''returns a new tuple, with byte strings
converted to unicode strings'''
    return tuple(map(utf8,tup))

For convenience, let's write a generic conversion function:

def row2utf8(row):
    if type(row) is tuple:
        return tuple2utf8(row)
    elif type(row) is dict:
        return dict2utf8(row)
    else:
        raise TypeError('row is of unhandled type')

That's the meat of the dbi.py module.

  1. use diff to compare wmdb-ascii.sql with wmdb.sql.
  2. load the wmdb.sql file into MySQL
  3. read the testing code end of the dbi.py module.
  4. run the dbi.py script to see it in action.

people.py

How does our custom database interaction module have to change? First, when we get our list of people, we need to convert them all to unicode strings:

def getPeople(conn):
    '''Returns a list of rows, as dictionaries.'''
    curs = conn.cursor(MySQLdb.cursors.DictCursor)
    curs.execute('select name,birthdate from person')
    all = curs.fetchall()
    for p in all:
        dbi.row2utf8(p)
    return all

That's pretty much it.

  1. read the testing code end of the people.py module.
  2. run the people.py script to see it in action.
  3. pipe the output through grep to focus on just Chalamet:
    python people.py | grep Chalamet
    

Browser Charset

We have to tell the browser that we are using UTF-8 (as opposed to Latin1 or ASCII). We've actually been doing this all along. Look at the top of templates/people-list.html and you'll see:

<meta charset="utf-8">

See more about the meta tag.

Our Flask App

How does our app have to change? Not at all! All the ugliness has been hidden.

@app.route('/people/')
def displayPeople():
    conn = dbi.getConn('wmdb')
    # we could also write a different query
    # getting a subset of the people
    # and render it with the same template
    all = people.getPeople(conn)
    now = servertime.now()
    desc = 'All people as of {}'.format(now), 
    return render_template('people-list.html', 
                          desc=desc,
                          people=all)

  1. run the app
  2. click on the people list link
  3. search for Chalamet to see that he's handled correctly.

Summary

  • Use UTF-8 throughout, and you'll probably be safe.

Summer Example

You may find the ete.py example interesting:

ete_str = '\xc3\xa9t\xc3\xa9'
ete_utf8 = unicode(ete_str,'utf8')
ete_latin1 = ete_utf8.encode('latin1')

print 'byte string',len(ete_str),ete_str
print 'utf8',len(ete_utf8),ete_utf8.encode('utf8')
print 'latin1',len(ete_latin1),ete_latin1

  1. Run the example
  2. Consider the lengths of the strings. Is that what you expected?
  3. Consider the output. What's wrong with the latin1 string?
  4. Do
    printenv LC_ALL

    This shows that our shell is expecting UTF-8, not Latin1.

How do I change mysql from UTF

Similarly, here's the command to change character set of MySQL table from latin1 to UTF8. Replace table_name with your database table name. mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; Hopefully, the above tutorial will help you change database character set to utf8mb4 (UTF-8).

How to set UTF

To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace dbname with the database name: Copy ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci; To exit the mysql program, type \q at the mysql> prompt.

How do I change the encoding of a column in SQL?

The Process.
Convert the column to the associated BINARY-type (ALTER TABLE MyTable MODIFY MyColumn BINARY).
Convert the column back to the original type and set the character set to UTF-8 at the same time (ALTER TABLE MyTable MODIFY MyColumn TEXT CHARACTER SET utf8 COLLATE utf8_general_ci).