If you, like me, do not trust automation, this is how I have handled the problem. Show First Stop digging! Start with altering the default charset of new tables by changing the DB definition(like in all other answers):
Then generate sql to change the default charset for new columns of all existing tables:
Now we can handle the "legacy" List character datatypes you are using:
For me that list was "varchar" and "text" List character_SETS_ in use:
This gives me "utf8", "latin1", and "utf8mb4" which is a reason I do not trust automation, the latin1 columns risk having dirty data. Now you can make a list of all columns you need to update with:
Edit: Original syntax above had an error. Tables containing only utf8 or utf8mb4 could be converted with "CONVERT TO CHARACTER SET" as Mathias and MrJingles describes above, but then you risk MySQL changing the types for you, so you may be better of running "CHANGE COLUMN" instead since that gives you control of exactly what happens. If you have non-utf8 columns these questions may give inspiration about checking the columns data: https://stackoverflow.com/q/401771/671282 https://stackoverflow.com/q/9304485/671282 Since you probably know what you expect to have in most of the columns something like this will probably handle most of them after modifying the non-ascii chars allowed to suit your needs:
When the above did not fit I used the below that have a bit "fuzzier" maching: SELECT distinct CONVERT(CONVERT(column_name USING BINARY) USING latin1) AS latin1, CONVERT(CONVERT(column_name USING BINARY) USING utf8) AS utf8 FROM table_name WHERE CONVERT(column_name USING BINARY) RLIKE CONCAT('[', UNHEX('C0'), '-', UNHEX('F4'), '][',UNHEX('80'),'-',UNHEX('FF'),']') limit 5; This query matches any two characters that could start an utf8-character, thus allowing you to inspect those records, it may give you a lot of false positives. The utf8 conversion fails returning null if there is any character it can not convert, so in a large field there is a good chance of it not being useful. Class on MySQL and UTF8Today, we'll look at strings in our database that are not all-ASCII. GoalBy the end of today, you will be able to:
Concepts
Today's ExampleUse DBIFirst, we'll enhance the connection to the database. Because it's a bit more work now, and work that we always have to do, we'll put this function in a new module that I've called Here's the function that gets a connection: def getConn(db): conn = MySQLdb.connect(host='localhost', user='ubuntu', passwd='', db=db) conn.set_character_set('utf8') curs = conn.cursor() curs.execute('set names utf8;') curs.execute('set character set utf8;') curs.execute('set character_set_connection=utf8;') return conn These say that when we send stuff to the client, we want it sent using UTF8, which is a particular encoding of Unicode. UTF-8 is an extremely common encoding, and one that is not going to break if someone uses 💩 (pile of poo). Latin-1 is a different encoding and can't handle pile of poo. Unicode stringsThe First a generic converter. This function assumes that the byte string is represented using UTF-8. def utf8(val): return unicode(val,'utf8') if type(val) is str else val RowsRows can be represented as either tuples or dictionaries, so we should be able to convert either kind: def dict2utf8(dic): '''Because dictionaries are mutable, this mutates the dictionary; it also returns it''' for k,v in dic.iteritems(): dic[k] = utf8(v) return dic def tuple2utf8(tup): '''returns a new tuple, with byte strings converted to unicode strings''' return tuple(map(utf8,tup)) For convenience, let's write a generic conversion function: def row2utf8(row): if type(row) is tuple: return tuple2utf8(row) elif type(row) is dict: return dict2utf8(row) else: raise TypeError('row is of unhandled type') That's the meat of the
people.pyHow does our custom database interaction module have to change? First, when we get our list of people, we need to convert them all to unicode strings: def getPeople(conn): '''Returns a list of rows, as dictionaries.''' curs = conn.cursor(MySQLdb.cursors.DictCursor) curs.execute('select name,birthdate from person') all = curs.fetchall() for p in all: dbi.row2utf8(p) return all That's pretty much it.
Browser CharsetWe have to tell the browser that we are using UTF-8 (as opposed to Latin1 or ASCII). We've actually been doing this all along. Look at the top of <meta charset="utf-8"> See more about the meta tag. Our Flask AppHow does our app have to change? Not at all! All the ugliness has been hidden. @app.route('/people/') def displayPeople(): conn = dbi.getConn('wmdb') # we could also write a different query # getting a subset of the people # and render it with the same template all = people.getPeople(conn) now = servertime.now() desc = 'All people as of {}'.format(now), return render_template('people-list.html', desc=desc, people=all)
Summary
Summer ExampleYou may find the ete_str = '\xc3\xa9t\xc3\xa9' ete_utf8 = unicode(ete_str,'utf8') ete_latin1 = ete_utf8.encode('latin1') print 'byte string',len(ete_str),ete_str print 'utf8',len(ete_utf8),ete_utf8.encode('utf8') print 'latin1',len(ete_latin1),ete_latin1
How do I change mysql from UTFSimilarly, here's the command to change character set of MySQL table from latin1 to UTF8. Replace table_name with your database table name. mysql> ALTER TABLE table_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci; Hopefully, the above tutorial will help you change database character set to utf8mb4 (UTF-8).
How to set UTFTo change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace dbname with the database name: Copy ALTER DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci; To exit the mysql program, type \q at the mysql> prompt.
How do I change the encoding of a column in SQL?The Process. Convert the column to the associated BINARY-type (ALTER TABLE MyTable MODIFY MyColumn BINARY). Convert the column back to the original type and set the character set to UTF-8 at the same time (ALTER TABLE MyTable MODIFY MyColumn TEXT CHARACTER SET utf8 COLLATE utf8_general_ci). |