parts/django/docs/ref/unicode.txt
changeset 69 c6bca38c1cbf
equal deleted inserted replaced
68:5ff1fc726848 69:c6bca38c1cbf
       
     1 ============
       
     2 Unicode data
       
     3 ============
       
     4 
       
     5 .. versionadded:: 1.0
       
     6 
       
     7 Django natively supports Unicode data everywhere. Providing your database can
       
     8 somehow store the data, you can safely pass around Unicode strings to
       
     9 templates, models and the database.
       
    10 
       
    11 This document tells you what you need to know if you're writing applications
       
    12 that use data or templates that are encoded in something other than ASCII.
       
    13 
       
    14 Creating the database
       
    15 =====================
       
    16 
       
    17 Make sure your database is configured to be able to store arbitrary string
       
    18 data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
       
    19 a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
       
    20 able to store certain characters in the database, and information will be lost.
       
    21 
       
    22  * MySQL users, refer to the `MySQL manual`_ (section 9.1.3.2 for MySQL 5.1)
       
    23    for details on how to set or alter the database character set encoding.
       
    24 
       
    25  * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in
       
    26    PostgreSQL 8) for details on creating databases with the correct encoding.
       
    27 
       
    28  * SQLite users, there is nothing you need to do. SQLite always uses UTF-8
       
    29    for internal encoding.
       
    30 
       
    31 .. _MySQL manual: http://dev.mysql.com/doc/refman/5.1/en/charset-database.html
       
    32 .. _PostgreSQL manual: http://www.postgresql.org/docs/8.2/static/multibyte.html#AEN24104
       
    33 
       
    34 All of Django's database backends automatically convert Unicode strings into
       
    35 the appropriate encoding for talking to the database. They also automatically
       
    36 convert strings retrieved from the database into Python Unicode strings. You
       
    37 don't even need to tell Django what encoding your database uses: that is
       
    38 handled transparently.
       
    39 
       
    40 For more, see the section "The database API" below.
       
    41 
       
    42 General string handling
       
    43 =======================
       
    44 
       
    45 Whenever you use strings with Django -- e.g., in database lookups, template
       
    46 rendering or anywhere else -- you have two choices for encoding those strings.
       
    47 You can use Unicode strings, or you can use normal strings (sometimes called
       
    48 "bytestrings") that are encoded using UTF-8.
       
    49 
       
    50 .. admonition:: Warning
       
    51 
       
    52     A bytestring does not carry any information with it about its encoding.
       
    53     For that reason, we have to make an assumption, and Django assumes that all
       
    54     bytestrings are in UTF-8.
       
    55 
       
    56     If you pass a string to Django that has been encoded in some other format,
       
    57     things will go wrong in interesting ways. Usually, Django will raise a
       
    58     ``UnicodeDecodeError`` at some point.
       
    59 
       
    60 If your code only uses ASCII data, it's safe to use your normal strings,
       
    61 passing them around at will, because ASCII is a subset of UTF-8.
       
    62 
       
    63 Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
       
    64 to something other than ``'utf-8'`` you can use that other encoding in your
       
    65 bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
       
    66 the result of template rendering (and e-mail). Django will always assume UTF-8
       
    67 encoding for internal bytestrings. The reason for this is that the
       
    68 :setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
       
    69 application developer). It's under the control of the person installing and
       
    70 using your application -- and if that person chooses a different setting, your
       
    71 code must still continue to work. Ergo, it cannot rely on that setting.
       
    72 
       
    73 In most cases when Django is dealing with strings, it will convert them to
       
    74 Unicode strings before doing anything else. So, as a general rule, if you pass
       
    75 in a bytestring, be prepared to receive a Unicode string back in the result.
       
    76 
       
    77 Translated strings
       
    78 ------------------
       
    79 
       
    80 Aside from Unicode strings and bytestrings, there's a third type of string-like
       
    81 object you may encounter when using Django. The framework's
       
    82 internationalization features introduce the concept of a "lazy translation" --
       
    83 a string that has been marked as translated but whose actual translation result
       
    84 isn't determined until the object is used in a string. This feature is useful
       
    85 in cases where the translation locale is unknown until the string is used, even
       
    86 though the string might have originally been created when the code was first
       
    87 imported.
       
    88 
       
    89 Normally, you won't have to worry about lazy translations. Just be aware that
       
    90 if you examine an object and it claims to be a
       
    91 ``django.utils.functional.__proxy__`` object, it is a lazy translation.
       
    92 Calling ``unicode()`` with the lazy translation as the argument will generate a
       
    93 Unicode string in the current locale.
       
    94 
       
    95 For more details about lazy translation objects, refer to the
       
    96 :doc:`internationalization </topics/i18n/index>` documentation.
       
    97 
       
    98 Useful utility functions
       
    99 ------------------------
       
   100 
       
   101 Because some string operations come up again and again, Django ships with a few
       
   102 useful functions that should make working with Unicode and bytestring objects
       
   103 a bit easier.
       
   104 
       
   105 Conversion functions
       
   106 ~~~~~~~~~~~~~~~~~~~~
       
   107 
       
   108 The ``django.utils.encoding`` module contains a few functions that are handy
       
   109 for converting back and forth between Unicode and bytestrings.
       
   110 
       
   111     * ``smart_unicode(s, encoding='utf-8', strings_only=False, errors='strict')``
       
   112       converts its input to a Unicode string. The ``encoding`` parameter
       
   113       specifies the input encoding. (For example, Django uses this internally
       
   114       when processing form input data, which might not be UTF-8 encoded.) The
       
   115       ``strings_only`` parameter, if set to True, will result in Python
       
   116       numbers, booleans and ``None`` not being converted to a string (they keep
       
   117       their original types). The ``errors`` parameter takes any of the values
       
   118       that are accepted by Python's ``unicode()`` function for its error
       
   119       handling.
       
   120 
       
   121       If you pass ``smart_unicode()`` an object that has a ``__unicode__``
       
   122       method, it will use that method to do the conversion.
       
   123 
       
   124     * ``force_unicode(s, encoding='utf-8', strings_only=False,
       
   125       errors='strict')`` is identical to ``smart_unicode()`` in almost all
       
   126       cases. The difference is when the first argument is a :ref:`lazy
       
   127       translation <lazy-translations>` instance. While ``smart_unicode()``
       
   128       preserves lazy translations, ``force_unicode()`` forces those objects to a
       
   129       Unicode string (causing the translation to occur). Normally, you'll want
       
   130       to use ``smart_unicode()``. However, ``force_unicode()`` is useful in
       
   131       template tags and filters that absolutely *must* have a string to work
       
   132       with, not just something that can be converted to a string.
       
   133 
       
   134     * ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')``
       
   135       is essentially the opposite of ``smart_unicode()``. It forces the first
       
   136       argument to a bytestring. The ``strings_only`` parameter has the same
       
   137       behavior as for ``smart_unicode()`` and ``force_unicode()``. This is
       
   138       slightly different semantics from Python's builtin ``str()`` function,
       
   139       but the difference is needed in a few places within Django's internals.
       
   140 
       
   141 Normally, you'll only need to use ``smart_unicode()``. Call it as early as
       
   142 possible on any input data that might be either Unicode or a bytestring, and
       
   143 from then on, you can treat the result as always being Unicode.
       
   144 
       
   145 URI and IRI handling
       
   146 ~~~~~~~~~~~~~~~~~~~~
       
   147 
       
   148 Web frameworks have to deal with URLs (which are a type of IRI_). One
       
   149 requirement of URLs is that they are encoded using only ASCII characters.
       
   150 However, in an international environment, you might need to construct a
       
   151 URL from an IRI_ -- very loosely speaking, a URI that can contain Unicode
       
   152 characters. Quoting and converting an IRI to URI can be a little tricky, so
       
   153 Django provides some assistance.
       
   154 
       
   155     * The function ``django.utils.encoding.iri_to_uri()`` implements the
       
   156       conversion from IRI to URI as required by the specification (`RFC
       
   157       3987`_).
       
   158 
       
   159     * The functions ``django.utils.http.urlquote()`` and
       
   160       ``django.utils.http.urlquote_plus()`` are versions of Python's standard
       
   161       ``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII
       
   162       characters. (The data is converted to UTF-8 prior to encoding.)
       
   163 
       
   164 These two groups of functions have slightly different purposes, and it's
       
   165 important to keep them straight. Normally, you would use ``urlquote()`` on the
       
   166 individual portions of the IRI or URI path so that any reserved characters
       
   167 such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
       
   168 the full IRI and it converts any non-ASCII characters to the correct encoded
       
   169 values.
       
   170 
       
   171 .. note::
       
   172     Technically, it isn't correct to say that ``iri_to_uri()`` implements the
       
   173     full algorithm in the IRI specification. It doesn't (yet) perform the
       
   174     international domain name encoding portion of the algorithm.
       
   175 
       
   176 The ``iri_to_uri()`` function will not change ASCII characters that are
       
   177 otherwise permitted in a URL. So, for example, the character '%' is not
       
   178 further encoded when passed to ``iri_to_uri()``. This means you can pass a
       
   179 full URL to this function and it will not mess up the query string or anything
       
   180 like that.
       
   181 
       
   182 An example might clarify things here::
       
   183 
       
   184     >>> urlquote(u'Paris & Orléans')
       
   185     u'Paris%20%26%20Orl%C3%A9ans'
       
   186     >>> iri_to_uri(u'/favorites/François/%s' % urlquote(u'Paris & Orléans'))
       
   187     '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
       
   188 
       
   189 If you look carefully, you can see that the portion that was generated by
       
   190 ``urlquote()`` in the second example was not double-quoted when passed to
       
   191 ``iri_to_uri()``. This is a very important and useful feature. It means that
       
   192 you can construct your IRI without worrying about whether it contains
       
   193 non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
       
   194 result.
       
   195 
       
   196 The ``iri_to_uri()`` function is also idempotent, which means the following is
       
   197 always true::
       
   198 
       
   199     iri_to_uri(iri_to_uri(some_string)) = iri_to_uri(some_string)
       
   200 
       
   201 So you can safely call it multiple times on the same IRI without risking
       
   202 double-quoting problems.
       
   203 
       
   204 .. _URI: http://www.ietf.org/rfc/rfc2396.txt
       
   205 .. _IRI: http://www.ietf.org/rfc/rfc3987.txt
       
   206 .. _RFC 3987: IRI_
       
   207 
       
   208 Models
       
   209 ======
       
   210 
       
   211 Because all strings are returned from the database as Unicode strings, model
       
   212 fields that are character based (CharField, TextField, URLField, etc) will
       
   213 contain Unicode values when Django retrieves data from the database. This
       
   214 is *always* the case, even if the data could fit into an ASCII bytestring.
       
   215 
       
   216 You can pass in bytestrings when creating a model or populating a field, and
       
   217 Django will convert it to Unicode when it needs to.
       
   218 
       
   219 Choosing between ``__str__()`` and ``__unicode__()``
       
   220 ----------------------------------------------------
       
   221 
       
   222 One consequence of using Unicode by default is that you have to take some care
       
   223 when printing data from the model.
       
   224 
       
   225 In particular, rather than giving your model a ``__str__()`` method, we
       
   226 recommended you implement a ``__unicode__()`` method. In the ``__unicode__()``
       
   227 method, you can quite safely return the values of all your fields without
       
   228 having to worry about whether they fit into a bytestring or not. (The way
       
   229 Python works, the result of ``__str__()`` is *always* a bytestring, even if you
       
   230 accidentally try to return a Unicode object).
       
   231 
       
   232 You can still create a ``__str__()`` method on your models if you want, of
       
   233 course, but you shouldn't need to do this unless you have a good reason.
       
   234 Django's ``Model`` base class automatically provides a ``__str__()``
       
   235 implementation that calls ``__unicode__()`` and encodes the result into UTF-8.
       
   236 This means you'll normally only need to implement a ``__unicode__()`` method
       
   237 and let Django handle the coercion to a bytestring when required.
       
   238 
       
   239 Taking care in ``get_absolute_url()``
       
   240 -------------------------------------
       
   241 
       
   242 URLs can only contain ASCII characters. If you're constructing a URL from
       
   243 pieces of data that might be non-ASCII, be careful to encode the results in a
       
   244 way that is suitable for a URL. The ``django.db.models.permalink()`` decorator
       
   245 handles this for you automatically.
       
   246 
       
   247 If you're constructing a URL manually (i.e., *not* using the ``permalink()``
       
   248 decorator), you'll need to take care of the encoding yourself. In this case,
       
   249 use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented
       
   250 above_. For example::
       
   251 
       
   252     from django.utils.encoding import iri_to_uri
       
   253     from django.utils.http import urlquote
       
   254 
       
   255     def get_absolute_url(self):
       
   256         url = u'/person/%s/?x=0&y=0' % urlquote(self.location)
       
   257         return iri_to_uri(url)
       
   258 
       
   259 This function returns a correctly encoded URL even if ``self.location`` is
       
   260 something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
       
   261 call isn't strictly necessary in the above example, because all the
       
   262 non-ASCII characters would have been removed in quoting in the first line.)
       
   263 
       
   264 .. _above: `URI and IRI handling`_
       
   265 
       
   266 The database API
       
   267 ================
       
   268 
       
   269 You can pass either Unicode strings or UTF-8 bytestrings as arguments to
       
   270 ``filter()`` methods and the like in the database API. The following two
       
   271 querysets are identical::
       
   272 
       
   273     qs = People.objects.filter(name__contains=u'Å')
       
   274     qs = People.objects.filter(name__contains='\xc3\x85') # UTF-8 encoding of Å
       
   275 
       
   276 Templates
       
   277 =========
       
   278 
       
   279 You can use either Unicode or bytestrings when creating templates manually::
       
   280 
       
   281 	from django.template import Template
       
   282 	t1 = Template('This is a bytestring template.')
       
   283 	t2 = Template(u'This is a Unicode template.')
       
   284 
       
   285 But the common case is to read templates from the filesystem, and this creates
       
   286 a slight complication: not all filesystems store their data encoded as UTF-8.
       
   287 If your template files are not stored with a UTF-8 encoding, set the :setting:`FILE_CHARSET`
       
   288 setting to the encoding of the files on disk. When Django reads in a template
       
   289 file, it will convert the data from this encoding to Unicode. (:setting:`FILE_CHARSET`
       
   290 is set to ``'utf-8'`` by default.)
       
   291 
       
   292 The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
       
   293 This is set to UTF-8 by default.
       
   294 
       
   295 Template tags and filters
       
   296 -------------------------
       
   297 
       
   298 A couple of tips to remember when writing your own template tags and filters:
       
   299 
       
   300     * Always return Unicode strings from a template tag's ``render()`` method
       
   301       and from template filters.
       
   302 
       
   303     * Use ``force_unicode()`` in preference to ``smart_unicode()`` in these
       
   304       places. Tag rendering and filter calls occur as the template is being
       
   305       rendered, so there is no advantage to postponing the conversion of lazy
       
   306       translation objects into strings. It's easier to work solely with Unicode
       
   307       strings at that point.
       
   308 
       
   309 E-mail
       
   310 ======
       
   311 
       
   312 Django's e-mail framework (in ``django.core.mail``) supports Unicode
       
   313 transparently. You can use Unicode data in the message bodies and any headers.
       
   314 However, you're still obligated to respect the requirements of the e-mail
       
   315 specifications, so, for example, e-mail addresses should use only ASCII
       
   316 characters.
       
   317 
       
   318 The following code example demonstrates that everything except e-mail addresses
       
   319 can be non-ASCII::
       
   320 
       
   321     from django.core.mail import EmailMessage
       
   322 
       
   323     subject = u'My visit to Sør-Trøndelag'
       
   324     sender = u'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>'
       
   325     recipients = ['Fred <fred@example.com']
       
   326     body = u'...'
       
   327     EmailMessage(subject, body, sender, recipients).send()
       
   328 
       
   329 Form submission
       
   330 ===============
       
   331 
       
   332 HTML form submission is a tricky area. There's no guarantee that the
       
   333 submission will include encoding information, which means the framework might
       
   334 have to guess at the encoding of submitted data.
       
   335 
       
   336 Django adopts a "lazy" approach to decoding form data. The data in an
       
   337 ``HttpRequest`` object is only decoded when you access it. In fact, most of
       
   338 the data is not decoded at all. Only the ``HttpRequest.GET`` and
       
   339 ``HttpRequest.POST`` data structures have any decoding applied to them. Those
       
   340 two fields will return their members as Unicode data. All other attributes and
       
   341 methods of ``HttpRequest`` return data exactly as it was submitted by the
       
   342 client.
       
   343 
       
   344 By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
       
   345 for form data. If you need to change this for a particular form, you can set
       
   346 the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
       
   347 
       
   348     def some_view(request):
       
   349         # We know that the data must be encoded as KOI8-R (for some reason).
       
   350         request.encoding = 'koi8-r'
       
   351         ...
       
   352 
       
   353 You can even change the encoding after having accessed ``request.GET`` or
       
   354 ``request.POST``, and all subsequent accesses will use the new encoding.
       
   355 
       
   356 Most developers won't need to worry about changing form encoding, but this is
       
   357 a useful feature for applications that talk to legacy systems whose encoding
       
   358 you cannot control.
       
   359 
       
   360 Django does not decode the data of file uploads, because that data is normally
       
   361 treated as collections of bytes, rather than strings. Any automatic decoding
       
   362 there would alter the meaning of the stream of bytes.