If you have used SQL databases, you are probably familiar with schema migrations (a.k.a. database migrations). Those are meant to bring existing database tables in sync with your application's knowledge of them so that queries performed by the application return correct data. In contrast, schemaless databases such as the GAE's Datastore do not need schema migrations because... well, they have no schema to migrate. However, even though there are no schemas, you still have data with some structure, and if you change the structure in your application (e.g., by changing your soft schema represented by a model), you may still need to modify existing data to match the way your application works with it. This is known as 'data migration' to differentiate from 'schema migration'.

The GAE documentation provides an in-depth treatment of the two most common scenarios when it comes to data migrations, but those are not the only scenarios you may encounter, so we will cover one more here. Namely, we will cover the case where you have changed the type of one or more properties.

For this example we will use the new NDB API. Things work more or less the same with the DB API, so using NDB is not a requirement.

The basic principle is to use an Expando model which contains all properties except those that are being altered, and use that instead of your actual model. If your actual model is using NDB's structured properties, you can still use the child models without modification. Only the main model needs to be an Expando.

Migrations are performed by implementing a protected request handler (the documentation provides more information on that) that will execute the migration code. Within the migration code, you need to change the data stored in the altered property and save the entity. The concrete mechanic of setting up the handler, and retrieving and saving multiple entities is not the topic of this post, since those are covered in the GAE documentation.

Let's take a look at some code. First the original model before the modification:

class Person(ndb.Model):
name = ndb.StringProperty()
age = ndb.IntegerProperty()
gender = ndb.BooleanProperty(default=False)

We have decided that using a boolean property for gender was a bad idea, and we want to use a string instead. The updated model looks like this:

class Person(ndb.Model):
name = ndb.StringProperty()
age = ndb.IntegerProperty()
gender = ndb.StringProperty(choices=['male', 'female', 'other'])

Now we need to update the existing data in a migration. We will set up an Expando model that looks like this:

class Person(ndb.Expando):
name = ndb.StringProperty()
age = ndb.IntegerProperty()

Note that the gender field has been completely omitted. This allows us to treat it as a dynamic property which can take any type we need. It's important to keep all other properties and also add any new ones that are not being altered between the two versions of a model. This minimizes weird issues arising from properties that have no pure-Python equivalent and therefore fail to save when the Expando entity is saved. For instance, if you get an error like this:

BadRequestError: BLOB, ENITY_PROTO or TEXT property 
revisions must be in a raw_property field

you might be missing a text or a blob property.

For each entity retrieved from the datastore, we simply change the data:

GENDERS = {True: 'male', False: 'female', None: 'other'}
person.gender = GENDERS[person.gender]

This concludes the data migration. Other properties are left as they are (provided you kept the property declaration in the Expando subclass intact).


There are two more things that could be useful to remember.

First off, try to keep track of what migrations have run so you don't run them more than once. I use a simple model for storing data about migrations and check that first to see if some migration has already run. This is subject to race conditions in theory, but since you usually have full control over when migrations are run, it shouldn't be an issue. The model looks like this:

class Migration(ndb.Model):
timestamp = ndb.DateTimeProperty(auto_now_add=True)

def has_run(cls, migration_number):
k = ndb.Key('Migration', migration_number)
return k.get() is not None

def create(cls, migration_number):

We use a custom id that should match the migration number. I use a three-digit zero-padded numbers for identifying migrations, and I use those numbers for storing data about migrations. Before the migration starts, I add code like this:


# ...

if Migration.has_run(MIGRATION):
return 'Migration %s has already run' % MIGRATION

# migration code ...


# return response ...

Even with this safeguard, you should still write the actual migration code in a way that prevents double migration. This is especially important if you fear that you might have mixed data that uses both old and new schema. Try to detect if the model is in correct format. For instance, using our example above:

if person.gender in GENDER.keys():
person.gender = GENDER[person.gender]

Dynamic Expando properties are very flexible and they will take anything as valid data, so this prevents unnecessary data corruption or errors during migration.