Cells Overview

Nova cells was introduced at the Folsom/Grizzly timeframe as a way to scale Nova beyond the limitations of a single database backend and message queueing system.  This was mainly geared at larger operators who needed to scale beyond hundreds of hypervisors.

At a basic level, Nova cells creates a hierarchy of Nova installs, each with its own database, message queue backend, scheduler, and compute nodes.  This allows the scope of any one Nova cell to be smaller, while still being able to manage the whole environment through a single Nova API, which runs at a parent cell (aka the API cell.)

These two diagrams are helpful to understand how all these pieces fit together (from http://comstud.com/cells.pdf):

cellsdown
cellsup

Here is some other useful reading when getting started with Nova cells:

Cells from Scratch

Setting up Nova cells from scratch is actually pretty straightforward.  For the most part, just follow the instructions in the configuration reference, with these caveats:

Don’t configure DEFAULT/compute_api_class in nova.conf

This setting is no longer used.  This should be fixed in the configuration reference soon (bug).

Instead, use cells/enable and cells/cell_type

This enables cells functionality and configures the cell type.

[cells]

enable = true

cell_type = api|compute

The api cell type is for the API/parent cell, and compute for the compute/child cell(s).

I am not sure what the right cell_type setting is for the middle tier cells in a multi-tier setup (cells within cells).  I suspect it would be api, but that is just a guess.  I don’t know that anyone is actually running multi-tiered cells (other than maybe Rackspace.)

For Clustered RMQ Servers

The nova-manage cell create command from the configuration guide only takes a single RMQ server hostname.  When there are multiple RMQ servers for any cells, you have to configure that directly in the cells table in the nova database using this format:

  • rabbit://user:pass@host1:port[,user:pass@hostN:portN]/virtual_host

That goes in the transport_url field in the cells table.

Another alternative is use a json file to configure the cells information, rather than storing it in the database (see “Optional cell configuration” in the Configuration Reference.)  I did not actually try this out, so I don’t know if it works any better.

There is a bug for this, and a fix has been released that will be included in Kilo.  But for Juno, you can use one of the above work arounds.

Note:  After further investigation of this, I did not actually see the multiple RMQ host failover work right.  If nova-cells could not connect to the first host in the list, it never moved on to try the second one.

RMQ over SSL

To use SSL to connect to RMQ for each cell, use the global DEFAULT/rabbit_user_ssl setting in nova.conf.  This will apply to RMQ connections for cells, too.

Unfortunately, this means that RMQ SSL is all-or nothing.  If you want to use it for one RMQ instance, you must use it for all of them.

No URL-Encodable Characters in RMQ Credentials

Any “special” characters in the username or password will be URL-encoded by the nova-manage command before being placed in the cells database table.  For example, for a username of rmq=dev=user, the actual username put in the database is  rmq%3Ddev%3Duser.

The problem is that the connection string is not url decoded by nova-cells before connecting to RMQ.  So for the above example, nova-cells is trying to log in to RMQ using username rmq%3Ddev%3Duser. instead of rmq=dev=user.

You can also get around this one by manually populating the cells information using one of the above methods.  Just use a URL that does not have the special characters url encoded.  Bug 1406598 has been opened for this problem.

Circular Reference Error on nova cells-show Call

If you get a 500 error about “circular reference detected” from nova cells-show (or the equivalent API call), you’re hitting this bug.

2014-12-30 09:45:57.067 6212 ERROR nova.api.openstack [req-3178636c-0ebe-4767-a9c5-812cf20b76b3 admin demo] Caught error: Circular reference detected
Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 133, in _dispatch_and_reply
incoming.message))

File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 67, in reply
self._send_reply(conn, reply, failure, log_failure=log_failure)

File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 61, in _send_reply
conn.direct_send(self.reply_q, rpc_common.serialize_msg(msg))

File "/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/common.py", line 462, in serialize_msg
_MESSAGE_KEY: jsonutils.dumps(raw_msg)}

File "/usr/lib/python2.7/site-packages/oslo/messaging/openstack/common/jsonutils.py", line 164, in dumps
return json.dumps(value, default=default, **kwargs)

File "/usr/lib64/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)

File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)

File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)

ValueError: Circular reference detected

The patch that’s been proposed worked for me.

What Services Run Where

One thing that was confusing to me is exactly which Nova services are supposed to run in the API cell vs. the compute cells.  Especially for some of the lesser-well-known services like nova-consoleauth and nova-spicehtml5proxy.

This may not be 100% correct, but here’s how we’ve split the services in our environment, and it seems to be working well:

API Cell Services

Compute Cell Services

  • nova-cells
  • nova-cert
  • nova-conductor
  • nova-console
  • nova-scheduler
  • nova-network*

To be honest, I am not really sure what nova-console and nova-cert do, so the compute cells may or may not be the right place for them.

* We use Neutron instead of nova-network, but if you were running nova-network, it would run in the compute cell.

Caveats

Nova clearly states that cells is experimental.  Cells v2 aims to correct a lot of the current problems (see below.)  For the time being, here are some caveats to be aware of with Cells “v1”:

Most Components Are Not Cell-Aware

No other OpenStack services are aware of Nova cells, which for the most part is not a huge problem (although, see the Neutron Port Notifications section below.)

Many objects within Nova itself are not aware of cells, either.  This can make life a little difficult, but it’s doable.

Flavors, host aggregates, availability zones, and security groups are the main troublemakers.  While these objects can be created in compute cells, the API cell has no visibility or knowledge of them.

Flavors: These must be synchronized at the database level between the API cell and the compute cells.  Even the internal id field in the database table must match, because instances are created referencing the id of the flavor, not the name.  Fortunately, flavors are typically not modified very often, so it’s pretty easy to just populate that info from one Nova database to another.

Host Aggregates and Availability Zones:  Since the API cell does not directly control any compute nodes, it’s not possible to create host aggregates there.  And, therefore, it’s also not possible to create availability zones.  (I did not do extensive investigation on this, because it wasn’t a big deal for us.  There may be others out there who have figured out how to do AZ’s in the API cell.)

We still create aggregates and zones in the compute cells, because we use those for some custom scheduler hints.  But since the API cell has no notion of these, they cannot be exposed or used by end-users.

Security Groups:  We use Neutron for networking so security groups are handled there.  As such, we don’t have to handle syncing security groups between cells.

Neutron Port Notifications

Neutron is typically configured to notify Nova about port state changes, so nova-compute will know when a new instance is plugged in to the network.  Because there are multiple instances of Nova when using cells, there is no way to configure Neutron so that the notifications will get to the right Nova.  Essentially we have the throw away this functionality.

To prevent Nova from waiting forever for a port notification which will never arrive, use these settings on nova-compute to allow things to continue working:

vif_plugging_is_fatal = false

vif_plugging_timeout = 5

This tells nova-compute to not worry if a vif plug operation appears to fail (which it will, since Nova will never be notified by Neutron), and to continue on its way after some amount of time waiting for the plug to happen.

This works pretty well, because normally the vif plug operation happens quite fast.  But it does introduce a race condition where the results are unclear if the vif is not plugged by the time the vif_plugging_timeout expires.  I suspect that in this case the instance would be booted and continue normally when the vif does get plugged (similar to if you turned on a physical server before plugging in its Ethernet cable.)  However, I didn’t test this and I don’t know that it doesn’t result in more catastrophic results.

The other problem, of course, is if the vif plug operation takes a large amount of time (many seconds to minutes.)  The VM will likely fail to come up on the network because DHCP and metadata services aren’t available to it yet.

Nova Notification Queues

If you run other tools that rely on the Nova notification queues (StackTach, for example), you’ll need to make sure those tools are configured to listen on the Nova queues in the compute cell, not the API cell.

I don’t know if StackTach can monitor multiple RMQ clusters at once.  That’s what would need to happen in order for it to be cell-aware.

Update:  Block Device Mapping

As noted by @mgagne, there is a bug around block device mapping preventing the creation of images from an instance booted from a  volume.  Please see his post on Nova cells and block device mapping for details.

Cells v2

One of the Nova priorities for the Kilo cycle is to implement Cells v2.  That should address a lot of the issues with the first implementation of Cells, while still realizing the scaling benefits.

In v2, the nova-cells service is gone, and instead nova-api will talk directly to the database and RMQ endpoints for each cell.  This simplifies the code path, and has the added benefit that all Nova objects will be cell aware.

Cells will be enabled by default.  You’ll be automatically running with a single cell if you run the default Nova config.

This architecture is much simpler and should be easier to deal with.  Migrating from a non-cells setup to one using cells should basically be seamless.  If you are thinking about moving to cells, you may want to wait for Kilo Liberty.

But if you are doing it under Icehouse or Juno, like we are, it is possible!  Check out the Converting to OpenStack Nova Cells Without Destroying the World post about doing the migration.

For more details on Cells v2, here’s some good additional reading: