One of the things that’s come out of the proposed “Refresh quota usages” spec are the little-known nova max_age and until_refresh quota settings. These options allow you to trigger quota refreshes within Nova at different intervals.
This can help with the various quota bugs, where occasionally the quota usages can get out of sync with reality. This seems to be worse when using Nova cells — if a VM deployment errors in the compute cell, the reservation was already made in the API cell, for example.
Note: Some great info from CERN about how they handle quota synchronization problems here.
With this option, a quota usage refresh is triggered after max_age seconds. Each row in the nova.quota_usages table has an updated_at field. This is how the age of the quota usage record is determined and the max_age refresh triggered.
until_refresh is a count of the number of reservations that happen on a particular user resource quota before it is refreshed. This is triggered by the nova.until_refresh field in the database. That field is set to the configured until_refresh value when the usage record is created, and is decremented for every reservation that happens. When it reaches zero, the quota refresh is triggered and the until_refresh field reset to the configured until_refresh value.
Note: If you are moving from the default or no until_refresh setting (effectively 0) to setting it to a value, you’ll need to manually update that until_refresh field in the database for any quota usages that already exist:
update nova.quota_usages set until_refresh = value where deleted = 0 and until_refresh is null;
The default value for both of these options is 0, which effectively disables the quota refresh. The only other way to trigger a refresh is if the current value of the quota usage is less than 0 (see https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L3312)
How These Work Together
These settings can be used together, or separately. When used together, the quota usage will be refreshed when either method is triggered: when the max_age or the until_refresh count is reached, whichever comes first.
Note that the refresh can only be triggered when a scheduler reservation occurs. For example, if a user has not deployed any instances for 7 days, the quota usage is not updated until they actually deploy another VM (even if max_age is much less than 7 days.)
You can see for yourself the logic by which this is determined at https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api.py#L3312-L3322
The one downside the refresh may not be 100% transparent to the user. If the quota is already (incorrectly) too high and exceeds the quota limit, the reservation that triggers the refresh will still fail. I.e. the reservation is attempted based on the quota usage value before the refresh.
However, after that initial failure the quota should be fixed and it will work again on the next reservation.
Also, refreshing the quota usage generates some additional database load. I don’t have any data on that, but I assume it refreshes based on the nova.instances or nova.reservations table. I imagine the load for a single refresh is probably comparable to that of a ‘nova list’. It would be interesting to investigate this more and understand the details here.
Personally, we’re running with max_age=86400 and until_refresh=50. The thinking being that users have a chance to get their quotas fixed up at least once a day, or if they are very active users, every 50 VM deployments.
We expect most quota issues happen slowly over time. If we are correcting them often and automatically, they hopefully never get to the point where they’re bad enough to manifest reservation errors to the user.
Anecdotally this seems to be true, as we have relatively few reported quota issues. But, we suspect that there are a lot of minor inaccuracies that are not being noticed. With max_age/until_refresh, we should prevent those small problems from turning into larger ones.