Key Takeaway – We Need Operators’ Feedback!

A common thread throughout the entire meetup is operators providing feedback and comments on specs and reviews.  It’s clear this feedback is invaluable to the developer community.  It also provides benefit to the operators community as our feedback helps shape the priorities of the various projects.

I think the project developers have a true desire to work on fixes and features that are desired by operators.  After all, cloud operators are the customers of the OpenStack projects, as it were.  The more we, as operators, can feedback to the projects we use, the better things will get.

To get notified of newly posted specs, edit your Garret project settings, and add any “-specs” projects to watch (i.e. openstack/nova-specs, etc.  Use the project name search box to help find things.

Also: Superuser post summarizing day 2 and Takeaways from OpenStack’s Mid-Cycle Ops Meetup.

Rabbit HA and Queue Issues

Etherpad: https://etherpad.openstack.org/p/PHL-ops-rabbit-queue

Moderator: Mike Dorman, Go Daddy

People running a variety of architectures that generally fall into three categories: single node, cluster behind a VIP, and cluster with client-side failover.  The etherpad has a good list of monitoring items and metrics people are watching on RMQ.

Generally everyone is dealing with the same classes of problems, clients not reconnecting reliably and/or recognizing that the connection is stale at all; orphaned queues after a failover; issues after a master reelection.

Nobody really has elegant solutions to this stuff, other than restarting services that use Rabbit when something happens.

Several in-flight reviews that aim to address some of this stuff (see list in etherpad), but the one with the most interest is for implementing RMQ heartbeats: https://review.openstack.org/#/c/146047/  As operators we should be commenting on these reviews/patches that are relevant and wanted by us.  Additionally, this is a good plug for providing feedback on specs to help fill in the operational impacts.

Nova Feedback – Cells v2, Heirarchical Project Quotas

Etherpad: https://etherpad.openstack.org/p/PHL-ops-nova-feedback

Moderator: Sean Dague

Cells v2 is being worked on, expected to come in M, but still very experimental in Liberty.  Request to have current users of cells provide feedback on Cells v2 specs.  Unsure exactly what/where these specs are, but this wiki page seems like it’d be a good starting point: https://wiki.openstack.org/wiki/Nova-Cells-v2

Still quite a few people running nova-network.  Nova team would like to be able to get rid of networking code in Nova.  Still some gaps with Neutron, particularly the HA/multi-host model.  Neutron DVR attempts to fill this.  Also the simplicity of nova-network is a benefit over Neutron for some people.

Quota sync issues a major issue.  This is due to issues in the internal nova quota code which has a lot of race conditions, etc.  Nova trying to focus on addressing some of this stuff to improve.  There’s a spec for providing a quota fix-up call or tool: https://review.openstack.org/#/c/161782/  This is a band aid, but also fills a big operational need.

Several scheduler issues, around performance and log messages for failures.  There’s an effort underway to refactor the scheduler code to make it better and more self contained.  This sets it on a path to expose a separate scheduler API and/or split it off from Nova.  The idea of splitting out the scheduler into its own thing is 2+ cycles out.

Touched on the discussion of moving EC2 API support out of Nova in favor of using this external implementation: https://github.com/stackforge/ec2-api  Want people who care about EC2 API to test this under Kilo.

Hierarchical projects quotas did not make it for Kilo, but hopefully can get approved for early Liberty.  Feedback from operators needed to set priority.  This is the Kilo spec (although I am not quite sure how this works when it’s reopened for Liberty): https://review.openstack.org/#/c/129420/

Pretty much anybody doing live migration on libvirt are doing it with Ceph.

Network Performance Optimization

Etherpad: https://etherpad.openstack.org/p/PHL-ops-network-performance

Moderator: Edgar Magana, Workaday

Ideas discussed:

  • Increased MTU
  • A lot of people not running any SDN
  • OVS vs. Linux Bridge
  • VXLAN vs. GRE for tunneling
  • L2 or L3?
  • Security groups performance
  • Network performance testing tools

Capacity Management

Etherpad: https://etherpad.openstack.org/p/PHL-ops-capacity-mgmt

Moderator: Ben Burdick, Rackspace

Almost nobody is using the stock flavors.  A pretty wide variety of CPU ratio and flavor configs.

General consensus is to not overcommit memory, if you do and VMs actually use it up, things can get OOM killed.  Action item to change default memory overcommit to 1.0 in nova, and possibly increase the default compute node reserved RAM.

Various strategies for scheduling based on flavors extra specs, different weights and filters, etc.  Stacking/depth first vs. breadth first (default) scheduling.

Need better scheduler support for IP capacity, otherwise things get scheduled to a place where there are no available IPs.  This is mainly an issue when using cells.  Go Daddy has a Neutron API extension to expose IP utilization info, should look at trying to upstream this.

Working Lunch, Config Management Discussions

Puppet, Chef, and Ansible interest groups met informally during lunch.

Day 2 Breakout Sessions

Testing/Interoperability

Etherpad: https://etherpad.openstack.org/p/PHL-ops-testing-interop

(I did not attend this breakout, these are just high points from the summary presentation.)

Tempest and Rally.  Testing tools people are using.  RefStack.

Packaging

Etherpad: https://etherpad.openstack.org/p/PHL-ops-packaging

Moderator: Matt Fischer, Time Warner Cable

Most people in this session are rolling their own packages (although this group is probably biased since it is the packaging session.)  Vendor packages don’t work for two main reason:  they lag behind upstream, and people need to include their custom patches.

Common (or at least ones somebody uses) packaging tools: Anvil, Giftwrap, local PyPi, Koji, stdeb.  Everybody is generally trying to solve the same problem(s), the trouble is it’s very difficult to figure this out in a general way that still works for everybody’s specific situation.  Different distros, architectures, etc.

Sounds like one of the main focuses is around gathering some documentation of reference architectures for building packages.  General solutions for how to do this, that people can start from and build upon.

Discussion around the stable branch situation.  Point by Monty that in general people are unhappy with the stable branch management, it’s extra work and not being contributed to much.  Suggestion to possibly decouple the requirement of committing to master first, then backporting the stable branches.  Sometimes this is difficult for people running on stable branches, because master can drastically vary from stable.

Telco

Etherpad: https://etherpad.openstack.org/p/PHL-ops-telco

(I did not attend this breakout, these are just high points from the summary presentation.)

Process around evaluating use cases.  Main actionable take-away is work out how to work with NFV to feed requirements in, to ensure there’s no duplication of work.

Burning Issues

Etherpad: https://etherpad.openstack.org/p/PHL-ops-burning-issues

(I did not attend this breakout, these are just high points from the summary presentation.)

ID’d and prioritized list of burning issues:

  1. nova-network to Neutron migration
  2. RabbitMQ
  3. SSL termination
  4. Federated keystone (keystone ops working group created, see etherpad to sign up)
  5. Ceilometer
  6. Billing issues
  7. Upgrade issues (better release notes and output from CIs, as well as encouraging other projects than just nova to do rolling online upgrades.)
  8. IPv6
  9. EC2 API
  10. Sample production configs (see https://github.com/osops/example-configs)

Architecture Show & Tell

Etherpad: https://etherpad.openstack.org/p/PHL-ops-arch-show-tell

Rackspace Hypervisor Networking

Slides: http://www.slideshare.net/andyhky/rackspace-hypervisor-networking-show-tell

Andy Hill, Rackspace

Custom Overcommit in the Scheduler

Slides: http://www.wormley.com/os/os-meetup-wormley.pdf

Steve Wormley

Rackspace Private Cloud (os-ansible-deployment)

Slides: https://slides.com/kevin-cloudnull-carter

Kevin Carter, Rackspace

Feedback and Vancouver Summit Discussion

Etherpad: https://etherpad.openstack.org/p/PHL-ops-feedback

Lots of good discussion around goods/bads from this event.  If you attended, please provide feedback here!