Comcast Center, Philadelphia 

 

Here are some of my notes and main take-aways from the OpenStack Operators meet up going on today and tomorrow in Philadelphia.  (Obviously there is a lot more data and details in the individual etherpads.)

The meeting was originally scheduled to take place at the Comcast Center, but was moved to The Hub at United Center due to the large number of attendees.

Also, here’s the Superuser post coverage of day 1.

Introduction

Jonathan Bryce from the OpenStack Foundation and  Mark Muehl of Comcast.  Kicked off the event with short welcome comments.  There are 180+ attendees at this meetup, which is over 4 times more than the one a year ago.

Tom Fifield is coordinating the events.  There needs to be a focus on providing the most specific feedback for the upstream developers.  If we want to improve ops for OpenStack, we actually need to do some stuff.

OVS Issues/Fixes/Best Pratices

Etherpad: https://etherpad.openstack.org/p/PHL-ops-OVS

Moderator: Andy Hill, Rackspace.

Most people report using OVS with Neutron ML2 because it is the default setup.  A few people are packaging their own OVS because upstream distro packages are too slow to incorporate updates and/or they hav ea need for custom patches.  Very few people running SDN controller on top.  Just a couple folks doing IPv6.

OVS >=2.0.0 much more stable.  Think Neutron agent need at least 2.0 at this point, >= 2.1.2 is the recommendation for best performance.

There are some OVS upgrade issues around new kernel modules, and the time and effort it takes to evacuate nodes before kernel/modules updates.

Also there are problems with state sync between Neutron and OVS causing orphan/ghost ports, etc.  Cleanup is necessary and typically a restart of OVS agent.  Seems to be related to OVS agent calling into OVS CLI via screen scraping.  OVS doesn’t really have a great direct API that would avoid some of this stuff.

Another common issue is the time it takes OVS agent to restart.  This can affect connectivity to tenant networks while things are being restarted.  Some of this is caused by root wrap, which spawns a new shell and goes through sudo process for every command.

The default ‘allow’ rules on OVS bridges can cause loops during the time that OVS is up but before OVS agent configures everything.

Upstream networking guide coming out that Matt Kassawara and Sean Collins have been working on.  Draft at https://github.com/ionosphere80/openstack-networking-guide

New OVN effort, around integrating more tightly with OpenStack:  http://networkheresy.com/2015/01/13/ovn-bringing-native-virtual-networking-to-ovs/

Monitoring: sflow tools are challenging.  Difficult to troubleshoot network performance issues between VMs.  Potential tools: https://github.com/stackforge/vmtp  and https://github.com/CiscoSystems/avos

Asks to community:  Help with OVS and agent restarts and make them less impactful.  Get in-flight features/patches landed so we can take advantage of the features.

Security at the Host Level

Etherpad: https://etherpad.openstack.org/p/PHL-ops-security

Moderator:  Curtis Collicutt

This is the first time security has been a specific topic at an Ops meetup.  This is a broad topic.

Some discussion around compute/hypervisors and hardening them to make it harder to break out of a VM.  grsecurity can help limit impact if someone is actually able to break out of a VM to the hypervisor.  Another good practice is to try to reduce the number of kernel modules in the hypervisor, to limit potential attack surface.

A couple people have done PCI audits and have been successful.

Ideas for dealing with abuse:  start with very small quota and grow it over time as customer “karma” grows.  You also need to scrub for old/orphaned VMs that are running out there but no longer tracked by nova (due to state losses, or whatever.)

Ephemeral CA (Anchor project) interesting approach to cert management.

How to deal with compute to compute communication for migrations?  FreeIPA a possibility.  Most people just using config management to handle key management and rotation.

OpenStack security advisories, people following these and they’re useful.  Also security notes: https://wiki.openstack.org/wiki/Security_Notes

Documentation around policy.json is not great.  Ask: better documentation on this.  https://bugs.launchpad.net/openstack-manuals/+bug/1311067

Working Lunch – Puppet Group

Etherpad: https://etherpad.openstack.org/p/puppet-ops-discussions

I spent the lunch period with other people from the openstack-puppet community.  A lot of these ad-hoc topics will result in mailing list posts and follow ups in future IRC meetings:

  • Master branch management
  • Some bug triage is needed
  • How to get more people involved in the project
  • Potentially becoming an official OpenStack project
  • Assigning groups of reviews to different projects (subject matter experts)
  • Templatizing config files vs the service_config resources
  • Potential to support Python venvs
  • Post/edit meeting agenda on wiki ahead of meetings
  • Moving meeting to Tuesday 1500 UTC on #openstack-meeting3 (more convenient time, not on Monday morning)

Day 1 Breakout Sessions

Application Ecosystem

Etherpad: https://etherpad.openstack.org/p/PHL-ops-app-eco-wg

(I did not attend this breakout, these are just high points from the summary presentation.)

  • Idea: API /info call for auto-discovery of capabilities.
  • First app tutorial ideas, migrating an app from AWS to OpenStack.
  • What should an application catalog look like?.

Tools/Monitoring

Etherpad: https://etherpad.openstack.org/p/PHL-ops-tools-wg

Trying to not cover what’s been covered in previous meetups.

Monitoring, collecting common monitoring plugins in https://github.com/osops/tools-monitoring/tree/master/nagios-plugins repo.  There is a good list of metrics/items people typically monitor in the etherpad.

Discussion around log formats and filters, it’s still a pain.  There is an ongoing effort in oslo to make this better, starting to get some traction now.

StackTach folks did a demo for StackTach, which provides tracking/correlation of events.  Mainly they want people to try it out and give feedback.  Rackspace uses it for billing, so they take catching notifications very seriously.  http://stacktach.com/ and Freenode #stacktach.

Discussion around cleanup/removal of resources in services when a tenant is deleted in Keystone.  Dichotomy between the ease of having Keystone clean up stuff, vs. the complexity of Keystone having to know about everything else.  Idea of having Keystone emit a notification event that other services could pick up and handle cleanup.

Large Deployments

Etherpad: https://etherpad.openstack.org/p/PHL-ops-large-deployments

(I did not attend this breakout, these are just high points from the summary presentation.)

  • Covered several topics.
  • Identified services that don’t seem to be “production ready” and try to feedback that info to those projects.
  • Identified pain points that should be fairly easy to fix upstream.
  • Some shared best practices.

Tags Discussion

Etherpad: https://etherpad.openstack.org/p/PHL-ops-tags

Moderator: Thierry Carrez

For background on this idea, see http://ttx.re/the-way-forward.html.

TC to define taxonomy for tags, not the tags themselves.

What info and attributes do people want from tags?

  • Fixes backported or only in master?  (Stable branch definition.)
  • Release cycle
  • Compatibility/dependency between projects
  • Package/module dependency (this is a large and important topic)

Large discussion around compatibility and dependencies and how these are defined and discovered.

General consensus on what are considered core services:

  • Keystone, Nova, Glance, Horizon (~100%)
  • Cinder, Neutron (~75%)
  • Swift (~50%)
  • Heat, Ceilometer (~30%)

Make sure to take the user survey!  http://openstack.org/user-survey

What’s the best way to expose these tags?  Yaml in a repo sounds like the most popular.  Tags need descriptions (well defined.)

Forming user committee subgroup for working on some of these questions around tags.

What hardware are you running?

Etherpad: https://etherpad.openstack.org/p/PHL-ops-hardware

Moderator: Randy Perryman, Dell

How to manage BIOS settings is a challenge, but sounds like different IPMI and vendor solutions are working.

Architecture Show & Tell

Etherpad: https://etherpad.openstack.org/p/PHL-ops-arch-show-tell

Time Warner Cable

Slides: https://docs.google.com/presentation/d/1dBl5ingzWJmvkUYW8UUHrAPwGw5gmyGK6q3QLKl3sIA/edit?usp=sharing

Matt Fischer and Clayton O’Neill.

Go Daddy

Slides: https://docs.google.com/presentation/d/1cSsERJvguJc0J9pL32Eru2myJFmK886_-dxOerc6C0k/edit?usp=sharing

Mike Dorman.

Blue Box

Slides: https://slides.com/jessekeating/blueboxcloud-snt

Jesse Keating.

RedHat: TryStack Hardware Refresh

Slides: http://goo.gl/w2fh7o

Dan Radez.