Operation Bootstrap

Web Operations, Culture, Security & Startups.

Continuous Deployment: Are You Afraid It Might Work?

| Comments

I’ve been wondering for a few years now, why it’s so hard to get companies to prioritize the work that I feel is important. I mean, I’m telling you how to do it and you aren’t listening – don’t you want to build quality software?

Would you listen to that argument? I wouldn’t. Everybody has an opinion about how to do things, what makes one better than another?

I think you should listen to me, but that’s irrelevant

I’m on my 4th SaaS company at this point. I’m starting early this time and hoping to steer things in the right direction. I feel like I’ve observed some good and some bad and some really ugly at this point and I have a pretty good idea of what patterns and anti-patterns are important. The problem hasn’t changed though, just because I feel they are important doesn’t make them a priority for the company.

When I go to work for a new company a very important question to answer is whether the company is ready and willing to implement all the cultural and technical requirements for Continuous Deployment. I’ve at least figured out that, from my position, it’s exceptionally hard (so far impossible for me) to convince a company they want to do this – they have to already want to. I know how to implement, I know how to enact change, but I need support that has to already exist.

I focus on Continuous Deployment not because it’s a technical solution that you just plug in and go. I focus on it because it drives conversation around all the other areas where organizations should improve. For each improvement you make working toward Continuous Deployment, you make your development process better and your software better. These aren’t things that only provide benefit once you are doing Continuous Deployment – but when you’ve done them all it becomes a fairly easy decision to deploy continuously.

I’m early into my latest venture but already the attention is there, the interest in doing this right. I’m being asked for my thoughts on what we need to prioritize to move toward Continuous Deployment, what do we need to focus on early so that it’s easier later on. I’m also being asked to educate folks on what is important and why, what have I watched work and what have I watched fail. What mistakes can we avoid and what mistakes are we just going to have to make on our own?

Oh, you were hoping not to make mistakes? Good luck with that. The best I can hope for is to make my own mistakes.

Dude, I would never do that…

My first flight on an airplane, in my life, was a skydiving trip. When the instructors discovered this as we ascended toward altitude they said “Well this is perfect, next time you fly you can tell the person next to you that you’ve flown before, but you’ve never landed”. My risk tolerance may differ slightly from others. I like to rock climb as well, plenty of people won’t do that. The thing is, both of these sports involve a very risky activity offset by copious amounts of safety. Still, when you watch the girl up on the cliff hanging by a limb all you can think of is “what if she falls?”. The answer is, she’ll get a little bruised up maybe, but if she’s doing things right she’ll keep right on climbing.

When you’ve been climbing a bit, when you understand the safety mechanisms, you pay much more attention to the climb, to the technique, to each movement. You know that the climber is probably safe, because you know what keeps them safe. You can take more risk with each movement knowing that any single mistake will only set you back so far.

These activities, like Continuous Deployment, look more risky from the outside than they are. If you don’t take the time to understand all the safety mechanisms then you can’t accurately evaluate the risk. For a company who pushes software every few weeks to consider pushing every commit without substantial other changes would be insane. Just like I would never go rock climbing without the right equipment (I’m allergic to free soloing – sorry). The act of Continuous Deployment is a realization of a ton of other effort – and all that effort has to be prioritized before you can ever get on the rock face.

Reality check – you are deploying more often than you think

Lets say you deploy every week – I’m being generous here, but lets just pretend. So you deploy on Thursday during the day because you have an awesome deploy process and you know it’s better to spot problems when everyone is in the office. You spot a problem, what do you do? I’m guessing your deploy was from a branch, so you just fix that branch & deploy. Then you merge the fix into master.

Friday comes along, hey there’s another critical issue. Fix, branch, deploy. Lather, rinse repeat. Meanwhile, depending on how involved the fix is and what other stuff you have going on, you’ve got a bunch of merging to get right and the closer to your next branch you get, the more of a problem this becomes. How about a fix on Wednesday before the next deploy? I’m guessing you’ve already cut the next branch, so now you apply the fix to 3 branches (last week, this week, master).

All this deploying and merging and branching, it’s all work. The problem is – it’s not automated work, it’s work asking for mistakes to be made. It’s risk. Where are your safety mechanisms? Are they your manual testers? Your automated test suite? If those automated tests aren’t good enough to test each commit before it goes to production, why are they good enough to test each weeks deploy? Because you do manual testing?

This all sounds risky to me, but for some reason it sounds less risky than Continuous Deployment to some. I think this can only be because of a lack of understanding around the safety mechanisms, the pre-requisites. The proof is in the pudding though, and if you still produce shitty software when doing Continuous Deployment because you write bad tests and don’t do retrospectives and don’t prioritize the important work of making the system work right – then you’re sunk either way.

There are some companies that are probably better off deploying every 8 weeks.

Wrapping up – why so much focus on Continuous Deployment?

The practices that surround Continuous Deployment/Delivery substantially reduce risk – things like Feature Toggles, automated testing, automated deployments, deploying off master, retrospectives, monitoring, accountability, access, ownership, reduced MTTR, and the list goes on. These all add up to make a software development and deployment environment so safe, anyone can commit code – if it doesn’t work it will not make it to production.

But, things will still break. In my experience you have to break things in very subtle ways for code to get into production & as time goes on and you build better monitoring, even those issues should be detected fast & resolved fast.

It can take a while to reach the end goal, but you’ve got to start somewhere. However, even if you never actually practice Continuous Deployment, all of these practices will produce better software and probably happier developers.

Further Reading

Here are a few other good resources to learn about Continuous Deployment/Delivery

DevOps Is Culture - So Is a Lot of Other Stuff…

| Comments

I hung out in an excellent discussion at DevopsDays driven by Spike Morelli around culture. The premise was that DevOps started as an idea around culture – around behavior patterns that lead to better software. Somewhere along the way our industry shifted this discussion into a tools discussion & now the amount of noise out there about “DevOps tools” is magnitudes higher than any discussion about the real reason DevOps exists – to shift culture.

I looked up the definition of “culture” – here are a few definitions:

  1. the quality in a person or society that arises from a concern for what is regarded as excellent in arts, letters, manners,scholarly pursuits, etc.
  2. that which is excellent in the arts, manners, etc.
  3. the behaviors and beliefs characteristic of a particular social, ethnic, or age group: the youth culture; the drug culture.

Note that culture is the manifestation of intellectual achievement. It’s the evidence and result of achievement. I think the 3rd definition is most appropriate for DevOps – what are the behaviors that are characteristic of a well integrated Development & Operations organization?

The challenge, the discussion, was how we can re-balance the scales and get the word out that this is actually about culture and that tools happen as a result of culture, not the other way around. This post begins my contribution to that effort.

The question was asked – do we all agree that culture is the most important thing when it comes to creating a successful business? The short answer is “yes”. If you wanted to hear all the if/and/but/what-if/etc discussions, you should have come to Devopsdays. For the sake of this blog post – culture is the most important factor. If you want case studies and analysis that proves that culture matters – read Jim Collins Good to Great.

My present company has a really excellent culture of Developer / Ops cooperation and collaboration. I wasn’t there when it wasn’t that way (if ever) and so I can’t tell a story about how to change your organization. What I can tell you is what a healthy and thriving Dev/Ops practicing organization looks like and what I think some of the key factors in that success are. I see this as two components – there are fundamental core values that enable and support the culture, and then there are tactical things that are done to make the culture work for us. I’d like to talk about both. The culture is the result of these actions and ideas put into practice.

Background

I work for a company with a well defined set of core values. Those values set forth parameters under which the culture exists. Here’s what they are:

These values are public and they matter – they matter a lot.

These might sound hokey to you – but every single one of them is held high at the company & strongly defended. Defending a list of values like this is hard sometimes. When someone doesn’t show respect to others, how do you uphold that core value? When someone’s idea of “work life balance” is different than another person, how do you support both of them? When creating your own reality means you don’t want to work for Rally anymore – what do you do?

I’m proud to say that in Rally’s case – they are generally true to the core values. Putting “Create your own reality” on a list of core values doesn’t create culture – what creates culture is having repeated examples where individuals have followed their passion & the company has supported them. This support doesn’t just mean they have permission, it means the company uses whatever resources it can to help. Sometimes this means using your resources to help someone find another job. Sometimes this means helping them get an education they can use at another company. Usually though, it means getting them into a role where they can do their best work. Whatever the case – Rally’s culture is to always be true to that core value and do whatever they can to support an employee in creating their own reality.

This is repeated for all of the core values. By being explicit & public about these values they set the stage for what an employee can expect from Rally as a workplace. But there’s more to it – you have to make sure these core values are upheld and you have to make sure they thrive – and this is where some of the tactical parts come in.

What are the tactical things?

  • Collaboration – at Rally collaboration is a requirement. Development is done almost exclusively in pairs, planning is done as groups, retrospectives are done regularly and the actions from those retrospectives are announced publicly and followed up on. Architecture decisions are reviewed by a group comprised of Developers, Operations and Product.
  • Self Formed Teams – teams are largely formed of individuals who have an interest in that teams work. When we need a task force, an email will go out to the organization looking for people interested in participating & those teams self-form. This also gives anyone in the company the ability to participate in areas of the business they may never otherwise get exposure to.
  • Servant Leadership – Leaders at Rally often do very similar work to everyone else – they just have the additional responsibility of enabling their teams. Decisions about how to do things don’t often come from managers, they come from the teams.
  • Data Driven Decisions – Not strictly associated with a core value, I think this is one of the most important aspects of the Rally culture. There is an expectation that we establish evidence that a decision is correct. Sometimes this evidence is established before any dev work is done but sometimes this data comes from dark launching a new service or testing out some new piece of software. Either way, it’s understood that the job isn’t really done until you have data to support why a particular decision is right & have talked to the broader group about it.

There are plenty of other things here and there but you get the general idea. We talk a lot & tell each other what we’re doing, we enlist passionate individuals in areas they have interest, we embrace & seek out change and we empower individuals to drive change by working with others.

So what? What does that have to do with Devops?

Everything

2.5 years ago the company had some performance & stability problems. Technical debt had caught up with them and the only real way to fix the problem was to completely change the way the company did development & prioritized their work. The good news is that they did it, but it was made possible by the fact that individuals were empowered to drive that change. Almost overnight, two teams were formed to focus on architectural issues. A council was formed to prioritize architectural work. The things we all complain about never being able to prioritize became a priority and remain a priority to a degree I’ve never experienced before at other companies. Prioritizing this work is defended and advocated by the development teams – something only possible because of the collaborative environment in which we operate.

I have been personally involved in two services that literally started out as a skeleton of an app when they went into production. The goal was to lay the groundwork to allow fast production deployments, get monitoring in place & enable visibility while the system was being developed. This was all done because the developers understand the value of these things, but they don’t know exactly how to build it – they need Ops help. Having tight Ops and Dev collaboration on these projects has made them examples of what works in our organization. These projects become examples for other teams in the company and they push the envelope on new tech. These two projects have:

  • Implemented a completely new monitoring framework that allows developers to add any metric they want to the system
  • Implemented Continuous deployment
  • Established an example of how and why to Dark Launch a service

I’m sure the list will continue to go on… it’s fantastic stuff.

The Rub – culture isn’t much of anything without people who embrace it.

Along with a responsibility for pushing change from the bottom up in Rally comes responsibility for defending culture – or changing it. This means that when you hire people, they have to align with your core values – they have to be willing to defend that culture or the company as a whole needs to shift culture. All those core values and tactical things will not maintain a culture that the team members do not support. Rally’s culture is what it is because everyone takes it seriously and that includes taking it seriously when there’s a problem that needs fixing.

This has happened. There are core values that used to be on that list above but they aren’t anymore. At one point or another things changed and those core values were eroding at other core values. This takes time to surface, it takes time to collect data to show it’s true, but when the teams start to observe this trend they have to take action. This isn’t the job of management alone – this is the job of every member of the company. When the voice begins to develop asking for change – you need a culture that allows that change to take place and for everyone to agree on the new shape things take.

That said, it also isn’t possible if management doesn’t support those same core values. Management has the same responsibility to take those core values seriously.

DevOps is our little corner of a much bigger idea

There’s a problem that we’re trying to fix – we’re trying to improve the happiness of people, the quality of software, and the general health of our industry. Our industry is totally healthy when you look at the bottom line, but we’re looking for something more. We want a happy and healthy development organization (including Ops, because Ops is part of the Development organization), but we also want our other teams to be part of that. As Ops folks and Developers, we can clean up our side of the street – we can do better. We seek to set an example for the rest of the organization.

For culture to really improve in companies it has to go beyond Dev and Ops into Executives, Product, Support, Marketing, Sales and everyone else. You ALL own quality by building a healthy substrate (culture) on top of which all else evolves.

But in the end it’s about culture. It’s really only about culture for now – because when you get culture right the other problems are easy to solve.

Congratulations to those of you who read this far – shoot me a note and let me know you read this far because you probably share the same passion about this that I do. Also – putting up blog posts from 32,000 feet is awesome – thanks Southwest.

Rate My Skills: No

| Comments

I’ve been asked in the past to rate my skills when chatting with a recruiter about a position, it goes something like this:

“Can you rate your skills in the following areas on a scale of 1 to 10:
Linux
Perl
SAN Storage
Networking
etc”

My response is generally, “no and goodbye”. This isn’t out of arrogance, but there’s just no point in moving forward. When we’re talking about a job I’m interviewing your company every step of the way. One key question I’m always asking myself is “does this seem like a hiring process that would hire people I want to work with?”. I don’t get a huge chance to interview my co-workers when I’m coming into a new place so my only real guide as to what type of folks they are is… the hiring process.

If your hiring process starts by selecting people based on how they rate skills, think of who you are excluding. Also, think of who ends up at the top of that list – arrogant, self-important folks who think they are experts in everything. Perhaps the HR person is clever and is looking for folks who rate themselves poorly to suggest humility but there are better ways.

I want to work on a team where you hire people because they fit well into the team, because they have passion about their job, and because they can learn. I don’t want you to hire someone who already thinks they know everything because every conversation with that person is going to be an argument about why something new isn’t as good as what they’ve already done before. People do what has worked for them in the past unless they are tinkerers who like to try new things – in which case they probably don’t become experts, they become generalists. The best folks I’ve worked with didn’t know how to do most of their day to day tasks when they started the job. They learned, they asked questions, they were humble and eager and engaging. Those are the people I want to work with.

Yes, I’m restricting my potential job opportunities by doing this – which is exactly the point.

I’m Letting My CISSP Lapse

| Comments

I took a 5 day course, studied for 2 months, and spent quite a bit of time stressing about passing the test for my CISSP. It was a good experience and I’m glad I did it – but the time has come to let it go. The problem is, it hasn’t done anything for me and it requires a fair amount of ongoing effort to maintain. Doing the work to keep up on the CPE’s is fine if you are in the security industry – I’m not. I was at one time, but not anymore and it’s not where I’m interested in being. I can fire up SANS webcasts and let them run on mute in the background on my computer to get CPE’s – but that’s just lame.

So, I’m letting it go. I’ve thought a lot about this and it would be different if I had ever had an opportunity that appeared to be gained by having the certification. I haven’t. If the knowledge I gained from the certification helped me do my job. It doesn’t. If people even cared that I have the certification. They don’t (except for one guy in Korea who I had BBQ with at 4am, he seemed excited about it). And lastly, if I ever thought I was going to go after a security job – but I’m pretty sure I’m not.

I think certifications have their place and I applaud anyone who goes out and earns them. I just personally have better things to do with my time than maintain something that really hasn’t added any value to my career. I encourage others to seek out certifications that do add value to their career and I encourage those who help maintain certifications to keep them evolving & continue to challenge folks to get them and keep them.

Embedding Ops Members in Dev Teams - My Recent Experience

| Comments

For about 2 months I was sitting with a dev team while we worked through how to build a new service which will be continuously deployed. I wanted to share my experiences here because I’ve read both positive and negative opinions about doing this and I’m not sure there’s a single right answer. It was certainly an interesting experiment and one I may repeat in the future, with some modifications.

This started when I began attending this teams daily stand up. My goal was to get more involved with a single team in dev to get a better idea of how the process worked in general. This team was one of our two Infrastructure teams which focus on scalability, stability & performance enhancing changes. Initially this team was creating a new Web Services API service which I wrote a little bit about here.

Eventually that service was set aside and the team moved on to a new Authorization and Authentication service. For this new service the decision was made to use Continuous Deployment. We were already doing fully automated deploys at least once per week but there was a bit of a jump to giving the developers the tools they needed to deploy every commit including monitoring & deployment automation changes.

I had also noticed, leading up to this, that the few times I had sat over with the team I was immediately more involved in discussions – they asked me questions (because I was there) and I had the option of attending planning sessions. There was literally a 20 foot difference between my own desk & the desk at sat at “with them” but it made a world of difference. As such, I talked to my management about sitting with that team all the time and they agreed to try it.

Now, this team is a bit unique. The team is constructed of a handful of developers working on the code but it also is the home of the Build & Release guy as well as our Sysadmin who manages the testing infrastructure. Sitting with this team gave me an opportunity to not only be involved in the development of this new service but to also become more involved in the Build & Release process, getting familiar with the day to day problems that are dealt with as well as pairing with folks to work on our puppet configurations which are shared between dev & prod. This team structure, along with me, also made them uniquely suited to tackle the Continuous Deployment problem (at least for this service) completely within a single team.

As part of the Continuous Deployment implementation we wanted to make it as easy as possible for developers to get access to the metrics they needed. We already had splunk for log access but our monitoring system required Ops involvement to manage new metrics. So as part of this new service we also had to perform a spike on a new metric collection/trending systems – we looked at Ganglia & Graphite. We weren’t trying to tackle alerting – we just made it a requirement that any system we select be able to expose metrics to Nagios. I worked with the developers to test out a variety of ways for our application to push metrics into each of these systems while also evaluating each system for good Operational fit (ease of management, performance, scalability, etc).

Throughout this process there were also a lot of questions about how to perform deployments. How many previous builds do we keep? When and how do we rollback? What is our criteria for calling a deployment successful? How do we make sure it fails in test before it fails in production? What do we have to build into the service to allow rolling deploys to not interrupt service? The list goes on – these are all things that you should think about with any service but when the Developers are building the deployment tools they become very aware of all of this – it was awesome.

After about 45 days we had the monitoring system selected & running in production and test, we had deployments going to our testing systems and we were just starting to deploy into production. We now had to start our dark launch, sending traffic from our production system to the new service without impacting production traffic so we can see how this backend service performs, whether it is responding correctly to production traffic & generally get a better understanding of behavior with prod traffic. Today this service is still operating dark as we tweak and tune a variety of things to make sure it’s ready for production – again, it’s awesome.

60 days in things started winding down. We had been dark launched for a few weeks and largely the developers had access to everything they needed – they could look at graphs, logs, if they needed new metrics they just added them to the code and they showed up in monitoring as soon as they deployed. We got deploy lines added onto the graphs so we could correlate deployments with trends on the graph – more awesome. However my work was winding down, there were fewer and fewer Operational questions coming up and I was starting to move back toward working on other Ops projects.

As I looked back on the last 60 days working with this team I realized the same 20 feet that kept me from being involved with the development team had now kept me from being involved with the Ops team. I was really conflicted but it felt like the healthy thing to do would be to move back over into Ops now that the work was winding down. I immediately realized the impact it had as people made comments “wow, you’re back!”… seriously folks, I was 20 feet away! You shot me with nerf darts!

So now I’ve been back over in Ops for a few weeks and there has actually been a change – I’m still much more involved with that Dev team than I was from the start. They still include me in planning & they come to me when there are Operational questions or issues that come up around the service. However, that 20 feet is there again and I can’t hear all the conversations and I know there are questions that would get asked if someone didn’t have to stand up and walk over. Our Dev teams tend to do a lot of pairing and as a result aren’t often on IM and email responses are usually delayed – pairing certainly cuts down on the email checking.

Was I happy I did it? Absolutely. Would I do it again? I think I would – but I would constrain it and set expectations. I think the physical proximity to the team helped a lot to move quickly and toss ideas around while the service was being developed and decisions were being made but it did have an impact on my relationship with the Ops team that I wish I could have avoided. I think continuing to move back and forth – spending time with the Ops team would be helpful. I actually did spend my on-call weeks (every 4th week) in Ops instead of sitting with the Dev team, but I would try to find some time during the 3 weeks in-between to be over there too, it was just too much absence.

All that said, I think overall the company and the service is better for the way this turned out and for me personally it was a super insightful experience that I wish every Ops person could try sometimes.