Chef Cookbook dependency management and the environment cookbook pattern by Matt Wrock

Last week I discussed how we at CenturyLink Cloud are approaching the versioning of cookbooks, environments and data bags focusing on our strategy of generating version numbers using git commit counts. In this post I’d like to explore one of the concrete values these version numbers provide in your build pipeline. We’ll explore how these versions can eliminate surprise breaks when cookbook dependencies in production may be different from the cookbooks you used in your tests. You are testing your cookbooks right?

A cookbook is its own code plus the sum of its dependencies

My build pipeline must be able to guarantee to the furthest extent possible that my build conditions match the same conditions of production. So if I test a cookbook with a certain set of dependencies and the tests pass, I want to have a high level of confidence that this same cookbook will converge successfully on production environments. However, if I promote the cookbook to production but my production environment has different versions of the dependent cookbooks, this confidence is lost because my tests accounted for different circumstances.

Even if these different versions are just at the patch level (the third level version number in the semantic version numbering schema), I still consider this an important difference. I know that only changes of the major build number should include breaking changes but lets just see a show of hands of those who have deployed a bug fix patch that regressed elsewhere…I thought so. You can put your hands down now everyone. Regardless, it can be quite simple to allow major version changes to slip in if your build process does not account for this.

Common dependency breakage scenarios

We will assume that you are using Berkshelf to manage dependencies and that you keep your Berksfile.lock files synced and version controlled. Lets explore why this is not enough to protect from dependency change creep.

Relaxed version constraints

Even if you have rigorously ensured that your metadata.rb files include explicit version constraints, there is never any guarantee that some downstream community cookbook has not specified any constraints in its metadata.rb file. You might think this is exactly where the Berksfile.lock saves you. A Berksfile.lock will snap the entire dependency graph to the exact version of the last sync. If this lock file is in source control, you can be assured that fellow teammates and your build system are testing the top level cookbook with all the same dependency versions. So far so good.

Now you upload the cookbooks to the chef server and when its time for a node to converge against your cookbook changes, where is your Berksfile.lock now? Unless you have something in place to account for this, chef-client is simply going to get the highest available version for cookbook dependencies without constraints. If anyone at any time uploads a higher version of a community cookbook to the chef server that is used by other cookbooks that had been tested against lesser versions, a break can easily occur.

Dependency islands in the same environment

This is related to the scenario just described above and explains how cookbook dependencies can be uploaded from one successful cookbook test run that can break sibling cookbooks in the same environment.

The Berksfile.lock file generates the initial dependency graph upon the very first berks install. Therefore you can create two cookbooks that have all of the same cookbook dependencies but if you build the Berksfile.lock file even hours apart, there is a perfectly reasonable possibility that the two cookbooks will have different versioned dependencies in their respective Berksfile.lock files.

Once both sets are uploaded to the chef server and unless an explicit version constraint is specified, the highest eligible version wins and this may be a different version than some of the cookbooks that use this dependency were tested against. So now you can only hope that everything works when nodes start converging.

Poorly timed application of cookbook constraints

You may be thinking the obvious remedy to all of these dependency issues is to add cookbook constraints to your environment files. I think you are definitely on the right track. This will eliminate the mystery version creep scenarios and you can look at your environment file and know exactly what versions of which cookbooks will be used. However in order for this to work, it has to be carefully managed. If I promote a cookbook version along with all of its dependencies by updating the version constraints in my environment, can I be guaranteed that all other cookbooks in the environment with the same downstream dependencies  have been tested with any new versions being updated?

I do believe that constraining environment versions is important. These constraints can serve the same function as your Berksfile.lock file within any chef server environment. Unless the entire matrix of constraints matches up against what has been tested, these constraints provide inadequate safety.

Safety guidelines for cookbook constraints in an environment

A constraint for every cookbook

Not only your internal top level cookbooks should be constrained. All dependent cookbooks should also include constraints. Any cookbook missing a constraint introduces the possibility that an untested dependency graph will be introduced.

All constraints point to exact versions (no pessimistic constraints)

I do think pessimistic constraints are fine in your Berksfile.lock or metadata.rb files. This basically says you are ok with upgrading dependencies within dev/test scenarios, but once you need to establish a baseline of a known set of good cookbooks, you want that known group to be declared rigidly. Unless you point to precise versions, you are stating that you are ok with “inflight” change and that’s the change that can bring down your system.

Test all changes and all cookbooks potentially affected by changes

You will need to be able to identify what cookbooks that had no direct changes applied but are part of the same graph undergoing change. In other words if both cookbook A and B depend on C and C gets bumped as you are developing A, you must be able to automatically identify that B is potentially affected and must be tested to validate your changes to A even though no direct changes were made to B. Not until A, B, and C all converge successfully can you consider the changes to be included in your environment.

Leveraging the environment cookbook pattern to keep your tree whole

The environment cookbook pattern was introduced by Jamie Windsor in this post. The environment cookbook can be thought of as the trunk or root of your environment. You may have several top level cookbooks that have no direct dependencies on one another and therefore you are subject to the dependency islands referred to above. However if you have a common root declaring a dependency on all your top level cookbooks, you now have a single coherent graph that can represent all cookbooks.

The environment cookbook pattern prescribes the inclusion of an additional cookbook for every environment you want to apply this versioning rigor. This cookbook is called the environment cookbook and includes only four files:

README.md

Providing thorough documentation of the cookbook’s usage.

metadata.rb

Includes one dependency for each top level cookbook in your environment.

Berksfile and Berksfile.lock

These express the canonical dependency graph of your chef environment. Jamie suggests that this is the only Berksfile.lock you need to keep in source control. While I agree it’s the only one that “needs” to be in source control, I do see value in keeping the others. I think by keeping “child” Berksfile.lock files in sync the top level dependencies may fluctuate less often and provide a bit more stability during development.

Generating cookbook constraints against an environment cookbook

Some will suggest using berks apply in the environment cookbook and point to the environment you want to constrain. I personally do not like this method because it simply uploads the constraints to the environment on the chef server. I just want to generate it locally first where I can run tests and version control the environment file first.

At CenturyLink Cloud we have steps in our CI pipeline that I believe not only adds the correct constraints but allows us to identify all cookbooks impacted by the constraints and also ensures that all impacted cookbooks are then tested against the exact same set of dependencies. Here is the flow we are currently using:

Generating new cookbook versions for changed cookbooks

As included in the safety guidelines above, this not only means that cookbooks with changed code get a version bump, it also means that any cookbook that takes a dependency on one of these changed cookbooks also gets a bump. Please refer to my last post which describes the version numbering strategy. This is a three step process:

  1. Initial versioning of all cookbooks in the environment. This results in all directly changed cookbooks getting bumped.
  2. Sync all individual Berksfile.lock files. This will effectively change the Berksflie.locks of all dependent cookbooks.
  3. A second versioning pass that ensures that all cookbooks affected by the Berksfile.lock updates also get a version bump.

Generate master list of cookbook constraints against the environment cookbook

Using the Berksfile of the environment cookbook, we will apply the new cookbook versions to a test environment file:

def constrain_environment(environment_name, cookbook_name)
  dependencies = environment_dependencies(cookbook_name)
  env_file = File.join(config.environment_directory, 
    "#{environment_name}.json")
  content = JSON.parse(File.read(env_file))
  content['cookbook_versions'] = {}
  dependencies.each do | dep |
    content['cookbook_versions'][dep.name] = dep.locked_version.to_s
  end

  File.open(env_file, 'w') do |out|
    out << JSON.pretty_generate(content)
  end
  dependencies
end

def environment_dependencies(cookbook_name)
  berks_name = File.join(config.cookbook_directory, 
    cookbook_name, "Berksfile")
  berksfile = Berkshelf::Berksfile.from_file(berks_name)
  berksfile.list
end

This will result in a test.json environment file getting all of the cookbook constraints for the environment. Another positive byproduct of this code is that it will force a build failure in the event of version conflicts.

It is very possible that one cookbook will declare a dependency with an explicit version while another cookbook declares the same cookbook dependency but with a different version constraint. In these cases the Berkshelf list command invoked above will fail because it cannot satisfy both constraints. Its good that it fails now so you can align the versions before the final constraint is locked and potentially causing a version conflict during a chef node client run.

Run kitchen tests for impacted cookbooks against the test.json environment

How do we identify the impacted cookbooks? Well as we saw above, every cookbook that was either directly changed or impacted via a transitive dependency got a version bump. Therefore it’s a matter of comparing a cookbook’s new version to the version of the last known good tested environment. I've created an is_dirty function to determine if a cookbook needs to be tested:

def is_dirty(environment_name, cookbook_name, environment_cookbook)
  dependencies = environment_dependencies(environment_cookbook)
  
  cb_dependency = (
    dependencies.select { |dep| dep.name == cookbook_name })[0]

  env_file = File.join(config.environment_directory, 
    "#{environment_name}.json")
    
  content = JSON.parse(File.read(env_file))
  if content.has_key?('cookbook_versions')
    if content['cookbook_versions'].has_key?(cookbook_name)
      curr_version = cb_dependency.locked_version.to_s
      curr_version != content['cookbook_versions'][cookbook_name]
    else
      true
    end
  end
end

This method takes the environment that represents my last known good environment (the one where all the tests passed), the cookbook to check for dirty status and the environment cookbook. If the cookbook is clean, it effectively passes this build step.

In a future post I may go into detail regarding how we utilize our build server to run all of these tests concurrently from the same git commit and aggregate the results into a single master integration result.

Create a new Last Known Good environment

If any test fails, the entire build fails and it all stops for further investigation. If all tests pass, we run through the above constrain_environment method again to produce the final cookbook constraints of our Last Known Good environment which serves as a release candidate of cookbooks that can converge our canary deployment group. The deployment process is a topic for a separate post.

The Kitchen-Environment provisioner driver

One problem we hit early on was that when test-kitchen generated the Berksfile dependencies to ship to the test instance, the versions it generated may differ from the versions in the environment file. This was because Test-Kitchen’s chef-zero driver as well as most the other chef provisioner drivers, run a berks vendor against the Berksfile of the individual cookbook under test. These may produce different versions than a berks vendor against the environment cookbook and it also illustrates why we are following this pattern. When this happens, it means that the individual cookbook on its own runs with a different dependency than it may in a chef server.

What we needed was a way for the provisioner to run berks vendor against the environment cookbook. The following custom provisioner driver does just this.

require "kitchen/provisioner/chef_zero"

module Kitchen

  module Provisioner

    class Environment < ChefZero

      def create_sandbox
        super
        prepare_environment_dependencies
      end

      private

      def prepare_environment_dependencies
          tmp_env = "TMP_ENV"
          path = File.join(tmpbooks_dir, tmp_env)
          env_berksfile = File.expand_path(
            "../#{config[:environment_cookbook]}/Berksfile", 
            config[:kitchen_root])    

          info("Vendoring environment cookbook")    
          ::Berkshelf.set_format :null

          Kitchen.mutex.synchronize do
             Berkshelf::Berksfile.from_file(env_berksfile).vendor(path)

              # we do this because the vendoring converts metadayta.rb
              # to json. any subsequent berks command on the 
              # vendored cookbook will fail
              FileUtils.rm_rf Dir.glob("#{path}/**/Berksfile*")

              Dir.glob(File.join(tmpbooks_dir, "*")).each do | dir |
                cookbook = File.basename(dir)
                if cookbook != tmp_env
                  env_cookbook = File.join(path, cookbook)
                  if File.exist?(env_cookbook)
                    debug("copying #{env_cookbook} to #{dir}")
                    FileUtils.copy_entry(env_cookbook, dir)
                  end
                end
              end
              FileUtils.rm_rf(path)
          end
      end

      def tmpbooks_dir
        File.join(sandbox_path, "cookbooks")
      end

    end
  end
end

Environments as cohesive units

This is all about treating any chef environment as a cohesive unit wherein any change introduced must be considered upon all parts involved. One may find this to be overly rigid or change adverse. One belief I have regarding continuous deployment is that in order to be fluid and nimble, you must have rigor. There is no harm in a high bar for build success as long as it is all automated and therefore easy to apply the rigor. Having a test framework that guides us toward success is what can separate a continuous deployment pipeline from a continuous hot fix fire drill.

Using git to version stamp chef artifacts by Matt Wrock

This post is not about using git for source control. It assumes that you are already doing that. What I am going to discuss is a version numbering strategy that leverages the git log. The benefit here is the guarantee that any change in the artifact (cookbook, environment, data bag) will result in a unique version number that will not conflict with other versions provided by your fellow teammates. It ensures that deciding on what version to stamp your change is one thing you don’t need to think about. I'll close the post demonstrating how this can be automated as a part of your build process.

The strategy explained

You can use the git log command to list all commits applied to a directory or individual file in your repository:

git log --pretty=oneline some_directory/

This will list all commits within the some_directory directory in a single line per commit that prints the sha1 and the commit comment. To make this a version, you would count these lines:

powershell:
(git log --pretty=oneline some_directory/).count

bash:
git log --pretty=oneline some_directory/ | wc –l

Semantic versioning

If you are using semantic versioning to express version numbers, the commit count can be used to produce the build number – the third element of a version. So what about the major and minor numbers? One argument you can pass to git log is a starting ref from which to list commits. When you decide to increment the major or minor build number, you want to tag your repository with those numbers:

git tag 2.3

So now you want your build numbers to reset to 0 starting from the commit being tagged. You can do this by by telling git’s log command to list all commits from that tag forward like so:

git log 2.3.. --pretty=oneline some_directory/ 

If you were to run this just after tagging your repo with the major and minor versions, you would get 0 commits and thus the semantic version would be 2.3.0. So you will need to give thought to incrementing major and minor build number but the final element just happens upon commit.

Benefits and downsides to this strategy

Before getting into the details of applying this to chef artifacts, lets briefly examine some pros and cons of this technique.

Upsides

Any change to a versionable artifact will result in a unique build number and if two builds have the same contents, their build numbers will be the same

This is crucial especially if you need to communicate with customers or fellow team members regarding features or bugs. This can help to remove confusion and ensure you are discussing the same build. If you are using a bug tracking system, you will want to include this version in the bug report so other team members reviewing the bug can checkout that version from source control or review all changes made since that version was committed.

Builds can be produced independently of a separate build server

Especially for solo/side projects where you may not even have a build server, this can help you create deterministic build numbers. However even if your project’s authoritative builds are produced by a system like Jenkins or TeamCity, individual team members can produce their own builds and produce the same build numbers generated by your build server (assuming the build server is using this strategy). Of course the number may vary slightly if other team members have produced commits and have not yet pushed to your shared remote or if the build is performed without pulling the latest changes. That’s why you also want to include the current sha1 somewhere in your artifact. More on that later.

Allows you to separately version different artifacts in your repository

Especially if your chef repository houses multiple cookbooks and you freeze your cookbook versions or use version constraints in your environments, this can be very important. I want to know that any change to a cookbook will increment the version and if the cookbook has remained unchanged, its version should be the same.

Downsides

There will be gaps in your build numbers

You will likely commit several times between builds. So two subsequent builds with say 5 commits in between will increment the build number by 5. This should not be an issue as long as your team is aware of this. However, if you consider sequential build numbers important as a customer facing means to communicate change, this could be an issue. I have used this technique on a couple of fairly popular OSS projects and I never had an issue with users or contributors stumbling on this.

Build numbers can get big

If you rarely increment the major or minor build numbers, this will surely happen over time. I try to increment the minor number on any feature enhancing release in which case this is not usually an issue.

If build agents cannot talk to git

If you are using a centralized build server and if this is a collaborative project you certainly should be, you definitely want the builds produced by your build server to follow this same strategy. In order to do that, you want to configure your build server to delegate the git pull to the build agents. Otherwise, the git log commands will not work. The build agent must have an actual git repo with the .git folder available to see the commit counts.

Applying this to chef artifacts

First, what do I mean by “chef artifacts?” Don’t I really mean cookbooks? No. While cookbooks are certainly included and are the most important artifact to version, I also want to version environment and data_bag files. If I used roles, I would version those too. Regardless of the fact that cookbooks are the only entity that has first class versioning support on a chef server, I should be able to pin these artifacts to their specific git commit. Also, I may change environment or data_bag files several times before uploading to the server and I may want to choose a specific version to upload. If you add cookbook version constraints to your environments, any dependency change will result in a version bump to your environment and your environment version may serve as a top level repository version.

Stamping the artifact

So what gets stamped where? For cookbooks this is obvious. The version string in metadata.rb will have the generated version applied. For environment and data_bag files, we create a new json element in the document:

{
  "name": "test",
  "chef_type": "environment",
  "json_class": "Chef::Environment",
  "override_attributes": {
    "environment_parent": "QA",
    "version": "1.0.24",
    "sha1": "c53bdaa92d67bea151928cdff10a8d5e634ec880"
  },
  "cookbook_versions": {
    "apt": "2.6.0",
    "build-essential": "2.0.6",
    "chef-client": "3.7.0",
    "chef_handler": "1.1.6",
    "clc_library": "1.0.20",
    "cron": "1.5.0",
    "curl": "2.0.0",
    "dmg": "2.2.0",
    "git": "4.0.2",
    "java": "1.28.0",
    "logrotate": "1.7.0",
    "ms_dotnet4": "1.0.2",
    "newrelic": "2.0.0",
    "platform_couchbase": "1.0.31",
    "platform_elasticsearch": "1.0.40",
    "platform_environment": "1.0.1",
    "platform_haproxy": "1.0.36",
    "platform_keepalived": "1.0.4",
    "platform_octopus": "1.0.13",
    "platform_rabbitmq": "1.0.33",
    "platform_win": "1.0.71",
    "provisioner": "1.0.209",
    "queryme": "1.0.2",
    "runit": "1.5.10",
    "windows": "1.34.2",
    "yum": "3.3.2",
   "yum-epel": "0.5.1"
  }
}

I add the version as an override attribute since you cannot add new top level keys to environment files. However for data_bag files I do insert the version as a top level json key.

Including the sha1

You may have noticed that the environment file displayed above has a sha1 attribute just below the version. Every commit in git is identified by a sha1 hash that uniquely identifies it. While the version number is a human readable form of expressing changes and can still be used to find the specific commit in git that produced the version, having the sha1 included with the version makes it much easier to track down the specific git commit. I can simply do a:

git checkout <sha1>

This will update my working directory to match all code exactly as it was when that version was commited. If you report problems with a cookbook and can give me this sha1, I can bring up its exact code in seconds.

As we have already seen, the sha1 is stored in a separate json attribute for environment and data_bag files. For cookbook metadata.rb file, I add this as a comment to the end of the file:

name        'platform_haproxy'
maintainer  'CenturyLink Cloud'
license     'All rights reserved'
description 'Installs/Configures haproxy for platform'
version     '1.0.36'

depends     'platform_keepalived'
depends     'newrelic'
#sha1 'c53bdaa92d67bea151928cdff10a8d5e634ec880'

Bringing all of this together with automation

At CenturyLink Cloud, we are using this strategy for our own chef versioning. I have been working on a separate “promote” gem that oversees our delivery pipeline of chef artifacts. This gem exposes rake tasks that handle the versioning discussed in this post as well as the process of constraining cookbook versions in various qa and production environments and uploading these artifacts to the correct chef server. The rake tasks tie in to our CI server so that the entire rollout is automated and auditable. I’ll likely share different aspects of this gem in separate posts. It is not currently open source, but I can certainly share snippets here to give you an idea of how this generally works.

Our Rakefile loads in the tasks from this gem like so:

config = Promote::Config.new({
  :repo_root => TOPDIR,
  :node_name => 'versioner',
  :client_key => File.join(TOPDIR, ENV['versioner_key']),
  :chef_server_url => ENV['server_url']
  })
Promote::RakeTasks.new(config)
task :version_chef => [
  'Promote:version_cookbooks', 
  'Promote:version_environments', 
  'Promote:version_data_bags'
]

so rake version_chef will stamp all of the necessary artifacts with their appropriate version and sha1. The code for versioning an individual cookbook looks like this:

def version_cookbook(cookbook_name)
  dir = File.join(config.cookbook_directory, cookbook_name)
  cookbook_name = File.basename(dir)
  version = version_number(current_tag, dir)
  metadata_file = File.join(dir, "metadata.rb")
  metadata_content = File.read(metadata_file)
  version_line = metadata_content[/^\s*version\s.*$/]
  current_version = version_line[/('|").*("|')/].gsub(/('|")/,"")

  if current_version != version
    metadata_content = metadata_content.gsub(current_version, version)
    outdata = metadata_content.gsub(/#sha1.*$/, "#sha1 '#{sha1}'")
    if outdata[/#sha1.*$/].nil?
      outdata += "#sha1 '#{sha1}'"
    end
    File.open(metadata_file, 'w') do |out|
      out << outdata
    end
    return { 
      :cookbook => cookbook_name, 
      :version => version, 
      :sha1 => sha1}
  end
end

def version_number(current_tag, ref)
  all = git.log(10000).object(ref).between(current_tag.sha).size
  bumps = git.log(10000).object(ref).between(current_tag.sha).grep(
    "CI:versioning chef artifacts").size
  commit_count = all - bumps
  "#{current_tag.name}.#{commit_count}"
end

This uses the git ruby gem to interact with git and plops in the version and sha1 into metadata.rb. Note, that we exclude all commits labeled “CI:versioning chef artifacts.” After our CI server runs this task, it commits and pushes the changes back to git. We don’t want to include this commit in our versioning. We also adjust our CI version control trigger to filter out this commit from commits that can initiate a build otherwise we would end up in an infinite loop of builds.

Adding a Berkshelf sync

After we generate the new versions but before we push the versions back to git we want to sync up our Berksfile.lock files so we run this:

cookbooks = Dir.glob(File.join(config.cookbook_directory, "*"))
cookbooks.each do |cookbook|
  berks_name = File.join(
    config.cookbook_directory, 
    File.basename(cookbook), 
    "Berksfile")
  if File.exist?(berks_name)
    Berkshelf.set_format :null
    berksfile = Berkshelf::Berksfile.from_file(berks_name)
    berksfile.install
  end
end

This ensures that the CI commit includes up to date Berksfile.lock files that may very well have changed due to the version changes in cookbooks that depend on one another. This will also be necessary in generating the environment cookbook constraints but that will be covered in a future post.

Thoughts?

I realize this is not how most version their chef artifacts or non chef artifacts for that matter. I know many folks use knife spork bump. You can certainly leverage spork with this strategy as well but just provide the git generated version instead of letting spork auto increment. This versioning strategy has proven itself to be very convenient for me on non chef projects. I’d be curious to get feedback from others on this technique. Any obvious or subtle pitfalls you see?

Hurry up and wait! Tales from the desk of an automation engineer by Matt Wrock

I have never liked the title of my blog: “Matt Wrock’s software development blog.” Boring! Sure it says what it is but that’s no fun and not really my style. So the other week I was taking a walk in my former home town of San Francisco and it suddenly dawned on me “Hurry up and wait.” The clouds opened up, doves descended and a black horse crossed my path and broke the 9th seal…then the dove pooped on my shoulder which distracted me and I went on about my day. Later I recalled the original epiphany and decided to purchase the domain which I did last night and point it to this blog. I love the phrase. It immediately strikes the incongruous tone of an oxymoron but the careful observer quickly sees that it is actually sadly true and I think this truth is particularly poignant to one who spends large amounts of time in automation like myself.

A brief note on the .io TLD. Why did I register under .io? Well besides the fact the .com/.org were taken by domain parkers, my understanding is that the .io will allow my content to be more accessible to the young and hip. Why just look at my profile pic to the right and then switch to mattwrock.com and look again. The nose hairs are way sexier on the .io sight right?! Sorry ladies but I’m taken.

I digress..so before the press releases, media events and other fan fare that will inevitably follow this “rebranding” of my blog, I thought I’d take some time to reflect on this phrase and why I think it resonates with my career and favorite hobby.

What do you mean “wait”? Isn’t that contrary to automation?

Technically yes, but a quote from Star Trek comes to mind here: “The needs of the many outweigh the needs of the few.” It is the automation engineer who takes one for the team to sacrifice their own productivity so that others may have a better experience. Yes, always putting others before themselves is the way of the automation engineer. There is no ego here. There is no ‘I’ in automation. Oh wait…well…it’s a big word and there is only one at the end and its basically silent.

I could go on and on about this virtuous path (obviously) but at least my own experience has been that making things faster and removing those tedious steps takes a lot of effort, trial and error, testing and retesting, reverse engineering and can incur quite a bit of frustration. The fact of the matter is that automation often involves getting technology to behave in a way that is contrary to the original design and intent of the systems being automated. It may also mean using applications contrary to their “supported” path. In many scenarios we are really on our own and traveling upstream.

So much time, so little code

I’ve been writing code professionally for the past fifteen years. I’ve been focusing on automation for about the last three. One thing I have noticed is that emerging from solving a big problem I often have much less code to show for myself than I would in more “traditional” software problems. Most of the effort involves just figuring out HOW to do some something. There may be no or extremely scant documentation covering what we are trying to accomplish. A lot of the work involves the use of packet filters, and other tooling that can trace file activity, registry access or process activity and lots and lots of desperate deep sea google diving where we come up for air empty. When all is said and done we may have just a small script or simply a set of registry keys.

Congratulations! You have automated the thing! Now can you automate it again?

This heading could also be entitled so little code, so much testing but this one’s a bit more upbeat I think.

Another area that demands a lot of time from this corner of the engineering community is testing. Much of the whole point of what we do is about taking some small bit of code and packaging it up in such a way that is easily accessible and repeatable. This means testing it, reverting it (or not) and testing it again. The second point, reverting it (or not), is absolutely key. We need to know that it can be repeatedly successful from a known state and sometimes that it can be repeatable in the “post automation” state without failing. The fancy way to describe the later is idempotence.

Maybe I’m actually lucky enough to solve my problem quickly. I high five my coworker Drew Miller (@halfogre) but then he refuses to engage in the chest bumping and butt gyrating victory dance which to me seems a most natural celebration of this event. But alas…I wave good bye to sweet productivity as I wait for my clean windows image to restore itself, test it, watch it fail due to some transient network error, add the necessary retry logic and then watch it fail again because it cant be executed again in the state it left the machine. So there goes the rest of that day…

Why bother?

Good question. I have often asked myself the same. The obvious answer is that while automating a task may take 100x longer than the unautomated task itself, we are really saving ourselves and often many others from the death of 1000 cuts. The mathematicians out there will note the difference between the “100” and “1000” and correctly observe that one is ten times the other. Of course this ratio can fluctuate wildly and yes there are times when the effort to automate will never pay off. It is important, and sometimes very difficult, to recognize those cases especially for those ADD/OCD impaired like myself.

I have seen large teams under management that rewarded simply “getting product out the door” with the unfortunate byproduct of discouraging time devoted to engineering efficiencies. This is a very slippery slope. It starts off with a process that garners a bit of friction but through years of neglect becomes a behemoth of soul sucking drudgery inflicted on hundreds of developers as they struggle to build, run and promote their code through the development life cycle. Even sadder is that often those who have been around the longest and with the most clout have grown numb to the pain. Like a frog bathing in a pot of water slowly drawn to a boil, they fail to see their impending death, and they don’t understand outsiders that criticize their backwards processes. They explain it away as being too big or complex to be simplified. Then when it finally becomes obvious that the system must be “fixed” it is a huge undertaking involving many committees and lots of meetings. MMmmmmmmm…committees and meetings…my favorite.

The best part is its magic

But beyond the extreme negative case for automation portrayed above, there is a huge up side. I personally find it immensely rewarding to take a task that has been the source of much pain and suffering and watch the friction vanish. There is a magical sensation that comes when you press a button or type a simple command and then watch a series of operations unfold and then unfold again that once took a day to set up. I wont go into details on the sensation itself, this is after all a family blog.

Want to keep the best and recruit the best? Automate!

In the end, everyone wants to be productive. There is nothing worse than spending half a day or more fighting with build systems, virtualized environments and application setup as opposed to actually developing new features. Eventually teams inundated with these experiences find themselves fighting turnover and it becomes difficult to recruit quality talent. Who wants to work like this?

Wait a second…haven't I just been describing my job in a similar light? Fighting systems not meant to be automated and getting little perceived bang for my coding buck? Kind of, but these are actually two distinctly different pictures. One is trapped in an unfortunate destiny and the other assumes command over destiny at an earlier stage of suffering in efforts to banish it.

So what are you waiting for?…Hurry up and wait!

Getting Ready…Troubleshooting unattended windows installation by Matt Wrock

ready.PNG

I install windows (and linux) A LOT in my role at CenturyLink Cloud automating our infrastructure rollout and management. Sometimes things go wrong. Usually if our provisioning code has been waiting for more than a few minutes for the machine to be reachable I know something is not right. So I might pop open a VMWare console and see this ever familiar screen. The windows installation is “Getting Ready.” That may fill one with the adrenaline of sweet anticipation but I know this only ends in disappointment. I can assure you that if windows is not ready now, it will never be ready. As in never, ever, ever ready.

In the past I have sat staring into the spinning circle of emptiness wondering what in gods name is windows doing. There are no error messages and usually nothing helpful in the VMWare events other than telling me that the OS customization has failed. Mmm…thanks. Sometimes after 5 or 15 minutes, the OS may come to life but often not in a state that our provisioning can connect to over winrm. I’m usually caught off guard by this since I have been spending the past several minutes in a very intense Vulcan mind meld with my monitor. Hoping somehow to break through and thinking I’m just beginning to feel the silent, cold, lonely suffering of a failed domain join when suddenly I am asked to press ctrl+alt+delete. Well…ok…I will…and slowly, as if just awoken from one of those inception dreams within a dream within another dream and having aged hundreds of years, I type just that – ctrl+alt+delete.

OK. You got me. Ctrl+Alt+Del does not work in a VMWare console, but you get the idea. Anyhoo, I next run off to the event logs reading lot and lots of events that are entirely unhelpful and provide no clues. Usually this all ends up being some stupid error like providing a faulty domain admin password to the unattend file. Not too long ago we added code to our windows provisioning that adds a second NIC and that introduced a few issues leading to this phenomenon until I got the sequence just right of adding the NIC, disabling it, configuring it and enabling it. But a couple weeks ago I ran into a new issue that really stumped me and I was not able to solve by looking over my provisioning code or configuration data. This prompted me to research how to get to the bottom of what's going on when Windows is “Getting Ready.” In this post I will cover what I learned and hopefully reveal clues that can help others figure out how to get out of these installation hang-ups

Overview of CenturyLink Cloud’s server provisioning sequence

It may help to point out roughly how we go about installing our windows boxes. Our methods may be different from yours but that should be irrelevant and the techniques here to troubleshooting windows installation hangs and errors should be just as applicable to just about any unattended windows install. Our windows servers do run server 2012 R2 so older OSs may certainly be different.

We have been using chef for our server automation and, in particular, Chef-Metal for our provisioning process. We have written a custom Chef-Metal Vsphere driver that leverages the RBVMOMI ruby library to interact with the VMWare VSphere API that does all the footwork of going to the right host, cloning a initial VM template, hooking up the right data stores, setting up initial networking etc. This also calls into VMWare’s guest OS customization configuration which will produce a windows unattend.xml file. Also known as an answer file. The VMWare tools will inject this file into the setup which windows will then use to drive its installation.

Our unattend file ends up being pretty simple. It performs a domain join and runs some scripts that tweak winrm so our provisioner can talk to the machine, install the Chef client and kick-off the appropriate cookbooks and recipes making the machine a “real boy” in the end. We run a mix of windows and linux but everything goes through this same sequence but of coarse the linux boxes don’t have unattend.xml files generated but they do have their own OS customization process that configures initial networking.

If everything goes right. This takes about 5 minutes from the initial cloning until the machine can receive network traffic and begin its convergence to whatever role that machine will fill: web server, rabbitMQ server, CouchDB server, etc. It really doesn't matter if its windows or linux, 5 minutes is roughly the norm. BTW: for most of our automation testing of linux machines we use Docker which is nearly instantaneous but we do not use that in production (yet).

Breaking through Getting Ready

So what can one do when the windows install gets “stuck” in this Getting Ready state? Shift-F10 is your friend. I don’t think it matters what hypervisor infrastructure you are using or even if this is a bare metal install. We use VMWare but this should work on Hyper-V, VirtualBox, etc. Shift-F10 will immediately open a CMD.exe as administrator if typed during the unattended install phase.

From here you can start pouring through logs and can even open regedit and other gui based tools if necessary but this command prompt is usually enough to find out what is happening.

Where are the logs?

As I have stated above, I have personally not found the VMWare events or the machine event logs to be much help. Your mileage may vary but you are likely going to want to find the unattend activity log which is located, of course, in

c:\windows\panther\UnattendGC\setupact.log

I don’t know what Panther is. I like to think there was some MS windows team back in the early 90’s that called themselves the panther team pioneering the way forward in windows automation. I also like to think they used gang-like panther calls to communicate with one another when spotting each other in the cafeteria or the campus store. They may have worn special jackets with the wild face of a panther on the back and perhaps some had tattoos or some form of tribal scarification applied resembling panther like imagery. Who knows…I can only guess.

At least in my case this is where the answers were found. Certainly they will be here if the issue is related to the domain join which mine usually tend to be. If the authentication with the domain admin account is at fault, that should be clear here. For instance:

2014-09-06 22:30:10, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:20, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:30, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:40, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:51, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:01, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:11, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:22, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:32, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:42, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...

The key above is the hex error code. Given the nature of the hexadecimal numeric format, the root is often immediately obvious and if not a google search usually points you to a more specific message.

In my recent stump scenario, the issue was that the domain controller could not be found. It ended up that although I was explicitly giving the domain controller IPs as the DNS servers to use, I was assigning the machine IP via DHCP and the DHCP server pointed to a different pair of DNS servers. For whatever reason, windows was choosing to use those servers and therefore unable to resolve the domain name to its correct domain controllers. There is also many other non-domain join details to be found here as well.

Other log locations that may be helpful

If for whatever reason, the unattend activity log does not have helpful information, there are a few more places to look. All files and subdirectories under:

c:\windows\panther
c:\windows\debug
c:\windows\temp

If you too are using the VMWare tools to drive the OS customization, you will find logs specific to VMWare’s work in c:\windows\temp. Many of the logs in the directories mentioned above may duplicate one another but some may have more granular detail than others.

I certainly hope this helps. If it does and you so happen to spot me in a crowd, let out a wild panther shriek and I promise to return with the same.

Dear VMWare, please give us nice things to automate the things by Matt Wrock

I’ve spent this week at VMWorld 2014 in San Francisco and have been exposed to a fair amount of VMWare news and products. One of my goals for the week was to talk to someone on the VMWare team about questions I have regarding their SDKs as well as to provide feedback regarding my own experiences working with their APIs. Yesterday I met with Brian Graf (@vTagion) during a “Meet the Experts” session. Brian has taken over Alan Renouf’s (@alanrenouf) previous role as technical marketing engineer focusing on automation. He was gracious enough to hear out some of my venting on this topic and asked that I follow up with an email so that he could direct these issues to the right folks in his organization. This blog post is intended to fill the role of that email in an internet indexible format.

I’m pretty new to VMWare. Having recently come from Microsoft, Hyper-V has been my primary virtualization tool. One of the attractions to my new job at CenturyLink Cloud was the opportunity to work with the VMWare APIs. I have many colleagues and acquaintances in the automation space who work almost exclusively with VMWare. VMWare has dominated the virtualization market since the beginning and I have been wanting for some time to get a better glimpse into their products and see what the hype was all about.

So for the last several months, I have been working closely with VMWare tools and APIs nearly all day every day. One of my key focus points has been developing the automation pipeline that CenturyLink Cloud uses to fire up new data centers and to bring existing datacenters under more highly automated management. This not only involves automating the build out of every server but automating the automation of those servers. That’s where me and VMWare hang out. We have been leveraging Chef and its new machine resource framework Chef Metal. So I have been doing a fair amount of Ruby development writing and refining a VSphere driver that serves as our bridge between VMWare VMs and Chef. This also includes code that ties into our testing framework and allows us to spin up VMs as new automation is committed to source control and then automatically tested to ensure what was meant to be automated is automated.

Not only am I new to VMWare, I’m also new to Ruby. For years and years I had been largely a C# developer and more recently with powershell over the last 5 years. So I may not be the most qualified voice to speak on behalf of the x-plat automation community, but I am a voice nonetheless and I have had the pleasure of interacting with several “Rubyists” and hearing their thoughts and observations on working with the VMWare APIs.

The VSphere SDK is wrought with friction

From what I can tell, there is absolutely no controversy here. Talk to any developer who has had to work with VMWare APIs around provisioning VMs and they are excited to not merely mention but pontificate upon the unfriendliness of these SDKs. I’m not talking about PowerCLI here -  more on that later but I am speaking of nearly all of the major programming language SDKs that sit on top of the VMWare SOAP service. One nice thing is that they all look nearly identical since they all sit on the same API therefore my criticism can be globally applicable. I have personally worked with the C# and mostly the Ruby based rbvmomi library.

One of the biggest pain points is the verbosity required to wire up a command. For example, to clone a VM there is quite a few configuration classes that  I have to instantiate and string together to feed into the CloneVM method. So CloneVM ends up being a very fat call but if anything goes wrong or I have not provided just the right value in one of the configuration classes, I may or may not get an actionable error message back and if not I may have to engage in quite a bit of trial and error to determine just where things went wrong. I think I understand the technical reasoning here and that this is attempting to keep the network chatter down, but frankly I don’t care. This is a solvable problem and it would be interesting to know if VMWare is looking to solve this.

Open Source solutions

I am not asking that VMWare feed me a better API. Especially in the Ruby community there are many quality developers more than willing to help. In fact a look over at github will reveal that there has been significant community effort and assistance here. 

Just use Fog

One option that many pursue in the Ruby space is using an API called Fog. This is an API that abstracts several of the popular (and not so popular) cloud and hypervisor solutions into a single API. In theory this means that I can code my infrastructure against VSphere but also leverage EC2. The API aggregates many of the underlying components that one expects to find in any of these environments like Machines, Networks, Storage, etc.

Of coarse the reality is that simply moving from one implementation to another never “just works.” Also the more you need to leverage the specific and unique strengths of one implementation, the more likely it is that you eventually need to “go native” and abandon fog. This was my fate and I also found the fog plugin model to be inherently flawed in that when I pull down the Fog ruby GEM, I had to pull down all plugin implementations built in making for a huge payload to download and install.

An apparent OSS cone of silence?

The core Ruby library, rbvmomi, the same library that Fog leverages has fairly recently been transferred to VMWare ownership. I’d be inclined to say that is a good thing. However it seems that VMWare is neither engaging with the developers trying to contribute to this library not are they releasing new bits to rubygems.org. Further, VMWare has been silent amidst requests to when a release can be expected. The last release was in December, 2013.

Unfortunately one pull request merged in January (8 months ago), has still not been released to RubyGems. This particular commit fixes a breaking change related to the popular Nokogiri library and this means that many need to pin their rbvmomi version to 1.5.5 released in 2012. Remember 2012? This not only shows a lack of desire to collaborate with and support their community but has an even more damaging effect of discouraging developers to contribute. Why would you want to contribute  to a dead repository?

I’m not saying that I believe VMWare has no interest in community collaboration and I fully appreciate the herculean effort sometimes needed to get a legal department in the company the size of VMWare to authorize this kind of collaboration but silence does not help and it is sending a bad message (well I guess no message really).

Engagement with the Ruby community is important

This community is likely small in comparison to VMWare’s bread and butter customers, but the world is changing. Ruby has an incredibly large stake in the configuration management space along with many other popular tools in the “devops” ecosystem. Puppet and Chef are both rooted in Ruby and as a Chef integrator, almost any integration between Chef and VSphere is written in Ruby. Another popular tool is Vagrant, any Vagrant plugin to support VSphere integration is going to leverage rbvmomi.

The industry is currently seeing a huge influx of involvement and interest with getting tools like these plugged into their infrastructures. I believe this will continue to become more popular and eventually those who use VMWare not because they “have to” may eventually opt for solutions that are more friendly to interact with.

Scant documentation

All of this is made worse by the fact that the documentation for the VSphere API is unacceptably sparse. VMWare does maintain a site that serves to document all of the objects, methods and properties exposed by their SDK. However these consist of one line descriptions with no in depth details or examples. Yes there is an active VMWare community but the resources they produce often do not suffice and may be difficult to find for more obscure issues.

PowerCLI is cool but it does not help me

The sense that I often get is that VMWare is trying to answer these shortcomings with its PowerCLI API – a powershell based API that I do think is awesome. The PowerCLI has succeeded in making many of the operations that take several lines of ruby or C# into one liners and it comes with great command line help. In stark contrast to the other language SDKs, almost everyone raves over PowerCLI who is able to use it in their automation pipeline. However, this is not an answer and especially so if you run either a linux or a mixed linux/windows shop (ie most people).

At Centurylink Cloud we run both windows and linux. We find that it is easiest to run all of our automation from linux as the control center to our pipeline. Its just not practical to provision a set of windows nodes to act as a gateway into VSphere. Therefore this means Power CLI is only available to me for one off scripts which is exactly what we need to automate away from.

While I’m at it, one small nit on PowerCLI. Its implementation as a powershell snapin as opposed to a module can make it more difficult to plug into existing powershell infrastructure. It’s a powershell v1 technology (we are coming up on v5) and while somewhat small, it is one of those things that can give the impression of an amateur effort. That said, PowerCLI is by no means amateur.

What am I suggesting?

First, let us know that you hear this and that you have a desire to move forward to help my community integrate with your technology. Respond to people commenting on your github repository. If you lack the legal approval to release contributions, designate one or more community members and transfer ownership to them allowing them to coordinate PRs and releases. This library is not contributing to VMWare IP its just a convenience layer sitting on top of your web services so this seems like a reasonable request. Let me also state what I don’t want. I do not want fancy GUIs or anything that waits for me to point and click. I’m working to automate every dark corner of our infrastructure so that we can survive.

Finally, I want to clarify this this post is not intended to insult anyone. My hope is that it serves as another data point to help you understand your customers and that you can consider as you plan future strategy. I’m sure there are many employees at VMWare that share my passion and want to make integration with other automation tools a better story. To those employees, I say fight fight fight and do not become complacent and you are really super cool and awesome for doing that.