Orchestrating multi node Windows tests in Test-Kitchen Beta! / by Matt Wrock

Primary and backup active directory controllers in their own kitchen.local domain

Primary and backup active directory controllers in their own kitchen.local domain

This week marks an important milestone in the development bringing Test-Kitchen to windows. All the work that has gone into this effort over the past nine months has been merged into the master branch and prerelease gems have been cut for both the test-kitchen repo as well as kitchen-vagrant. So this post serves as another update to getting started with these new bits on windows (there are some changes) and expands on some of my previous posts by illustrating a multi node converge and test. The same technique can be applied to linux boxes as well but I'm going to demonstrate this with a cookbook that will build a windows primary and backup active directory domain controller pair.

Prerequisites

In order to make the cookbook tests in this post work, the following needs to be preinstalled:

  • A recent version of vagrant greater than 1.6 and I would strongly recommend even higher to account for various bug fixes and enhancements around windows.
  • Either VirtualBox or Hyper-V hypervisor
  • git
  • A ruby environment with bundler. If you dont have this, I strongly suggest installing the chefdk.
  • Enough local compute resources to run 2 windows VMs. I use a first generation lenovo X1 with an i7 processor, 8GB of ram, an SSD and build 10041 of the windows 10 technical preview. I have also run this on a second generation X1 running Ubuntu 14.04 with the same specs.

The great news is that now your host can be linux, mac, or windows.

Setup

Install the vagrant-winrm plugin. Assuming vagrant is installed, this will download and install vagrant-winrm:

vagrant plugin install vagrant-winrm

Clone my windows-ad-pair cookbook:

git clone https://github.com/mwrock/windows-ad-pair.git

At the root of the repo, bundle install the necessary gems:

bundle install

This will grab the necessary prerelease test-kitchen and kitchen-vagrant gems along with their dependencies. It will also grab kitchen-nodes, a kitchen provisioner plugin I will explain later.

Using Hyper-V

I use Hyper-V when testing on my windows laptop. Vagrant will use VirtualBox by default but can use many other virtualization providers. To force it to use hyper-v you can:

  • Add the provider option to your .kitchen.yml:
  driver_config:
    box: mwrock/Windows2012R2Full
    communicator: winrm
    vm_hostname: false
    provider: hyperv
  • Add an environment variable, VAGRANT_DEFAULT_PROVIDER, and assign it the value "hyperv".

I prefer the later option since given a particular machine, I will want to always use the same provider and I want to keep my .kitchen.yml portable so I can share it with others regardless of their hypervisor preferences.

A possible vagrant/hyper-v bug?

I've been seeing intermittent crashes in powershell during machine create on the box used in this post. I had to create new boxes for this cookbook. One reason is that using the same image for multiple nodes in the same security domain required the box to be sysprepped to clean all SIDs (security identifiers) from the base images. This means that when vagrant creates the machine, there is at least one extra reboot involved and I think this may be confusing the hyperv plugin.

I have not dug into this but I have found that immediately reconverging after this crash consistently succeeds.

Converge and verify

Due to the sensitive nature of standing up an Active Directory controller pair (ya know...reboots), rather than calling kitchen directly, we are going to orchestrate with rake. We'll dive deeper into this later but to kick things off run:

bundle exec rake windows_ad_pair:integration

Now go grab some coffee. 

...

No, no, no. I meant get in your car and drive to another town for coffee and then come back.

What just happened?

Test kitchen created a primary AD controller, rebooted it and then spun up a replica controller. All of this uses the winrm protocol that is native to windows so no SSH services needed to be downloaded and installed. This pair now manages a kitchen.local domain that you could actually join additional nodes to if you so choose.

Where are these windows boxes coming from?

These are evaluation copies of windows 2012R2. They will expire in a little under six months from the date of this post. They are not small and weigh in at about 5GB. I typically use smaller boxes where I strip away unused windows features but I needed several features to remain in this cookbook and it was easiest to just package new boxes without any features removed. I keep the boxes accesible on Hashicorp's Atlas site but the bytes live in Azure blob storage.

Is there an SSH server running on these instances?

No. Thanks to Salim Afiune's work, there is a new abstraction in the Test-Kitchen model, a Transport. The transport governs communication between test-kitchen on the host and the test instance. The methods defined by the transport handle authentication, transferring files, and executing commands. For those familiar with the vagrant model, the transport is the moral equivalent of the vagrant communicator. Test-kitchen 1.4 includes built in transports for winrm and ssh. I could imagine other transports such as a vmware vmomi transport that would leverage the vmware client tools on a guest.

How does the backup controller locate the primary controller?

One of the challenges of performing multi node tests with test-kitchen be they windows or not is orchestrating node communication without hard coding endpoint URIs into your cookbooks or having to populate attributes with these endpoints. Ideally you want nodes to discover one another based on some metadata. At CenturyLink we use chef search to find nodes operating under a specific runlist or role. In this cookbook, the backup controller issues a chef search for a node with the primary recipe in its runlist and then grabs its IP address from the ohai data.

primary = search_for_nodes("run_list:*windows_ad_pair??primary*")
primary_ip = primary[0]['ipaddress']

The key here is the search_for_nodes method found in this cookbook's library helper:

require 'timeout'

def search_for_nodes(query, timeout = 120)
  nodes = []
  Timeout::timeout(timeout) do
    nodes = search(:node, query)
    until  nodes.count > 0 && nodes[0].has_key?('ipaddress')
      sleep 5
      nodes = search(:node, query)
    end
  end

  if nodes.count == 0 || !nodes[0].has_key?('ipaddress')
    raise "Unable to find any nodes meeting the search criteria '#{query}'!"
  end

  nodes
end

Does Test-Kitchen host a chef-server?

Mmmmm....kind of. Test-kitchen supports both chef solo and chef zero provisioners. Chef zero supports a solo-like workflow allowing one to converge a node locally  with no real chef server and also supports search functionality. This is facilitated by adding json files to a nodes directory underneath the test/integration folder:

The node files are named using the same suite and platform combination as the kitchen test instances. The contents of the node look like:

{
  "id": "backup-windows-2012R2",
  "automatic": {
    "ipaddress": "192.168.1.10"
  },
  "run_list": [
    "recipe[windows_ad_pair::backup]"
  ]
}

You can certainly create these files manually but there is no guarantee that the ip address will always be the same especially if others use this same cookbook. Wouldn't it be nice if you could dynamically create and save this data at node provisioning time? I think so.

We use this technique at CenturyLink and have wired it into some of our internal drivers. I've been working to improve on this making it more generlized and extracting it into its on dedicated kitchen provisioner plugin, kitchen-nodes. Its included in this cookbook's Gemfile and is wired into test-kitchen in the .kitchen.yml:

provisioner:
  name: nodes

Its still a work in progress and I came accross a scenario in the cookbook here where  I had to add functionality to support vagrant/VirtualBox and temporarily make this plugin windows specific. I'll be changing that later. There is really not much code involved and here is the meat of it:

def create_node
  node_dir = File.join(config[:test_base_path], "nodes")
  Dir.mkdir(node_dir) unless Dir.exist?(node_dir)
  node_file = File.join(node_dir, "#{instance.name}.json")

  state = Kitchen::StateFile.new(config[:kitchen_root], instance.name).read
  ipaddress = get_reachable_guest_address(state) || state[:hostname]

  node = {
    :id => instance.name,
    :automatic => {
      :ipaddress => ipaddress
    },
    :run_list => config[:run_list]
  }

  File.open(node_file, 'w') do |out|
    out << JSON.pretty_generate(node)
  end
end

def get_reachable_guest_address(state)    
  ips = code <<-EOS
    Get-NetIPConfiguration | % { $_.ipv4address.IPAddress}
  EOS      
  session = instance.transport.connection(state).node_session
  session.run_powershell_script(ips) do |address, _|
    address = address.chomp unless address.nil?
    next if address.nil? || address == "127.0.0.1"
    return address if Net::Ping::External.new.ping(address)
  end
  return nil
end      

The class containing this derives from the ChefZero provisioner class. It reads from the state file that test-kitchen uses to get the node's ip address and then ensures that this ip is reachable externally or uses one that is from a different NIC in the node. It then adds that ip and the node's run list to the node json file.

Dealing with reboots

Standing up active directory controllers present some awkward challenges to an automated workflow in that each must be rebooted before they can recognize and work with the domain they control. Ideally we would simply be able to kick off a kitchen test which would converge each node and run their tests. However if we did this here, the results would be disappointing unless you like failure - just don't count on failing fast. So we have to "orchestrate" this. A rather fancy term for the method I'm about to describe.

I'll warn this is crude and I'm sure there are better ways to do this but this is simple and it consistently works. It includes using a custom rake task to manage the test flow. It looks like this:

desc "run integration tests"
task :integration do
 system('kitchen destroy')
 system('kitchen converge primary-windows-2012R2')
 system("kitchen exec primary-windows-2012R2 -c 'Restart-Computer -Force'")
 system('kitchen converge backup-windows-2012R2')
 system("kitchen exec backup-windows-2012R2 -c 'Restart-Computer -Force'")
 system('kitchen verify primary-windows-2012R2')
 system('kitchen verify backup-windows-2012R2')
end

This creates and converges the primary node and then uses kitchen exec to reboot that instance. While it is rebooting, the backup instance in created and converged. Of course there is no guarantee that the primary node will be available by the time the backup node tries to join the domain but I have never seen the back up node add itself to the domain without the primary node completing its restart. Remember that the backup node has to go through a complete create first which takes minutes. Then the backup node reboots after its converge and while its doing that, the primary node begins its verify process. The kitchen verify can be run independently and do not need both instances to be up.

If this were a production cookbook running on a build agent, I'd raise an error if any of the system calls failed (returned false) and I'd include an ensure clause at the end that destroyed both instances.

Vagrant networking

With the vagrant virtualbox provider, vagrant creates a NAT based network and forwards host ports for the transport protocol. This is a great implementation for single node testing but for multi node tests, you may need a real IP issued statically or via DHCP that one node can use to talk to another. This is not always the case but it is for an active directory pair mainly because creating a domain is tied to DNS and simply forwarding ports wont suffice. In order for the backup node to "see" the domain, "kitchen.local" here, created by the primary node, we assign the primary node's IP address to the primary DNS server of the backup node's NIC.

  powershell_script 'setting primary dns server to primary ad' do
    code <<-EOS
      Get-NetIPConfiguration | ? { 
        $_.IPv4Address.IPAddress.StartsWith("#{primary_subnet}") 
        } | Set-DnsClientServerAddress -ServerAddresses '#{primary_ip}'
    EOS

    guard_interpreter :powershell_script
    not_if "(Get-DnsClientServerAddress -AddressFamily IPv4 | ? { $_.ServerAddresseses -contains '#{primary_ip}' }).count -gt 0"
  end

Adding 127.0.0.1 would not work. We need a unique IP from wich the domain's DNS records can be queried.

Most non-virtualbox providers will not need special care here and certainly not hyper-v. Hyper-V uses an external or private virtual switch which maps to a real network adapter on the host. For virtualbox, you can provide network configuration to tell vagrant to create an additional NIC on the guest:

    driver:
      network:
        - ["private_network", { type: "dhcp" }]

This will simply be ignored by hyper-v.

Testing windows infrastructure - the future is now

I started using all of this last July and have been blogging about it since August. That first post was entitled Peering into the future of windows automation. Well here we are with prerelease gems available for use. Its really been exciting to see this shape up and it has been super helpful to me as I have been learning a new language, ruby, to interact with this code base along with Salim Afiune and Fletcher Nichols to accelerate my learning process which is far from over.

By the way I'll be at ChefConf nearly all week next week and carrying both of my laptops (linux and windows) with all of this code. If anyone wants a demo to see this in action or just wants to talk Test Driven Infrastructure shop, I'd love to chat!