RealityForge.org

Documenting Cookbooks

2013-04-01T00:00:00+00:00

Our infrastructure has many cookbooks that aim to be reusable, primarily through encapsulating behaviour in LWRPs. This led to an explosion of LWRPs and sometimes the documentation didn't keep up or did just not exist.

Chef has been evolving rapidly and many of the pain points are being addressed by Opscode or by the community at large - which is great. However, one pain point that is not getting easier is writing cookbook documentation. Several threads came together last week to motivate me to change this.

Incorrect documentation of LWRPs contributed to an outage.
Mathias Lafeldt wrote a knife plugin that generates an initial README.md from the metadata.rb file in a a cookbook.
Other languages/frameworks have tools to generate documentation from annotated source code.

So I decided to try and extend Mathias's work so that I could always regenerate README.md from the cookbook source code. This result in the knife-cookbook-doc project.

knife cookbook doc DIR

As much as possible the plugin makes use of the same metadata as used by chef when generating the documentation. The plugin will also scan the source files for annotations present in comments. Users can also add fragments of markdown into the doc/ directory to merge into the generated README.md file.

The goal is to keep the code as the authoritative source of information. The hope is that keeping the documentation close to the code will help to maintain it's currency.

Getting Started

Step 1

Populate the metadata.rb of your cookbook according to Opscode's documentation. Particular attention should be paid to documenting the recipes, attributes, platform compatibility and cookbook requirements (i.e. depends, recommends, suggests etc).

Step 2

At the top of each recipe, add a detailed documentation section such as;

=begin
#<
The recipe is awesome. It does thing 1, thing 2 and thing 3!
#>
=end

Step 3

In each LWRP, add detailed documentation such as;

=begin
#<
This creates and destroy the awesome service.

@action create  Create the awesome service.
@action destroy Destroy the awesome service.

@section Examples

# An example of my awesome service
mycookbook_awesome_service "my_service" do
port 80
end
#>
=end

...

#<> @attribute port The port on which the HTTP service will bind.
attribute :port, :kind_of => Integer, :default => 8080

It should be noted that the documentation of the LWRP requires that the user document the actions, using @action <action> <description> and the attributes using @attribute <attribute> <description>. This allows meaningful descriptions for the actions and attributes to be added to the README.

The other text will be added at the start of the LWRP documentation except if marked with @section <heading>, in which case it will be added to the end of the LWRP documentation.

Step 4

Finally the user should add some documentation fragments into the doc/ dir. Most importantly you should add doc/overview.md which will replace the first Description section of the readme. You should also add a doc/credit.md which will replace the last License and Maintainer section in the readme. The remaining fragments will be included at the end of the readme in lexicographic order of the filename.

Step 5

Install the plugin and run the knife command, passing the directory of the cookbook as an argument.

gem install knife-cookbook-doc
knife cookbook doc MY_COOKBOOK_DIR

Examples

For an example of a README generated by the plugin, check out the glassfish cookbook. Unfortunately the plugin highlights the fact that so much of the cookbook is poorly documented. However there are some LWRPs such as glassfish_mq that have the beginning of useful documentation.

Final Thoughts

The plugin is raw but usable now. It needs to evolve to be a more seem-less part of our workflow. It would also be nice to see it or something more complete be adopted by the rest of chef community. I wonder what needs to be done to build such a tool?

References

knife-cookbook-doc - the new plugin.
knife-cookbook-readme - the original plugin.
chef-glassfish - example cookbook using the new plugin.

Updates: 3rd of April, 2013

Added References section.
Added inline links.
Added notes about the credit section.
Fixed some unclear language.

Role Cookbooks and Wrapper Cookbooks

2012-11-19T00:00:00+00:00

Roles in chef are un-versioned. Early in our adoption of chef we defined our roles as a run list and a collection of attributes. Our infrastructure is set up such that almost all of our environments share a single chef server and each node periodically checks in with the chef server and updates itself as appropriate.

We wanted a way for roles to propagate between our environments step by step, validating the role met our requirements at each step before being promoted to the next step. i.e. A role should be deployed to the 'development' environment and then 'integration', 'uat', 'staging' and finally 'production'.

Initially we broke our environments a few times when we updated our role definitions as these definitions were shared across all environments. To address this we attempted to use an ugly naming schema of roles, suffixing a version. So we ended up with a scenario where we had the 'myrole_v4' role in 'development', the 'myrole_v3' role in the 'uat' environment and the 'myrole_v1' role in the 'production' environment. This approach did not feel right. Finally we realized that Chef already has a mechanism for versioning artifacts - namely cookbooks.

Role Cookbooks

Today our roles simply include a single recipe, the role recipe. The role recipe then uses "include_recipe" to include other recipes in the required order. This allows us to rely on chefs builtin version resolution mechanisms to version our roles. Our role and associated 'role cookbook' looks something not unlike the following;

roles/foo.rb

name "foo"
description "Foo Server"
run_list("recipe[mybiz-foo]")

cookbooks/mybiz-foo/metadata.rb

name "mybiz-foo"
description "Sets up the Foo Server"
version "0.4.2"
...
depends "ntp"
depends "git"
depends "foo"

cookbooks/mybiz-foo/recipes/default.rb

include_recipe "ntp"
include_recipe "git"
include_recipe "foo"

Using this approach allowed us to have a single role "foo". However the associated cookbook/recipe may have a different version in different environments. This allowed us to easily control the evolution of the role and the promotion of the role between different environments.

Wrapper Cookbooks

Role cookbooks were great at being able to version the run list of our roles but it did not solve the problem of attributes in our roles. The attribute data was what we used to customize the cookbooks for our particular business. We had struggled with the interaction between the place the attributes are specified (i.e. node, environment, role, cookbook) and the precedence levels (i.e. default, normal, override and automatic). So we decided to simplify.

For every cookbook and/or recipe we wanted to customize we created a separate 'wrapper' recipe that set the required attributes and then included the recipe from the original cookbook. This resulted in a layout that looked like the following;

cookbooks/mybiz-bar/metadata.rb

name "mybiz-bar"
description "Sets up the Bar Server"
version "0.1.3"
...
depends "bar"

cookbooks/mybiz-bar/recipes/default.rb

node.override['bar']['port'] = 80
node.override['bar']['interface'] = '0.0.0.0'
node.override['bar']['host'] = 'bar.mybiz.example.com'
node.override['bar']['max_threads'] = 250

include_recipe "bar"

We also decided on a few simple rules for setting attributes;

Wrapper cookbooks should only ever set attributes using the 'override' precedence.
Cookbooks should set attributes using the 'default' precedence if a wrapper cookbook is allowed to override the attribute.
Cookbooks may set attributes using the 'override' precedence if they are publishing attribute data for other cookbooks to use but do not expect the other cookbooks to override the attribute data.

Using this technique, there was effectively only one place to look for cookbook customisations and you did not have to think about the precedence rules. It felt much easier to manage our cookbooks.

Role or Wrapper Cookbook?

While I presented role and wrapper cookbooks as separate concepts and separate cookbooks, this need not be the case. If the recipe in a wrapper cookbook is only used within a single role cookbook and it is relatively simple then we tend to move the wrapper recipe into the role cookbook. Only when there is complex logic required or a cookbook is used by multiple roles do we bother creating a separate wrapper cookbook.

Environments

Our environments do still have some attribute data within them but the data tends to be used to drive a rules layer. For example, our environment attributes specify a data center key. Later, in a wrapper cookbooks we examine the data center key and set the appropriate name servers and ntp servers (neither of which are managed by chef).

Closing thoughts

These cookbook patterns are not unique to our infrastructure. The wrapper cookbook is in some ways a limited form of the "application" cookbook pattern described in the latest Food Fight show hangout. Jamie Winsor from Riot Games has also mentioned role cookbooks as did Joseph Holsten on the mailing list.

One thing we have yet to try but I am really excited about is resource patching in the wrapper cookbook to modify resources defined in the original cookbook after it has been included. Bryan Berry put together the chef-rewind gem that makes it simple to patch the resources already defined. Joshua Timberman also demonstrated a slightly more raw way of modifying already defined resources. Both look like effective ways of keeping the line between vendored cookbooks and business specific cookbooks clear.

Reusable Cookbooks Revisited

2012-11-12T00:00:00+00:00

It seems reusable cookbooks are a hot topic at the moment. I recently sat in on the Reusable Cookbook Patterns hangout run by the most excellent Food Fight show where Noah Kantrowitz gave his thoughts on "Application" versus "Library" cookbooks. His approach aligned with the way we have approached cookbook reusability (See " Evolving towards cookbook reusability in Chef" for a basic overview of our view on reusability after using Chef for six months).

If I was to simplify Noah's view down I believe it would be that "library" cookbooks are a collection of LWRPs that manipulate resources. The "library" cookbook may also include a default recipe that installs the actual bits on the system. The "application" cookbooks depend on the "library" cookbook and then use the "library" cookbooks LWRPs to configure the system. (It should be noted that the term "application" cookbooks seemed to identify any cookbook that uses a "library" cookbook). The way that an "application" cookbook communicates with a "library" cookbook is through what Noah describes as "data capsules" which I believe just means rich data types passed into the LWRPs.

Our basic pattern for reusable cookbooks follows a similar approach except that the way we communicate with the reusable cookbooks is to use simple types - essentially anything that can be represented in json; numbers, strings, booleans, arrays and hashes. We go one step further in that we also define a recipe that reads node attributes and interprets the attributes to invoke the required LWRPs. The motivation for this was to DRY up our cookbooks. It also makes it easy to use other cookbooks that manipulate attribute data such as Heavywater's bag_config cookbook.

An Example

To highlight this I will make use of the glassfish cookbook again. GlassFish is an an application server in which you install sub-components such as web applications, libraries, database pools, message broker references etc.

Below are two ways of configuring a small, simple web application. The application uses a database and has a single configuration entry accessible via JNDI. The actual code in the two recipes is not important for the conversation but it is presented to give you a feel of the different approaches.

Using an attribute_driven recipe

node.override['glassfish']['domains']['mydomain'] =
{
  'config' =>
  {
    'max_memory' => 1548,
    'max_perm_size' => 192,
    'port' => 80,
    'admin_port' => 8085,
    'max_stack_size' => 128,
    'username' => 'admin',
    'password' => 'secret'
  },
  'deployables' =>
  {
    'somapp' =>
    {
      'url' => 'http://repo.example.com/somapp-0.17.war',
      'context_root' => '/somapp'
    }
  },
  'extra_libraries' =>
  {
    'mydatabasedriver' =>
       'http://repo.example.com/mydatabasedriver-1.2.3.jar'
  },
  'jdbc_connection_pools' =>
    {
      'SomeappSQL' =>
      {
        'config' =>
        {
          'datasourceclassname' => 'net.sourceforge.jtds.jdbcx.JtdsDataSource',
          'restype' => 'javax.sql.DataSource',
          'isconnectvalidatereq' => 'true',
          'validationmethod' => 'auto-commit',
          'ping' => 'true',
          'description' => 'SomeappSQL Connection Pool',
          'properties' =>
          {
             'Instance' => 'Instance1',
             'ServerName' => 'db.example.com',
             'User' => 'dbadmin',
             'Password' => 'dbsecret'
             'PortNumber' => '1433',
             'DatabaseName' => 'SOMEAPP'
          }
        },
        'resources' =>
        {
          'jdbc/SomeappDS' =>
            {'description' => 'SomeappSQL Connection Resource'}
        }
    },
    'custom_resources' =>
    {
      'MyServiceURL' => 'http://other.example.com:1234/MyService'
    }
}

include_recipe 'glassfish::attribute_driven_domain'

Using raw LWRPs

include_recipe 'glassfish::default'

password_file = "#{node['glassfish']['domains_dir']}/#{domain_key}_admin_passwd"
glassfish_domain 'mydomain' do
  max_memory 1548
  max_perm_size 192
  max_stack_size 128
  port 80
  admin_port 8085
  username 'admin'
  password_file password_file
  secure true
  password 'secret'
end
glassfish_library 'http://repo.example.com/mydatabasedriver-1.2.3.jar' do
  domain_name 'mydomain'
  admin_port 8085
  username 'admin'
  password_file password_file
  secure true
  library_type 'ext'
end
glassfish_jdbc_connection_pool 'SomeappSQL' do
  domain_name 'mydomain'
  admin_port 8085
  username 'admin'
  password_file password_file
  secure true
  datasourceclassname 'net.sourceforge.jtds.jdbcx.JtdsDataSource'
  restype 'javax.sql.DataSource'
  isconnectvalidatereq true
  validationmethod 'auto-commit'
  ping true
  description 'SomeappSQL Connection Pool'
  properties {
     'Instance' => 'Instance1',
     'ServerName' => 'db.example.com',
     'User' => 'dbadmin',
     'Password' => 'dbsecret'
     'PortNumber' => '1433',
     'DatabaseName' => 'SOMEAPP'
  }
end
glassfish_jdbc_resource 'jdbc/SomeappDS' do
  domain_name 'mydomain'
  admin_port 8085
  username 'admin'
  password_file password_file
  secure true
  connectionpoolid 'SomeappSQL'
  description 'SomeappSQL Connection Resource'
end
glassfish_custom_resource 'MyServiceURL' do
  domain_name 'mydomain'
  admin_port 8085
  username 'admin'
  password_file password_file
  secure true
  value 'http://other.example.com:1234/MyService'
end
glassfish_deployable 'somapp' do
  domain_name 'mydomain'
  admin_port 8085
  username 'admin'
  password_file password_file
  secure true
  url 'http://repo.example.com/somapp-0.17.war'
  context_root '/somapp'
end

Comparison

The attribute_driven recipe is marginally smaller (56 lines versus 68 lines) and this is mostly is due to the repetition when using raw LWRPs. However the greatest advantage that we see for the attribute_driven approach is the simpler cognitive model.

In most cases using raw LWRPs requires that the caller understands the implicit ordering requirements. i.e. Database pools and resources need to be set up before the application is deployed. The user of the raw LWRPs also needs to manually manage the removal of resources when they are no longer required. Compare this to the attribute_driven recipe approach that can automatically determine that a database pool, deployable or other component is no longer required (as it no longer appears in attribute data) and remove the component from the glassfish server.

Using the attribute_driven recipe does not remove the ability to directly use the raw LWRPs when needed. However 95% of the time we can get away with working at a higher level using the attribute_driven recipe.

Our approach also makes it easy it easy to build up configuration from multiple sources. In our environment we typically build up configuration data from data bags in the chef server, a separate configuration service, LDAP/ActiveDirectory, a rule layer as well as occasionally hard coding the configuration into a recipe. However after we have collected the configuration from the various sources, we just need to apply it as node attribute data and include the attribute_driven recipe. Hopefully there are fewer problems resulting from transcribing the configuration from one source to the node data than there are if we had to interpret the configuration data and invoking the LWRPs in the correct sequences.

In fact recently we have introduced a 'search_driven ' recipe that crystallizes a common approach to collecting configuration data. It searches a particular index, using a particular query and extracts data from within the index and applies the data to the node in the correct location. Essentially that means we can store all our configuration data in the data bags for a particular glassfish domain.

Using a search_driven recipe

# Specify the index to search. Usually defaults to domain name..
node.override['glassfish']['domains']['mydomain']['discover']['type'] = 'front_end'
# Specify the query to use. Defaults to '*:*'
node.override['glassfish']['domains']['mydomain']['discover']['query'] =
  "chef_environment:#{node.chef_environment}"
# Specify the key to merge into domain config. Defaults to 'config'
node.override['glassfish']['domains']['mydomain']['discover']['query'] = 'myconfig'

include_recipe 'glassfish::search_driven_domain'

When to use re-usable Cookbooks?

So one question that not a lot of time was spent on during the hangout was when to use "library" cookbooks. We are strong proponents of reusable cookbooks and yet in our infrastructure, only 5 of our 70+ cookbooks fall into this category. I can envision the ratio going up to as many as 9 in ~55 cookbooks but that is still a small proportion of our cookbooks. The reusable cookbooks include core functionality such as; firewalls, monitoring, the application server, the message broker and the content management system. Our other cookbooks may be reusable to one degree or another but no other cookbook follows the "library" design pattern.

There seemed to be a strong turnout from those who have come from the developer tradition in contrast to the operations tradition which may account for the strong push towards reuse and higher level abstractions. Our LWRPs tend to be thin veneers on top of abstractions in the underlying tool and the attribute_driven recipes are thin veneers on top of the LWRPs. I can see that higher level abstractions that are widely applicable may have merit and may even drive infrastructure decisions. Rails was remarkable in the way it simplified development through a set of conventions and higher level abstractions and maybe that approach could be just as successful in Chef. However that is not something we do locally so I don't have a feeling for how good or bad it could be.

Overall I enjoyed the hangout - it is pleasing to see a lot of smart and passionate people in the chef community.

LWRP notifying on changed resources

2012-07-17T00:00:00+00:00

Opscode Chef's light-weight resource providers are awesome. They allow you to compose a more complex resource from simpler resources. The one thing that has always annoyed me is that you can no longer use notifications to indicate which resources are actually changed or so I thought. Then along came a post at a gist by yfeldblum that demonstrated how to do this.

Of course I wanted to simplify this as almost all of our LWRPs will need this so I converted it into a method that can be added in the library directory.

def notifying_action(key, &block)
  action key do
    # So that we can refer to these within the sub-run-context
    # block.
    cached_new_resource = new_resource
    cached_current_resource = current_resource

    # Setup a sub-run-context.
    sub_run_context = @run_context.dup
    sub_run_context.resource_collection =
       Chef::ResourceCollection.new

    # Declare sub-resources within the sub-run-context. Since they
    # are declared here, they do not pollute the parent run-context.
    begin
      original_run_context, @run_context =
          @run_context, sub_run_context
      instance_eval(&block)
    ensure
      @run_context = original_run_context
    end

    # Converge the sub-run-context inside the provider action.
    # Make sure to mark the resource as updated-by-last-action if
    # any sub-run-context resources were updated (any actual
    # actions taken against the system) during the
    # sub-run-context convergence.
    begin
      Chef::Runner.new(sub_run_context).converge
    ensure
      if sub_run_context.resource_collection.any?(&:updated?)
        new_resource.updated_by_last_action(true)
      end
    end
  end
end

Now all the providers can simply use something like the following to define your actions in your provider. This will propagate notifications if any sub-resources have changed.

notifying_action :run do
  file "/tmp/something" do
    owner "root"
    group "root"
    mode "0755"
    action :create
  end
end

Monitoring as part of the definition of done using Chef

2012-06-27T00:00:00+00:00

Monitoring needs to be part of the definition of done. In an ideal world you would not be able to deploy an application without deploying services to monitor the ongoing activity and health of the application. If you care enough to deploy an application, you should care enough to make sure it keeps working within expected parameters.

When this is not adhered you end up in a bad place. We have a large number of bespoke services developed over a period of ~12 years in many different technology stacks. However we have very poor visibility into their operation. In many cases it is difficult to determine whether the services are still operating within expected parameters or even whether they are still being used. In almost all of them, it is our users that alert us to performance or correctness problems in the services. This is a terrible situation and is hated by everyone; developers, operators and users alike.

Our newer applications are built with a strict regime where a service is not done until it has monitoring. We have adopted Chef as our configuration management tool of choice. We have also started using an attribute driven approach for developing our cookbooks. The rest of this post outlines some of the techniques we use to keep our systems in check.

Overview

We tend to break down monitoring into four major orientations; infrastructure, system, service and business characteristics. The infrastructure characteristics tend to be elements such as utilization of underlying network infrastructure etc but we do not have a good story for this yet. The system characteristics are the node characteristics such as CPU usage, memory usage, disk operations, bytes transmitted etc. The service characteristics include metrics such as queue sizes, throughput rates. The business characteristics include things like the number of Wildfires still going, the number of resources deployed to each emergency event, the number of incident control teams activated etc. (We write software for emergency services).

For most characteristics we monitor, we want to graph the characteristic over time. Alerts need to be generated if the metric values are outside an expected range of values or are trending towards this scenario. We also want our releases to be relatively self-contained. i.e. If we release a new application this should update all of the graphs, monitors and alert configurations for that application. The pattern we use to implement this is to have the cookbook for the application publish the probe and graph definitions as attributes on the node. The configuration is then discovered by recipes that run on the local node or remote nodes and is used to drive configuration of the monitoring tools.

Right now our tool of choice for persistence and graphing of metric data is graphite. The graphite product suite allows flexibility in how data is collected, aggregated, presented and analyzed but it has a very poor user experience. So we have adopted gdash to build our dashboard.

System Level Monitoring

We use a number of different tools to monitor the system level characteristics but the default answer on Ubuntu/Linux hosts is to use collectd. It has many plugins to monitor all sorts of characteristics but we tend to use it to measure system level characteristics such as CPU usage. The collectd agents publish directly to a graphite server using the write_graphite plugin.

We have rewritten the existing collectd cookbook to use an attribute driven approach so that each node need only define the appropriate attributes and include the appropriate collectd recipe to activate desired plugins. A typical block used to configure a collectd node would look something like the following;

node.override['collectd']['name'] = node['hostname']
node.override['collectd']['plugins'] =
  {
    'syslog' => {'config' => {"LogLevel" => "Info"}},
    'disk' => {},
    'swap' => {},
    'memory' => {},
    'cpu' => {},
    'interface' => {'config' => {"Interface" => "lo", "IgnoreSelected" => true}},
    'df' => {'config' => {"ReportReserved" => false,
                          "FSType" => ["proc", "sysfs", "fusectl", "debugfs", "devtmpfs", "devpts", "tmpfs"],
                          "IgnoreSelected" => true}},
  }
# Use a utility method to search for the graphite server
graphite_host, graphite_port = ...
if graphite_host
  node.override['collectd']['plugins']['write_graphite'] =
      {'config' => {'Host' => graphite_host,
                    'Port' => graphite_port,
                    'Prefix' => "#{node.chef_environment}.node."}}
end

include_recipe "collectd::attribute_driven"

Of course we also consider statistics about our chef execution to be important enough to monitor and we use the graphite_handler cookbook to collect these statistics.

# Use a utility method to search for the graphite server
graphite_host, graphite_port = ...
if graphite_host
  node.override['chef_client']['handler']['graphite']['host'] = graphite_host
  node.override['chef_client']['handler']['graphite']['port'] = graphite_port
end

include_recipe "graphite_handler::default"

This is enough to configure the publishing of data about the nodes to graphite but we also want to configure graphite to customize the retention policy for the data collected. This is done by attribute configuration. The graphite node discovers the configuration on next converge. Below is an example configuration for the chef statistics.

node.default['graphite']['carbon']['storage_schemas']['chef'] =
  {
    'priority' => 0,
    'aggregation_method' => 'last',
    'x_files_factor' => '0.1',
    'pattern' => '^.*\.chef\..*$',
    'retentions' => '1m:7d,10m:2y'
  }

So far we have collected data and published it into graphite with a custom retention rules. To build up a dashboard we use a similar technique. Each application publishes attribute data defining a series of graphs and dashboard components. We use a modified gdash cookbook that uses search to discover all the published graph components and constructs a dashboard from the data. An example for such a configuration is below;

node_prefix = "#{node.chef_environment}.node.#{node['hostname']}"
node.override['gdash']['dashboards']["#{node['hostname']}-node"] =
  {
    'category' => 'nodes',
    'description' => "#{node['hostname']} Node Metrics",
    'display_name' => node['hostname'],
    'components' => {
      'cpu' => {
        'area' => 'stacked',
        'title' => 'CPU Usage',
        'vtitle' => 'percent',
        'description' => "The CPU usage",
        'fields' => {
          'iowait' => {
            'color' => 'red',
            'alias' => 'IO Wait',
            'data' => "sumSeries(#{node_prefix}.cpu-*.cpu-wait)"
          },
          'system' => {
            'color' => 'orange',
            'alias' => 'System',
            'data' => "sumSeries(#{node_prefix}.cpu-*.cpu-system)"
          },
          'user' => {
            'color' => 'yellow',
            'alias' => 'User',
            'data' => "sumSeries(#{node_prefix}.cpu-*.cpu-user)"
          }
        }
      }
    }
  }

Application Level Monitoring

Our applications come in all shapes and sizes but we have many java (and jruby) based applications so we collect a lot of the monitoring data via JMX. We developed a small tool (spydle) and a corresponding chef cookbook that periodically polls the applications using JMX and pushes the data to the graphite server. The cookbook uses search to discover the configurations that other nodes have published and adds that to the pollers configuration.

We use the OpenMQ message broker that is part of the GlassFish server. To collect data about it's operation we use a snippet that is not unlike the following to grab data out of JMX. This configuration is defined on the OpenMQ node.

node.override['spydle']['probes'] = {
  "#{node.chef_environment}_openmq" =>
    {
      'type' => 'in:jmx',
      'config' => {
        'host' => node['ipaddress'],
        'port' => node['openmq']['instances'][app_key]['jmx']['port'],
        'username' => 'spydle',
        'password' => jmx_monitors['spydle'],
        'probes' => [
          {
            'object_name' => 'com.sun.messaging.jms.server:type=Destination,subtype=Monitor,desttype=*,name=*',
            'attribute_names' =>
              [
                'NumActiveConsumers',
                'NumMsgs',
                'NumMsgsHeldInTransaction',
                'NumMsgsPendingAcks',
                'NumMsgsIn',
                'NumMsgsOut'
              ],
            'namespace' => namespace,
            'name_components' => ['type', 'desttype', 'name']
          },
          ...
        ]
      }
    }
}

Business Level Monitoring

The business level metrics are collected from all sorts of places but the two main sources of information are; JMX characteristics exposed by our applications and values in the database. The metric data collected from the database is often an aggregate SQL query against either our operational data store or our warehouse database. Spydle also supports queries against a database. As a result spydle is the tool of choice at this level.

Releasing

The way we release our applications is we promote a new application specific cookbook and then run converge on the application node. This ensures that the application is deployed and the attribute data for the monitoring system is published on the node. We then converge the monitoring nodes and they discover the new configuration for the application via search and update the graphite/gdash/spydle etc configuration as necessary. Rollback is simple as it is just another release and a re-converge on the monitor nodes and the application nodes.

Conclusions

So far, what we have works well. It is easy to monitor and graph data about a node or an application. In reality the configuration is a little more complex than is indicated above; we tend to have short retention times for data in environments other than production and we tend to limit the generation of graphs to environments we care about. We are trialing a few tools to generate alerts, mostly by querying graphite. The alerts will follow the same approach as our other monitoring infrastructure and we will publish the alerts in the application node's attributes. Once that is in place we will have much better insight into how our systems behave.

Evolving towards cookbook reusability in Chef

2012-05-12T00:00:00+00:00

A few months ago, I started to invest heavily in Chef to automate the roll out of our applications and the supporting infrastructure. So far, so good but it has not always been sunshine and puppy dogs. One of the major challenges is attempting to reuse cookbooks found on the community site, on GitHub or even within our own organization. I have found that I frequently had to customize the cookbooks heavily or rewrite the cookbooks from scratch to meet our needs.

Recently I have discovered a pattern that we use in our internal cookbooks that seems to make reuse possible, even easy. So I thought I would send it out into the world to see if it is something that others would find useful. So here is how it evolved...

Phase 1: Cookbook as a big bash script

In the beginning, our cookbooks mostly felt like big bash scripts. Conceptually they would do something along the lines of;

bash "install mypackage" do
  cwd "#{Chef::Config[:file_cache_path]}"
  code <<-EOH
wget http://example.com/mypackage-1.0.tar.gz
tar xzf mypackage-1.0.tar.gz
cd mypackage-1.0
./configure && make && make install
  EOH
  not_if { File.exists?("/usr/bin/mypackage") }
end

This was fast to write but that is the best that could be said about this technique. This approach resulted in no reusability of cookbooks unless we had the exact same requirements on a different node.

Phase 2: Attributes to customize

We quickly ran into issues when we needed to customize the application based on the environment. At which point we introduced attributes to customize the application. Conceptually, our recipes started to look something like;

bash "install mypackage" do
  cwd "#{Chef::Config[:file_cache_path]}"
  code <<-EOH
wget http://example.com/mypackage-#{node[:mypackage][:version]}.tar.gz
tar xzf mypackage-#{node[:mypackage][:version]}.tar.gz
cd mypackage-#{node[:mypackage][:version]}
./configure && make && make install
  EOH
  not_if { File.exists?("/usr/bin/mypackage") }
end

template "/etc/mypackage.conf" do
  source "mypackage.conf.erb"
  mode "0644"
  variables(
      :database => node[:mypackage][:database],
      :user => node[:mypackage][:user],
      :password => node[:mypackage][:password]
    )
end

Phase 3: Partition the recipes into units of reuse

Further down the track we found that different nodes would have different requirements. i.e. One installation of mypackage would use a local database for authentication while another installation would authenticate against Active Directory. This resulted in us splitting recipes into multiple recipes based on the units of reuse. So our hypothetical "mypackage::default" recipe would be split into "mypackage::default", "mypackage::db_auth", "mypackage::ad_auth". The role would include the particular recipes that it required.

Phase 4: Resources to the rescue

Resources (via LWRPs) were the next abstraction that we introduced. This made it easy to repeat similar sets of complex actions in many recipes with minor differences in configurations. A typical scenario involves defining multiple queues in a message broker, such as this snippet using the glassfish cookbook;

glassfish_mq_destination "WildfireStatus queue" do
  queue "Fireweb.WildfireStatus"
  config {'validateXMLSchemaEnabled' => true, 'XMLSchemaURIList' => 'http://...'}
  host 'localhost'
  port 7676
end

glassfish_mq_destination "PlannedBurnStatus queue" do
  queue "Fireweb.PlannedBurnStatus"
  config {'maxCount' => 1000, ...}
  host 'otherhost'
  port 7676
end

It should be noted that these resources can be composed. So that low level resources can be used to build up high level resources. So we actually have a glassfish_mq resource that uses the glassfish_mq_destination resource in it's implementation.

glassfish_mq "MessageBroker Instance" do
  instance "MessageBroker"
  users {...}
  access_control_rules {...}
  config {...}
  queues {
    "Fireweb.WildfireStatus" => {'validateXMLSchemaEnabled' => true, 'XMLSchemaURIList' => 'http://...'},
    "Fireweb.PlannedBurnStatus" => {'maxCount' => 1000, ...}
  }
  port 7676
  admin_port 7677
  jms_port 7678
  jmx_port 8087
  stomp_port 8087
end

Phase 5: Data driven reuse

The use of resources allowed us to easily create customized cookbooks but authoring the cookbooks could get monotonous. There was a lot of boilerplate code in each recipe. We reacted by storing a simplified description of the resources as data, interpreting the description and invoking the resources to represent the data. Sometimes the description was stored in data bags, sometimes the description was synthesized by searching the chef server, sometimes the description was synthesized using a rule layer.

For example, we discovered the set of queues to create in our message broker by searching the chef server for nodes in the same environment that declared a requirement for message queues in the attributes (i.e. "openmq.destinations.queues"). When configuring the logging aspects of our systems, we search for a graylog2 node and ensure we get the production node in the production environment and the development node in all other environments.The .war files and their required customizations are declared in a data bag and we query the data bag when populating our application server.

Phase 6: Policy recipe + Attribute driven recipe

The data driven approach saved us a lot of work but it limited the amount of cookbook reuse; business rules were encoded into the the way we stored, synthesized and discovered the data. It also meant that some of our core cookbooks changed every time we changed the way we abstracted our application configuration data.

Our most recent approach has been to pull the the business specific policy code out into a separate cookbook and then include a recipe that uses the attributes defined on the current node to drive the creation of the infrastructure.

Our policy cookbooks tend to look something like the following.

node.override[:openmq][:extra_libraries] =  ["http://example.org/repo/myext.jar"]

search(:node, 'omq_dests_queues:*' + node.name) do |n|
  n.to_hash.each_pair do |key, value|
    node.override['omq']['dests']['queues'][key] = value
  end
end

include_recipe "glassfish::attribute_driven_mq"

This approach seems to have given us a way to create a reusable cookbook ( glassfish in the case above) with the components that are less likely to be reused in a separate "policy" recipe. We are already using this to successfully manage an application server, a message broker, to configure monitoring and logging and to apply firewall rules. I wonder if this is an approach that others have discovered and if it could be applied to other cookbooks.

Antix - and tasks for Ant

2011-08-07T00:00:00+00:00

A long time ago I was involved with the Ant project and part of the philosophy was that Ant was not executable xml. So this meant that <if/> and <for/> tasks were out. Implementing the equivalent functionality involved complex sets of tasks and properties to be defined.

Fast-forward many years and Ant still does not provide this functionality out of the box. I rarely use ant these days opting instead to use Buildr or Rake depending on the project. But when I do use Ant I find myself re-implementing the same set of tasks - usually <if/> and <forEach/>. A while ago I consolidated all the different implementations under one source tree, Antix on Github.

Someone asked me how to use them so here is a basic description...

Setup

The simplest way to install Antix is to download the jar and add a taskdef to your build file.

Jar: http://cloud.github.com/downloads/realityforge/antix/antix-1.0.0.jar

<taskdef resource="org/realityforge/antix/antlib.xml">
  <classpath path="path/to/antix-1.0.0.jar"/>
  <!--
  This task library can also be put in the
  ${ANT_HOME\}/lib directory, in such case this
  classpath node is not needed
  -->
</taskdef>

Benefits

The <if/> task

The <if/> is simple in that it has two child elements; conditions and sequential. The sequential has a a sequential list of tasks to execute if all of the conditions evaluate to true.

e.g.

<if>
  <conditions>
    <equals arg1="${my.build.parameter}" arg2="true"/>
  </conditions>
  <sequential>
    <echo message="The property my.build.parameter is set to true!"/>
  </sequential>
</if>

The <forEach/> task

The <forEach/> takes a list of white space separated values and invokes a nested sequential element for each value, setting a specific parameter to the value during the invocation.

e.g.

<forEach property="day" list="Mon Tue Wed Thu Fri">
  <sequential>
    <echo message="Day = @{day}"/>
  </sequential>
</forEach>

will print ..

[echo] Day = Mon
[echo] Day = Tue
[echo] Day = Wed
[echo] Day = Thu
[echo] Day = Fri

The <property-copy/> task

Ant properties are not allowed to be nested so you need to do some hackery to get get nested properties to work properly. The Antix library implements the approach recommended by the FAQ by implementing a property-copy task that will evaluate the property two layers deep and copy the value to another property.

This is typically used when you are selecting from a variety of different build configuration settings. i.e. Should you generate the EJB or web service generator.

e.g.

<property name="ejb.service.generator"
          value="com.biz.EjbGen"/>
<property name="ws.service.generator"
          value="com.biz.WebServiceGen"/>

<property name="generator-type" value="ws"/>
<property-copy name="service.generator"
               from="${generator-type}.service.generator"/>
<echo>service.generator=${service.generator}</echo>

will print ..

[echo] service.generator=com.biz.WebServiceGen

The <dbgmsg/> task

The dbgmsg task will only print the specified message if the property named "debug" is set to a value. This is mostly used when debugging builds.

<dbgmsg message="My debug message"/>

The <start-phase/> and <end-phase/> tasks

The start-phase and end-phase tasks are used to print the time it takes for various build phases. Each phase has a name and a timer starts when start-phase is executed and is stopped when end-phase executes. Both tasks echo a message at warning level (if the property named timing.check is set) or at the verbose level.

e.g.

<start-phase phase="integration-tests"/>
...
<end-phase phase="integration-tests"/>

will print ..

[echo] Starting phase 'integration-tests' at 18:15:39
...
[echo] Completing phase 'integration-tests' at 18:15:39 (Duration = 48ms)

The <toAscii/> task

Copy a file while replacing non-ascii characters with the character '?'.

e.g.

<toAscii src="SomeNonAsciiFile.txt" dest="SomeAsciiFile.txt"/>

The <selectRegex/> task

The selectRegex task attempts to extract a value from a string based on a regular expression and assign that value to a property. Often I use this to extract out results from tests to do further processing.

<selectRegex property="that"
             pattern="string (.*) will"
             select="\1"
             value="My string that will attempt to be matched."/>
<echo>that=${that}</echo>

will print ..

[echo] that=that

The <timer/> task

The timer task can either be a "start" or "stop" timer. A "start" timer sets a property to now indicating a start time. A "stop" timer sets a property to now that indicates an end time and it calculates the duration from the corresponding start time. Mostly this task is not not directly used but instead used by the start-phase and stop-phase tasks described above.

<timer property="mytimer" stop="false"/>
<echo>Start: ${mytimer.start}</echo>
<timer property="mytimer" stop="true"/>
<echo>Stop: ${mytimer.end}</echo>
<echo>Duration: ${mytimer.duration}</echo>

will print ..

[echo] Start: 1312706159383
[echo] Stop: 1312706159397
[echo] Duration: 14

GWT and EJB 3.1

2011-08-06T00:00:00+00:00

Recently we have been tasked with building a rich, complex web application for resource planning. Historically most of our applications have been successfully delivered using Rails. However the cost of developing rich applications has been significant and only a few developers are comfortable working with low-level javascript frameworks.

We prototyped the front-end using ExtJS and were looking at implementing the backend technology using EJB 3.1 beans (See Why EJB? for our reasoning). We investigated using JSF and GWT as the front end technology but eventually settled on GWT, took a course and went through a number of labs to develop a simple GWT application.

The course only gave us a taste of GWT and we still needed to do a bunch of investigation to get a GWT application from "toy" stage to production ready. We took the best practices MVP example and started to evolve it towards an archetypal example in our world. This involved adding an automated build system that we could run from our CI box, moving to EJBs for the service layer, JPA for the persistence layer, moving to Intellij IDEA for the IDE and splitting the project out into multiple components that could be worked on independently.

The code for converting the service layer to EJBs was actually quite simple as is evidenced by the commit. However this code does not work in the built-in Jetty container used as part of development mode. The documentation on how to actually use EJBs is rather thin. In Intellij IDEA, this entailed setting up the IDE to build an exploded war, configuring GlassFish support and passing -noserver to the GWT plugin.

For the build system we initially trialled using Maven 3.0.3 but went back to using Apache Buildr. Maven is a project I want to like and has great ideas but even years after it was developed I keep running into stability issues; plugins don't work, dependencies are not locked by default etc. Buildr is a little rough around the edges but does not get in your way when you want to do something custom. The commit that converted the project to buildr is a perfect example. There is very little code involved in the Buildr files buildfile and build.yaml but there is a fair amount of custom code involved in modifying the Intellij IDEA buildr extension so that it generates custom metadata for the build. Admittedly these customizations will be rolled back into Buildr over time but it was simple to extend core Buildr classes to achieve our immediate needs.

Separating the different elements of the application out into separate components proved to be a minor annoyance. The code remained unchanged but the build system had to be refactored significantly as did the Buildr customizations to generate the IDEA project files. However, you can see a snapshot of the current work in progress on github project at the tag BLOG_POST.

Footnotes

Why EJB?

We need to integrate with a thick Swing client over a custom network protocol and a web portal & BPMS using SOAP web services as well as the front-end for our new application. Candidate service layers included OSGi, Spring and EJB 3.1. Bizarrely enough we chose EJBs because it was simpler (!!!) to provide these interfaces in the straight JEE stack. (And yes we were very surprised to come to that conclusion too!).

GWT Course

The course was delivered by Adam Jenkins at Object Training and it was quite good. About the only complaint I have was that some of the architectural labs and talks were too focused on mechanical aspects and did not make the reasons for selecting a particular architecture clear. I came away thinking that the GWT MVP design pattern was developed by architecture astronauts but after reading more and watching a few YouTube clips I am sold on the approach.

Meta Data Is An Overhead

2011-06-06T00:00:00+00:00

layout: post title: Meta-Data is an overhead —-

‘Portrait of a n00b’ raises the issue of meta—data addiction and puts forth the proposition that way too many programmers are afflicted by this condition. Every time a programmer is forced to annotate their code or “computation”, this increases the cost of maintenance, evolution and change as the meta-data must be kept up to date. Sometimes the meta-data can provide some useful information (i.e. documenting the intention of code) that offsets the cost of the meta-data but this is often not the case.

A small child who is describing what they what they have done will often provide a list of minutiae inter—spiced with “… and then I …”. The sentence structure is often simple and repetitive. Often the child will explicitly explain what they mean when they encounter a subject that they feel is complex. As the language skills develop the person will be able to focus on more salient features and use more sophisticated language. The information density of the speech act increases and many subtle nuances can be combined in one speech act.

This parallels the development of a programmer. The more sophisticated the programmer the more compressed their “speech acts” aka programs will be. They start focusing on more salient features of the program and start using more sophisticated techniques (i.e. higher-order programming). The information density dramatically increases as the programmer develops.

This has an interesting implication for some common approaches used within development process. Often adept programmers are forced to program in such a way that less adept programmers can understand. Worse yet they are forced to program in a consistent style with far less adept programmers. When the skill discrepancy is large this can effectively dilute the effectiveness of the more skilled participant. It is like forcing Shakespeare to tell his stories in baby talk: the story will inevitably loose many of the nuances and expand to massive size [1].

Comments are one form of meta-data that is common in programming. A “young” developer often writes many comments describing every step of the code. The comments can be in-band comments (i.e. javadoc and other code comments) or out-of-band comments (i.e. UML diagrams and sequence diagrams). As the developer becomes more sophisticated, the comments are often compressed and restricted to more complex aspects (i.e. more salient). The more mature developers seem to take the approach that the code is the document and attempt to restrict the code to high level descriptions or essential complexities that can not be removed.

Of course this not apply to all programmers. As in natural language there is all sorts of reasons for using language. The above description applies to those programmers working in a small team of similarly skilled individuals who have the aim of producing a product with a balance of flexibility and robustness. Programmers who were working in a larger team have to worry about potential miscommunications and thus need to be far more precise in their communication (even if all members of the team are of roughly equal skill).

Comments as meta-data require some maintenance to keep correct. Incorrect documentation is often far worse than no documentation at all as it creates some confusion in the reader. Is the code wrong or the comments wrong? Have I misread the code or the comments and is it thus me that is wrong? The cognitive dissonance can be extremely disruptive to action[2].

Psychology experiments that ask a person to read a color word show that if the word is written in the same color the reaction time is fastest. The next fastest is when the word is written in a neutral color (such as black). The slowest reaction time occurs when the word is written in a non-neutral color that is different from what the word says (i.e. the word “red” written in green ink). With my pop-psychology sun glasses on I would hazard to guess that a similar impact is at work when meta-data does not line up.

So the value of meta-data is often relative to the level of compression of the meta-data (is it about a salient feature or does it “trivially” follow from the code (where trivial is relative to the programmer skill level), the chance that meta-data could be out of date (i.e. is it verified by a compiler or checker of sorts), the cost of maintaining the meta-data and the chance that the code and thus meta-data is likely to change.

The distinction between meta-data and data is often an arbitrary point. If meta-data is verified then is it really meta-data or is it just more data? Are the types within a statically typed language really data or meta-data that the compiler checks?

The claim that static types may in fact be frivolous meta-data that people can do with out raised a number of hackles as is to be expected. The most interesting counter - ‘Concretizing static typing metadata’ made the claim that the static typing that “concretizes” the meta-data was not just for error avoidance but also aimed at actively increasing programmer productivity.

One of the more interesting points made in ‘Portrait of a n00b’ is that the failure of the semantic web is largely due to the fact the people will NOT spend considerable resources adding meta-data to their data. Systems with extreme levels of typing tend to also fail to gain traction outside of academia (where it offers many “research” opportunities in the same sense philosophy does) and defence / aerospace industries (and the adoption of ada may actually not be due to preference or any expectation by the individuals that it will bring better reliability but instead be due to a mandate from on high that were influenced by academic partners).

[1] It should be noted that Shakespeare often wrote for the common person and thus far less sophisticated people could understand and appreciate the work even if they missed the subtle nuances. However programming often has more in common with multiple people writing a story or perhaps multiple people creating a film. Having to talk to some of the people in baby talk is going to slow down the operation.

[2] When the code and comments line up and people expect them to line up then action may be faster but more action is required when the code needs to be altered thus potentially drying up any wins gained from faster action.

Babel

2011-06-05T00:00:00+00:00

**Babel** (noun) 1. an ancient city in the land of Shinar in which the building of a tower (Tower of Babel) intended to reach heaven was begun and the confusion of the language of the people took place. Gen. 11:49. 2. (usually lowercase) a confused mixture of sounds or voices. 3. (usually lowercase) a scene of noise and confusion. [Source](http://dictionary.reference.com/browse/babel)

Babel is a proposal for a multi-paradigm, low level type-safe virtual machine capable of executing several different programming languages. Different programming languages will inevitably be optimized to solve certain classes of problem in certain domains. No single programming language or programming paradigm is ideal in all scenarios.

The aim is to be able to execute virtual instruction sets from several different virtual machines such as Ruby/rubinius, Java/Java bytecode, Erlang/Beam, Scheme, Haskell, Mercury, R etc. Different types of languages that are expected to be supported should represent functional, logic or constraint-based, message-based and imperative paradigms.

In an ideal environment it would be possible to transparently combine elements written using different paradigms in the same application. The impedance mismatch at the language barriers often makes this a difficult proposition.

One possible remedy is to host each different programming language or paradigm in a software isolated process (SIP). Each SIP communicates with other SIPs through message passing. Communication between SIPs would still need to translate values from one paradigm to another and may even need to serialize, deserialize or copy values between SIPs. However translation could be skipped when processes share a representation and in some scenarios copying may be avoided if a SIP supports copy-on-write or can only transfer immutable values.

SIPs have many of the advantages of Erlangs processes; fault isolation, low overhead, easy to parallelize. If Babel is structured correctly, the exection engine for each SIP could be shared between SIPs and maybe composed from elements such that a logic and functional SIP share many of the same VM components.

Babel is unlikely to be started until well after I complete my PhD and while I have experimented with several different components at times, no head way has been made. For Babel to have any chance of success it must be optimized for fun.

Instruction Set Composition

It is likely that the various programming languages that are built on the Babble VM (BVM) can share subsets of their instruction sets. Candidate instructions that immediately come to mind include arithmetic operations for 32-bit integer values and 32-bit IEEE 754 floating point values. It is also likely that there will be dependency relationships be instruction groups. i.e. The 32-bit integer vector operations rely on the presence of 32-bit integer scalar operations.

Each instruction group is likely to result in different sets of optimization passes being incorporated in the runtime compilers. These could be a new set of BURS rules or specific optimization passes (i.e. to vectorize scalar operations in loops). Thus each instruction group should be able to be bundled separately and identify associated optimization passes etc.

Layered Programming Language Features

The BVM is likely to have a family of “native” languages. The languages should be layered such that each successive language incorporates features from lower layer. The “kernel” language is most likely going to be a stack based language such as Joy/Forth with linear types and procedures/functions that are guaranteed not to be recursive. The next higher language may support recursive functions and immutable variables (i.e. those that can be written once and read many times). A higher layer still may support mutation of variables or polymorphic invocations.

Language features such as generic types, tail calls, lazy vs strict modes, query matching vs normal execution, object manipulation etc are gradually added depending on where the language is positioned in the kernel-system-application-scripting language spectrum. Ideas should definitely be incorporated from Forth, Scheme, Haskell, Smalltalk and Mercury when developing the composable language features.