Flipper Preloading

Flipper is already pretty optimized for production usage (flipper does billions of feature checks a day at GitHub), but the latest release (0.10.2) just received a couple new ones — all thanks to community contributions.

In jnunemaker/flipper#190, @mscoutermarsh said:

Hi,

Would love if there’s a way to preload enabled features for an actor. For Product Hunt, we check several features in a single request for current_user. With activerecord this adds up to quite a few queries. Would love to get it down to one.

From browsing source I don’t believe this is currently available.

I suggested using caching, as we do at GitHub, but also put some thoughts down on how preloading features could work. @gshutler saw the conversation and put pen to paper on a great pull request that made the idea concrete.

Some Background

Often times users of flipper do many feature checks per request.

Continue reading “Flipper Preloading”

Flipper: Insanely Easy Feature Flipping

Of Late

A lot has changed over the years. I now do a lot more than just rails and having railstips as my domain seems to mentally put me in a corner.

As such, I have revived johnnunemaker.com. While I may still post a rails topic here once in a while, I’ll be posting a lot more varied topics over there.

In fact, I just published my first post of any length, titled Analytics at GitHub. Head on over and give it a read.

Let Nunes Do It

In a moment of either genius or delirium I decided to name my newest project after myself. Why? Well, here is the story whether you want to know or not.

Why Nunes?

Naming is always the hardest part of a project. Originally, it was named Railsd. The idea of the gem is automatically subscribe to all of the valuable Rails instrumentation events and send them to statsd in a sane way, thus Railsd was born.

After working on it a bit, I realized that the project was just an easy way to send Rails instrumentation events to any service that supports counters and timers. With a few tweaks, I made Railsd support InstrumentalApp, a favorite service of mine, in addition to Statsd.

Thus came the dilemma. No longer did the (already terrible) name Railsd make sense. As I sat and thought about what to name it, I remembered joking one time about naming a project after myself, so that every time anyone used it they had no choice but to think about me. Thus Nunes was born.

Lest you think that I just wanted to name it Nunes only so that you think of me, here is a bit more detail. Personally, I attempt to instrument everything I can. Be it code, the steps I take, or the calories I consume, I want to know what is going on. I have also noticed that which is automatically instrumented is the easiest to instrument.

I love tracking data so deeply that I want to instrument your code. Really, I do. I want to clone your repo, inject a whole bunch of instrumentation and deploy it to production, so you can know exactly what is going on. I want to sit over your shoulder and look at the graphs with you. Ooooooh, aren’t those some pretty graphs!

But I don’t work for you, or with you, so that would be weird.

Instead, I give you Nunes. I give you Nunes as a reminder that I want to instrument everything and you should too. I give you Nunes so that instrumenting is so easy that you will feel foolish not using it, at least a start. Go ahead, the first metric is free! Yep, I want you to have that first hit and get addicted, like me.

Using Nunes

I love instrumenting things. Nunes loves instrumenting things. To get started, just add Nunes to your gemfile:

# be sure to think of me when you do :)
gem "nunes"

Once you have nunes in your bundle (be sure to think of bundling me up with a big hug), you just need to tell nunes to subscribe to all the fancy events and provide him with somewhere to send all the glorious metrics:

# yep, think of me here too
require 'nunes'

# for statsd
statsd = Statsd.new(...)
Nunes.subscribe(statsd) # ooh, ooh, think of me!

# for instrumental
I = Instrument::Agent.new(...)
Nunes.subscribe(I) # one moooore tiiiime!

With just those couple of lines, you get a whole lot of goodness. Out of the box, Nunes will subscribe to the following Rails instrumentation events:

  • process_action.action_controller
  • render_template.action_view
  • render_partial.action_view
  • deliver.action_mailer
  • receive.action_mailer
  • sql.active_record
  • cache_read.active_support
  • cache_generate.active_support
  • cache_fetch_hit.active_support
  • cache_write.active_support
  • cache_delete.active_support
  • cache_exist?.active_support

Thanks to all the wonderful information those events provide, you will instantly get some of these counter metrics:

  • action_controller.status.200
  • action_controller.format.html
  • action_controller.exception.RuntimeError – where RuntimeError is the class of any exceptions that occur while processing a controller’s action.
  • active_support.cache_hit
  • active_support.cache_miss

And these timer metrics:

  • action_controller.runtime
  • action_controller.view_runtime
  • action_controller.db_runtime
  • action_controller.posts.index.runtime – where posts is the controller and index is the action
  • action_view.app.views.posts.index.html.erb – where app.views.posts.index.html.erb is the path of the view file
  • action_view.app.views.posts._post.html.erb – I can even do partials! woot woot!
  • action_mailer.deliver.post_mailer – where post_mailer is the name of the mailer
  • action_mailer.receive.post_mailer – where post_mailer is the name of the mailer
  • active_record.sql
  • active_record.sql.select – also supported are insert, update, delete, transaction_begin and transaction_commit
  • active_support.cache_read
  • active_support.cache_generate
  • active_support.cache_fetch
  • active_support.cache_fetch_hit
  • active_support.cache_write
  • active_support.cache_delete
  • active_support.cache_exist

But Wait, There is More!

In addition to doing all that work for you out of the box, Nunes will also help you wrap your own code with instrumentation. I know, I know, sounds too good to be true.


class User < ActiveRecord::Base
  extend Nunes::Instrumentable # OH HAI IT IS ME, NUNES

  # wrap save and instrument the timing of it
  instrument_method_time :save
end

This will instrument the timing of the User instance method save. What that means is when you do this:

# the nerve of me to name a user nunes
user = User.new(name: "NUNES!")
user.save

An event named instrument_method_time.nunes will be generated, which in turn is subscribed to and sent to whatever you used to send instrumentation to (statsd, instrumental, etc.). The metric name will default to “class.method”. For the example above, the metric name would be user.save. No fear, you can customize this.

class User < ActiveRecord::Base
  extend Nunes::Instrumentable # never

  # wrap save and instrument the timing of it
  instrument_method_time :save, 'crazy_town.save'
end

Passing a string as the second argument sets the name of the metric. You can also customize the name using a Hash as the second argument.

class User < ActiveRecord::Base
  extend Nunes::Instrumentable # gonna

  # wrap save and instrument the timing of it
  instrument_method_time :save, name: 'crazy_town.save'
end

In addition to name, you can also pass a payload that will get sent along with the generated event.


class User < ActiveRecord::Base
  extend Nunes::Instrumentable # give nunes up

  # wrap save and instrument the timing of it
  instrument_method_time :save, payload: {pay: "loading"}
end

If you subscribe to the event on your own, say to log some things, you’ll get a key named :pay with a value of "loading" in the event’s payload. Pretty neat, eh?

Conclusion

I hope you find Nunes useful and that each time you use it, you think of me and how much I want to instrument your code for you, but am not able to. Go forth and instrument!

P.S. If you have ideas for Nunes, create an issue and start some chatter. Let’s make Nunes even better!

An Instrumented Library in ~30 Lines

The Full ~30 Lines

For the first time ever, I am going to lead with the end of the story. Here is the full ~30 lines that I will break down in detail during the rest of this post.

require 'forwardable'

module Foo
  module Instrumenters
    class Noop
      def self.instrument(name, payload = {})
        yield payload if block_given?
      end
    end
  end

  class Client
    extend Forwardable

    def_delegator :@instrumenter, :instrument

    def initialize(options = {})
      # some other setup for the client ...
      @instrumenter = options[:instrumenter] || Instrumenters::Noop
    end

    def execute(args = {})
      instrument('client_execute.foo', args: args) { |payload|
        result = # do some work...
        payload[:result] = result
        result
      }
    end
  end
end

client = Foo::Client.new({
  instrumenter: ActiveSupport::Notifications,
})

client.execute(...) # I AM INSTRUMENTED!!!

The Dark Side

A while back, statsd grabbed a hold of the universe. It swept in like an elf on a unicorn and we all started keeping track of stuff that previously was a pain to keep track of.

Like any wave of awesomeness, it came with a dark side that was felt, but mostly overlooked. Dark side? Statsd? Graphite? You must be crazy! Nope, not me, definitely not crazy this one. Not. At. All.

What did we all start doing in order to inject our measuring? Yep, we started opening up classes in horrible ways and creating hooks into libraries that sometimes change rapidly. Many times, updating a library would cause a break in the stats reporting and require effort to update the hooks.

The Ideal

Now that the wild west is settling a bit, I think some have started to reflect on that wave of awesomeness and realized something.

I no longer want to inject my own instrumentation into your library. Instead, I want to tell your library where it should send the instrumentation.

The great thing is that ActiveSupport::Notifications is pretty spiffy in this regard. By simply allowing your library to talk to an “instrumenter” that responds to instrument with an event name, optional payload, and optional block, you can make all your library’s users really happy.

The great part is:

  1. You do not have to force your users to use active support. They simply need some kind of instrumenter that responds in similar fashion.
  2. They no longer have to monkey patch to get metrics.
  3. You can point them in the right direction as to what is valuable to instrument in your library, since really you know it best.

There are a few good examples of libraries (faraday, excon, etc.) doing this, but I haven’t seen a great post yet, so here is my attempt to point you in what I feel is the right direction.

The Interface

First, like I said above, we do not want to force requiring active support. Rather than require a library, it is always better to require an interface.

The interface that we will require is the one used by active support, but an adapter interface could be created for any instrumenter that we want to support. Here is what it looks like:

instrumenter.instrument(name, payload) { |payload|
  # do some code here that should be instrumented
  # we expect payload to be yielded so that additional 
  # payload entries can be included during the 
  # computation inside the block
}

Second, we have two options.

  1. Either have an instrumenter or not. If so, then call instrument on the instrumenter. If not, then do not call instrument.
  2. The option, which I prefer, is to have a default instrumenter that does nothing. Aptly, I call this the noop instrumenter.

The Implementation

Let’s pretend our library is named foo, therefore it will be namespaced with the module Foo. I typically namespace the instrumenters in a module as well. Knowing this, our noop instrumenter would look like this:

module Foo
  module Instrumenters
    class Noop
      def self.instrument(name, payload = {})
        yield payload if block_given?
      end
    end
  end
end

As you can see, all this instrumenter does is yield the payload if a block is given. As I mentioned before, we yield payload so that the computation inside the block can add entries to the payload, such as the result.

Now that we have a default instrumenter, how can we use it? Well, let’s imagine that we have a Client class in foo that is the main entry point for the gem.

module Foo
  class Client
    def initialize(options = {})
      # some other setup for the client ...
      @instrumenter = options[:instrumenter] || Instrumenters::Noop
    end
  end
end

This code simply allows people to pass in the instrumenter that they would like to use through the initialization options. Also, by default if no instrumenter is provided, we use are noop version that just yields the block and moves on.

Note: the use of || instead of #fetch is intentional. It prevents a nil instrumenter from being passed in. There are other ways around this, but I have found using the noop instrumenter in place of nil, better than complaining about nil.

Now that we have an :instrumenter option, someone can quite easily pass in the instrumenter that they would like to use.

client = Foo::Client.new({
  :instrumenter => ActiveSupport::Notifications,
})

Boom! Just like that we’ve allowed people to inject active support notifications, or whatever instrumenter they want into our library. Anyone else getting excited?

Once we have that, we can start instrumenting the valuable parts. Typically what I do is I setup delegation of the instrument to the instrumenter using ruby’s forwardable library:

require 'forwardable'

module Foo
  class Client
    extend Forwardable

    # forward instrument in this class to @instrumenter, for those unfamilier
    # with forwardable.
    def_delegator :@instrumenter, :instrument

    def initialize(options = {})
      # some other setup for the client ...
      @instrumenter = options[:instrumenter] || Instrumenters::Noop
    end
  end
end

Now we can use the instrument method directly anywhere in our client instance. For example, let’s say that client has a method named execute that we would like to instrument.

module Foo
  class Client
    def execute(args = {})
      instrument('client_execute.foo', args: args) { |payload|
        result = # do some work...
        payload[:result] = result
        result
      }
    end
  end
end

With just a tiny wrap of the instrument method, the users of our library can do a ridiculous amount of instrumentation. For one, note that we pass the args and the result along with the payload. This means our users can create a log subscriber and log each method call with timing, argument, and result information. Incredibly valuable!

They can also create a metrics subscriber that sends the timing information to instrumental, metriks, statsd, or whatever.

The Bonus

You can even provide log subscribers and metric subscribers in your library, which means instrumentation for your users is simply a require away. For example, here is the log subscriber I added to cassanity.

require 'securerandom'
require 'active_support/notifications'
require 'active_support/log_subscriber'

module Cassanity
  module Instrumentation
    class LogSubscriber < ::ActiveSupport::LogSubscriber
      def cql(event)
        return unless logger.debug?

        name = '%s (%.1fms)' % ["CQL Query", event.duration]

        # execute arguments are always an array where the first element is the
        # cql string and the rest are the bound variables.
        cql, *args = event.payload[:execute_arguments]
        arguments = args.map { |arg| arg.inspect }.join(', ')

        query = "#{cql}"
        query += " (#{arguments})" unless arguments.empty?

        debug "  #{color(name, CYAN, true)}  [ #{query} ]"
      end
    end
  end
end

Cassanity::Instrumentation::LogSubscriber.attach_to :cassanity

All the users of cassanity need to do to get logging of the CQL queries they are performing and their timing is require a file (and have activesupport in their gemfile):

require 'cassanity/instrumentation/log_subscriber'

And they get logging goodness like this in their terminal:

The Accuracy

But! BUT, you say. What about the tests? Well, my friend, I have that all wrapped up for you as well. Since it is so easy to pass through an instrumenter to our library, we should probably also have an in memory instrumenter that keeps track of the events instrumented, so you can test thoroughly, and ensure you don’t hose your users with incorrect instrumentation.

The previous sentence was quite a mouthful, so my next one will be short and sweet. For testing, I created an in-memory instrumenter that simply stores each instrumented event with name, payload, and the computed block result for later comparison. Check it:

module Foo
  module Instrumenters
    class Memory
      Event = Struct.new(:name, :payload, :result)

      attr_reader :events

      def initialize
        @events = []
      end

      def instrument(name, payload = {})
        result = if block_given?
          yield payload
        else
          nil
        end

        @events << Event.new(name, payload, result)

        result
      end
    end
  end
end

Now in your tests, you can do something like this when you want to check that your library is correctly instrumenting:

instrumenter = Foo::Instrumenters::Memory.new

client = Foo::Client.new({
  instrumenter: instrumenter,
})

client.execute(...)

payload = {... something .. }
event = instrumenter.events.last

assert_not_nil event
assert_equal 'client_execute.foo', event.name
assert_equal payload, event.payload

The End Result

With two instrumenters (noop, memory) and a belief in interfaces, we have created immense value.

Further Reading

Without any further ado, here are a few of the articles and decks that I read recently related to this.

Fin

Go forth and instrument all the things!

Booleans are Baaaaaaaaaad

First off, did you pronounce the title of this article like a sheep? That was definitely the intent. Anyway, onward to the purpose of this here text.

One of the things I have learned the hard way is that booleans are bad. Just to be clear, I do not mean that true/false is bad, but rather that using true/false for state is bad. Rather than rant, lets look at a concrete example.

An Example

The first example that comes to mind is the ever present user model. On signup, most apps force you to confirm your email address.

To do this there might be a temptation to add a boolean, lets say “active”. Active defaults to false and upon confirmation of the email is changed to true. This means your app needs to make sure you are always dealing with active users. Cool. Problem solved.

It might look something like this:

class User
  include MongoMapper::Document

  scope :active, where(:active => true)

  key :active, Boolean
end

To prevent inactive users from using the app, you add a before filter that checks if the current_user is inactive. If they are, you redirect them to a page asking them to confirm there email or resend the email confirmation. Life is grand!

The Requirements Change

Then, out of nowhere comes an abusive user, let’s name him John. John is a real jerk. He starts harassing your other users by leaving mean comments about their moms.

In order to combat John, you add another boolean, lets say “abusive”, which defaults to false. You then add code to allow marking a user as abusive. Doing so sets “abusive” to true. You then add code that disallows users who have abusive set to true from adding comments.

The Problem

You now have split state. Should an abusive user really be active? Then a new idea pops into your head. When a user is marked as abusive, lets also set active to false, so they just can’t use the system. Oh, and when a user is marked as active, let’s make sure that abusive is set to false. Problem solved? Right? RIGHT? Wrong.

You are now maintaining one state with two switches. As requirements change, you end up with more and more situations like this and weird edge cases start to sneak in.

The Solution

How can we improve the situation? Two words: state machine. State machines are awesome. Lets rework our user model to use the state_machine gem.

class User
  include MongoMapper::Document

  key :state, String

  state_machine :state, :initial => :inactive do
    state :inactive
    state :active
    state :abusive

    event :activate do
      transition all => :active
    end

    event :mark_abusive do
      transition all => :abusive
    end
  end
end

With just the code above, we can now do all of this:

user = User.create
user.active? # false because initial is set to inactive
user.activate!
user.active? # true because we activated
user.mark_abusive!
user.active? # false
user.inactive? # false
user.abusive? # true

User.with_state(:active) # scope to return active
User.with_state(:inactive) # another scope
User.with_state(:abusive) # driving the example home

Pretty cool, eh? You get a lot of bang for the buck. I am just showing the beginning of what you can do, head on over to the readme to see more. You can add guards and all kinds of neat things. Problem solved. Right? RIGHT? Wrong.

Requirements Change Again

Uh oh! Requirements just changed again. Mr. CEO decided that instead of calling people abusive, we want to refer to them as “douchebaggish”.

The app has been wildly successful and you now have millions of users. You have two options:

  1. Leave the code as it is and just change the language in the views. This sucks because then you are constantly translating between the two.
  2. Put up the maintenance page and accept downtime, since you have to push out new code and migrate the data. This sucks, because your app is down, simply because you did not think ahead.

A Better State Machine

Good news. With just a few tweaks, you could have built in the flexibility to handle changing your code without needing to change your data. The state machine gem supports changing the value that is stored in the database.

Instead of hardcoding strings in your database, use integers. Integers allow you to change terminology willy nilly in your app and only change app code. Let’s take a look at how it could work:

class User
  include MongoMapper::Document

  States = {
    :inactive => 0,
    :active => 1,
    :abusive => 2,
  }

  key :state, Integer

  state_machine :state, :initial => :inactive do
    # create states based on our States constant
    States.each do |name, value|
      state name, :value => value
    end

    event :activate do
      transition all => :active
    end

    event :mark_abusive do
      transition all => :abusive
    end
  end
end

With just that slight change, we now are storing state as an integer in our database. This means changing from “abusive” to “douchebaggish” is just a code change like this:

class User
  include MongoMapper::Document

  States = {
    :inactive => 0,
    :active => 1,
    :douchebaggish => 2,
  }

  key :state, Integer

  state_machine :state, :initial => :inactive do
    States.each do |name, value|
      state name, :value => value
    end

    event :activate do
      transition all => :active
    end

    event :mark_douchebaggish do
      transition all => :douchebaggish
    end
  end
end

Update the language in the views, deploy your changes and you are good to go. No downtime. No data migration. Copious amounts of flexibility for little to no more work.

Next time you reach for a boolean in your database, think again. Please! Whip out the state machine gem and wow your friends with your wisdom and foresight.

Four Guidelines That I Feel Have Improved My Code

I have been thinking a lot about isolation, dependencies and clean code of late. I know there is a lot of disagreement with people vehemently standing in both camps.

I certainly will not say either side is right or wrong, but what follows is what I feel has improved my code. I post it here to formalize some recent thoughts and, if I am lucky, get some good feedback.

Before I rush into the gory details, I feel I should mention that I went down this path, not as an architecture astronout, but out of genuine pain in what I was working on.

My models were growing large. My tests were getting slow. Things did not feel “right”.

I started watching Gary Bernhardt’s Destroy All Software screencasts. He is a big proponent of testing in isolation. Definitely go get a subscription and take a day to get caught up.

On top of DAS, I started reading everything I could on the subject of growing software, clean code and refactoring. When I say reading, I really should say devouring.

I was literally prowling about like a lion, looking for the next book I could devour. Several times my wife asked me to get off my hands and knees and to kindly stop roaring about SRP.

Over the past few months as I have tried to write better code, I have definitely learned a lot. Learning without reflection and writing is not true learning for me.

Reflecting on why something feels better and then writing about it formalizes it in my head and has the added benefit of being available for anyone else who is struggling with the same.

Here are a few guidelines that have jumped out at me over the past few days as I reflected on what I have been practicing the past few months.

Guideline #1. One responsibility to rule them all

Single responsibility principle (SRP) is really hard. I think a lot of us are frustrated and feeling the pain of our chubby <insert your favorite ORM> classes. Something does not feel right. Working on them is hard.

The problem is context. You have to load a lot of context in your brain when you crack open that INFAMOUS user model. That context takes up the space where we would normally create and come up with new solutions.

Create More Classes

So what are we to do? Create more classes. Your models do not need to inherit from ActiveRecord::Base, or include MongoMapper::Document, or whatever.

A model is something that has business logic. Start breaking up your huge models that have persistence bolted on into plain old Ruby classes.

I am not going to lie to you. If you have not been doing this, it will not be easy. Everything will seem like it should just be tucked as another method in a model that also happens to persist data in a store.

Naming is Hard

Another pain point will be naming. Naming is fracking hard. You are welcome for the BSG reference there. I would like to take that statement a step further though.

Naming is hard because our classes and methods are doing too much. The fewer responsibilities your class has, the easier it will be to name, especially after a few months of practice.

An Example

Enough talk, lets see some code. In our track processors, which pop tracks off a queue and store reports in a database, we query for the gauge being tracked before storing reports. The purpose of this query is to ensure that the gauge is in good standing and that we should, in fact, store reports in the database for it.

A lot of people throw the tracking code on their site and never remove it or sign up for a paying account. We do this find to make sure those people noop, instead of creating tons of data that no one is paying for.

This query happens for each track and it is pulling information that rarely if ever changes. It seemed like a prime spot for a wee bit of caching.

First, I created a tiny service around the memcached client I decided to use. This only took an hour and it means that my application now has an interface for caching (get, set, delete, and fetch). I’ll talk more about this in guideline #3.

Once I had defined the interface Gauges would use for caching, I began to integrate it. After much battling and rewriting of the caching code, each piece felt like it was doing too much and things were getting messy.

I stepped back and thought through my plans. I wanted to cache only the attributes, so I threw everything away and started with that. First, I wanted to be able to read attributes from the data store.

class GaugeAttributeService
  def get(id)
    criteria = {:_id => Plucky.to_object_id(id)}
    if (attrs = gauge_collection.find_one(criteria))
      attrs.delete('_id')
      attrs
    end
  end
end

Given an id, this class returns a hash of attributes. That is pretty much one responsibility. Sweet action. Let’s move on.

Second, I knew that I wanted to add read-through caching for this. Typically read-through caching uses some sort of fetch pattern. Fetch is basically a shortcut for look first in the cache and if it is not there, compute the block, store the computed result in the cache and return the computed result.

If I would have added caching in the GaugeAttributeService class, I would have violated SRP. Describing the class would have been "checks the cache and if not there it fetches from database". Note the use of "and".

As Growing Object Oriented Software states:

Our heuristic is that we should be able to describe what an object does without using any conjunctions (“and,” “or”).

Instead, I created a new class to wrap (or decorate) my original service.

class GaugeAttributeServiceWithCaching
  def initialize(attribute_service = GaugeAttributeService.new)
    @attribute_service = attribute_service
  end

  def get(id)
    cache_service.fetch(cache_key(id)) {
      @attribute_service.get(id)
    }
  end
end

I left a few bits out of this class so we can focus on the important part, which is that all we do with this class is wrap the original one with a cache fetch.

As you can see, naming is pretty easy for this class. It is a gauge attribute service with caching and is named as such. It initializes with an object that must respond to get. Note also that it defaults to an instance of GaugeAttributeService.

Unit testing this class is easy as well. We can isolate the dependencies (attribute_service and cache_service) in the unit test and make sure that they do what we expect (fetch and get).

Note: There definitely could a point made that "with" is the same as "and" and therefore means that we are breaking SRP. Naming is hard, really hard. Rather than get mired forever in naming, I rolled with this convention and, at this point, it does not bother me. I am definitely open to suggestions. Another name I played with was CachedGaugeAttributeService.

Below is an example setup with new dependencies inject in the test that help us verify this classes behavior in isolation.

attributes = {'title' => 'GitHub'}

attribute_service = Class.new do
  def get(id)
    attributes
  end
end.new

cache_service = Class.new do
  def fetch(key)
    get(key) || yield
  end

  def get(key)
  end
end.new

service = GaugeAttributeServiceWithCaching.new(attribute_service)
service.cache_service = cache_service

Above I used dynamic classes. Instead of dynamic classes, one could use stubbing or whatever. I’ll talk more about cache_service= later.

Decorating in this manner means we can easily find without caching by using GaugeAttributeService or with caching by using GaugeAttributeServiceWithCaching.

The important thing to note is that we added new functionality to our application by extending existing parts instead of changing them. I read recently, but cannot find the quote, that if you can add a new feature purely by extending existing classes and creating new classes, you are winning.

Guideline #2. Use accessors for collaborators

In the example above, you probably noticed that when testing GaugeAttributeServiceWithCaching, I changed the cache service used by assigning a new one. What I often see is others using some top level config, or even worse they actually use a $ global.

# bad
Gauges.cache = Memcached.new
class GaugeAttributeServiceWithCaching
  def get(id)
    Gauges.cache.fetch(cache_key(id)) { … }
  end
end

# worse
$cache = Memcached.new
class GaugeAttributeServiceWithCaching
  def get(id)
    $cache.fetch(cache_key(id)) { … }
  end
end

What sucks about this is you are coupling this class to a global and coupling leads to pain. Instead, what I have started doing is using accessors to setup collaborators. Here is the example from above, but now with the cache service accessors included.


class GaugeAttributeServiceWithCaching
  attr_writer :cache_service

  def cache_service
    @cache_service ||= CacheService.new
  end
end

By doing this, we get a sane, memoized default for our cache service (CacheService.new) and the ability to change that default (cache_service=), either in our application or when unit testing.

Finding ourselves doing this quite often, we created a library, aptly named Morphine. Right now it does little more than what I just showed (memoized default and writer method to change).

As I have started to use this gem, I am getting more ideas for things that would be helpful. Here is the same code as above, but using Morphine. What I like about it, over a memoized method and an attr_writer is that it feels a little more declarative and creates a standard way of declaring collaborators for classes.


class GaugeAttributeServiceWithCaching
  include Morphine

  register :cache_service do
    CacheService.new
  end
end

Note also that I am not passing these dependencies in through initialize. At first I started with that and it looked something like this:

class GaugeAttributeServiceWithCaching
  def initialize(attribute_service = GaugeAttributeService.new, 
                 cache_service = CacheService.new)
    @attribute_service = attribute_service
    @cache_service = cache_service
  end
end

Personally, over time I found this method tedious. My general guideline is pass a dependency through initialize when you are going to decorate it, otherwise use accessors. Let’s look at the attribute service with caching again.

class GaugeAttributeServiceWithCaching
  include Morphine

  register :cache_service do
    CacheService.new
  end

  def initialize(attribute_service = GaugeAttributeService.new)
    @attribute_service = attribute_service
  end
end

Since this class is decorating an attribute service with caching, I pass in the service we want to decorate through initialize. I do not, however, pass in the cache service through initialize. Instead, the cache service uses Morphine (or accessors).

First, I think this makes the intent more obvious. The intent of this class is to wrap another object, so that object should be provided to initialize. Defaulting the service to wrap is merely a convenience.

Second, the cache service is a dependency, but not one that is being wrapped. It purely needs a sane default and a way to be replaced, therefore it uses Morphine (or accessors).

I cannot say this is a hard and fast rule that everyone should follow and that you are wrong if you do not. I can say that through trial and error, following this guideline has led to the least amount of friction while maintaining flexibility and isolation.

Guideline #3. Create real interfaces

As I mentioned above, the first thing I started with when working on the caching code was an interface for caching for the application, rather than just using a client directly. Occasionally what I see people do is create an interface, but wholesale pass arguments through to a client like so:

# bad idea
class CacheService
  def initialize(driver)
    @driver = driver
  end

  def get(*args)
    @driver.get(*args)
  end

  def set(*args)
    @driver.set(*args)
  end

  def delete(*args)
    @driver.delete(*args)
  end
end

In my opinion, this is abstracting at the wrong level. All you are doing is adding a layer of indirection on top of a driver. It makes it harder to follow and any exceptions that the driver raises will be raised in your application. Also, any parameters that the driver works with, your interface will work with. There is no point in doing this.

Instead, create a real interface. Define the methods and parameters you want your application to be able to use and make that work with whatever driver you end up choosing or changing to down the road.

Handling Exceptions

First, I created the exceptions that would be raised if anything goes wrong.

class CacheService

  class Error < StandardError
    attr_reader :original

    def initialize(original = $!)
      if original.nil?
        super
      else
        super(original.message)
      end
      @original = original
    end
  end

  class NotFound < Error; end
  class NotStored < Error; end

end

CacheService::Error is the base that all other errors inherit from. It wraps whatever the original error was, instead of discarding it, and defaults to the last exception that was raised $!. I will show how these are used in a bit.

Portability and serialization

I knew that I wanted the cache to be portable, so instead of just defaulting to Marshal’ing, I used only raw operations and ensured that I wrapped all raw operations with serialize and deserialize, where appropriate.

In order to allow this cache service class to work with multiple serialization methods, I registered a serializer dependency, instead of just using MultiJson’s dump and load directly. I then wrapped convenience methods (serialize and deserialize) that handle a few oddities induced by the driver I am wrapping.

class CacheService
  include Morphine

  register :serializer do
    Serializers::Json.new
  end

  private

  def serialize(value)
    serializer.serialize(value)
  end

  def deserialize(value)
    if value.is_a?(Hash) # get with multiple keys
      value.each { |k, v| value[k] = deserialize(v) }
      value
    else
      serializer.deserialize(value)
    end
  end
end

Handling exceptions (continued)

I then created a few private methods that hit the driver and wrap exceptions. These private methods are what the public methods use to ensure that exceptions are properly handled and such.

class CacheService
  private

  def driver_read(keys)
    deserialize(@driver.get(keys, false))
  rescue Memcached::NotFound
    raise NotFound
  rescue Memcached::Error
    raise Error
  end

  def driver_write(method, key, value)
    @driver.send method, key, serialize(value), DefaultTTL.call, false
  rescue Memcached::NotStored
    raise NotStored
  rescue Memcached::Error
    raise Error
  end

  def driver_delete(key)
    @driver.delete(key)
  rescue Memcached::NotFound
    raise NotFound
  end
end

At this point, no driver specific exceptions should ever bubble outside of the cache service. When using the cache service in the application, I need only worry about handling the cache service exceptions and not the specific driver exceptions.

If I change to a different driver, only this class changes. The rest of my application stays the same. Big win. How many times have you upgraded a gem and then had to update pieces all over your application because they willy-nilly changed their interface.

The public interface

All that is left is to define the public methods and parameters that can be used in the application.

class CacheService
  def get(keys)
    driver_read(keys)
  rescue NotFound
    nil
  end

  def set(key, value)
    driver_write :set, key, value
  end

  def delete(key)
    driver_delete key
  rescue NotFound
    nil
  end
end

At this point, the application has a defined interface that it can work with for caching and for the most part does not need to worry about exceptions as they are wrapped and, in some cases, even handled (ie: nil for NotFound).

Creating real interfaces ensures that expectations are set and upgrades are easy. Defined interfaces give other developers on the project confidence that if they follow the rules, things will work as expected.

Guideline #4. Test the whole way through

Whatever you want to call them, you need tests that prove all your components are wired together and working as expected, in the same manor as they will be used in production.

The reason a lot of developers have felt pain with pure unit testing and isolation is because they forget to add that secondary layer of tests on top that ensure that the way things are wired together works too.

Unit tests are there to drive our design. Acceptance tests are there to make sure that things are actually working the whole way through. Each of these are essential and not to be skipped over.

If you are having problems testing, it may be your design. If you are getting burned by isolation, you are probably missing higher level tests. You should be able to kill your unit tests and still have reasonable confidence that your system is working.

Nowadays, I often start with a high level test and then work my way in unit testing the pieces as I make them. I’ve found this keeps me focused on the value I am adding and ensures that my coverage is good.

Conclusion

While it has definitely taken a lot of trial and error, I am starting to find the right balance between flexibility, isolation and overkill.

  1. Stick to single responsibilities.
  2. Inject decorated dependencies through initialization and use accessors for other dependencies.
  3. Create real interfaces.
  4. Test in isolation and the whole way through.

Follow these guidelines and I believe you will start to feel better about the code you are writing, as I have over the past few months.

I would love to hear what others of you are doing and see examples. Comment below with gists, github urls, and other thoughts. Thanks!

Misleading Title About Queueing

I don’t know about you, but I find it super frustrating when people blog about cool stuff at the beginning of a project, but then as it grows, they either don’t take the time to teach or they get all protective about what they are doing.

I am going to do my best to continue to discuss the strategies we are using to grow Gauges. I hope you find them useful and, by all means, if you have tips or ideas, hit me. Without any further ado…

March 1st of last year (2011), we launched Gauges. March 1st of this year (a few days ago), we finally switched to a queue for track requests. Yes, for one full year, we did all report generation in the track request.

1. In the Beginning

My goal for Gauges in the beginning was realtime. I wanted data to be so freakin’ up-to-date that it blew people’s minds. What I’ve realized over the past year of talking to customers is that sometimes Gauges is so realtime, it is too realtime.

That is definitely not to say that we are going to work on slowing Gauges down. More what it means, is that my priorities are shifting. As more and more websites use Gauges to track, availability moves more and more to the front of my mind.

Gut Detects Issue

A few weeks back, with much help from friends (Brandon Keepers, Jesse Newland, Kyle Banker, Eric Lindvall, and the top notch dudes at Fastest Forward), I started digging into some performance issues that were getting increasingly worse. They weren’t bad yet, but I had this gut feeling they would be soon.

My gut was right. Our disk io utilization on our primary database doubled from January to February, which was also our biggest growth in terms of number of track requests. If we doubled again from February to March, it was not going to be pretty.

Back to the Beginning

From the beginning, Gauges built all tracking reports on the fly in the track request. When a track came in, Gauges did a few queries and then performed around 5-10 updates.

When you are small, this is fine, but as growth happens, updating live during a track request can become an issue. I had no way to throttle traffic to the database. This meant if we had enough large sites start tracking at once, most likely our primary database would say uncle.

As you can guess, if your primary says uncle, you start losing tracking data. In my mind, priority number one is now to never lose tracking data. In order to do this effectively, I felt we were finally at the point where we needed to separate tracking from reporting.

2. Availability Takes Front Seat

My goal is for tracking to never be down. If, occasionally, you can’t get to your reporting data, or if, occasionally, your data gets behind for a few minutes, I will survive. If, however, tracking requests start getting tossed to the wayside while the primary screams for help, I will not.

I talked with some friends and found Kestrel to be very highly recommended, particularly by Eric (linked above). He swore by it, and was pushing it harder than we needed to, so I decided to give it a try.

A few hours later, my lacking JVM skills (Kestrel is Scala) were bearing their head big time. I still had not figured out how to build or run the darn thing. I posted to the mailing list, where someone quickly pointed out that Kestrel defaults to /var for logging, data, etc. and, unfortunately, spits out no error on startup about lacking permissions on OSX. One sudo !! later and I was in business.

3. Kestrel

Before I get too far a long with this fairy tail, let’s talk about Kestrel — what is it and why did I pick it?

Kestrel is a simple, distributed message queue, based on Blaine Cook’s starling. Here are a few great paragraphs from the readme:

Each server handles a set of reliable, ordered message queues. When you put a cluster of these servers together, with no cross communication, and pick a server at random whenever you do a set or get, you end up with a reliable, loosely ordered message queue.

In many situations, loose ordering is sufficient. Dropping the requirement on cross communication makes it horizontally scale to infinity and beyond: no multicast, no clustering, no “elections”, no coordination at all. No talking! Shhh!

It features the memcached protocol, is durable (journaled), has fanout queues, item expiration, and even supports transactional reads.

My favorite thing about Kestrel? It is simple, soooo simple. Sound too good to be true? Probably is, but the honeymoon has been great so far.

Now that we’ve covered what Kestrel is and that it is amazing, let’s talk about how I rolled it out.

4. Architecture

Here is the general idea. The app writes track requests to the tracking service. Workers process off those track requests and generate the reports in the primary database.

After the primary database writes, we send the information through a pusher proxy process, which sends it off to pusher.com, the service that provides all the live web socket goodness that is in Gauges. Below is a helpful sketch:

That probably all makes sense, but remember that we weren’t starting from scratch. We already had servers setup that were tracking requests and I needed to ensure that was uninterrupted.

5. Rollout

Brandon and I have been on a tiny classes and services kick of late. What I am about to say may sound heretical, but we’ve felt that we need a few more layers in our apps. We’ve started using Gauges as a test bed for this stuff, while also spending a lot of time reading about clean code and design patterns.

We decided to create a tiny standardization around exposing services and choosing which one gets used in which environment. Brandon took the standardization and moved it into a gem where we could start trying stuff and share it with others. It isn’t much now, but we haven’t needed it to be.

Declaring Services

We created a Registry class for Gauges, which defined the various pieces we would use for Kestrel. It looked something like this:

class Registry
  include Morphine

  register :track_service do
    KestrelTrackService.new(kestrel_client, track_config['queue'])
  end

  register :track_processor do
    KestrelTrackProcessor.new(blocking_kestrel_client, track_config['queue'])
  end
end

We then store an instance of this register in Gauges.app. We probably should have named it Gauges.registry, but we can worry about that later.

At this point, what we did probably seems pointless. The kestrel track service and processor look something like this:

class KestrelTrackService
  def initialize(client, queue)
    @client = client
    @queue  = queue
  end

  def record(attrs)
    @client.set(@queue, MessagePack.pack(attrs))
  end
end

class KestrelTrackProcessor
  def initialize(client, queue)
    @client = client
    @queue = queue
  end

  def run
    loop { process }
  end

  def process
    record @client.get(@queue)
  end

  def record(data)
    Hit.record(MessagePack.unpack(data))
  end
end

The processor uses a blocking kestrel client, which is just a decorator of the vanilla kestrel client. As you can see, all we are doing is wrapping the kestrel-client and making it send the data to the right place.

Using Services

We then used the track_service in our TrackApp like this:

class TrackApp < Sinatra::Base
  get '/track.gif' do
    # stuff
    Gauges.app.track_service.record(track_attrs)
    # more stuff
  end
end

Then, in our track_processor.rb process, we started the processor like so:

Gauges.app.track_processor.run

Like any good programmer, I knew that we couldn’t just push this to production and cross our fingers. Instead, I wanted to roll it out to work like normal, but also push track requests to kestrel. This would allow me to see kestrel receiving jobs.

On top of that, I also wanted to deploy the track processors to pop track requests off. At this point, I didn’t want them to actually process those track requests and write to the database, I just wanted to make sure the whole system was wired up correctly and stuff was flowing through it.

Another important piece was seeing how many track request we could store in memory with Kestrel, based on our configuration, and how it performed when it used up all the allocated memory and started going to disk.

Service Magic

The extra layer around tracking and processing proved to be super helpful. Note that the above examples used the new Kestrel system, but that I wanted to push this out and go through a verification process first. First, to do the verification process, we created a real-time track service:

class RealtimeTrackService
  def record(attrs)
    Hit.record(attrs)
  end
end

This would allow us to change the track_service in the registry to perform as it currently was in production. Now, we have two services that know how to record track requests in a particular way. What I needed next was to use both of these services at the same time so I created a multi track service:

class MultiTrackService
  include Enumerable

  def initialize(*services)
    @services = services
  end

  def record(attrs)
    each { |service| service.record(attrs) }
  end

  def each
    @services.each do |service|
      yield service
    end
  end
end

This multi track services allowed me to record to both services for a single track request. The updated registry looked something like this:

class Registry
  include Morphine

  register :track_service do
    which = track_config.fetch(:service, :realtime)
    send("#{which}_track_service")
  end

  register :multi_track_service do
    MultiTrackService.new(realtime_track_service, kestrel_track_service)
  end

  register :realtime_track_service do
    RealtimeTrackService.new
  end

  register :kestrel_track_service do
    KestrelTrackService.new(kestrel_client, track_config['queue'])
  end
end

Note that now, track_service selects which service to use based on the config. All I had to do was update the config to use “multi” as the track service and we were performing realtime track requests while queueing them in Kestrel at the same time.

The only thing left was to beef up failure around the Kestrel service so that it was limited in how it could affect production. For this, I chose to catch failures, log them, and move on as if they didn’t happen.

class KestrelTrackService

  def initialize(client, queue, options={})
    @client = client
    @queue  = queue
    @logger = options.fetch(:logger, Logger.new(STDOUT))
  end

  def record(attrs)
    begin
      @client.set(@queue, MessagePack.pack(attrs))
    rescue => e
      log_failure(attrs, e)
      :error
    end
  end

  private

  def log_failure(attrs, exception)
    @logger.info "attrs: #{attrs.inspect}  exception: #{exception.inspect}"
  end
end

I also had a lot of instrumentation in the various track services, so that I could verify counts at a later point. These verifications counts would prove whether or not things were working. I left that out as it doesn’t help the article, but you definitely want to verify things when you roll them out.

Now that the track service was ready to go, I needed a way to ensure that messages would flow through the track processors without actually modifying data. I used a similar technique as above. I created a new processor, aptly titled NoopTrackProcessor.

class NoopTrackProcessor < KestrelTrackProcessor
  def record(data)
    # don't actually record
    # instead  just run verification
  end
end

The noop track processor just inherits from the kestrel track processor and overrides the record method to run verification instead of generating reports.

Next, I adjusted the registry to allow flipping the processor that is used based on the config.

class Registry
  include Morphine

  register :track_processor do
    which = track_config.fetch(:processor, :noop)
    send("#{which}_track_processor")
  end

  register :kestrel_track_processor do
    KestrelTrackProcessor.new(blocking_kestrel_client, track_config['queue'])
  end

  register :noop_track_processor do
    NoopTrackProcessor.new(blocking_kestrel_client, track_config['queue'])
  end
end

With those changes in place, I could now set the track service to multi, the track processor to noop, and I was good to deploy. So I did. And it was wonderful.

6. Verification

For the first few hours, I ran the multi track service and turned off the track processors. This created the effect of queueing and never dequeueing. The point was to see how many messages kestrel could hold in memory and how it performed once messages started going to disk.

I used scout realtime to watch things during the evening while enjoying some of my favorite TV shows. A few hours later and almost 530k track requests later, Kestrel hit disk and hummed along like nothing happened.

Now that I had a better handle of Kestrel, I turned the track processors back on. Within a few minutes they had popped all the messages off. Remember, at this point, I was still just noop’ing in the track processors. All reports were still being built in the track request.

I let the multi track service and noop track processors run through the night and by morning, when I checked my graphs, I felt pretty confident. I removed the error suppression from the kestrel service and flipped both track service and track processor to kestrel in the config.

One more deploy and we were queueing all track requests in Kestrel and popping them off in the track processors after which, the reports were updated in the primary database. This meant our track request now performed a single Kestrel set, instead of several queries and updates. As you would expect, response times dropped like a rock.

It is pretty obvious when Kestrel was rolled out as the graph went perfectly flat and dropped to ~4ms response times. BOOM.

You might say, yeah, your track requests are now fast, but your track processors are doing the same work that the app was doing before. You would be correct. Sometimes growing is just about moving slowness into a more manageable place, until you have time to fix it.

This change did not just move slowness to a different place though. It separated tracking and reporting. We can now turn the track processors off, make adjustments to the database, turn them back on, and instantly, they start working through the back log of track requests queued up while the database was down. No tracking data lost.

I only showed you a handful of things that we instrumented to verify things were working. Another key metric for us, since we aim to be as close to realtime as possible, is the amount of time that it takes to go from queued to processing.

Based on the numbers, it takes us around 500ms right now. I believe as long as we keep that number under a second, most people will have no clue that we aren’t doing everything live.

7. Conclusion

By no means are we where I want us to be availability-wise, but at least we are one more step in the right direction. Hopefully this article gives you a better idea how to roll things out into production safely. Layers are good. Whether you are using Rails, Sinatra, or some other language entirely, layer services so that you can easily change them.

Also, we are now a few days in and Kestrel is a beast. Much thanks to Robey for writing it and Twitter for open sourcing it!

More Tiny Classes

My last post, Keep ’Em Separated, made me realize I should start sharing more about what we are doing to make Gauges maintainable. This post is another in the same vein.

Gauges allows you to share a gauge with someone else by email. That email does not have to exist prior to your adding it, because nothing is more annoying that wanting to share something with a friend or co-worker, but first having to get them to sign up for the service.

If the email address is found, we add the user to the gauge and notify them that they have been added.

If the email address is not found, we create an invite and then send an email to notify them they should sign up, so they can see the data.

The Problem: McUggo Route

The aforementioned sharing logic isn’t difficult, but it was just enough that our share route was getting uggo. It started off looking something like this:

post('/gauges/:id/shares') do
  gauge = Gauge.get(params['id'])

  if user = User.first_by_email(params[:email])
    Stats.increment('shares.existing')
    gauge.add_user(user)
    ShareWithExistingUserMailer.new(gauge, user).deliver
    {:share => SharePresenter.new(gauge, user)}.to_json
  else
    invite = gauge.invite(params['email'])
    Stats.increment('shares.new')
    ShareWithNewUserMailer.new(gauge, invite).deliver
    {:share => SharePresenter.new(gauge, invite)}.to_json
  end
end

Let’s be honest. We’ve all seen Rails controller actions and Sinatra routes that are fantastically worse, but this was really burning my eyes, so I charged our programming butler to refactor it.

The Solution: Move Logic to Separate Class

We talked some ideas through, and once he had finished, the route looked more like this:

post('/gauges/:id/shares') do
  gauge    = Gauge.get(params['id'])
  sharer   = GaugeSharer.new(gauge, params['email'])
  receiver = sharer.perform
  {:share => SharePresenter.new(gauge, receiver)}.to_json
end

Perfect? Who cares. Waaaaaaaaay better? Yes. The concern of a user existing or not is moved away to a place where the route could care less.

Also, the bonus is that sharing a gauge can now be used without invoking a route.

So what does GaugeSharer look like?

class GaugeSharer
  def initialize(gauge, email)
    @gauge = gauge
    @email = email
  end

  def user
    @user ||= … # user from database
  end

  def existing?
    user.present?
  end

  def perform
    if existing?
      share_with_existing_user
    else
      share_with_invitee
    end
  end

  def share_with_existing_user
    # add user to gauge
    ShareWithExistingUserMailer.new(@gauge, user).deliver
    user
  end

  def share_with_invitee
    invite = ... # invite to db
    ShareWithNewUserMailer.new(@gauge, invite).deliver
    invite
  end
end

Now, instead of having several higher-level tests to check each piece of logic, we can just ensure that GaugeSharer is invoked correctly in the route test and then test the crap out of GaugeSharer with unit tests. We can also use GaugeSharer anywhere else in the application that we want to.

This isn’t a dramatic change in code, but it has a dramatic effect on the coder. Moving all these bits into separate classes and tiny methods improves ease of testing and, probably more importantly, ease of grokking for another developer, including yourself at a later point in time.

Keep ‘Em Separated

Note: If you end up enjoying this post, you should do two things: sign up for Pusher and then subscribe to destroy all software screencasts. I’m not telling you do this because I get referrals, I just really like both services.

For those that do not know, Gauges currently uses Pusher.com for flinging around all the traffic live.

Every track request to Gauges sends a request to Pusher. We do this using EventMachine in a thread, as I have previously written about.

The Problem

The downside of this, is when you get to the point we were (thousands of a requests a minute), there are so many pusher notifications to send (thousands of a minute) that the EM thread starts stealing a lot of time from the main request thread. You end up with random slow requests that have one to five seconds of “uninstrumented” time. Definitely not a happy scaler does this make.

In the past, we had talked about keeping track of which gauges were actually being watched and only sending a notification for those, but never actually did anything about it.

The Solution

Recently, Pusher added web hooks on channel occupy and channel vacate. This, combined with a growing number of slow requests, was just the motivation I needed to come up with a solution.

We (@bkeepers and I) started by mapping a simple route to a class.

class PusherApp < BaseApp
  post '/pusher/ping' do
    webhook = Pusher::WebHook.new(request)
    if webhook.valid?
      PusherPing.receive(webhook)
      'ok'
    else
      status 401
      'invalid'
    end
  end
end

Using a simple class method like this moves all logic out of the route and into a place that is easier to test. The receive method iterates the events and runs each ping individually.

class PusherPing
  def self.receive(webhook)
    webhook.events.each do |event|
      new(event, webhook.time).run
    end
  end
end

At first, we had something like this for each PusherPing instance.

class PusherPing
  def initialize(event, time)
    @event         = event || {}
    @time          = time
    @event_name    = @event['name']
    @event_channel = @event['channel']
  end

  def run
    case @event_name
    when 'channel_occupied'
      occupied
    when 'channel_vacated'
      vacated
    end
  end

  def occupied
    update(@time)
  end

  def vacated
    update(nil)
  end

  def update(value)
    # update the gauge in the
    # db with the value
  end
end

We pushed out the change so we could start marking gauges as occupied. We then forced a browser refresh, which effectively vacated and re-occupied all gauges people were watching.

Once we new the occupied state of each gauge was correct, we added the code to only send the request to pusher on track if a gauge was occupied.

Deploy. Celebrate. Booyeah.

The New Problem

Then, less than a day later, we realized that pusher doesn’t guarantee the order of events. Imagine someone vacating and then occupying a gauge, but receiving the occupy first and then the vacate.

This situation would mean that live tracking would never turn on for the gauge. Indeed, it started happening to a few people, who quickly let us know.

The New Solution

We figured it was better to send a few extra notifications than never send any, so we decided to “occupy” gauges on our own when people loaded up the Gauges dashboard.

We started in and quickly realized the error of our ways in the pusher ping. Having the database calls directly tied to the PusherPing class meant that we had two options:

  1. Use the PusherPing class to occupy a gauge when the dashboard loads, which just felt wrong.
  2. Re-write it to separate the occupying and vacating of a gauge from the PusherPing class.

Since we are good little developers, we went with 2. We created a GaugeOccupier class that looks like this:

class GaugeOccupier
  attr_reader :ids

  def initialize(*ids)
    @ids = ids.flatten.compact.uniq
  end

  def occupy(time=Time.now.utc)
    update(time)
  end

  def vacate
    update(nil)
  end

private

  def update(value)
    return if @ids.blank?
    # do the db updates
  end
end

We tested that class on its own quite quickly and refactored the PusherPing to use it.

class PusherPing
  def run
    case @event_name
    when 'channel_occupied'
      GaugeOccupier.new(gauge_id).occupy(@time)
    when 'channel_vacated'
      GaugeOccupier.new(gauge_id).vacate
    end
  end
end

Boom. PusherPing now worked the same and we had a way to “occupy” gauges separate from the PusherPing. We added the occupy logic to the correct point in our app like so:

ids = gauges.map { |gauge| gauge.id }
GaugeOccupier.new(ids).occupy

At this point, we were now “occupied” more than “vacated”, which is good. However, you may have noticed, that we still had the issue where someone loads the dashboard, we occupy the gauge, but then receive a delayed, or what I will now refer to as “stale”, hook.

To fix the stale hook issue, we simply added a bit of logic to the PusherPing class to detect staleness and simple ignore the ping if it is stale.

class PusherPing
  def run
    return if stale?
    # do occupy/vacate
  end

  def stale?
    return false if gauge.occupied_at.blank?
    gauge.occupied_at > @time
  end
end

Closing Thoughts

This is by no means a perfect solution. There are still other holes. For example, a gauge could be occupied by us after we receive a vacate hook from pusher and stay in an “occupied” state, sending notifications that no one is looking for.

To fix that issue, we can add a cleanup cron or something that occasionally gets all occupied channels from pusher and vacates gauges that are not in the list.

We decided it wasn’t worth the time. We pushed out the occupy fix and are now reaping the benefits of sending about 1/6th of the pusher requests we were before. This means our EventMachine thread is doing less work, which gives our main thread more time to process requests.

You might think us crazy for sending hundreds of http requests in a thread that shares time with the main request thread, but it is actually working quite well.

We know that some day we will have to move this to a queue and an external process that processes the queue, but that day is not today. Instead, we can focus on the next round of features that will blow people’s socks off.

What a Year

The last 12 months have been nuts. My health and professional/personal life were completely at odds.

Between January and August, I had three hernia surgeries. As if that wasn’t enough for one year, the last few months of the year I’ve been plagued by a few other ailments (which are still giving me a hard time). Definitely a rough stretch. I will never take health for granted again and really look forward to getting back to “normal”.

Quite the contrary to my health, Ordered List grew from 2 to 5 people, helped Zynga launch Words with Friends on Facebook, launched Gauges and Speaker Deck while improving Harmony, and, finally, was acquired by the only other company in the world I wanted to be a part of, GitHub.

Here is to a healthy 2012.

Acquired

Several times over the past few years, I have stated that GitHub is probably the only other place I could see myself working. Today, it is official. All of Ordered List has joined GitHub.

Maybe someday I’ll write about what Ordered List has meant to me, but today I am going to fully enjoy the present, instead of rambling about the past. I have no doubt great things will come of this.

You can read more at GitHub and Ordered List.

Creating an API

A few weeks back, we publicly released the Gauges API. Despite building Gauges from the ground up as an API, it was a lot of work. You really have to cross your t’s and dot your i’s when releasing an API.

1. Document as You Build

We made the mistake of documenting after most of the build was done. The problem is documenting sucks. Leaving that pain until the end, when you are excited to release it, makes doing the work twice as hard. Thankfully, we have a closer on our team who powered through it.

2. Be Consistent

As we documented the API, we noticed a lot of inconsistencies. For example, in some places we return a hash and in others we returned an array. Upon realizing these issues, we started making some rules.

To solve the array/hash issue, we elected that every response should return a hash. This is the most flexible solution going forward. It allows us to inject new keys without having to convert the response or release a whole new version of the API.

Changing from an array to a hash meant that we needed to namespace the array with a key. We then noticed that some places were name-spaced and others weren’t. Again, we decided on a rule. In this case, all top level objects should be name-spaced, but objects referenced from a top level object or a collection of several objects did not require name-spacing.

{users:[{user:{...}}, {user:{...}}]} // nope
{users:[{...}, {...}]} // yep
{username: 'jnunemaker'} // nope
{user: {username:'jnunemaker'}} // yep 

You get the idea. Consistency is important. It is not so much how you do it as that you always do it the same.

3. Provide the URLs

Most of my initial open source work was wrapping APIs. The one thing that always annoyed me was having to generate urls. Each resource should know the URLs that matter. For example, a user resource in Gauges has a few URLs that can be called to get various data:

{
  "user": {
    "name": "John Doe",
    "urls": {
      "self": "https://secure.gaug.es/me",
      "gauges": "https://secure.gaug.es/gauges",
      "clients": "https://secure.gaug.es/clients"
    },
    "id": "4e206261e5947c1d38000001",
    "last_name": "Doe",
    "email": "john@doe.com",
    "first_name": "John"
  }
}

The previous JSON is the response of the resource /me. /me returns data about the authenticated user and the URLs to update itself (self), get all gauges (/gauges), and get all API clients (/clients). Let’s say next you request /gauges. Each gauge returned has the URLs to get more data about the gauge.

{
  "gauges": [
    {
      // various attributes
      "urls": {
        "self":"https://secure.gaug.es/gauges/4ea97a8be5947ccda1000001",
        "referrers":"https://secure.gaug.es/gauges/4ea97a8be5947ccda1000001/referrers",
        "technology":"https://secure.gaug.es/gauges/4ea97a8be5947ccda1000001/technology",
        // ... etc
      },
    }
  ]
}

We thought this would prove helpful. We’ll see in the long run if it turns out to work well.

4. Present the Data

Finally, never ever use to_json and friends from a controller or sinatra get/post/put block. At least as a bare minimum rule, the second you start calling to_json with :methods, :except, :only, or any of the other options, you probably want to move it to a separate class.

For Gauges, we call these classes presenters. For example, here is a simplified version of the UserPresenter.

class UserPresenter
  def initialize(user)
    @user = user
  end

  def as_json(*)
    {
      'id'          => @user.id,
      'email'       => @user.email,
      'name'        => @user.name,
      'first_name'  => @user.first_name,
      'last_name'   => @user.last_name,
      'urls'        => {
        'self'    => "#{Gauges.api_url}/me",
        'gauges'  => "#{Gauges.api_url}/gauges",
        'clients' => "#{Gauges.api_url}/clients",
      }
    }
  end
end

Nothing fancy. Just a simple ruby class that sits in app/presenters. Here is an example of the the /me route looks like in our Sinatra app.

get('/me') do
  content_type(:json)
  sign_in_required
  {:user => UserPresenter.new(current_user)}.to_json
end

This simple presentation layer makes it really easy to test the responses in detail using unit tests and then just have a single integration test that makes sure overall things look good. I’ve found this tiny layer a breath of fresh air.

I am sure that nothing above was shocking or awe-inspiring, but I hope that it saves you some time on your next public API.

Stupid Simple Debugging

There are all kinds of fancy debugging tools out there, but personally, I get the most mileage out of good old puts statements.

When I started with Ruby, several years ago, I used puts like this to debug:

puts account.inspect

The problem with this is two fold. First, if you have a few puts statements, you don’t know which one is actually which object. This always led me to doing something like this:

puts "account: #{account.inspect}"

Second, depending on whether you are just in Ruby or running an app through a web server, puts is sometimes swallowed. This led me to often times do something like this when using Rails:

Rails.logger.debug "account: #{account.inspect}"

Now, not only do I have to think about which method to use to debug something, I also have to think about where the output will be sent so I can watch

Continue reading “Stupid Simple Debugging”

Stupid Simple Debugging

There are all kinds of fancy debugging tools out there, but personally, I get the most mileage out of good old puts statements.

When I started with Ruby, several years ago, I used puts like this to debug:

puts account.inspect

The problem with this is two fold. First, if you have a few puts statements, you don’t know which one is actually which object. This always led me to doing something like this:

puts "account: #{account.inspect}"

Second, depending on whether you are just in Ruby or running an app through a web server, puts is sometimes swallowed. This led me to often times do something like this when using Rails:

Rails.logger.debug "account: #{account.inspect}"

Now, not only do I have to think about which method to use to debug something, I also have to think about where the output will be sent so I can watch for it.

Enter Log Buddy

Then, one fateful afternoon, I stumbled across log buddy (gem install log_buddy). In every project, whether it be a library, Rails app, or Sinatra app, one of the first gems I throw in my Gemfile is log_buddy.

Once you have the gem installed, you can tell log buddy where your log file is and whether or not to actually log like so:

LogBuddy.init({
  :logger   => Gauges.logger,
  :disabled => Gauges.production?,
})

Simply provide log buddy with a logger and tell it if you want it to be silenced in a given situation or environment and you get some nice bang for your buck.

One Method, One Character

First, log buddy adds a nice and short method named d. d is 4X shorter than puts, so right off the bat you get some productivity gains. The d method takes any argument and calls inspect on it. Short and sweet.

d account # will puts account.inspect
d 'Some message' # will puts "Some message"

The cool part is that on top of printing the inspected object to stdout, it also logs it to the logger provided in in LogBuddy.init. No more thinking about which method to use or where output will be. One method, output is sent to multiple places.

This is nice, but it won’t win you any new friends. Where log buddy gets really cool, is when you pass it a block.

d { account } # puts and logs account = <Account ...>

Again, one method, output to stdout and your log file, but when you use a block, it does magic to print out the variable name and that inspected value. You can also pass in several objects, separating them with semi-colons.

d { account; account.creator; current_user }

This gives you each variable on its own line with the name and inspected value. Nothing fancy, but log buddy has saved me a lot of time over the past year. I figured it was time I send it some love.

Counters Everywhere, Part 2

In Counters Everywhere, I talked about how to handle counting lots of things using single documents in Mongo. In this post, I am going to cover the flip side—counting things when there are an unlimited number of variations.

Force the Data into a Document Using Ranges

Recently, we added window and browser dimensions to Gaug.es. Screen width has far fewer variations as there are only so many screens out there. However, browser width and height can vary wildly, as everyone out there has there browser open just a wee bit different.

I knew that storing all widths or heights in a single document wouldn’t work because the number of variations was too high. That said, we pride ourselves at Ordered List on thinking through things so our users don’t have to.

Does anyone really care if someone visited their site with a browser open exactly 746 pixels wide? No. Instead, what matters is what ranges of widths are visiting their site. Knowing this, we plotted out what we considered were the most important ranges of widths (320, 480, 800, 1024, 1280, 1440, 1600, > 2000) and heights (480, 600, 768, 900, > 1024).

Instead of storing each exact pixel width, we figure out which range the width is in and do an increment on that. This allows us to receive a lot of varying widths and heights, but keep them all in one single document.

{
  "sx" => {
    "320"  => 237,
    "480"  => 367,
    "800"  => 258,
    "1024" => 2273,
    "1280" => 10885,
    "1440" => 6144
    "1600" => 13607,
    "2000" => 2154,
  },
  "bx" => {
    "320"  => 121,
    "480"  => 390,
    "800"  => 3424,
    "1024" => 9790,
    "1280" => 11125,
    "1440" => 3989
    "1600" => 6757,
    "2000" => 301,
  },
  "by" => {
    "480"  => 3940,
    "600"  => 13496,
    "768"  => 8184,
    "900"  => 6718,
    "1024" => 3516
  },
}

I would call this first method for storing a large number of variations cheating, but in this instance, cheating works great.

When You Can’t Cheat

Where the single document model falls down is when you do not know the number of variations, or at least know that it could grow past 500-1000. Seeing how efficient the single document model was, I tried to store content and referrers in the same way, initially.

I created one document per day per site and it had a key for each unique piece of content or referring url with a value that was an incrementing number of how many times it was hit.

It worked great. Insanely small storage and no secondary indexes were needed, so really light on RAM. Then, a few larger sites signed up that were getting 100k views a day and had 5-10k unique pieces of content a day. This hurt for a few reasons.

First, wildly varying document sizes. Mongo pads documents a bit, so they can be modified without moving on disk. If a document grows larger than the padding, it has to be moved. Obviously, the more you hit the disk the slower things are, just as the more you go across the network the slower things are. Having some documents with 100 keys and others with 10k made it hard for Mongo to learn the correct padding size, because there was no correct size.

Second, when you have all the content for a day in one doc and have to send 10k urls plus page titles across the wire just to show the top fifteen, you end up with some slowness. One site consistently had documents that were over a MB in size. I quickly realized this was not going to work long term.

In our case, we always write data in one way and always read data in one way. This meant I needed an index I could use for writes and one that I could use for reads. I’ll get this out of the way right now. If I had it to do over again, I would definitely do it different. I’m doing some stupid stuff, but we’ll talk more about that later.

The keys for each piece of content are the site_id (sid), path (p), views (v), date (d), title (t), and hash (h). Most of those should be obvious, save hash. Hash is a crc32 of the path. Paths are quite varying in length, so indexing something of consistent size is nice.

For writes, the index is [[‘sid’, 1], [‘d’, -1], [‘h’, 1]] and for reads the index is [[‘sid’, 1], [‘d’, -1], [‘v’, -1]]. This allows me to upset based on site, date and hash for writes and then read the data by site, date and views descending, which is exactly what it looks like when we show content to the user.

As mentioned in the previous post, I do a bit of range based partitioning as well, keeping a collection per month. Overall, this is working great for content, referrers and search terms on Gaug.es.

Learning from Mistakes

So what would I do differently if given a clean slate? Each piece of content and referring url have an _id key that I did not mention. It is never used in any way, but _id is automatically indexed. Having millions of documents each month, each with an _id that is never used starts to add up. Obviously, it isn’t really hurting us now, but I see it as wasteful.

Also, each document has a date. Remember that the collection is already partitioned by month (i.e.: c.2011.7 for July), yet hilariously, I store the full date with each document like so: yyyy-mm-dd. 90% of that string is completely useless. I could more easily store the day as an integer and ignore the year and month.

Having learned my lesson on content and referrers, I switched things up a bit for search terms. Search terms are stored per month, which means we don’t need the day. Instead of having a shorter but meaningless _id, I opted to use something that I knew would be unique, even though it was a bit longer.

The _id I chose was “site_id:hash” where hash is a crc32 of the search term. This is conveniently the same as the fields that are upserted on, which combined with the fact that _id is always indexed means that we no longer need a secondary index for writes.

I still store the site_id in the document so that I can have a compound secondary index on site_id (sid) and views (v) for reads. Remember that the collection is scoped by month, and that we always show the user search terms for a given month, so all we really need is which terms were viewed the most for the given site, thus the index is [[‘sid’, 1], [‘v’, -1]].

Hope that all makes sense. The gist is rather than have an _id that is never used, I moved the write index to _id, since it will always be unique anyway, which means one less secondary index and no wasted RAM.

Interesting Finding

The only other interesting thing about all this is our memory usage. Our index size is now ~1.6GB, but the server is only using around ~120MB of RAM. How can that be you ask? We’ve all heard that you need to have at least as much RAM as your index size, right?

The cool thing is you don’t. You only need as much RAM as your active set of data. Gaug.es is very write heavy, but people pretty much only care about recent data. Very rarely do they page back in time.

What this means is that our active set is what is currently being written and read, which in our case is almost the exact same thing. The really fun part is that I can actually get this number to go up and down just by adjusting the number of results we show per page for content, referrers and search terms.

If we show 100 per page, we use more memory than 50 per page. The reason is that people click on top content often to see what is doing well, which continually loads in the top 100 or 50, but they rarely click back in time. This means that the active set is the first 100 or 50, depending on what the per page is. Those documents stay in RAM, but older pages get pushed out for new writes and are never really requested again.

I literally have a graph that shows our memory usage drop in half when we moved pagination from the client-side to the server-side. I thought it was interesting, so figured I would mention it.

As always, if you aren’t using Gaug.es yet, be sure to give the free trial a spin!

Counters Everywhere

Last week, coming off hernia surgery number two of the year (and hopefully the last for a while) I eased back into development by working on Gaug.es.

In three days, I cranked out tracking of three new features. The only reason this was possible is because I have tried, failed, and succeeded on repeat at storing various stats efficiently in Mongo.

While I will be using Mongo as the examples for this article, most of it could very easily be applied to any data store that supports incrementing numbers.

How are you going to use the data?

The great thing about the boon of new data stores is the flexibility that most provide regarding storage models. Whereas SQL is about normalizing the storage of data and then flexibly querying it, NoSQL is about thinking how you will query data and then flexibly storing it.

This flexibility is great, but it means if you do not fully understand how you will be accessing data, you can really muck things up. If, on the other hand, you do understand your data and how it is accessed, you can do some really fun stuff.

So how do we access data on Gaug.es? Depends on the feature (views, browsers, platforms, screen resolutions, content, referrers, etc.), but it can mostly be broken down into these points:

  • Time frame resolution. What resolution is needed? To the month? Day? Hour? Which piece of content was viewed the most matters on a per day basis, but which browser is winning the war only matters per month, or maybe even over several months.
  • Number of variations. Browsers is a finite number of variations (Chrome, Firefox, Safari, IE, Opera, Other). Content is completely the opposite, as it varies drastically from website to website.

Knowing that resolution and variation drive how we need to present data is really important.

One document to rule them all

Due to the amount of data a hosted stats service has to deal with, most store each hit and then process them into reports on intervals. This leads to delays between something happening on your site and you finding out, as reports can be hours or even a day behind. This always bothered me and is why I am working really hard at making Gaug.es completely live.

Ideally, you should be able to check stats anytime and know exactly what just happened. Email newsletter? Watch the traffic pour in a few minutes after you hit send. Post to your blog? See how quickly people pick it up on Twitter and in feed readers.

In order to provide access to data in real-time, we have to store and retrieve our data differently. Instead of storing every hit and all the details and then processing those hits, we make decisions and build reports as each hit comes in.

Resolution and Variations

What kind of decisions? Exactly what I mentioned above.

First, we determine what resolution a feature needs. Top content and referrers need to be stored per day for at least a month. After that, probably month is a good enough resolution.

Browsers and screen sizes are far less interesting on a per day basis. Typically, these are only used a few times a year to make decisions such as dropping IE 6 support or deciding to target 1024×768 instead of 800×600 (remember that back in the day?).

Second, we determine the variations. Content and referrers varies greatly on a per site basis, but we can choose the browsers and screen dimensions to track. For example, with browsers, we picked Chrome, Safari, Firefox, Opera, IE and then we lump the rest of the browsers into Other. Do I really care how many people visit RailsTips in Konquerer? Nope, so why even show it.

The same goes for platforms. We track Mac, Windows, Linux, iPhone, iPad, iPod, Android, Blackberry, and Other.

Document Model

Knowing that we only have 6 variations of browsers and 9 variations of platforms to track, and that the list is not likely to grow much, I store all of them in one document per month per site. This means showing someone browser and/or platform data for an entire month is one query for a very tiny document that looks like this:

{
  '_id' => 'site_id:month',
  'browsers' => {
    'safari' => {
      '5-0' => 5,
      '4-1' => 2,
    },
    'ie' => {
      '9-0' => 5,
      '8-0' => 2,
      '7-0' => 1,
      '6-0' => 1,
    }
  },
  'platforms' => {
    'macintosh' => 10,
    'windows'   => 5,
    'linux'     => 2,
  },
}

When a track request comes in, I parse the user agent to get the browser, version, and platform. We only store the major and minor parts of the version. Who cares about 12.0.1.2? What matters is 12.0. This means we end up with 5-10 versions per month per browser instead of 50 or 100. Also, note that Mongo does not allow dots in key names, so I store the dot as a hyphen, thus 12-0.

I then do a single query on that document to increment the platform and browser/version.

query  = {'_id' => "#{hit.site_id}:#{hit.month}"}
update = {'$inc' => {
  "b.#{browser_name}.#{browser_version}" => 1,
  "p.#{platform}" => 1,
}}
collection(hit.created_on).update(query, update, :upsert => true)

b and p are short for browser and platform. No need to waste space. The dot syntax in the strings in the update hash tell Mongo to reach into the document and increment a value for a key inside of a hash.

Also, the _id (or primary key) of the document is the site id and the month since the two together are always unique. There is no need to store a BSON ObjectId or incrementing number, as the data is always accessed for a given site and month. _id is automatically indexed in Mongo and it is the only thing that we query on, so there is no need for secondary indexes.

Range based partitioning

I also do a bit of range based partitioning at the collection level (ie: technology.2011, technology.2012). That is why I pass the date of the hit to the collection method. The collection that stores the browser and platform information is split by year. Maybe unnecessary looking back at it, but it hurts nothing. It means that a given collection stores number of sites * 12 documents at a maximum.

Mongo creates collections on the fly, so when a new year comes along, the new collection will be created automatically. As years go by, we can create smaller summary documents and drop the old collections or move them to another physical server (which is often easier and more performant than removing old data from an active collection).

Because I know that the number of variations is small (< 100-ish), I know that the overall document size is not really going to grow and that it will always efficiently fly across the wire. When you have relatively controllable data like browsers/platforms, storing it all in one document works great.

Closing Thoughts

As I said before, this article is using Mongo as an example. If you wanted to use Redis, Membase or something else with atomic incrementing, you could just have one key per month per site per browser.

Building reports on the fly through incrementing counters means:

  • less storage, as you do not need the raw data
  • less RAM, as there are fewer secondary indexes
  • real-time querying is no problem, as you do not need to generate reports, the data is the report

It definitely involves more thought up front, but several areas of Gaug.es use this pattern and it is working great. I should also note that it increases the number of writes. Creating the reports on the fly means 7 or 8 writes for each “view” instead of 1.

The trade off is that reading the data is faster and avoids the lag caused by having to post-process it. I can see a day in the future where having all these writes will force me to find a different solution, but that is a ways off.

What do you do when you cannot limit the number of variations? I’ll leave that for next time.

Oh, and if you have not signed up for Gaug.es yet, what are you waiting on? Do it!

EventMachine and Passenger

In order to fully explain this post, we first need to cover some back story. Originally, Gaug.es was hosted on Heroku. Recently, we moved Gaug.es to RailsMachine (before the great AWS outage luckily), where we are already happily hosting Harmony.

At Heroku, we were running on 1.9.2 and thin. The most common RailsMachine stack is REE 1.8 and Passenger. Sticking with the common stack meant it would be a far easier and faster transition to Railsmachine, so we tweaked a few things and switched.

Heroku, Thin, and EventMachine

While at Heroku, we had been testing using PusherApp for live updating of analytics as they occurred. The pusher gem has two ways to trigger notifications, trigger (net/http) and trigger_async (em-http-request).

Since Heroku runs on thin, we used trigger_async. This meant that sending the PusherApp notifications in the request cycle was fine, as they did not block.

One of the changes when moving to RailsMachine was switching from trigger_async to trigger. Obviously, having an external HTTP request in your request path is less than ideal, but backgrounding it seemed to go against the whole idea of “live”.

Our response times for Gaug.es average around 5ms, so even with 75-100ms for each PusherApp request, we were still in a normally acceptable response time range (not ok with me, but ok for now).

Pusher Conversation

I contacted the fine folks at Pusher and asked if they had any suggestions. One suggestion Martyn mentioned was Thread.new { EM.run }.

Given my lack of experience with threads and event machine, this at first this struck me as dirty/scary and I was not sure if he was serious.

I did a bit of research and discovered he was not only serious, but people we doing it. The AMQP gem even recommends it in the Readme.

Hmmm, This Might Actually Work

After a bit of googling and scouring code on Github I found a few different solutions. I started hacking and got something that was “working” pretty quickly. Quite intrigued I decided to hit up someone smarter than I, Aman Gupta, who maintains the EventMachine and AMQP gems.

He confirmed that it would work and recommended a few tweaks. Yesterday, I pushed it to production and thus far it is working great. Below is the code needed to make the magic happen.

module GaugesEM
  def self.start
    if defined?(PhusionPassenger)
      PhusionPassenger.on_event(:starting_worker_process) do |forked|
        if forked && EM.reactor_running?
          EM.stop
        end
        Thread.new { EM.run }
        die_gracefully_on_signal
      end
    end
  end

  def self.die_gracefully_on_signal
    Signal.trap("INT")  { EM.stop }
    Signal.trap("TERM") { EM.stop }
  end
end

GaugesEM.start

Gaug.es is 100% Sinatra, so I just put this in the file in Gaug.es that works similar to environment.rb or an initializer would in Rails.

There are two key parts. First, if we are running on Passenger and using smart spawning, we need to stop the event machine if it is started. Second, we create a new thread and start the event machine loop.

Now, in the Notification class that we have in Gaug.es, I can do the following to make the Pusher request not block the main request.

EM.next_tick {
  Pusher[channel].trigger_async('hit', doc)
}

The main request carries on as usual and does not wait for the Pusher to request to finish. In the background, event machine is sending all these notifications. Once again, even on Passenger, we now have non-blocking pusher notifications.

Hmm, This Does Work

Since it took me a bit to figure it out, I thought I would post it here for everyone to benefit from and maybe to start some discussion. If you have suggestions or see glaring issues, please let me know.

I have no assumptions that I am wise or that this is perfect, but thus far it is getting the job done with no adverse affects.

Misleading Graph of Proof

Below is a graph of response times for Gaug.es thanks to New Relic. Seriously, where would we be without New Relic! Green is the time spent in external requests. I am sure you can tell at which point I pushed out the event machine integration.

That said, don’t think that all that time is instantly gone. It is still happening, just in a thread in the background without much affect, if any, on our normal response times.

Demo, Plz!

If you are curious about what the live updating looks like currently in Gaug.es, I posted a short video a few weeks back.

Also, if you too are addicted to analytics, you should definitely sign up and try it out. Lots of good stuff coming down the pipe!

SSH Tunneling in Ruby

The other day I wanted to do some queries in production, but our servers are pretty locked down to the outside world. I was well aware that I could just make an ssh tunnel to connect to the database server, but I decided I wanted to do it in Ruby.

I am not the brightest of crayons in the box, so it took me a bit. Since I struggled with it for a few, I figured others probably will someday as well and decided to post my solution here.

Obviously, replace the strings with <…> with your own information and change the host port information in the gateway.open call.

require 'net/ssh/gateway'

gateway = Net::SSH::Gateway.new('<myremotehostorip.com>', '<remote_user>')

# Open port 27018 to forward to 127.0.0.1:27017
# on the remote host provided above
gateway.open('127.0.0.1', 27017, 27018)

# Connect to local port set in previous statement
conn = Mongo::Connection.new('127.0.0.1', 27018)

# Just printing out stats to show that it works
puts conn.db('<database_name>').stats.inspect

gateway.shutdown!

With just a few lines of Ruby, I can make scripts that use my local ssh key to talk to production. Thanks go to Jamis Buck for all the heavy lifting of writing net-ssh and company.