up and running with cassandra

Cassandra is a hybrid non-relational database in the same class as Google's BigTable. It is more featureful
than a key/value store like Riak, but supports fewer query types than a document store like MongoDB.

Cassandra was started by Facebook and later transferred to the open-source community. It is an ideal runtime database for web-scale domains like social networks.

This post is both a tutorial and a "getting started" overview. You will learn about Cassandra's features, data model, API, and operational requirements—everything you need to know to deploy a Cassandra-backed service.

May 11, 2010: post updated for Cassandra gem 0.8 and Cassandra version 0.6.

features

There are a number of reasons to choose Cassandra for your website. Compared to other databases, three big features stand out:

  • Flexible schema: with Cassandra, like a document store, you don't have to decide what fields you need in your records ahead of time. You can add and remove arbitrary fields on the fly. This is an incredible productivity boost, especially in large deployments.
  • True scalability: Cassandra scales horizontally in the purest sense. To add more capacity to a cluster, turn on another machine. You don't have restart any processes, change your application queries, or manually relocate any data.
  • Multi-datacenter awareness: you can adjust your node layout to ensure that if one datacenter burns in a fire, an alternative datacenter will have at least one full copy of every record.

Some other features that help put Cassandra above the competition
:

  • Range queries: unlike most key/value stores, you can query for ordered ranges of keys.
  • List datastructures: super columns add a 5th dimension to the hybrid model, turning columns into lists. This is very handy for things like per-user indexes.
  • Distributed writes: you can read and write any data to anywhere in the cluster at any time. There is never any single point of failure.

installation

You need a Unix system. If you are using Mac OS 10.5, all you need is Git. Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.

Start a terminal and run:

sudo gem install cassandra

If you are using Mac OS, you need to export the following environment variables:

export JAVA_HOME="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home"
export PATH="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:$PATH"

Now you can build and start a test server with cassandra_helper:

cassandra_helper cassandra

It runs!

live demo

The above script boots the server with a schema that we can interact with. Open another terminal window and start irb, the Ruby shell:

irb

In the irb prompt, require the library:

require 'rubygems'
require 'cassandra'
include SimpleUUID

Now instantiate a client object:

twitter = Cassandra.new('Twitter')

Let's insert a few things:

user = {'screen_name' => 'buttonscat'}
twitter.insert(:Users, '5', user)  

tweet1 = {'text' => 'Nom nom nom nom nom.', 'user_id' => '5'}
twitter.insert(:Statuses, '1', tweet1)

tweet2 = {'text' => '@evan Zzzz....', 'user_id' => '5', 'reply_to_id' => '8'}
twitter.insert(:Statuses, '2', tweet2)

Notice that the two status records do not have all the same columns. Let's go ahead and connect them to our user record:

twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '1'}})
twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '2'}})

The UUID.new call creates a collation key based on the current time; our tweet ids are stored in the values.

Now we can query our user's tweets:

timeline = twitter.get(:UserRelationships, '5', 'user_timeline', :reversed => true)
timeline.map { |time, id| twitter.get(:Statuses, id, 'text') }
# => ["@evan Zzzz....", "Nom nom nom nom nom."]

Two tweet bodies, returned in recency order—not bad at all. In a similar fashion, each time a user tweets, we could loop through their followers and insert the status key into their follower's home_timeline relationship, for handling general status delivery.

the data model

Cassandra is best thought of as a 4 or 5 dimensional hash. The usual way to refer to a piece of data is as follows: a keyspace, a column family, a key, an optional super column, and a column. At the end of that chain lies a single, lonely value.

Let's break down what these layers mean.

  • Keyspace (also confusingly called "table"): the outer-most level of organization. This is usually the name of the application. For example, 'Twitter' and 'Wordpress' are both good keyspaces. Keyspaces must be defined at startup in the storage-conf.xml file.

  • Column family: a slice of data corresponding to a particular key. Each column family is stored in a separate file on disk, so it can be useful to put frequently accessed data in one column family, and rarely accessed data in another. Some good column family names might be :Posts, :Users and :UserAudits. Column families must be defined at startup.

  • Key: the permanent name of the record. You can query over ranges of keys in a column family, like :start => '10050', :finish => '10070'—this is the only index Cassandra provides for free. Keys are defined on the fly.

After the column family level, the organization can diverge—this is a feature unique to Cassandra. You can choose either:

  • A column: this is a tuple with a name and a value. Good columns might be 'screen_name' => 'lisa4718' or 'Google' => 'http://google.com'.

    It is common to not specify a particular column name when requesting a key; the response will then be an ordered hash of all columns. For example, querying for (:Users, '174927') might return:

    {'name' => 'Lisa Jones', 
     'gender' => 'f', 
     'screen_name' => 'lisa4718'}
    

    In this case, name, gender, and screen_name are all column names. Columns are defined on the fly, and different records can have different sets of column names, even in the same keyspace and column family. This lets you use the column name itself as either structure or data. Columns can be stored in recency order, or alphabetical by name, and all columns keep a timestamp.

  • A super column: this is a named list. It contains standard columns, stored in recency order.

    Say Lisa Jones has bookmarks in several categories. Querying (:UserBookmarks, '174927') might return:

    {'work' => {
        'Google' => 'http://google.com', 
        'IBM' => 'http://ibm.com'}, 
     'todo': {...}, 
     'cooking': {...}}
    

    Here, work, todo, and cooking are all super column names. They are defined on the fly, and there can be any number of them per row. :UserBookmarks is the name of the super column family. Super columns are stored in alphabetical order, with their sub columns physically adjacent on the disk.

Super columns and standard columns cannot be mixed at the same (4th) level of dimensionality. You must define at startup which column families contain standard columns, and which contain super columns with standard columns inside them.

Super columns are a great way to store one-to-many indexes to other records: make the sub column names TimeUUIDs (or whatever you'd like to use to sort the index), and have the values be the foreign key. We saw an example of this strategy in the demo, above.

If this is confusing, don't worry. We'll now look at two example schemas in depth.

twitter schema

Here is the schema definition we used for the demo, above. It is based on Eric Florenzano's Twissandra:

<Keyspace Name="Twitter">
  <ColumnFamily CompareWith="UTF8Type" Name="Statuses" />
  <ColumnFamily CompareWith="UTF8Type" Name="StatusAudits" />
  <ColumnFamily CompareWith="UTF8Type" Name="StatusRelationships"
    CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />  
  <ColumnFamily CompareWith="UTF8Type" Name="Users" />
  <ColumnFamily CompareWith="UTF8Type" Name="UserRelationships"
    CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
</Keyspace>

What could be in StatusRelationships? Maybe a list of users who favorited the tweet? Having a super column family for both record types lets us index each direction of whatever many-to-many relationships we come up with.

Here's how the data is organized:

Click to enlarge

Cassandra lets you distribute the keys across the cluster either randomly, or in order, via the Partitioner option in the storage-conf.xml file.

For the Twitter application, if we were using the order-preserving partitioner, all recent statuses would be stored on the same node. This would cause hotspots. Instead, we should use the random partitioner.

Alternatively, we could preface the status keys with the user key, which has less temporal locality. If we used user_id:status_id as the status key, we could do range queries on the user fragment to get tweets-by-user, avoiding the need for a user_timeline super column.

multi-blog schema

Here's a another schema, suggested to me by Jonathan Ellis, the primary Cassandra maintainer. It's for a multi-tenancy blog platform:

<Keyspace Name="Multiblog">      
  <ColumnFamily CompareWith="TimeUUIDType" Name="Blogs" />
  <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"/>
</Keyspace>

Imagine we have a blog named 'The Cutest Kittens'. We will insert a row when the first post is made as follows:

require 'rubygems'
require 'cassandra'
include SimpleUUID

multiblog = Cassandra.new('Multiblog')

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new => 
    '{"title":"Say Hello to Buttons Cat","body":"Buttons is a cute cat."}' })

UUID.new generates a unique, sortable column name, and the JSON hash contains the post details. Let's insert another:

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new => 
    '{"title":"Introducing Commie Cat","body":"Commie is also a cute cat"}' })

Now we can find the latest post with the following query:

post = multiblog.get(:Blogs, 'The Cutest Kittens', :reversed => true).to_a.first

On our website, we can build links based on the readable representation of the UUID:

guid = post.first.to_guid
# => "b06e80b0-8c61-11de-8287-c1fa647fd821"

If the user clicks this string in a permalink, our app can find the post directly via:

multiblog.get(:Blogs, 'The Cutest Kittens', :start => UUID.new(guid), :count => 1)

For comments, we'll use the post UUID as the outermost key:

multiblog.insert(:Comments, guid,
  {UUID.new => 'I like this cat. - Evan'})
multiblog.insert(:Comments, guid, 
  {UUID.new => 'I am cuter. - Buttons'})

Now we can get all comments (oldest first) for a post by calling:

multiblog.get(:Comments, guid)

We could paginate them by passing :start with a UUID. See this presentation to learn more about token-based pagination.

We have sidestepped two problems with this data model: we don't have to maintain separate indexes for any lookups, and the posts and comments are stored in separate files, where they don't cause as much write contention. Note that we didn't need to use any super columns, either.

storage layout and api comparison

The storage strategy for Cassandra's standard model is the same as BigTable's. Here's a comparison chart:

multi-file per-file intra-file
Relational server database table* primary key column value
BigTable cluster table column family key column name column value
Cassandra, standard model cluster keyspace column family key column name column value
Cassandra, super column model

* With fixed column names.

Column families are stored in column-major order, which is why people call BigTable a column-oriented database. This is not the same as a
column-oriented OLAP database like Sybase IQ—it depends on whether your data model considers keys to span column families or not.

Click to enlarge

In row-orientation, the column names are the structure, and you think of the column families as containing keys. This is the convention in relational databases.

Click to enlarge

In column-orientation, the column names are the data, and the column families are the structure. You think of the key as containing the column family, which is the convention in BigTable. (In Cassandra, super columns are also stored in column-major order—all the sub columns are together.)

In Cassandra's Ruby API, parameters are expressed in storage order, for clarity:

Relational SELECT `column` FROM `database`.`table` WHERE `id` = key;
BigTable table.get(key, "column_family:column")
Cassandra: standard model keyspace.get("column_family", key, "column")
Cassandra: super column model

Note that Cassandra's internal Thrift interface mimics BigTable in some ways, but this is being changed.

going to production

Cassandra is an alpha product and could, theoretically, lose your data. In particular, if you change the schema specified in the storage-conf.xml file, you must follow these instructions carefully, or corruption will occur (this is going to be fixed). Also, the on-disk storage format is subject to change, making upgrading a bit difficult.

The biggest deployment is at Facebook, where hundreds of terabytes of token indexes are kept in about a hundred Cassandra nodes. However, their use case
allows the data to be rebuilt if something goes wrong. Proceed carefully, keep a backup in an unrelated storage engine…and submit patches if things go wrong. (Some other production
deployments are listed here.)

That aside, here is a guide for deploying a production cluster:

  • Hardware: get a handful of commodity Linux servers. 16GB memory is good; Cassandra likes a big filesystem buffer. You don't need RAID. If you put the commitlog file and the data files on separate physical disks, things will go faster. Don't use EC2 or friends without being aware that the virtualized I/O can be slow, especially on the small instances.

  • Configuration: in the storage-conf.xml schema file, set the replication factor to 3. List the IP address of one of the nodes as the seed. Set the listen address to the empty string, so the hosts will resolve their own IPs. Now, adjust the contents of cassandra.in.sh for your various paths and JVM options—for a 16GB node, set the JVM heap to 4GB.

  • Deployment: build a package of Cassandra itself and your configuration files, and deliver it to all your servers (I use Capistrano for this). Start the servers by setting CASSANDRA_INCLUDE in the environment to point to your cassandra.in.sh file, and run bin/cassandra. At this point, you should see join notices in the Cassandra logs:

    Cassandra starting up...
    Node 10.224.17.13:7001 has now joined.
    Node 10.224.17.14:7001 has now joined.
    

    Congratulations! You have a cluster. Don't forget to turn off debug logging in the log4j.properties file.

  • Visibility: you can get a little more information about your cluster via the tool bin/nodeprobe, included:

    $ bin/nodeprobe --host 10.224.17.13 ring
    Token(124007023942663924846758258675932114665)  3 10.224.17.13  |<--|
    Token(106858063638814585506848525974047690568)  3 10.224.17.19  |   ^
    Token(141130545721235451315477340120224986045)  3 10.224.17.14  |-->|
    

    Cassandra also exposes various statistics over JMX.

Note that your client machines (not servers!) must have accurate clocks for Cassandra to resolve write conflicts properly. Use NTP.

conclusion

There is a misperception that if someone advocates a non-relational database, they either don't understand SQL optimization, or they are generally a hater. This is not the case.

It is reasonable to seek a new tool for a new problem, and database problems have changed with the rise of web-scale distributed systems. This does not mean that SQL as a general-purpose runtime and reporting tool is going away. However, at web-scale, it is more flexible to separate the concerns. Runtime object lookups can be handled by a low-latency, strict, self-managed system like Cassandra. Asynchronous analytics and reporting can be handled by a high-latency, flexible, un-managed system like Hadoop. And in neither case does SQL lend itself to sharding.

I think that Cassandra is the most promising current implementation of a runtime distributed database, but much work remains to be done. We're beginning to use Cassandra at Twitter, and here's what I would like to happen real-soon-now:

  • Interface cleanup: the Thrift API for Cassandra is incomplete and inconsistent, which makes writing
    clients very irritating.

    Done!
  • Online migrations: restarting the cluster 3 times to add a column family is silly.
  • ActiveModel or DataMapper adapter: for interaction with business objects in Ruby.
    Done!
    Michael
    Koziarski on the Rails core team wrote an ActiveModel adapter.
  • Scala client: for interoperability with JVM middleware.

Go ahead and jump on any of those projects—it's a chance to get in on the ground floor.

Cassandra has excellent performance. There some benchmark results for version 0.5 at the end of the Yahoo performance study.

further resources

distributed systems primer, updated

Well, it's been a long time. But! I have four papers to add to my original distributed systems primer:

coordination

CRDTs: Consistency Without Concurrency Control, Mihai Letia, Nuno Preguiça, and Marc Shapiro, 2009.

Guaranteeing eventual consistency by constraining your data structure, rather than adding heavyweight distributed algorithms. FlockDB works this way.

partitioning

Scaling Online Social Networks Without Pains
, Josep M. Pujol, Georgos Siganos, Vijay Erramilli, and Pablo Rodriguez, 2010.

Optimally partitioning overlapping graphs through lazy replication. Think of applying this technique at a cluster level, not just a server level.

There's a better version of this paper titled "The Little Engines That Could"; I'll update the post when it's generally available.

Feeding Frenzy: Selectively Materializing Users' Event Feeds, Adam Silberstein, Jeff Terrace, Brian F. Cooper, and Raghu Ramakrishnan, 2010.

Judicious session management and application of domain knowledge allow for optimal high-velocity mailbox updates in a memory grid. Twitter's timeline system works this way.

systems integration

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag, 2010.

Add a transaction-tracking, sampling profiler to a reusable RPC framework and get full stack visibility without performance degradation.

Happy scaling. Make sure to read the original post if you haven't.

object allocations on the web

How many objects does a Rails request allocate? Here are Twitter's numbers:

  • API: 22,700 objects per request
  • Website: 67,500 objects per request
  • Daemons: 27,900 objects per action

I want them to be lower. Overall, we burn 20% of our front-end CPU on garbage collection, which seems high. Each process handles ~29,000 requests before getting killed by the memory limit, and the GC is triggered about every 30 requests.

In memory-managed languages, you pay a performance penalty at object allocation time and also at collection time. Since Ruby lacks a generational GC (although there are patches available), the collection penalty is linear with the number of objects on the heap.

a note about structs and immediates

In Ruby 1.8, Struct instances use fewer bytes and allocate less objects than
Hash and friends. This can be an optimization opportunity in circumstances where the Struct class is reusable.

A little bit of code shows the difference (you need REE or Sylvain Joyeux's patch to track allocations):

GC.enable_stats
def sizeof(obj)
  GC.clear_stats
  obj.clone
  puts "#{GC.num_allocations} allocations"
  GC.clear_stats
  obj.clone
  puts "#{GC.allocated_size} bytes"
end

Let's try it:

>> Struct.new("Test", :a, :b, :c)
>> struct = Struct::Test.new(1,2,3)
=> #<struct Struct::Test a=1, b=2, c=3>
>> sizeof(struct)
1 allocations
24 bytes

>> hash = {:a => 1, :b => 2, :c => 3}
>> sizeof(hash)
5 allocations
208 bytes

Watch out, though. The Struct class itself is expensive:

>> sizeof(Struct::Test)
29 allocations
1216 bytes

In my understanding, each key in a Hash is a VALUE pointer to another object, while each slot in a Struct is merely a named position.

Immediate types (Fixnum, nil, true, false, and Symbol) don't allocate, except for Symbol. Symbol is interned and keeps its string representations on a special heap that is not garbage-collected.

your turn

If you have allocation counts from a production web application, I would be delighted to know them. I am especially interested in Python, PHP, and Java.

Python should be about the same as Ruby. PHP, though, discards the entire heap per-request in some configurations, so collection can be
dramatically cheaper. And I would expect Java to allocate fewer objects and have a more efficient collection cycle.

scribe client

I've released Scribe 0.1, a Ruby client for the Scribe
remote log server.

sudo gem install scribe

Usage is simple:

client = Scribe.new
client.log("I'm lonely in a crowded room.", "Rails")

Documentation is here.

about scribe

The primary benefit of Scribe over something like syslog-ng is
increased scalability, because of Scribe's fundamentally distributed architecture. Scribe also does away with the legacy
syslog
alert levels, and lets you define more application-appropriate categories on the fly instead.

Dmytro Shteflyuk has good article about installing the Scribe
server itself on OS X. It would be nice if someone would put it in MacPorts, but it may be blocked on the
release of Thrift.

ree

We recently migrated Twitter from a custom Ruby 1.8.6 build to a Ruby Enterprise Edition release candidate, courtesy of Phusion. Our primary motivation was the integration of Brent's MBARI patches, which increase memory stability.

Some features of REE have no effect on our codebase, but we definitely benefit from the MBARI patchset, the Railsbench tunable GC, and the various leak fixes in 1.8.7p174. These are difficult to integrate and Phusion has done a fine job.

testing notes

I ran into an interesting issue. Ruby is faster if compiled with -Os (optimize for size) than with -O2 or -O3 (optimize for speed). Hongli pointed out that Ruby has poor instruction locality and benefits most from squeezing tightly into the instruction cache. This is an unusual phenomenon, although probably more common in interpreters and virtual machines than in "standard" C programs.

I also tested a build that included Joe Damato's heaped thread frames, but it would hang Mongrel in rb_thread_schedule() after the first GC run, which is not exactly what we want. Hopefully this can be integrated later.

benchmarks

I ran a suite of benchmarks via Autobench/httperf and plotted them with Plot. The hardware was a 4-core Xeon machine with RHEL5, running 8 Mongrels balanced behind Apache 2.2. I made a typical API request that is answered primarily from composed caches.

As usual, we see that tuning the GC parameters has the greatest impact on throughput, but there is a definite gain from switching to the REE bundle. It's also interesting how much the standard deviation is improved by the GC settings. (Some data points are skipped due to errors at high concurrency.)

upgrading

Moving from 1.8.6 to REE 1.8.7 was trivial, but moving to 1.9 will be more of an ordeal. It will be interesting to see what patches are still necessary on 1.9. Many of them are getting upstreamed, but some things (such as tcmalloc) will probably remain only available from 3rd parties.

All in all, good times in MRI land.

memcached gem release

One of the hardest gems to install is no more. It's now easy to install!

Memcached 0.15 features:

  • Update to libmemcached 0.31.1
  • Bundle libmemcached itself with the gem (antifuchs)
  • UDP connection support
  • Unix domain socket support (hellvinz)
  • AUTO_EJECT_HOSTS bugfixes (mattknox)

Install with gem install memcached. Since libmemcached is bundled in, there are no longer any dependencies.

on coordination

Andreas Fuchs suggested several months ago that I include libmemcached itself in the gem, but at the time I resisted. I was wrong.

My opposition was based on the idea that libmemcached itself would be an integration point, so running multiple versions on a system would be bad.

In real life, the hash algorithm became the integration point, not the library itself. And since the library's ABI kept changing, the gem always required a very specific custom build. This annoyed the public and caused extra work for my operations team, who had to make sure to upgrade both the library and the gem at the same time.

Updates can come thick and fast now because I don't have to worry about publishing custom builds or waiting for the libmemcached developers to merge my patches.

In retrospect it seems obvious—it's always a win to remove coordination from a system.

linker woes

Unfortunately, it was easier to make that decision than it was to implement it. Linux and OS X link libraries differently, and I had a lot of trouble making sure that no system-installed version of libmemcached would get linked, instead of the custom one built during gem install.

When you link a shared object, OS X seems to maintain a reference to the original .dylib. Linux does not, and depends on ldconfig and LD_LIBRARY_PRELOAD to find the object at runtime. Since you can't modify the shell environment from within a running process, there's no way to override LD_LIBRARY_PRELOAD, so I needed to statically link libmemcached into the gem's own .so or .bundle.

The only way I could do this on both systems was to configure libmemcached with CFLAGS=-fPIC --disable-shared, rename the libemcached.* static object files to libemcached_gem.*, and pass -lmemcached_gem to the linker rather than -lmemcached. Otherwise the linker would prefer the system-installed dynamic objects, even with the correct paths and -static option set.

Note that you can check what objects a binary has linked to via otool -F on OS X, and ldd on Linux.

Feel free to look at the extconf.rb source and let me know if there's a better way to do this.

distributed systems primer

I've been reading a bunch of papers about distributed systems recently, in order to help systematize for myself the thing that we built over the last year. Many of them were originally passed to me by Toby DiPasquale. Here is an annotated list so everyone can benefit.

It helps if you have some algorithms literacy, or have built a system at scale, but don't let that stop you.

prelude

The Death of Architecture, Julian Browne, 2007.

First, a reminder of what it means to build a system that solves a business problem. Browne built the space-based billing system at Virgin Mobile, one of the most well-known SBAs outside the financial and research industries.

That lovely diagram showing clean service-oriented interfaces, between unified systems of record, holding clean data, performing well-defined business processes is never going to be….You have to roll up your sleeves, talk to the business analysts, developers, operations and make a contribution that makes those boxes and arrows real.

System failures in the web world are most often due to inflated technical requirements and integration risks, not an incorrect decision to skip two-phase commit.

constraints

The Case for Shared Nothing, Michael Stonebraker, 1985.

The source of the shared-nothing paradigm, and importantly, its alternatives. Shared-nothing is a nice hammer, but not every problem is a nail.

Harvest, Yield and Scalable Tolerant systems, Eric Brewer, 1999.

The CAP theorem. Sometimes you just can't get what you want. (This is related to the Dynamo work, below.)

Distributed Computing Economics, Jim Gray, 2003.

How to predict the cost of the thing you want to build. Via some napkin math, Gray shows why making the cloud cost-efficient for current problems continues to be a struggle.

coordination

Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Barbara Liskov and Robert Scheifler, 1983.

Two-phase commit. Making sure what you plan to do will get done.

Time, Clocks and the Ordering of Events in a Distributed System, Leslie Lamport, 1978.

Distributed systems are inherently relative; there is no privileged position that can describe all events exactly. Lamport clocks (and the closely related vector clocks) let participants agree on the order of events in the world, if you need to care about that.

Paxos Made Simple, Leslie Lamport, 2001.

The consensus problem: how can potentionally faulty processes agree about an element of global state? The Paxos algorithm guarantees correctness during a failure of a minority of nodes. This paper is difficult, but important for the subtleties it reveals.

Also see Paxos Made Live for a discussion of the implementation in Google's Chubby v2 coordination server. Real life introduces many unfortunate kinks.

encapsulation

Generative Communication in Linda, David Gelernter, 1985.

Tuple spaces, a.k.a. the blackboard pattern, a.k.a. spaced-based architecture. Coordinating a system through the data, not through the behaviors. I will be writing a lot more about this in the future.

partitioning

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, et al., 2001.

The original distributed hash table paper. Introduced consistent hashing for robust pool rebalancing.

Frangipani: A Scalable Distributed File System, Chandramohan A. Thekkath, Timothy Mann and Edward K. Lee, 1998.

Classic paper regarding modern-style distributed filesystems.

systems integration

Dynamo: Amazon’s Highly Available Key-Value Store, Giuseppe DeCandia, et al., 2007.

A key-value store that spawned numerous clones, it integrated many fundamental ideas from the above works into an actual running system. Cassandra is the most featureful open-source version.

(Also see Tokyo Cabinet. Not a Dynamo clone, per se, but it's the next most practical alternative aside from MySQL. Tokyo is a networked BerkeleyDB replacement, so the domain code must handle the distributed aspects.)

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Matt Welsh, David Culler, and Eric Brewer, 2001.

Great paper on managing the interactions between individual components, and ways to build a well-conditioned service under unpredictable load.

conclusion

That's all really. Focus on the ideas, not the implementations. Try to see the patterns in existing systems, rather than imagining how they "should" work if everything was perfect. Then you'll be able to scale to the moon.

The moon is above the cloud and therefore obviously more scaleable.

peeping into memcached

Memcached is generally treated as a black box. But what if you really need to know what’s in there? Not for runtime purposes, but for optimization and capacity planning?

demo

$ sudo peep --pretty 2479

time | exptime | nbytes | clsid |                 key | exprd | flushd
8658 |  613458 |    272 |     5 |      "c3RhdH:171:5" | false |  false
8658 |       0 |      6 |     1 |  "current_c3RhdH:3" | false |  false
8658 |  613458 |    281 |     5 |     "c3RhdH:171:26" | false |  false
8678 |   95078 |      6 |     1 |  "User:1:auth:m4Uq" | false |  false
8658 |       0 |      8 |     2 |   "user_dGltZWxp:4" | false |  false
8686 |  613486 |   1278 |     9 |          "User:1:6" | false |  false
8658 |  613458 |   1286 |     9 |          "User:1:4" | false |  false
...
8658 |  613458 |    283 |     5 |     "c3RhdH:171:28" | false |  false
8658 |  613458 |    277 |     5 |     "c3RhdH:171:30" | false |  false

Peep uses ptrace to freeze a running memcached server, dump the internal key metadata, and return the server to a running state. If you have a good host ejection mechanism in your client, such as in the Twitter libmemcached builds, you won’t even have to change the production server pool. The instance is not restarted, and no data is lost.

Installation instructions are here. Requires Linux.

a note about heap management

Memcached has two separate memory management strategies:

  • On read, if a key is past its expiry time, return NOT FOUND.
  • On write, choose an appropriate slab class for the value; if it’s full, replace the oldest-used (read or written) key with the new one.

Note that the second strategy, LRU eviction, does not check the expiry time at all. This makes eviction an O(1) operation, because the tail of the LRU linked list is immediately available; checking within that list for evicted keys first would raise the order to O(N). This optimization has subtle implications.

the problem

In the bad old days, we added a page cache to Twitter via memcached. We used a generational key scheme: a piece of user metadata was incremented on every change, and used to construct dependent keys. This eliminated direct invalidation concerns.

However, because of the orthogonal expiry and LRU strategies, we knew that the generational keys were affecting our evictions of direct keys. The continual massive onslaught of new generational keys would sweep out older direct keys, even when the generational keys were expired, but the direct keys were fresh.

But how could we prove it was happening? Eventually we ran peep on a representative server. Aggregate results showed that 75% of our cache pool was spent on expired page caches, and our cache horizon was only 5 hours. Moving the generational keys to a separate pool raised our cache horizon to two days, a 10x improvement.

aggregating results

The best way to aggregate peep results is to use MySQL’s excellent grouping abilities. Run peep with the --ugly (unformatted) switch, and pipe it to a file. Then you can load it into an approprate MySQL schema:

CREATE TABLE entries (
  lru_time int(11) default NULL,
  expires_at_time int(11) default NULL,
  value_size int(11) default NULL,
  suffix_size int(11) default NULL,
  it_flag varchar(255) default NULL,
  class_id int(11) default NULL,
  key_size int(11) default NULL,
  key_name varchar(255) default NULL,
  is_expired varchar(255) default NULL,
  is_flushed varchar(255) default NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

LOAD DATA LOCAL INFILE '/tmp/peep.txt' INTO TABLE entries 
  FIELDS TERMINATED BY ' | ' LINES TERMINATED BY '\n';

This will load very fast, even for millions of keys on a laptop; LOAD LOCAL INFILE is an optimized pipeline.

Here are some handy queries for revealing cache properties:

fresh vs. expired values, by count and size

SELECT 
  is_expired, 
  COUNT(*), 
  SUM(value_size) 
FROM entries 
GROUP BY is_expired;

+------------+----------+-----------------+
| is_expired | COUNT(*) | SUM(value_size) |
+------------+----------+-----------------+
|  false     |   688883 |       759401296 |
|  true      |   472755 |      2430092660 |
+------------+----------+-----------------+

histogram of values by size

SELECT 
  ROUND(ROUND(LOG10(value_size) * 5, 0) / 5, 1) AS log, 
  ROUND(AVG(value_size), 0) AS size, 
  COUNT(*), 
  RPAD('', COUNT(*) / 6500, '*') AS bar 
FROM entries 
GROUP BY log 
ORDER BY log DESC;

+------+--------+----------+----------------------------------------+
| log  | size   | COUNT(*) | bar                                    |
+------+--------+----------+----------------------------------------+
|  5.6 | 422238 |        3 |                                        | 
...
|  1.6 |     39 |     3652 | *                                      | 
|  1.4 |     23 |   103434 | ************************************** | 
|  1.2 |     14 |     6369 | *                                      | 
|  1.0 |     11 |    36588 | ******                                 | 
|  0.8 |      6 |    55795 | *********                              | 
|  0.6 |      4 |    90332 | **************                         | 
|  0.4 |      2 |      239 |                                        | 
+------+--------+----------+----------------------------------------+

histogram of values by LRU age

SELECT 
  ROUND(ROUND(LOG10(
    (SELECT MAX(lru_time) FROM entries)
    - lru_time) * 5, 0) / 5, 1) AS log, 
  ROUND(AVG(
    (SELECT MAX(lru_time) FROM entries) 
    - lru_time), -2) AS lru_age, 
  COUNT(*), 
  RPAD('', COUNT(*) / 6500, '*') AS bar 
FROM entries 
GROUP BY log 
ORDER BY log DESC;

+------+-----------+----------+-------------------------------------+
| log  |   lru_age | COUNT(*) | bar                                 |
+------+-----------+----------+-------------------------------------+
|  4.6 |     34800 |    18064 | ***                                 | 
|  4.4 |     24200 |    96739 | ***************                     | 
|  4.2 |     15700 |   212865 | *********************************   | 
|  4.0 |     10200 |   224703 | *********************************** | 
|  3.8 |      6500 |   158067 | ************************            | 
|  3.6 |      4100 |   108034 | *****************                   | 
...
|  0.4 |         0 |      672 |                                     | 
|  0.0 |         0 |      319 |                                     | 
| NULL |         0 |    13400 | **                                  | 
+------+-----------+----------+-------------------------------------+

histogram of values by size for a namespace

SELECT 
  key_name, 
  ROUND(ROUND(LOG10(value_size) * 5, 0) / 5, 1) AS log, 
  ROUND(AVG(value_size), 0) AS size, 
  COUNT(*), 
  RPAD('', COUNT(*) / 10000, '*') AS bar 
FROM entries 
WHERE key_name LIKE '%chunk:%' 
GROUP BY log 
ORDER BY log desc;

+----------------------+------+-------+----------+----------------------+
| key_name             | log  | size  | COUNT(*) | bar                  |
+----------------------+------+-------+----------+----------------------+
|  reply_chunk:69913   |  3.4 |  2511 |     1032 | **                   | 
|  user_chunk:1868495  |  3.2 |  1530 |      972 | **                   | 
...
|  user_chunk:1405137  |  1.6 |    40 |     2822 | ******               | 
|  reply_chunk:2084579 |  1.4 |    24 |     4361 | *********            | 
|  user_chunk:2455162  |  1.2 |    14 |     6141 | ************         | 
|  user_chunk:5989656  |  1.0 |    12 |     2477 | *****                | 
|  user_chunk:2268781  |  0.8 |     6 |    16527 | ******************** | 
+----------------------+------+-------+----------+----------------------+

slab allocation counts by class

SELECT 
  class_id AS slab_class, 
  MAX(value_size) AS slab_size, 
  CEIL(COUNT(*) / ((1024 * 1024) / MAX(value_size))) AS allocated, 
  (
    (SELECT COUNT(*) FROM entries 
     WHERE class_id = slab_class AND is_expired = "true")
   * 100
   / 
    (SELECT COUNT(*) 
     FROM entries 
     WHERE class_id = slab_class) 
  ) AS `%_expired`
FROM entries 
GROUP BY class_id 
ORDER BY class_id;

+------------+-----------+-----------+-----------+
| slab_class | slab_size | allocated | %_expired |
+------------+-----------+-----------+-----------+
|          1 |         8 |         1 |   99.8938 | 
|          2 |        32 |         5 |    6.0320 | 
|          3 |        71 |         9 |   36.4859 | 
...
|         34 |    187783 |        28 |   61.2903 | 
|         35 |    234205 |       130 |    1.8998 | 
|         36 |    286110 |         7 |    0.0000 | 
|         37 |    309832 |         2 |   75.0000 | 
|         38 |    397500 |         1 |  100.0000 | 
|         39 |    496014 |         1 |  100.0000 | 
+------------+-----------+-----------+-----------+

This is comparable to the memcached stats slabs command, but lets you investigate things like expired ratio by slab, etc.

Play around and see what else you can come up with.

conclusion

Generational keys are often presented as the easy way to cache, but consider their implications. How else are you abusing your cache? Some suggested things to check:

  • Keys without expiry times (which will bite you when you alter the server pool)
  • Excessive use of very large slabs
  • Uneven LRU age by slab (indicating poor slab allocation)
  • Unexpected namespaces (revealing invalidation bugs)

The only unfortunate thing about peep right now is that it blocks the memcached server. I could do an incremental version, but it would be quite slow. The current version is already CPU-intensive, taking 15 minutes to dump a 4GB memcached on modern hardware (turn off your client-facing processes if you run mixed front-ends!). A version in C would be much quicker, if someone is up for it, but for now the Ruby version is sufficient.

Happy peeping.

ruby gc tuning

I’d like to call out something important from my QCon slides: the Railsbench GC settings.

quick study

In my experience, a typical production Rails app on Ruby 1.8 can recover 20% to 40% of user CPU by applying Stefan Kaes’s Railsbench GC patch to the Ruby binary, and using the following environment variables:

RUBY_HEAP_MIN_SLOTS=500000 
RUBY_HEAP_SLOTS_INCREMENT=250000 
RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 
RUBY_GC_MALLOC_LIMIT=50000000 

In return, you get a minor increase in peak memory consumption. Not too shabby.

gc behavior

Ruby’s GC is non-generational, which means that the GC looks at every object every time it runs.
Java, OCaml, and other static languages have a generational GC, which means that only recent allocations are frequently checked (the
older an object is, the less likely it is to be freeable). But Ruby pays a high cost each
GC run because it looks at all old objects too; for example, the Rails code itself stored in the AST.

Writing a generational GC is quite difficult, but there is one thing we can do: run the GC less frequently. This is what Stefan’s patches allow.

how to tune your app

If you install this patch from Sylvain Joyeux, you can add object allocation deltas to your Rails log, as well as use Stefan’s GC tracking methods. This gives you visibility into exactly when the GC runs and how much useful work it does. Spin up ab and script a representative page. Also start top in another shell to watch total memory in the Rails process.

Now, do a simulated annealing-like search through the available environment settings.

example

Here is one of our test runs with Ruby’s default GC settings (we use a custom logger at Twitter, but you should be able to arrive at similar output):

tcpu:0.154 alloc:63926 delta:-66290 gccpu:0.078
tcpu:0.067 alloc:63640 delta:63645 gccpu:0.000
tcpu:0.146 alloc:63788 delta:-68896 gccpu:0.078
tcpu:0.060 alloc:63779 delta: 63784 gccpu:0.000
tcpu:0.138 alloc:63787 delta:-75152 gccpu:0.072
tcpu:0.059 alloc:63779 delta: 63784 gccpu:0.000
tcpu:0.138 alloc:63787 delta:-77591 gccpu:0.072
tcpu:0.062 alloc:63779 delta: 63784 gccpu:0.000

With the conservative Ruby defaults, the GC runs every other request, and takes 0.075 seconds, giving us a per-request GC cost of 0.038 seconds—40% of the entire request time.

This is excessive. But if you explore a bit, you will quickly arrive at something like RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000, which is what we use at Twitter.

These particular settings mean:

  • Start with enough memory to hold Rails (Ruby’s default is practically nothing)
  • Increase it linearly if you need more (Ruby’s default is exponential increase)
  • Only garbage-collect every 50 million malloc calls (Ruby’s default is 6x smaller)

Here are the GC timings for the same ab run with these settings applied:

tcpu:0.181 alloc:63829 delta:-763708 gccpu:0.118
tcpu:0.067 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63777 delta: 63782 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.185 alloc:63828 delta:-761854 gccpu:0.119
tcpu:0.069 alloc:63777 delta: 63782 gccpu:0.000
tcpu:0.065 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.061 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63775 delta: 63780 gccpu:0.000

Woah! Now the GC runs only every 13 requests, at a slightly higher cost, for a per-request cost of 0.009 seconds. This translates to a general speedup of 34%. The frequency of GC calls corresponds quite directly to the change in RUBY_GC_MALLOC_LIMIT, but if we increase it much more the memory usage balloons.

further reading

Unfortunately, there’s not much clear information on the Ruby garbage collector. But here are a few resources:

We are now experimenting with Brent’s MBARI branch of Ruby 1.8.6, kindly sponsored by EngineYard. So far it looks excellent, expecially in combination with Stefan’s patches. I will publish some results soon.

qcon presentation

My QCon presentation is available.

Improving Running Components at Twitter


Some choice Tweets:

  • philwills: Evan Weaver on scaling twitter at #qcon was full of
    interesting stuff and good questions from audience.
  • markhneedham: fascinating reading these stats about #twitter
    from Evan Weaver’s talk #qcon
  • jurgenonland: sitting at a presentation from Evan Weaver @
    #qcon, wow he must be verry unhappy at his work
  • szegedi: Listening to Evan Weaver talking about Twitter system
    architecture & tuning. Getting to learn from these experiences is priceless.
  • rbanks54: This could be a great presentation, but Evan is
    presenting it monotone/bored & skipping important links 🙁 #qcon
  • oudenampsen: Was just by Evan Weaver of twitter. Gave the impression that any time he could commit suicide. However interesting.

My presentation abilities have gone from "bad" to "tolerable", so I’m relatively satisfied with the situation. Clearly I need to
be more engaging.

secret codes

Here are some secret codes I am involved with. They are some of the best codes recently coded.

kestrel

A replacement for Starling, the distributed message queue. Written on the JVM (Scala) because of the mature garbage collector. Has a constant performance profile regardless of the size of the queue.

memcached/libmemcached pre-builds

The C library and the Ruby client as a matched pair. Much improved failure handling over the public builds, and Ketama hashing is built-in as always. Switching from ruby-memcache to memcached at Twitter effectively halved our cluster CPU load. (These changes are getting upstreamed, so you can also just wait.)

cache-money

A Rails object cache layer for ActiveRecord. It can do primary key lookups and single index queries solely out of cache. Especially nice if you have replication lag.

peep

A heap inspector for live memcached server instances. They said it couldn’t be done.

thinking sphinx

Pat Allan has been working hard on his Sphinx plugin. He’s adding the missing enterprisey features from Ultrasphinx, so that it can finish its years sleeping gently in legacy apps. (Facets, and non-invasive delta indexing.)

bybusyness

Apache 2.2.10 got a new fair balancing algorithm. Smooths out latencies introduced by single-threaded backends with high standard deviation, such as Mongrel/Rails.

postscripts

Speaking of, I’m making sporadic progress on Mongrel 2, but don’t hold your breath. The main improvement will be Rack support.

A downside of my position at Twitter is meager open-source progress. But hey, we’re hiring (must be in SF or willing to move).

My RSI is slowly going away. The only real solution is to type less. And make sure to keep totally blasting your pecs.