For a recent ongoing project, I realized that I had to begin offloading a sub-system from mySQL to a key/value store for efficient retrieval. We all know that mySQL heavily depends on power to read as opposed to write. Therefore, when dealing with joins, and sometimes IN’s on non trivial queries, it can become fairly expensive. Not to mention my budget is finite, so I can’t (and wouldn’t, even if I could) just throw metal at the DB.
The sub-system in question is a user’s activity stream. A user has friend’s those friends perform activities such as post photos, comments, be-friend other people etc. Naturally to display all this data from mySQL, lots of joins are required from various tables Users,Friends,Photos,Comments. An alternate solution is to employ a mySQL strategy which keeps track of redundant fields in a separate table that is populated asynchronously. With this method, all data can be pulled in a simple SELECT, again this is nothing more than just a patch, and not really a viable long term solution for scaling.
The real way to alleviate this, is to replace the way activities are displayed, and persisted with an alternate key/value store that is quick to serve data from even when sometimes retrieving non-cached content. In addition, it has to be an augmented key/value store, so that it somewhat facilitates basic relationships.
So now which one do I choose? There are a couple of factors that lead to my decision:
1. Something with very low latency on reads.
2. Something that plays nice with THRIFT or HTTP.
3. A system that is truly distributed and fault tolerant.
4. Eventually consistent system, since availability is more important than inconsistent reads.
5. Decent developer community contributing to the project.
After looking at a couple of options, I narrowed it down to two; Cassandra, or Riak. Both great distributed systems, one system battle tested (Cassandra), the other not so much, but both very much still experimental. I am familiar with Cassandra from a previous endeavor, so I know what it can do. It is also proven to handle well in production (Facebook,Twitter,Digg), so it’s a solid contender, but I’m not going to go with it until I experiment with Riak a bit more. Right off the bat, I was impressed by some of RIAK’s core features, such as virtual node replication and linking between documents, just to mention the attractive ones. The only drawback so far is the lack of documentation and community, since it is so new.
I will have to make a decision soon, and when I do, a post will follow with some more technical details about the move and which system I eventually chose. Until then I’ll be hacking away at Riak to see if it’s a viable option for putting something like this in production, if not then good ol’ Cassandra it is :-)