Saturday, October 16, 2010


I'm evaluating a couple of products for a larger project at work. The project will use at some point some kind of distributed storage, which may possibly be filled by the "ceph" project. Although it is experimental, there are a good couple of features that work already. For these evaluations, I've looked at some other work that has been done in this area. Most notably of course is the work that Google has done (GFS, Chubby, Percolator and BigTable). Some of the concepts described there are also implemented in Ceph, most specifically the Paxos algorithm for reliable replication.

Distributed storage is great for anything you'd like to query and which should be available 3-4 seconds after the data is "created" (although it's very likely created earlier). Distributed storage though is not a very good basis for broadcasting/multicasting/unicasting data as fast as possible after it was produced, because clients end up polling the file system for changes. For efficiency's sake and your own sanity, it is then better to use some kind of message queue. Two main architectures here: "bus" and "broker". The bus is basically a local subnet broadcast, where everything attached to the bus will have to parse messages and discard them if there is no interest. The broker(s) can be configured to receive messages from producers and route them through some broker network, reducing the number of times messages are copied and thus get re-transmitted through a separate link. Buses are useful only if you want all servers to know about the data, or the majority of them. Brokers are useful if you have geographically dispersed locations, want full control on how data is retransmitted, etc... Some brokers can also act more like "hubs", if you follow the "pubsubhubbub" project.

Sidenote: if you can avoid 'realtime' and mq altogether and do it through some file system, I'd recommend it. The concepts are simpler, it's easier to deal with faults and should you need any group locks, then the file system, even the distributed one called ceph, will be able to provide that for you. A mq puts the onus on the client for data caching, storing, reordering and all that. So file-based applications are always easier to write (at least from the client perspective :).

A quite complete evaluation of message queues was made by SecondLife. Note what is said about "AMQP". I believe that standards should evolve from things that work and not the other way around. For more information, check out:

I've downloaded 0MQ and decided to give it a spin. The first thing I did was visit the IRC channel on a saturday and there were 50 people in there. That's usually a very good sign it's quite popular. There's a good chance that 0MQ is getting some serious attention soon. It may not be the absolute topper in performance speed, but it may be best if you're looking for flexible solutions, prototyping and esxperimentation. Having said that, if you're not running a super-duper high-performance site that needs to squeeze all the juice of your CPU's and NIC's, ZeroMQ will do just very fine for you, no problems whatsoever.

To set the expectations right... it's a bit more of a library for internetworking your applications together using multicast/tcp/ipc protocols, but that's already a big gain if there's a library that does that for you. This is typically a layer where many developers think they want something special, but ends up being the same requirement like everybody else's. Relying on a 3rd party lib that's widely used everywhere, and thus tested, is a very good decision. The only thing you end up doing is writing the application or service specific implementation and the message marshalling stuff. The library does the rest.

For anyone doing Enterprise messaging, 0MQ won't give you much yet. There are no tools to replay messages, retrieve them from error logs and what have you (or development tools). You may need something like OpenAMQ instead or RabbitMQ. To be honest, you do not necessarily need anything more complicated other than 0MQ if you develop everything in house.

Many services nowadays use XML/XSD to specify message formats. My main gripe with XML is that for too many people, it's about dealing with format instead of function. The issues with namespaces and all these other things make XML a really big beast to handle. Read _message matching_ to get some ideas where this will go wrong. Delayed brokers eventually slow down your entire business, make brokers unreliable, cause delays in some transactions that need to go fast, or negatively impact other processes that run on the same machine, requiring you to set up 2 brokers with double the probability something goes wrong... etc... Better get rid of it, move on and separate 'meaning' from 'transmission' and then verify things later. No need to encapsulate each message with the entire ontology each time a message is sent. I have a couple of interesting ideas in this respect that I'll share later as a research publication perhaps.

Update: 0MQ isn't without 'issues', but as far as I could determine so far there aren't any serious bugs in the code like memory leaks or reconnects, etc... I found two issues so far:
  • Messages are not being throttled at the server side. So if you look at the weather server example, which sends a lot of messages as fast as possible, under the water these messages actually get piled up in a list for the I/O subsystem to transfer them asynchronously. If the I/O subsystem simply cannot keep up, then you'll deplete your memory resources very quickly. The solution is to add high water marks or swaps, or use other methods for throttling the throughput of messages.
  • zmq_poll works, but because it assumes you're listening on multiple sockets, the idea is to visit each socket pool as fast as possible to collect any new messages that may have arrived. You then get one message per call and some kind of interleaving between each socket pool (if they're both really fast). Although this sounds nice, the underwater code assumes that you do not want to maintain any kind of "sleep" or other function to give back some CPU time for other processors. So, when you start 10 clients as in the examples implementing the zmq_poll, no client will yield and they all try to get 100% CPU time. If your client app can deal with it, the workaround is to embed a simple 'x' ms sleep in there. I recommend you sleep only when the last cycle didn't actually read any messages (no event was set).

No comments: