Tuesday, September 29, 2009

FSM and ragel

The diagram on the right is a depiction of a state machine to parse command line arguments. I'm looking at ragel lately, because the architecture and design are genuinely compelling. The philosophy and architecture behind it are not necessarily limited to lexing input or protocols (although that is what ragel basically does). I'm looking at this from the perspective of applied research in intelligent agents knowledge base sharing and upgrading. One of the ideas I was having is whether there is a possibility to develop a common knowledge between two computer processes that is not necessarily static (like 'pre-defined'), but whether it may actually have dynamic properties such that it can reason with its internal state and knowledge base to resolve specific dead-ends and so on.

( btw, just inbetween, for an explanation how I post code on blogger without using syntax highlighter: http://kevin-berridge.blogspot.com/2007/08/posting-code-on-blogger.html ).

The above diagram was generated by specifying a sort of language for the command line arguments that the application understands. Language is to be interpreted in the broadest sense of the word. Think of it as any stream of input characters in which you can convey ideas or specifications of actions to undertake.

In the above diagram, a state is reached when the state machine can successfully pick up the next character from the stream. So, the state machine can move to a different state if it finds that the next character in the stream contains that specific symbol. It's a bit like a filter. Some states have multiple exit points (so they can go over a number of transitions), which is fine. The interesting characteristic of ragel in comparison with lexer is that you're both string matching and executing code at the same time. So when using ragel, you get the opportunity to start executing things which at a later point may not be completable because the final part of the input is missing. It takes a bit of programming to either discard the state or use it any way, it's not applicable in every context. I can imagine that if you work on transaction-based systems, you just panicked with these statements :). There, you typically wait until the full request is in, generate a response and wait for the client to actually tell you to commit and do it for real.

Another interesting part of ragel is that it doesn't use glibc or other heavier functions (possibly that much). In the above example, you'd typically use some strXXX function from glibc to find out what the user supplied. You also need to make sure your buffers are correctly set up and you don't go over them (I always use strNXXXX functions just in case I get caught out). ragel on the other hand works on your supplied buffers immediately and uses pointer arithmetic. There are two output modes: table-based, where the transition of one state to another is more of a path description and goto-based.

Goto's should probably be considered evil, but with state machines I'm starting to think that machine intelligence could greatly benefit from execution contexts that can switch very quickly from one state to another. Earlier posts made in 2007 have already rambled on about stackless python and so on. Having a stack that grows infinitely doesn't help much.

Now, in the philosophy of ragel, would there be a possibility to develop an agent language that runs in some kind of engine where the agent would continuously instruct the engine what to execute next? Maybe a thread-based context of instructions could help to make this multi-processing.

In that line of thought, consider that a state is basically an identifiable place or state of mind or state in a computer. Having an apple in your hand could be considered a state. A large problem in AI is how you make computers reason from one state to another. Generally this is done with a pre-defined knowledge base that defines all the rules before anything is executed. Such machines or robots become applicable to one thing only (that what is in the knowledge base) and not much else.

Now, a start state is known, maybe it's idle or maybe it's about being hungry or some pro-active state where the AI is trying to achieve something (possibly governed by some emotion engine?). The interaction of several state machines together would be really interesting here. The idea is to get from "start" to "goal" state. If the computer would simulate in its engine how it could get from start to goal by going over the transitions, then it may be able to find different ways of achieving its objective. If transitions have costs associated with them, then the AI could reason about the best method to achieve the objective.

Taking a transition also means using its internal resources. It isn't necessarily a trivial task. A robot could be in a start state somewhere identifiable in the current space and it may deem that it is necessary to move to another location, thus another state. The current focus is then how to get from start -> location. The transition to do that is movement and movement is concerned about finding a path from start->location, possibly using intermediate states that could be used to achieve it. If the transition finds through observation that everything is blocked, then it may decide to panic or find other ways (more costly?) to attempt.

What is described here is a design for a flexible reasoning engine depending fully on (a combination of) state machines, which execute snippets of code inbetween its reasoning processes. Combine this with a shareable language between other robots and human beings (interactive terminal?) and the computer could start asking questions...:
  1. Q: "how start->goal?"
  2. A: "apply movement transition"
  3. ( robot downloads movement knowledgebase and code? )
A basic scenario involves a monkey, a box, a cage with three prescribed locations and a banana. The objective for the monkey is to grab the banana, which it can only do if the box is in the middle of the room, the monkey is standing on the box and it reaches out to grab the banana. This is a reasoning problem, as the details and specifics of actually executing those actions are of a different domain. Actually, the interesting part would be to communicate to other modules of an AI that something is an objective and leave it to sensors and other stuff to actually carry out the specific task. When those modules all agree the task has been executed, they could communicate this back to the reasoning module, which is now confirmed in the new state.

Ragel doesn't just apply actions when it's doing a transition. It can also do this when leaving states and entering them. This allows for some more flavours of interestingness. The idea is that an AI should be able to dynamically extend its knowledge base (which a couple of implementations do), ideally through communication using a simple, non-ambiguous language to communicate those knowledge gems.

In the example of the monkey above, a goal state could be to "have the banana". The computer then doesn't know how to get into that state, so it needs to understand the differences in the following:
  1. how to grab a banana
  2. how to reach out for a banana
  3. how to climb on a box (and whether it is strong enough to support the monkey)
  4. whether the monkey robot is tall enough to reach the banana without the box
  5. how to move the box around the room (and that the monkey cannot be on the box to do this).
  6. whether the box is in the middle of the room
Using these states, you can draw a state diagram of actions to be executed in a certain order. Eventually, if you leave the reasoning to the computer, it should reach a sequence of actions that is least costly to execute (the shortest way to get there) and that is what it should try.

Monday, September 28, 2009

Writing fast protocol-compliant programs quickly

I've used bison and flex for some text parsing. Another project of mine was done with ANTLR (see xssprotect), where it's basically an HTML tag filter that allows well-known tags and attributes and removes all others. Flex and Bison work together in a way, but you need to keep on your toes to understand which does what. The combination isn't suitable for all kinds of parsing and one of the reasons why programming languages are non-ambiguous by nature is so that it compiles into machine- or interpretercode exactly as intended by the programmer without unwanted side effects. Regular file parsing with flex and bison becomes more difficult once you don't have control over the format of the input file; that is, if you don't control what it should look like. This is because bison is dependent on the lexer, and the lexer should output the correct tokens such that bison can apply them properly. You could say that the lexer is more about syntax and bison is more about semantics and ordering.

So flex feeds bison and bison allows you to execute actions that should take place once things are recognized. The general tutorials all over the internet always show the same example: a calculator. Some code in flex looks like this:


%{
#include
%}
%s BRACKET
%%
[A-Za-z0-9]+ { printf("Symbol: %s\n",yytext); }
"(" { BEGIN BRACKET; }
")" { BEGIN INITIAL; /* Switch Back */}
[A-Za-z \t]+ { printf("BRACKET: %s\n",yytext); }
.|\n { }
%%
int yywrap(void) { return 1; }
int main(int arg,char *argv[])
{
yylex();
printf("Bye...Bye...\n");
return 0;
}

This is input to flex. It's just printing symbols outside any context and when it encounters any brackets ( or ), it switches to different states, such that certain tokens can be disregarded, or you can start regarding them. The curly brackets are basically standard C code. Other parsers that are more advanced typically use them to return a token value (an integer), such that bison can use it. Bison would then typically attach it to a certain context.

Hence, it becomes clear why programming languages have so many magic tokens, like: { ( [ ] ) } " ' 0x * & ^ % $ # @ \ | ; : and so on. Most of these tokens are delimiters to create some kind of action. The { } tokens are probably the most interesting, as many mature languages use these to control scope of variables and instructions.

Bison scripts look like the following:


input: /* empty */
| input line
;
line:
'\n'
| exp '\n' { printf ("\t%.10g\n", $1); }
| error '\n' { yyerrok; }
;
exp: NUM { $$ = $1; }
| VAR { $$ = $1->value.var; }
| VAR '=' exp { $$ = $3; $1->value.var = $3; }
| FNCT '(' exp ')' { $$ = (*($1->value.fnctptr))($3); }
| exp '+' exp { $$ = $1 + $3; }
| exp '-' exp { $$ = $1 - $3; }
| exp '*' exp { $$ = $1 * $3; }
| exp '/' exp { $$ = $1 / $3; }
| '-' exp %prec NEG { $$ = -$2; }
| exp '^' exp { $$ = pow ($1, $3); }
| '(' exp ')' { $$ = $2; }
;
/* End of grammar */
%%

You should notice how the C instructions, also in this case, permeate the general function of the processor. The processor basically has a stack of memory of tokens that were processed before, whether these are simple expressions or tokens ( a "1" is a token and an expression, a "1+1" is an expression composed of two other expressions and an operator). This calculator function above shows how bison works iteratively down your sentence and then compiles a sort of tree structure. This tree structure can be influenced to take operator precedence into account. Also, you could choose to make it contextual here.

The differences between flex and bison should become somewhat clear by these examples. Lexer doesn't know anything about the meaning of what it is processing, just that it complies with the lexing rules. An integer is a sequence of 0-9 digits without any dots or comma's. A float is a sequence of 0-9 digits with a dot or comma in the middle somewhere (followed by more digits). A char token has some alphabetic characters in them. You could easily construct your own language by prefixing or postfixing them and then making sure the rules are processed in the right order.

You could also make the parser interactive, so it becomes a command line application. Rather than verifying the exact thing was entered, you could now attempt to handle whatever was typed. So basically you could create some kind of adventure game with the above. In the adventure game, you'd probably end up recognizing verbs specifically, but maybe use a more dynamic method for dealing with nouns (objects and rooms differ from game to game). That way, the engine is reusable for entirely different contexts.

Now that you know about bison; you'll find ragel and lemon are of better use. These tools have been used to parse SQL, or geographic data (libgeom) for calculating paths and distances and so on. Without too much effort, you could script together a method for doing geometric calculations that way, such that you can invoke this from the command line.

Notice how the parser does force you to think in tree structures though. Check out this link for why state machines are quite interesting from a protocol design / implementation perspective. Often, when you parse code yourself, you find yourself knee-deep in invalid states, states that should not be reached from certain contexts, and so on. Why program that yourself? :)

State machines are also often used in Artificial Intelligence, so it makes sense in certain places to mix up these things with artificial reasoning. The problem with state machines is that it's possible to have missed some dead-end. In that case, the machine may become stuck forever and might not be able to find a sequential state to go to. Ragel though has a nice way of showing it's rules in a directed graph. That should help to make things clearer and spot those difficult places.

Sunday, September 20, 2009

Linux: epoll performance and gotcha's

This is a graph of a very small test I have been running to test the performance of synchronous vs. asynchronous I/O. The previous post already showed how lighttpd is using the epoll interface. I wanted to do some testing on my own with my own implementation.
In this test, I'm using a single dual-core machine (E6850) with 4GB memory on a 1333 MHz bus with CL9. The forking server is very crude, which basically forks after accepting a connection. This could certainly be done more effectively by pooling threads and so on, but that is what the sock queue already simulates. The idea is to show the effect of creating new processes to handle new traffic and how this affects overall machine performance.

The forking server shows varying performance. At some point, the socket queue of the process within the kernel is full and the client is either waiting 3, 9 or 12 seconds to obtain a connection, probably because the kernel stalls connections (SYN flood protection?) or needs to clean up resources before it continues to accept new ones. Actually, I didn't wait for this process to finish. When you add the time required, it was running about 1 minute and did 5800 connections. So that performance is very poor. Programming wise though, things are very easy, especially using standard C available on every decent UNIX machine, since it was just an accept and a fork. (The gotcha was that you need to close the parent's file descriptor and waitpid on child's processes, otherwise it's holding on to resources).

All processes had their file descriptor limit increased to 65536 (ulimit -n 65536) and processes were run as root to prevent having to build in some setrlimit call and other complexities.

The socket queue with 8 threads is basically a round-robin fire and forget mechanism. It is implemented with glib and similarly to fork, it isn't as difficult to implement. Basically, it has one thread to continuously accept new connections as fast as it can, then it hands the socket descriptor to one of the queues in a round-robin way. The threads at the other end consume from their queue when a descriptor becomes available and process the socket until empty. Since this is all running locally, there is virtually zero latency and we know that each client connects and writes as fast as possible. Since the server doesn't create new threads or sockets, this is actually quite efficient. Problem may become when sockets actually start blocking, then you may need more threads to handle them.

The epoll() method with 8 threads is the most fancy one and actually skips a couple of other possibilities in the middle. It's strictly using asynchronous I/O, which is also the most difficult to implement. The advantage is that none of the threads ever block and are therefore always looking to do something useful. In this case, I configured eight worker threads as well. The performance is very similar to the queue one, but in different circumstances the epoll() method will certainly outperform the others. The reason is that the previous one will block waiting for something to read. Alternatives to epoll() is poll() with a self-maintained list of socket array. Eventually, those lists need maintenance and therefore they need to be processed linearly in each thread at some point. That is O(n) time, so it doesn't scale as well as epoll(), which has O(1) performance, plus the array maintenance that needs to be coded.

The simple client used in these tests is an absolute psychopath. It attempts to connect up to 125 sockets at the same time and maintain that number (that means a barrage of 125 SYN packets in a very quick succession from one thread in the beginning and as many as it can, up to 125, whenever connections get processed). This client also uses asynchronous sockets. The technique is similar to non-blocking connect() calls with select() or poll(), but in this case you feed the descriptors to epoll() to let the kernel figure everything out. The gotcha here is that the data structure used in the epoll_event structure is not a 4-attribute structure, but a union. That means that you can set only any one of the fields, but not two. For example, writing the fd field to store the socket descriptor overwrites a ptr and vice-versa.

Thus, the implementation for the client is like:
  1. open socket
  2. set socket to non-blocking
  3. call connect(), this doesn't wait around until it completes. It almost always returns "EINPROGRESS".
  4. store the socket + connected info in a specialized structure in epoll through the data.ptr attribute.
  5. open another socket.
After 125 sockets are open at the same time:
  1. see if any sockets have completed already, write a little bit of data and close the socket.
  2. this may free up a couple of places, after which new SYN's are sent.
The quicker the server accepts new sockets, the faster it can go, especially when clients are suffering from a bit of latency. In the epoll and queued sockets structure above, I cycled through 30,000 connection requests in 4-5 seconds. That's about 6,000 processed connections per second. This is in an environment where both client and server run on the same machine. The problem that this load faces is that the machine runs out of sockets, because due to the speed, there are a lot of sockets in the "ESTABLISHED" state (that's how quick the NIC is and how relatively slow the NIC is :). Sockets in ESTABLISHED state consume file descriptors and you can only have so many open before you run out. Modern servers should be able to handle 10,000 connections at the same time however. Running this on two different computers should give twice the number of sockets available.

The idea of the above is to develop software where the system becomes network-constrained. A lot of software, due to its architecture, doesn't use all of the network's bandwidth, but it probably uses all of CPU or memory. Threads and processes require a stack space to operate, because they call functions and leave other data on the stack for reference. The 90's way of handling traffic was to add threads and processes, which increases the requirements for memory and CPU. The context switch and process management becomes a bigger bottleneck (since that's what the kernel does). The Linux kernel has a very nice scheduling system now, which is default, which also has O(1) performance. So you could have 10,000 processes hanging around quite easily, but at some point you do noticeably see performance degradation, especially once the kernel needs to traverse the process lists every x (here 1) ms. That is why having one process/thread per socket is a bad idea.

Increasing throughput on LAN's has some more considerations and options for tweaking. Most Linux distributions are pre-configured for internet use, which means that their window sizes are different from a LAN. This is because if the round trip time is high and you send a large packet, which somehow gets lost, the cost of resending that packet is high. So a smaller packet would do better. But if you only send 1 byte per packet, then you're not getting anywhere either. Linux actually seems to optimize that in the kernel now. High performance servers sometimes need a bit more local ports to deal with incoming traffic. This can be modified non-permanently in /proc/sys/net/ipv4/ip_local_port_range.

Saturday, September 19, 2009

Supercharging reliable packet transfer capacity

Lighttpd is a very light, high performance web server that uses a different architecture for handling connections. It is being used for meebo and youtube, but probably with modifications. Although Apache is the best known webserver around and has been used for years as a very reliable webserver, the rise in network communications capacity is showing that the software itself is now more likely the bottleneck in communications than the hardware itself. Apache with its threaded architecture doesn't scale as well or use the hardware resources as well as lighttpd can. The reason is related to the specific use of resources on the operating system, like the number of file handles, threads, and so on. Two very important issues become copying data from socket to kernel memory and finally to user memory. This is one copy too many and people are researching how to prevent this first copy taking place, such that data arriving at the network card can be copied straight into user memory when available. The second problem is related to the overhead of thread/process context switches and the repercussions a high number of threads and processes have on overall system performance. In effect, the consequence of these systems is that the kernel spends more time figuring out which thread/process to run and housekeeping efforts become larger.

In communications where sockets are associated with threads (let's assume for the remainder of this type that threads and processes are interchangeable), the threads typically block on particular operations within the communication cycle. So, writing to a socket may succeed immediately, or it may block the sender until either the socket is disconnected or the remote end has received the data, such that the client buffer sending the data becomes free. If the thread blocks on either send or receive, the thread is put into a particular blocking state, signalling the kernel that it only needs to be woken up if some buffer becomes empty or data arrives at the socket. Having, let's say, 10,000 threads around then sounds like a lot, but in the end only a couple of those threads are actually executable on the CPU. The advantage of this approach is that programming is pretty clear and the program is easy to understand. The disadvantage is that this model is not scaleable towards the actual hardware capacity installed.

In looking at some specific problems I am facing, I also noticed that this above model doesn't scale at all for situations where clients send out notification packets. Those are characterized by clients sending a bit of data and closing the connection immediately. They can then pick up another notification to be sent, send it immediately and cycle on that. It is also possible that there are a large number of clients sending data.

For these situations, TCP state transitions look like this:

client         send ->     <-  send     server
ESTABLISHED     FIN           --        ESTABLISHED
FIN_WAIT1       --            ACK       CLOSE_WAIT
FIN_WAIT2       --            --        CLOSE_WAIT

...... server processes .......

FIN_WAIT2       --            FIN       LAST_ACK
TIME_WAIT       ACK           --        CLOSED

The "CLOSE_WAIT" state is maintained until the server has accepted the socket (which is already in CLOSE_WAIT now), reads the data, then closes the socket itself. A more serious problem arises when the number of sockets in CLOSE_WAIT equals the size of the backlog queue of the listening socket. Let's say this is 128, which is a reasonable number for Linux. The number of sockets in CLOSE_WAIT becomes 128, the server doesn't accept any more connections and it only allows more connections when the server picks up one of the pending connections by calling accept(), handles it and closes it. If the number of clients is relatively high, this server becomes a bottle-neck for any communications up-stream.

Lighttpd supports a new architecture that is available for newer kernels. Actually, it supports a couple of new constructs for handling TCP traffic, that which are given by kqueue, epoll, select, poll and so on. The ideas of these architectures is that a thread should not be created just to wait for data to come in, but the thread selects which sockets are in a state that a successful read or write can take place and the CPU will initiate those actions. The idea is that you use less threads to do the same kind of work with better use of resources, such that you become more efficient overall, such that you can handle more connections and communication overall, increasing the throughput which should easily be handled by the server.

A very good discussion on performance is here. It is also linking to a very good presentation.

Sunday, September 13, 2009

Grozzr

If you want a geeky way of doing groceries, try out http://grozzr.appspot.com. It's a minimalist application to write shopping lists online on Google. Because it's hosted on AppEngine, you can return to the list later, add stuff, remove stuff, and so forth. Cool thing is that it is integrated with Android. If you have an Android phone, you can download an Android app online which syncs the online list with one of the shopping list applications you may have installed (like OI shopping list or Trolly). See the QR code below where you can download the app from:

I've written this to make this slightly easier. No pen and paper and you always have access to the shopping list. I'm trying to add as little features as possible and do the one thing really well, which is writing shopping lists. One thing I might add is the ability to send the list to an email address of another user who might have an android phone, who could then use a temporary key to access the list and sync without having to login. But nothing much beyond that.

Try it out and let me know what you think!

Wednesday, September 09, 2009

Tried out a new distribution: Fedora

I've used Fedora for the first time now. Like Ubuntu and Debian relationships, Fedora is a distribution that started after RedHat release 9 for home-users. RedHat sponsors Fedora heavily, but has put RedHat Enterprise Linux after the Fedora developments. So basically, the community is expected to innovate on the platform and RedHat sits around these developments, sponsoring Fedora where required and streaming those developments back into the enterprise product.

I'm now using a couple of Linux distro's actually. Some of the servers are running on CentOS, where CentOS is basically the opensource distribution of RHEL, as RedHat is required to release the software under the GPL license. So in a sense, CentOS == RHEL without the support contracts.

It's the first time I've been using SELinux as well. I can't say I'm truly happy about it, it's a bit invisible to the first-time user. I've managed to get things started ok and some nice things are coming out of Fedora, but I do think that Ubuntu support for many desktop tasks is slightly better. It's probably because of the widely visited Ubuntu forums. All-in-all, Ubuntu is slightly friendlier for using the platform, but Fedora has some great and innovative features and probably contains things I haven't even discovered yet.

From the software perspective, they seem to have more or less the same availability. RedHat systems use "yum" for package management though and Ubuntu (+Debian) use apt, that's the difference. Personally, I slightly prefer apt still. Install-wise, things are pretty easy nowadays. It all installs without a hitch.

Monday, September 07, 2009

Truth about Amsterdam

Apparently some guy called Bill O'Reilly from Fox News has painted a very untrue and wacko picture about Amsterdam. Now some people from Amsterdam set up a website to show people what it is really like. I worked right in the center of Amsterdam for the last year, my uncle and aunt lived there for their entire lives, I've studied there for five years and it's really just great. What drugs? What violence? What weapons? What crime?

Come to Amsterdam and see for yourself! I promise you'll have a great time!

Saturday, September 05, 2009

Distributed Hash Tables

I've looked shortly at DHT, als called Distributed Hash Tables. DHT is not only a way to store data in some cloud of computers, it can also serve as a resilient, robust cloud of storage. It's basically a technology that has a very simple manipulation interface (get, remove, set). The differences between specific DHT's relate to geography and oriƫntation of nodes, data versioning, hash generation, language of implementation and so on. If you want a clean version, look for projects like bamboo, openchord, opendht, and the likes. Other projects already put in some more functionality, like Project Voldemort.

There are some disadvantages to the use of DHT:
  • It doesn't provide absolute guarantees on data consistency and integrity (but doesn't necessarily make a total mess either, check Amazon's paper on "Dynamo").
  • It's not very useful for "group" queries, range queries or other kinds of data lookups.
  • It doesn't natively support events or triggers very well.
  • There is no authority in the network. So nodes have to cooperate between them in case certain decisions need to be made.
  • Lookup of data is O(log(n)) and may take 2 or more seconds, depending on the real location, how many nodes there are and individual latencies between nodes.
There are clear advantages too:
  • It's highly resilient to network leavers/joiners or other big changes around the network. Most implementations handle node changes very well.
  • Data is automatically distributed, according to specific configuration options. The specific way how data is eventually organized depends on the chosen strategy in the design. For example, the way how nodes are organized (-ing) together is very important.
  • Data is replicated across nodes, so it's difficult to lose it.
  • The removal of any single node doesn't impact the network overall in any way.
Now, if you read the wikipedia page, you'll also see products that use this technology. BitTorrent is a protocol that is very effective in the distribution of file parts, but before you can start sharing a file, you need a way to find it. Finding files works by creating hash keys of them, which is basically an alternate key. By setting up a server where you can browse contents and which has an index of files that are in the network, you can find the content you're looking for and then by downloading a .torrent file, you get the hash key used to find peers in the network that can serve the content you're looking for. From that point onwards, you can communicate with the individual peers to start downloading.

Because keys allow multiple values, you can add yourself as a peer to share the file as well, so that others can negotiate with you on the file parts you may have that they don't. So, DHT in this context is used as a very effective mechanism to find peers to communicate with.

There are other uses of DHT once you start building some logic on top of this simple interface. You could create a virtual file system for example. A key with "/" can be loaded with a set of values, which are the 'subdirectories' that are valid under /. Then you just keep going until there are no more files. To get a file directly, you should be able to look for a key : "/work/myfiles/important-document", in order to get information about the location of that file.

Because projects like Hadoop have different ways for large file replication (blocks of X MB) across a huge cluster, using a DHT layer under this system could store the locations of each replicated block. Nodes themselves can manage their own files this way. This could possibly remove the need for the authoritative and centralized HDFS file directory server and make the network overall more robust and resilient (the HDFS central server is a single point of failure).

It's also possible to store files in two formats: the /x/y/z way, which is meant for finding information and the /nodeX/files way, which indicates the files located on a single machine. If each node itself manages and maintains that information, the rest of the network can react to node crashes by just looking up this information from somewhere and act on this.

The above method for storing information doesn't deviate much from Prefix Hash Trees. A complication in DHT is that you cannot access data by parts of the key (a search) because they are all hashed and unreadable, so you can only find data when you have the key in its entirety. You could create a database in this network as a kind of directory, but that introduces a single point of failure again. The schemes above make this slightly more accessible. This is also the reason why DHT's are not the right solution for every problem. It works when you access lots of data by the same primary key. Facebook uses it for showing entire user profiles, slashdot for storing and retrieving rendered comments. The only thing you need next is a method for expiring entries.

DHT's are generally stored in memory, making them really fast for lookups. Most of the times this is acceptable, but sometimes you want a bit more persistence by having individual nodes store their bits around this network. The biggest problems are faced once a node goes offline for a longer period of time and then reconnects. It is possible that the data it stores is now partly expired and you don't want those bits of information to get back on the network overall. It is also possible that certain updates have changed the information, so that it is newer, which the old keys do not contain. Amazon's Dynamo has more or less solved this by the use of "vector clocks". At some point, you need reconciliation.

Virtually, you could store everything in this network, but you need to know of course what something is as you retrieve it. If you're only looking at non-binary data, an efficient method for storing the data could be JSON. I doubt you'd need "protocol buffers", as they're designed for streaming access to large amounts of data. But possibly you could use the compiler in the project anyway to store it in the described format. That allows you at least to have the storage part covered, so that later you can create apps on top of DHT in python, C or Java.

What is so nice about DHT? The design allows you to work on top of a distributed system, if designed right of course, where you need not worry about fault tolerance within the application. Basically, once you are connected, you just write and read to your heart's content and it should work.