Thoughts on DHTs and p2p networking

October 8th, 2024 by Justin Evankovich

Motivation

The idea of constructing an overlay network on top of the Internet that would allow any two computers (and more crucially their users) to communicate is compelling. I often wish for an alternative to Discord that wouldn't require standing up a complicated application server to host my friends' conversations. Beyond the internet, I imagine being in a crowded place where the cell towers are overloaded, but I want to send a message to my friend to let them know where I am. It's silly that to do so I need to open a tracker-laden app, make a tenuous connection to a server owned by {Google, Meta, Apple, etc}, and play a lossy relay race; especially when my friend is probably less than a mile away, and my phone and the phones of everyone around me are virtually bristling with antennas that can efficiently transmit on a huge portion of the radio spectrum. Why can't all of our devices relay messages to my friend and back?

There are plenty of p2p networking protocols. Most of them designed for use over the open internet are structured as DHTs (Distributed Hash Tables). A DHT, as the name implies, is a data structure like a classic hash table, that provides efficient lookup of a value given a key when the mapping is distributed over many computers. Several names pop up often when researching DHTs: Kademlia, Pastry, Tapestry. Kademlia is probably the most widely-implemented of these, famously being used as the basis for the BitTorrent DHT. For aesthetic reasons I can't articulate, Pastry is the most appealing to me. All three operate similarly: each peer assigns itself a random identity, and this identity exists in the same space as the hash table keys. A distance metric in the key-space is used to select peers to connect to and keep in the routing table. To GET or PUT some data at a key, successively nearer peers (in the key-space) are selected until the peer closest to the key is found. Often, there's some method of replicating the key-value pair at multiple peers near the destination, in case some peers go off-line. The target for DHTs like this, is that to fully route a message requires O(log n) "hops" from originator to destination, and that each node only maintains O(log n) state about the network. That is, in broad strokes, how it works — there's of course some variation between each method, but it hardly matters at this stage of discussion.

One difficulty when designing and implementing a DHT like the three above, is that not every peer can directly connect to every other peer. That is, the adjacency matrix is not complete. This could be due to NAT, where a peer can make out-going connections but other peers can't easily make connections to it, or it could be due to a BGP error, or link degradation within an ISP's network. There exists a DHT, most similar to Kademlia, used by the GNUnet project that attempts to handle this so-called "Restricted Routing" problem. They call it R⁵N.

We can also draw inspiration from mesh-routing networks designed for use over wireless links. In a wireless mesh, the routes available to each peer are severely restricted! Each peer can reach a small handful of peers that are within range, and there's no guarantee that if peer A can send a message to peer B, that peer B could do the same for A — a strong interferer near A could prevent any of B's messages from getting through, while A's come through clearly for B. Protocol's designed for this environment often restrict themselves to small numbers of peers (small meaning hundreds and maybe low-thousands) and take the option of flooding any given message through the network. This works, but makes inefficient use of the wireless spectrum. Other protocols require each peer to have global knowledge of the network in order to effectively route a message, again limiting the number of possible peers. A protocol that doesn't resort to either approach is B.A.T.M.A.N.. B.A.T.M.A.N. effectively constructs a spanning tree for each peer, such that each peer knows the best "next-hop" to route to any given peer. In this scheme, each node doesn't require global knowledge, but they are required to keep O(n) state (though perhaps if it's represented as "this next-hop is best for this range of addresses" it's less than O(n)...).

A brief aside on SCRIBE

The Pastry project includes a component called SCRIBE. SCRIBE allows peers to subscribe to "topics". Then, other peers can send a message to the topic and it will reach all subscribed peers. Effectively, this is a form of application-level multicast — quite convenient, since multicast doesn't really function over the open internet. SCRIBE works by, effectively, constructing a spanning tree of peers for each topic, rooted at the peers nearest the key used to address the topic, where the leaves are the members of the topic. When a message is to be routed to the topic, it begins its journey by following the typical Pastry routing algorithm, getting forwarded to nearer-and-nearer peers. Somewhere along that path, however, the message will be received by a node that is a member of the spanning tree, possibly the root, but likely one of the inner nodes as well. At this point, the message will continue being forwarded in the normal way, but it will also be forwarded toward the leaves of the spanning tree (the members!). An upshot of this construction is that each branch of the spanning tree is a link that is already proven to be reachable, meaning that as long as one of the inner nodes of the tree is reachable from the originator of the message so is the member of the topic.

A new approach

My idea, is to use the ideas in B.A.T.M.A.N. (constructing a spanning tree for each node's next-hop reachability) and in SCRIBE (doing this on top of a more classic DHT, and also not flooding the routes but using the DHT routing to converge toward the routing tree). On top of the normal routing information present in Kademlia or Pastry or any other DHT, each node would need to keep track of the next-hop for any peer it's on the spanning tree of. In order to furnish this information within the network, each peer would need to send extra book-keeping messages to its neighbors. What I'm imagining is that it would route (with replication) a UNICAST_JOIN message to its own address. It would need to perform this periodically, and I believe it would help to choose different neighbors to begin the routing process each time. As the message traverses the network, each peer would modify the header to indicate how many hops the message has taken, as well as recording a tuple composed of (the ID of the originator, the number of hops, the next hop (i.e. the node the message was received from)). From that point on whenever that peer receives a message addressed to the originator of the UNICAST_JOIN, it would simply forward the message toward the recorded next hop.

Of course, maintaining this state in a consistent way in the face of peers joining and exiting the network will be a challenge. If the next hop disappears, then the destination could appear unreachable for some time. I'm gonna hand-wave those issues away for now, and assume they can be overcome.

In terms of state within the network, O(log n) state would be stored due to each UNICAST_JOIN probe. If every peer participates in this process, then O(n log n) state would be stored due to this process across the entire network or O(log n) extra state per peer. This is on the same order as the state maintained for the general DHT routing information, so I expect some version of this idea will be feasible in large networks.

Robust routes and simpler replication

One way of improving reliability in the face of node- or link-failures, is by replicating messages so they take multiple routes through the network and reach multiple destinations. An idea for implementing this, would be to include a "route offset" field in the header of some (all?) messages. When forwarding a message with a non-zero route offset, the forwarding peer would select the next hop as normal, but after selecting it choose the nth successor or predecessor of that peer that is still a neighbor. I expect, but haven't proved or demonstrated, that this would eventually reach the peer that is the nth successor or predecessor of the destination. Crucially, not only would the end-point be diverse, the entire route would be (with high probability) distinct!

In the context of our unicast spanning tree constructed from a subset of the peers in the network, a peer could send its probes with different values of route offset in order to achieve diversity of peers that are holding its routing information and make the resulting routing tree more robust to peers exiting the network.

Conclusion

I'm not sure what to say. I really only have a rough idea of what this would look like, and I think I need to go through the exercise of implementing it on top of a network simulator, and testing some variations. If it works, it could make for a very robust overlay network, even in the face of poorly behaved networks. If it doesn't work... well, I guess that would be fine. I might even learn something >:)