Whitepaper In Four Minutes - Dat

Whitepaper In Four Minutes - Dat




Introduction

Dat is a protocol designed for versioning and securely sharing large datasets in a peer to peer fashion. It uses public key cryptography to encrypt the network traffic to ensure privacy or the writer and reader of a Dat. Dat clients can connect to each other to form a public or a private decentralized network to exchange data between each other.

A Better HTTP

HTTP is the most used way to transfer files and data across the internet. It is designed to handle transfer of small files, which was well suited for the bandwidth and requirement of the time it was designed for. Nowadays, dataset in size of petabytes have to be shared over the internet. Moreover, since the first HTTP spec, new techniques of file distribution have been designed. Since, applying the new techniques without breaking the current model of HTTP and the web is not possible, new protocols are being designed to handle this issue of scale and size.

Dat is designed as a peer to peer protocol suited to exchange pieces of dataset among a swarm of peers. It does so by storing data as chunks in the leaf nodes of a Merkle Tree. This allows peers to request part of the chunk. In parallel, the peer can share the chunk's that has downloaded in the past.

Content Addressable and Dat links

Being a distributed system it is important to verify the data that is received is the exact same as the data being expected. In Dat, files are referred to by the hash of its own content. The hash is arranged in a Merkle Tree where each non-leaf node is the hash of all child nodes. Child nodes contain pieces of the dataset while others nodes are important for integrity checks. If two trees have matching root hashes, then all other nodes in the tree must match as well. The hash allows the data to be version controlled, helps in the replication process and most importantly can be used to verify the integrity of the data received.

How hashes are kept in a Merkel Tree

Each Dat filesystem has a a public and private key pair associated with it. The public key has a length of 32 bytes (64 characters when hex encoded) and is used to access the filesystem.

  • A Dat public key looks like : 8e1c71b894ec2bbec3423eb44a9...
  • A Dat protocol url looks like : dat://8e1c71b894ec2bbec3423eb44a9...
  • Dat as part of a HTTP URL could look like : https://datproject.org/8e1c71b894ec2bbec3423eb44a9...

All Dat messages are signed with the public key during transport. This means that only the peers who know the public key will be able to decrypt the messages or communicate to the swarm for that Dat.

The Dat repository has a corresponding private key which will be used to sign messages and files to verify that they were created by the owner. Dat never exposes the public or private key over the network. The BLAKE2b hash of the public key is used for discovery. Thus, only peers who know a public key will be able to access a Dat repository and files in it.

Peer Discovery

In the Dat implementation discovery can happen over the following three types of discovery networks

  • Multicast UDP over a LAN
  • DNS name servers
  • Kadmelia Distributed Hash Table - to provide a less central point of failure.

Dat is very flexible as in theory, discovery can happen over any network as long as the following actions can be modeled

  • join(key, [port]) - perform lookup for a key
  • leave(key, [port]) - stop looking for the key
  • foundpeer(key, ip, port) - called when a peer is found

After discovery, Dat can contact the peers that have the keys required and can communicate over TCP, HTTP or UTP.

Versioning

Dat splits files into the content and its metadata. The content is stored in the content registry and metadata is stored in the registry. Dat can therefore version the metadata and content separately. The metadata versioning register is used to keep an historical record of the filesystem.

In regard to the content versioning, Dat is a little different compared to regular version control systems like Git. Git stores all the past content of a repository locally and it works well for a source code repository. Since Dat is designed to be used as a repository for huge datasets, it is not feasible to store all previous versions locally. Dat can be configured to store all past versions of a dataset.

Conclusion

If BitTorrent and Git were to have a magical child for large datasets, it would've been Dat. Today, there is a reference implementation of Dat written in NodeJs. There is a Dat first browser ((Beaker Browser)[https://beakerbrowser.com]) which can be used to access websites on the Dat network and also allows the users to host their websites on the network as the browser is also a Dat peer. This is an amazing time for peer to peer technologies and the decent(ralized) web is slowly becoming a reality.

References:

  1. Dat Whitepaper
  2. Dat FAQ
  3. Verified, shared, modular research communication with the Dat protocol

Image Source:

Read summary of a whitepaper in four minutes. Every week. On Wednesdays.
Sometimes twice a week!