How could you safely store files “in the cloud”? How can a “cloud storage” system work? The idea is that you (“the speaker”) have a large file and you would like storage servers (“providers”) to be incentivized to store it for you so you can retrieve it in the future. Assume for the time being that the file is private, not public.
(This is basically the problem that MojoNation, MNet, Filecoin, etc., attempt to solve, and it’s possible that their solutions are better than what I propose here. IIRC Tahoe didn’t try to solve the incentive problem, punting it to social “friendnet” considerations, which ultimately failed in almost all cases.)
Of course, you could just send a hosting service some Mastercard payments or bitcoin every month, as long as the file remains accessible. This has a few limitations:
You can solve problem #1 by encrypting the file and storing the keys locally, and problem #2 by computing a secure hash of the file which you store locally. If you want to be able to retrieve small pieces of the file, instead of just storing a single secure hash, you need to store the hash of a Merkle tree of the file, as some modern hashes like BLAKE3 do implicitly.
A particularly simple way to build such a Merkle tree is by dividing the file into pieces of, say, 64KiB, and concatenating the hashes of the pieces to form a piece table file. Files up to 64KiB will have a single-entry table (32 bytes with SHA-256), which can be stored directly; larger files would require a piece table to be stored somewhere, such as in the cloud storage system itself. With the given example parameters, a single-piece piece table would handle files up to 128 MiB, while you’d also need to store a piece table for your piece table for files up to 256 GiB, while three levels are needed for files up to 512 TiB, etc.
Solving problem #3 is a little more difficult; it requires redundancy among different hosting providers who don’t know about each other, which is potentially expensive because you need to store M copies of the file, which costs M times as much. N-of-M Shamir secret sharing gives you a less expensive way to do this and also solves problem #1, again, as long as there’s no way for an attacker to find out which hosting providers they are. Then you need to store a Merkle tree of each of the M shares so they can be verified independently. And you need to periodically test some randomly chosen pieces of each share to ensure it hasn’t been lost (“auditing”).
So far, as I understand it, this is more or less how modern cloud backup systems like Borg, Restic, kup, rdedup, and perkeep work.
If a malicious provider wants to get paid but not perform the service, they can cheat by the following approach: when asked for piece #42 of a file, retrieve piece #42 from N other providers, compute piece #42 of the original file, and then recompute piece #42 of the share they were asked to store. This is not a problem if they have no way to find out about the other providers, but that would impede extending the scheme to public files, because if a file is public then anyone can find out what its pieces are and retrieve them. A different defense against this attack is to encrypt each share before sending it to the provider; then they cannot recompute their piece without the encryption key. By using public-key cryptography, even clients who can decrypt the piece using the public key will be unable to recompute it, which would require the private key.
Extending the system to public files, however, makes it easier to censor: a would-be censor can repeatedly retrieve a public file while observing the network in various ways to find out who the providers of the file are, then attack those providers, perhaps presenting them with ultimatums to delete the file to end the attack. In a simple example where the providers are directly contacted over IP, they can observe which IP addresses they are talking to and traceroute those addresses, an attack which can be stymied by putting the providers on Tor onion services. But even passive traffic analysis (in the time domain, the frequency domain, a wavelet domain, etc.) is probably sufficient to unmask Tor onion services, particularly in the presence of observable network outages, which will tend to increase the latency of particular providers or even knock them offline. Active traffic analysis, in which streams of artificial traffic are used to artificially alter network latency and loss, is even more powerful. High-latency networks like uucp, Fidonet, and the old cypherpunks remailers, particularly with randomized rendezvous times, are one defense that reduces the information leakage to traffic-analysis attackers; constant-bandwidth mixnets like the old ISDN mix proposal are another.
Sybil problems are also important for #3: if all your providers are in Amazon’s us-east-1 data center, that’s really only a single provider with uncountable faces, because when that data center goes offline, all the providers disappear at once. I don’t know how to really solve that problem; the best attack on it so far seems to be Satoshi’s proof of work, and that’s turned out not to be as watertight as Satoshi had hoped. (Centralized identity systems like government IDs are often suggested to solve Sybil problems, but those don’t help with the problem of many distinct real persons hosting their storage server in us-east-1.) Spreading across many providers will tend to diminish it, which requires automation, bringing us to problem #4.
Problem #4 is partly a political problem rather than a technical one: censorship policies, especially policies for liability for distributing forbidden information, tend to disincentivize automated data preservation systems. Similarly, repudiable payment systems like Paypal and Mastercard impose counterparty risk on hosting providers: a customer can revoke payment for service they’ve already received, a problem solved by Bitcoin and similar cryptocurrencies.
Problem #5 is the trickiest: it’s an incentive-design problem. You need to delegate the auditing function to some sort of third party that will probably survive your death. Such systems are rife with principal–agent problems:
If the auditor can be trusted by the provider as well as the speaker, a system becomes possible where the provider posts bonds that are forfeit if it fails auditing. This way, speakers can choose to use providers who stand to lose more if they fail an audit, perhaps long after the speaker’s death.
Two ways of cross-checking auditors are to have multiple auditors performing the same audit, who ought to get the same result, or to include “dummy providers” who will intentionally return no data or bad data for certain nonexistent files. If the auditor cannot determine whether a provider is a real provider or a dummy provider, false auditing results can be detected.
Who audits the auditors? Szabo’s “property club” approach suggests using a quorum of auditors who can vote dishonest auditors off the island. Alternatively, for public databases (which are the case we most care about after the speaker’s death), micropayments can flow from readers of a database to the auditors and the providers; if a database stops working reliably, the auditors can anticipate that readers will stop paying.