Internals

This sections provides insights into the IPFS cluster internals. It explains how cluster works on the inside, what happens when a pin request is received and how the project code is organized.

The consensus algoritm

IPFS Cluster was designed with the idea that it should eventually support different consensus algorithm implementations. The consensus layer takes care of two things:

Regardless of the considerations above, we leave the definition of what a consistent view of the state is quite open, as different consensus layer implementations may respond to different needs for consistency, or provide different approaches towards it. Some consensus approaches may also not worry about keeping a peerset as others do.

Raft

The Raft consensus implementation was chosen as the default consensus layer for IPFS Cluster for several reasons:

Raft works by commiting log entries to a “distributed log” which every peer follows. In IPFS Cluster, every “Pin” and “Unpin” requests are log entries in that log. When a peer receives a log “Pin” operation, it updates its local copy of the shared state to indicate that the CID is now pinned.

In order to work, Raft elects a cluster “Leader” by majority, which is the only peer allowed to commit entries to the log. Thus, a Leader election can only succeed if at least half of the nodes are online. Log entries, and other parts of the Cluster functionality, can only happen when a Leader exists.

For example, a commit operation to the log is triggered with ipfs-cluster-ctl pin add <cid>. This will use the peer’s API to send a Pin request. The peer’s Raft consensus layer will in turn forward the request to the cluster’s Leader, which will perform the commit of the operation and inform all the peers in the peerset about it. This is explained in more detail in the “Pinning an item” section below.

The “peer add” and “peer remove” operations also trigger log entries (internal to Raft) and depend too on a healthy consensus status. Modifying the cluster peers is a tricky operation because it requires informing every peer of the new peer’s multiaddresses. If a peer is down during this operation, the operation will fail, as otherwise that peer will not know how to contact the new member. Thus, it is recommended remove and bootstrap any peer that is offline before making changes to the peerset.

By default, the consensus data is stored in the raft subfolder, next to the main configuration file. This folder stores two types of information: the boltDB database storing the Raft log, and the state snapshots. Snapshots from the log are performed regularly when the log grows too big (see the raft configuration section for options). When a peer is far behind in catching up with the log, Raft may opt to send a snapshot directly, rather than to send every log entry that makes up the state individually. This data is initialized on the first start of a cluster peer and maintained throughout its life. Removing or renaming the raft folder effectively resets the peer to a clean state. Only peers with a clean state should bootstrap to already running clusters.

When running a cluster peer, it is very important that the consensus data folder does not contain any data from a different cluster setup, or data from diverging logs. What this essentially means is that different Raft logs should not be mixed. On clean shutdowns, ipfs-cluster peers will also create a Raft snapshot. This snapshot is the state copy that can be used for exporting or upgrading the state format.

The shared state, the local state and the ipfs state

It is important to understand that ipfs-cluster deals with three types of states, regardless of the consensus implementation used:

In normal operation, all three states are in sync, as updates to the shared state cascade to the local and the ipfs states. Additionally, syncing operations are regularly triggered by ipfs-cluster. Unpinning cluster-pinned items directly from ipfs will, for example, cause a mismatch between the local and the ipfs state. Luckily, there are ways to inspect every state:

As a final note, the local state may show items in error. This happens when an item took too long to pin/unpin, or the ipfs daemon became unavailable. ipfs-cluster-ctl recover <cid> can be used to rescue these items. See the “Pinning an item” section below for more information.

Pinning an item

ipfs-cluster-ctl pin add <cid> will tell ipfs-cluster to pin (or re-pin) a CID.

When using the Raft consensus implementation, this involves:

Errors in the first part of the process (before the entry is commited) will be returned to the user and the whole operation is aborted. Errors in the second part of the process will result in pins with an status of PIN_ERROR.

Deciding where a CID will be pinned (which IPFS daemon will store it - receive the allocation) is a complex process. In order to decide, all available peers (those reporting valid/non-expired metrics) are sorted by the allocator component, depending on the value of their metrics. These values are provided by the configured informer. If a CID is already allocated to some peers (in the case of a re-pinning operation), those allocations are kept.

New allocations are only provided when the allocation factor (healthy peers holding the CID) is below the replication_factor_min threshold. In those cases, the new allocations (along with the existing valid ones), will attempt to total as much as replication_factor_max. When the allocation factor of a CID is within the margins indicated by the replication factors, no action is taken. The value “-1” and replication_factor_min and replication_factor_max indicates a “replicate everywhere” mode, where every peer will pin the CID.

Default replication factors are specified in the configuration, but every Pin object carries them associated to its own entry in the shared state. Changing the replication factor of existing pins requires re-pinning them (it does not suffice to change the configuration). You can always check the details of a pin, including its replication factors, using ipfs-cluster-ctl pin ls <cid>. You can use ipfs-cluster-ctl pin add <cid> to re-pin at any time with different replication factors. But note that the new pin will only be commited if it differs from the existing one in the way specified in the paragraph above.

In order to check the status of a pin, use ipfs-cluster-ctl status <cid>. Retries for pins in error state can be triggered with ipfs-cluster-ctl recover <cid>.

The reason pins (and unpin) requests are queued is to not perform too many requests to ipfs (i.e. when ingesting many pins at once).

Unpinning an item

ipfs-cluster-ctl pin rm <cid> will tell ipfs-cluster to unpin a CID.

The process is very similar to the “Pinning an item” described above. Removed pins are wiped from the shared and local states. When requesting the local status for a given CID, it will show as UNPINNED. Errors will be reflected as UNPIN_ERROR in the pin local status.

Next steps: Composite Clusters