In this post, I am going to cover how the IPFS Cluster’s PinTracker component used to work, what some of the issues with that implementation were, how we fixed them, and where to go next.
First, the purpose of the pintracker: the pintracker serves the role of ferrying the appropriate state from IPFS Cluster’s shared state to a peer’s ipfs daemon state.
How this occurs is as follows: - IPFS Cluster receives a request to Pin a particular Cid - this request is routed to the consensus component, where it is stored in the distributed log and then an RPC call to Track the Cid is made to the pintracker - the pintracker then creates and stores the PinInfo in an internal map, before making a Pin request to the IPFS node via an RPC call to the IPFSConnector component - the IPFSConnector component is what finally requests the ipfs daemon to pin the Cid
The issues are separated into those which were due to how we initially implemented the MapPinTracker and then those that were/are inherent in any implementation of the pintracker that uses a map internally to store the status of the pins.
Issues with the implementation:
Issues with a MapPinTracker:
To tackle the issues with current implementation of the MapPinTracker, we did the following things.
We moved to a model where we track in-flight operations, so instead of a map that stored the status of a Cid, i.e.
map[Cid]PinInfo, we now store an operation for a Cid, i.e.
Now, an Operation contains not only the type of operation that is being performed for a Cid, (Pin or Unpin), and the phase of the operation, (Queued, In Progress, Done, or Error), but a
With the addition of context propagation through RPC calls to the IPFSConnector component, having a context available in every Operation gives us the ability to cancel an operation at any point.
Also upon receiving opposing operations for the same Cid we can cancel the in-flight operation automatically, maybe even before that operation had started to be processed depending on the timing.
With the increased visibility into the queue of operations that have been requested and the ability to cancel operations, the potential of the local state getting out of sync has greatly decreased. This means that
cluster.StateSync doesn’t need to be called every 60 seconds anymore to guarantee consistency. Also,
Recover is now async as the queue of operations is no longer a blackbox.
Currently in PR, there is a stateless implementation of the pintracker interface. This implementation removes the duplication of state and potential for stale PinInfos in the pintracker itself. The stateless pintracker relies directly on the shared state provided by the consensus component and the state provided by the ipfs node. The main benefit is for clusters with a very large number of pins, as the status of all those pins will not be held in memory.
Last week we released IPFS Cluster v0.4.0.
We bumped the minor version number to make explicit not only that we brought on a lot of new things, but that this release also included a number of breaking changes, mostly in regards to the configuration file.
The changelog gives an overview of what’s new (a lot of things!), but we will also take time to explore and explain the changes in more detail right in this news section, in separate upcoming entries.
For the moment, this release is a solid milestone and provides essential features and fixes for production deployments. It is so far running very smoothly in our storage cluster. This website and the documentation pages, on which we have put significant efforts, should help users install, configure, deploy and tune their Clusters. If you encounter a problem or need help, just reach out!
In the last two months many things have happened in the IPFS Cluster project.
First, we have welcomed a new team member: @lanzafame has already started contributing and has resolved a few issues already included in the last release.
Secondly, we have been working very hard on implementing the “sharding RFC” that I mentioned in my last update. @zenground0 has made very significant progress on this front. Sharding will be a unique feature of IPFS Cluster and will help to drive the adoption of ipfs by being able tu support huge datasets distributed among different nodes. We hope that the first “sharding” prototype is ready in the upcoming weeks.
Thirdly, we have made 3 releases (the latest being
0.3.5) which bring a diverse set of features and some bugfixes. Some of the major ones are these:
ipfs-cluster-ctl health graphgenerates a
.dotfile which allows to quickly have an overview of connectivity among the peers in the cluster.
refspinning method allows to download dags in parallel and pin only when they content is already on the disk.
We have also started working on the IPFS Cluster website, which we will use to provide a central and well organized place for documentation, roadmaps and other information related to the project.
We are about to tag the
0.3.2 release and it comes with two nice features.
On one side, @zenground0 has been focused in implementing state offline export and import capabilities, a complement to the state upgrades added in the last release. They allow taking the shared from an offline cluster (and in a human readable format), and place it somewhere else, or in the same place. This feature might save the day in situations when the quorum of a cluster is completely lost and peers cannot be started anymore due to the lack of master.
Additionally, I have been putting some time into a new approach to replication factors. Instead of forcing cluster to store a pin an specific number of times, we now support a lower and upper bounds to the the replication factor (in the form of
replication_factor_max). This feature (and great idea) was originally proposed by @segator.
Having this margin means that cluster will attempt its best when pinning an item (reach the max factor), but it won’t error if it cannot find enough available peers, as long as it finds more than the minimum replication factor.
In the same way, a peer going offline, will not trigger a re-allocation of the CID as it did before, if the replication factor is still within the margin. This allows, for example, taking a peer offline for maintenance, without having cluster vacate all the pins associated to it (and then coming up empty).
Of course, the previous behaviour can still be obtained by setting both the max and the min to the same values.
Finally, it is very important to remark that we recently finished the Sharding RFC draft. This document outlines how we are going to approach the implementation of one of the most difficult but important features upcoming in cluster: the ability to distribute a single CID (tree) among several nodes. This will allow to use cluster to store files or archives too big for a single ipfs node. Input from the community on this draft can be provided at https://github.com/ipfs/notes/issues/278.
During the last weeks we’ve been working hard on making the first “live” deployment of IPFS Cluster. I am happy to announce that a 10-peer cluster runs on ipfs-gateway nodes, maintaining a >2000-length pinset.
The nodes are distributed, run a vanilla IPFS Cluster docker container mounting a volume with a customized cluster configuration, which uses higher-than-default timeouts and intervals. The injection of the pin-set took a while, but enventually every pin in every node became PINNED. In one occassion, a single IPFS node hanged while pinning. After re-starting the IPFS node in question, all pins in the queue became PIN_ERRORs, but they could easily be fixed with a
Additionally, the IPFS IRC Pinbot now supports cluster-pinning, by using the IPFS Cluster proxy to ipfs, which intercepts pin requests and performs them in cluster. This allowed us to re-use the
go-ipfs-api library to interact with cluster.
The first live setup has shown nevertheless that some things were missing. For example, we added
--local flags to Sync, Status and Recover operations (and allowed a local RecoverAll). They are handy when a single node is at fault and you want to fix the pins on that specific node. We will also work on a
go-ipfs-cluster-api library which provides a REST API client which allows to programatically interact with cluster more easily.
Parallel to all this, @zenground0 has been working on state migrations. The cluster’s consensus state is stored on disk via snapshots in certain format. This format might evolve in the future and we need a way to migrate between versions without losing all the state data. In the new approach, we are able to extract the state from Raft snapshots, migrate it, and create a new snapshot with the new format so that the next time cluster starts everything works. This has been a complex feature but a very important step to providing a production grade release of IPFS Cluster.
Last but not least, the next release will include useful things like pin-names (a string associated to every pin) and peer names. This will allow to easily identify pins and peers by other than their multihash. They have been contributed by @te0d, who is working on https://github.com/te0d/js-ipfs-cluster-api, a JS Rest API client for our REST API, and https://github.com/te0d/bunker, a web interface to manage IPFS Cluster.
This update comes as our
0.3.0 release is about to be published. This release includes quite a few bug fixes, but the main change is the upgrade of the underlying Raft libraries to a recently published version.
Raft 1.0.0 hardens the management of peersets and makes it more difficult to arrive to situations in which cluster peers have different, inconsistent states. These issues are usually very confusing for new users, as they manifest themselves with lots of error messages with apparently cryptic meanings, coming from Raft and libp2p. We have embraced the new safeguards and made documentation and code changes to stress the workflows that should be followed when altering the cluster peerset. These can be summarized with:
--bootstrapis the method to add a peer to a running cluster as it ensures that no diverging state exists during first boot.
ipfs-cluster-datafolder is renamed whenever a peer leaves the cluster, resulting on a clean state for the next start. Peers with a dirty state will not be able to join a cluster.
ipfs-cluster-datahas been initialized,
cluster.peersshould match the internal peerset from the data, or the node will not start.
In the documentation, we have stressed the importance of the consensus data and described the workflows for starting peers and leaving the cluster in more detail.
I’m also happy to announce that we now build and publish “snaps”. Snaps are “universal Linux packages designed to be secure, sandboxed, containerised applications isolated from the underlying system and from other applications”. We are still testing them. For the moment we publish a new snap on every master build.
You are welcome to check the changelog for a detailed list of other new features and bugfixes.
Our upcoming work will be focused on setting up a live IPFS Cluster and run it in a “production” fashion, as well as adding more capabilities to manage the internal cluster state while offline (migrate, export, import) etc.
We have now started the final quarter of 2017 with renewed energy and plans for IPFS Cluster. The team has grown and come up with a set of priorities for the next weeks and months. The gist of these is:
v0.2.0 marks the start of this cycle and includes. Check the changelog for a list of features and bugfixes. Among them, the new configuration options in the consensus component options will allow our users to experiment in environments with larger latencies than usual.
Finally, coming up in the pipeline we have:
Unfortunately, I have not thought of updating the Captain’s log for some months. The Coinlist effort has had me very busy, which means that my time and mind were not fully focused on cluster as before. That said, there has been significant progress during this period. Much of that progress has happened thanks to @Zenground0 and @dgrisham, who have been working on cluster for most of Q2 making valuable contributions (many of them on the testing front).
As a summary, since my last update, we have:
ipfs-cluster-serviceare tested and not broken in obvious ways at least and complement our testing pipeline.
cluster_secret. This brings a significant reduction on the security pitfalls of running IPFS Cluster: default setup does not allow anymore remote control of a cluster peer. More information on security can be read on the guide.
All the above changes are about to crystallize in the
v0.1.0 release, which we’ll publish in the next days.
The last weeks were spent on improving go-ipfs/libp2p/multiformats documentation as part of the documentation sprint mentioned earlier.
That said, a few changes have made it to IPFS Cluster:
type=recursivein IPFS API queries they return way faster.
swarm connectoperations for each ipfs node associated to a cluster peer, both at start up and upon operations like
peer add. This should ensure that ipfs nodes in the cluster know each others.
diskinformer. The default allocation strategy now is based on how big the IPFS repository is. Pins will be allocated to peers with lower repository sizes.
I will be releasing new builds/release for IPFS Cluster in the following days.
This week has been mostly spent on making IPFS Cluster easy to install, writing end-to-end tests as part of the Test Lab Sprint and bugfixing:
Next week will probably focus on the Delightful documentation sprint. I’ll try to throw in some more tests for
ipfs-cluster-ctl and will send the call for early testers that I was talking about in the last update, now that we have new multiple install options.
IPFS cluster now has basic peer monitoring and re-pinning support when a cluster peer goes down.
This is done by broadcasting a “ping” from each peer to the monitor component. When it detects no pings are arriving from a current cluster member, it triggers an alert, which makes cluster trigger re-pins for all the CIDs associated to that peer.
The next days will be spent fixing small things and figuring out how to get better tests as part of the Test Lab Sprint. I also plan to make a call for early testers, to see if we can get some people on board to try IPFS Cluster out.
A global replication factor is now supported! A new configuration file option
replication_factor allows to specify how many peers should be allocated to pin a CID.
-1 means “Pin everywhere”, and maintains compatibility with the previous behaviour. A replication factor >= 1 pin request is subjec to a number of requirements:
How the peers are allocated content has been most of the work in this feature. We have two three componenets for doing so:
Informercomponent. Informer is used to fetch some metric (agnostic to Cluster). The metric has a Time-to-Live and it is pushed in TTL/2 intervals to the Cluster leader.
Allocatorcomponent. The allocator is used to provide an
Allocate()method which, given current allocations, candidate peers and the last valid metrics pushed from the
Informers, can decide which peers should perform the pinning. For example, a metric could be the used disk space in a cluster peer, and the allocation algorithm would be to sort candidate peers according to that metrics. The first in the list are the ones with less disk used, and will then be chosen to perform the pin. An
Allocatorcould also work by receiving a location metric and making sure that the most preferential location is different from the already existing ones etc.
PeerMonitorcomponent, which is in charge of logging metrics and providing the last valid ones. It will be extended in the future to detect peer failures and trigger alerts.
The current allocation strategy is a simple one called
numpin, which just distributes the pins according to the number of CIDs peers are already pinning. More useful strategies should come in the future (help wanted!).
The next steps in Cluster will be wrapping up this milestone with failure detection and re-balancing.
So much for commitments… I missed last friday’s log entry. The reason is that I was busy with the implementation of dynamic membership for IPFS Cluster.
What seemed a rather simple task turned into a not so simple endeavour because modifying the peer set of Raft has a lot of pitfalls. This is specially if it is during boot (in order to bootstrap). A
peer add operation implies making everyone aware of a new peer. In Raft this is achieved by commiting a special log entry. However there is no way to notify of such event on a receiver, and such entry only has the peer ID, not the full multiaddress of the new peer (needed so that other nodes can talk to it).
Therefore whoever adds the node must additionally broadcast the new node and also send back the full list of cluster peers to it. After three implementation attempts (all working but all improving on top of the previous), we perform this broadcasting by logging our own
PeerAdd operation in Raft, with the multiaddress. This proved nicer and simpler than broadcasting to all the nodes (mostly on dealing with failures and errors - what do when a node has missed out). If the operation makes it to the log then everyone should get it, and if not, failure does not involve un-doing the operation in every node with another broadcast. The whole thing is still tricky when joining peers which have disjoint Raft states, so it is best to use it with clean, just started peers.
peer add, there is a
join operation which facilitates bootstrapping a node and have it directly join a cluster. On shut down, each node will save the current cluster peers in the configuration for future use. A
join operation can be triggered with the
--bootstrap flag in
ipfs-cluster-service or with the
bootstrap option in the configuration and works best with clean nodes.
The next days will be spent on implementing replication factors, which implies the addition of new components to the mix.
Friday is from now on the Captain Log entry day.
Last week, was the first week out of three the current IPFS Cluster sprintino (https://github.com/ipfs/pm/issues/353). The work has focused on addressing “rough edges”, most of which came from @jbenet’s feedback (#14). The result has been significant changes and improvements to IPFS Cluster:
go-libp2p-gorpcand the whole dependency tree is now Gx’ed.
ipfs-cluster-ctlas names for the cluster tools.
urfave/cli, which means better help, clearer commands and more consistency.
Sync()operations, which update the Cluster pin states from the IPFS state have been rewritten.
Recover()has been promoted to its own endpoint.
ID()endpoint which provides information about the Cluster peer (ID, Addresses) and about the IPFS daemon it’s connected to. The
Peers()endpoint retrieves this information from all Peers so it is easy to have a general overview of the Cluster.
pin lsand doing Cluster pinning operations instead. This not only allows replacing an IPFS daemon by a Cluster peer, but also enables compositing cluster peers with other clusters (pointing
ipfs_node_multiaddressto a different Cluster proxy endpoint).
The changes above include a large number of API renamings, re-writings and re-organization of the code, but IPFS Cluster has grown more solid as a result.
Next week, the work will focus on making it easy to add and remove peers from a running cluster.
I have just merged the initial cluster version into master. There are many rough edges to address, and significant changes to namings/APIs will happen during the next few days and weeks.
The rest of the quarter will be focused on 4 main issues:
These endaevours will be reflected on the ROADMAP.