Add pull to Diaspora's push model in federation
I had some ideas a while ago about improving communication between pods in instances where it currently falls down, but didn't know enough about how federation works to be able to flesh them out. Now, helped by Fla's blog post about federation to understand more about how it works, I've refined those ideas.
Just for clarity, this is only a speculative concept. I understand the technical issues only poorly, and so my suggestions as I've presented them may not be workable. However, I hope that, even if this proves to be the case, my suggestions will spark ideas in those of you who understand the technical side of Diaspora which might help to improve Diaspora's federation protocols.
At the moment Diaspora relies solely, or almost solely, on pushing data from one pod to another. This means that if a pod does not receive data when it is pushed, there is no way for that pod to retrieve these data at a later time. I suggest that if we're going to keep Diaspora working on a push model, we supplement this by enabling pods to pull data under certain circumstances.
Pods only receive data from pods with which they have an established connection. Currently, this means users making connections with users on other pods, and this takes time. I suggest putting in place an automatic means of connections with other pods so that this process can be done automatically, immediately the pod goes online, so that when users start using the pod, these connections with other pods are already in place.
I suggest putting in place a sort of 'handshake' system.
The process would work something like this:
- Podmin sets up Pod Z, and puts it online. Pod Z knows about Pod A.
- Pod Z contacts Pod A, and says 'Hi, which pods do you know about?'
- Pod A gives Pod Z a list of pods it knows about.
- Pod Z adds each of these pods to its knowledge base.
- Pod Z contacts each of these pods and asks the same question in step 2.
- This process is repeated until Pod Z is not finding out about any more new pods.
This way the new pod would very quickly build connections with the whole network.
Of course, there needs to be some means of establishing the first pod to contact (Pod A). This could be prompted by going to the pod of whichever account new accounts are set to auto-follow on that pod (currently the Diaspora HQ account, which is located on joindiaspora.com). Alternatively a list of a few key pods could be kept on diasporafoundation.org (not as a web page visible to visitors, but somewhere from which pods can FTP the data), or the pod could get the information from a site such as podupti.me, which is frequently updated.
One possible way of doing this would be to automatically create 'bot' accounts on each pod which communicate with each other via the above protocol. I'm calling them 'pod-spiders'. If Pod Z knows about Pod A, pod-spider@PodZ.org adds pod-spider@PodA.com to its aspects in order to contact it, and so on. I'm sure the inter-pod communication could be done without setting up bot accounts, and might be a better way to do it. As much as anything, the 'pod-spider' concept is a visual aid.
As tags are not federated, you could also have each pod-spider account follow all the tags that users on its pod follow or search for. (This could involve only tags that have been searched more than 5 times or are followed by more than 5 people, to eradicate spelling mistakes.) When Pod Z goes online, pod-spider@PodZ.org can also ask each pod it contacts 'which tags do you know about?' and can then follow those tags itself. In this way, it might be possible to populate tag searches from the time the pod goes online.
Alternatively, when a user searches for a tag which is not currently in that pod's database, the pod can pull the data on that tag from all the pods it is connected to. That way, the first time a tag search is done on that pod, it is done by a pull, which would take longer but at least would get the data. After that, data relating to that tag can be pushed to the pod in the usual way.
There are also some circumstances in which an established pod doesn't receive data that are pushed – for example, if a pod goes offline for a while or is temporarily over capacity. In these circumstances, it would be helpful if the pod can pull data when it goes back online.
At the moment, when Pod A can't push data to another Pod B, it puts the data back into its send queue and retries a number of times at intervals. When the last of these retries has taken place, Pod A stops trying, whether or not it has been successful. If not successful by the last of these attempts, there is no possibility of the data getting from Pod A to Pod B.
For my suggestion to work, at the end of this process of retries, if the data still cannot be pushed, Pod A should write all data destined for Pod B to a log rather than placing them back in its queue. Pod B is placed on a list of 'pods incommunicado, do not attempt to communicate', and Pod A stops trying to push new data to Pod B, instead writing it to the log. This would save network resources. Once this has happened, when there are new data destined for Pod B, Pod A should add them to this log instead of attempting to push them to Pod B. (Pod A could perhaps continue to attempt communication with Pod B say once a day, and if successful can then push the logged data.)
When Pod B is back online, it immediately communicates with all pods known to it and says: 'I'm back. What have I missed?' When Pod A receives this communication, it refers to its log for Pod B, retrieves the data and sends them to Pod B, and once it receives confirmation that this transfer has been successful, deletes the log and removes Pod B from the 'do not communicate' list.
This should (a) allow pods to receive data pushed when they were unavailable, and (b) save network resources currently wasted by pods trying to communicate many times with pods which are unavailable.
There may be other circumstances in which it would be good for a pod to be able to do a pull request – perhaps if it hadn't heard from a pod for a set period of time. However, this would involve pods keeping logs of data destined for other pods even when it hasn't detected a communication problem, so may be a waste of resources.