Changes

myeisha · baad0e57
--- a/Reliable-state-machines-(concept).md
+++ b/Reliable-state-machines-(concept).md
+## Step 1: basic reliable, ordered datagrams
+
+The current Hexabus system suffers from a relatively high degree of packet loss, and as such, the distributed state machine running on devices can easily enter inconsistent states. Furthermore, packet loss from sporadic sensors (such as buttons used for lighting control) is a big problem - losing a "turn on the lights" packet on the way to any actor will not satisfy anybody. A simple state machine for lighting control might not be able to turn of all lights afterwards without resetting the entire distributed state machine, which is also not a good solution. Thus, a multicast packet that can trigger state changes anywhere in the system must be received by all nodes, or by none at all.
+
+Ordering of packets is also important: a multicast packet that trigger state changes must be received by all nodes in the system at approximately the same time, and reordering of such packets must be avoided at all costs. A simple retransmit protocol cannot guarantee consistent ordering of packets for all devices, which can also lead to inconsistent states.
+
+To guarantee reliable transmission, we use a retransmit protocol. To guarantee ordering of multicast packets that may alter the system state, we allow only one node in the system to send such packets. In such a system, individual nodes or groups of nodes may fail (by power failure, temporary loss of network connectivity, or other modes), which must be accounted for. The protocol described here borrows many ideas from [Gbcast](http://en.wikipedia.org/wiki/Gbcast), but is much simpler. For our purposes, two-phase commit is adequate, and nodes sending state-changing packets need not be notified of the result of their actions.
+
+### Changes to the packet format
+
+To allow for retransmission and acknowledgement of packets and for detection of retransmission, a sequence number field is added to the general hexabus packet header. To allow for ordering of packets, this sequence number must be monotonic for any given tuple of (source IP, source port, destination IP, destination port).
+
+The basic hexabus packet header after this addition then looks like this:
+```
+-----------------------------------------
+| 0..3 | 4    | 5     | 6..7            |
+|------|------|-------|-----------------|
+| HX0D | type | flags | sequence number |
+-----------------------------------------
+```
+The `HX0D` prefix field, the `type` field and the `flags` field retain their original meaning. While the original packet format did not define any flags, the new format defines one flag:
+```
+WANT_ACK (0x01): sending node requires acknowledgement from recipients
+```
+
+The packet footer remains unmodified and contains only a two-byte CRC of the packet, calculated with the CRC field set to `0`.
+
+The other packet types are described by their content and value of the `type` flag in the packet header, omitting the rest of the header and the footer.
+
+#### ERROR (0x00)
+
+Additionally to the error code as present now, the `ERROR` packet also contains the sequence number of the packet that caused the error on the device:
+```
+--------------------------------------
+| 0          | 1..2                  |
+|------------|-----------------------|
+| error code | cause sequence number |
+--------------------------------------
+```
+
+When sent to a node in response to a unicast packet that caused an error on the recipient node, the `cause sequence number` field is filled with the sequence number of the packet that caused the error. If a node spontaneously reports an error, the `error code` must indicate that the `cause sequence number` field is not valid, and the `cause sequence number` field must be set to `0`.
+
+#### QUERY (0x02), EPQUERY (0x0A)
+
+These two packets retain their original shape:
+```
+--------
+| 0..3 |
+|------|
+| eid  |
+--------
+```
+
+#### INFO (0x01), WRITE (0x04), EPINFO (0x09)
+
+Also retain their original shape:
+```
+--------------------------
+| 0..3 | 4        | 5..n |
+|------|----------|------|
+| eid  | datatype | data |
+--------------------------
+```
+The `INFO` and `EPINFO` packets must not be sent in response to `QUERY`/`EPQUERY` packets, since this would discard ordering information. Two new packet types are introduced for replies to `QUERY` and `EPQUERY` packets.
+
+#### REPORT (0x03), EPREPORT (0x0B)
+
+Newly introduced to allow ordering of `QUERY`/`EPQUERY` queries and responses. These packets contain all information contained in `INFO`/`EPINFO` packets, and additionally contain the sequence number of the query packet that caused these packets to be sent.
+```
+--------------------------------------
+| 0..1                  | 2..n       |
+|-----------------------|------------|
+| cause sequence number | <see INFO> |
+--------------------------------------
+```
+
+#### PINFO (0x11)
+
+This packet is used to multicast information that may change the state of the distributed state machine. Additionally to the contents of a normal `INFO` packet, it also contains the IP address of the node that caused the `PINFO` packet to be sent. See the description of the retransmission and ordering protocol below for more information.
+```
+--------------------------------------
+| 0..15                 | 16..n      |
+|-----------------------|------------|
+| originator IP address | <see INFO> |
+--------------------------------------
+```
+
+#### ACK (0x10)
+
+This new packet is introduced to allow acknowledgements of packets for which no implicit acknowledgement is possible. For `WRITE`, `QUERY` and `EPQUERY` packets an implicit acknowledgement can be sent - reception of the corresponding report packets is an implicit acknowledgement of the write or query command in question. For `INFO`, `EPINFO` and `PINFO` packets, no such thing is possible, thus the need for an explicit `ACK` packet.
+
+The `ACK` packet contains only the sequence number of the packet that required acknowledgement:
+```
+-------------------------
+| 0..1                  |
+|-----------------------|
+| cause sequence number |
+-------------------------
+```
+
+### Retransmission and ordering protocol
+
+Most operations in a hexabus network that are not related to the operation of the state machine have much less stringent reliability requirements than operations pertaining to the state machine. For queries and writes to single devices, the requesting node may simply retry the operation if an acknowledgement was requested but not received. Writes should never be done to more than one device at once and not at all by the user, and queries to many nodes at once will be implicitly acknowledged by answers to the queries.
+
+Only `INFO` packets that may change the state of the distributed state machine must be delivered to all devices in the network reliably and in some specific order that is the same for every device in the network. To ensure this, packets that may change the state of the distributed state machine may no longer be sent by any node in the system, but only by one node: the _master_. Hexabus device continue to send `INFO` packets as they do now, but other device must not change their state upon reception of `INFO` packets. A new packet type, `PINFO` is introduced for information that may change state, and devices must act upon `PINFO` packets only if they were sent by the master.
+
+It is then the responsibility of the master node to ensure that the distributed state machine is in a consistent state, that packets affecting the state machine are totally ordered, and that all devices have either acknowledged a packet affecting the state machine or have failed in some manner.
+
+#### Protocol state machine on the device
+
+To each packet sent to a destination, the device assigns a monotonically increasing sequence number. If the packet to be sent requires acknowledgement, the device waits for this acknowledgement and retransmits the packet as appropriate, until an ACK has been received or the maximum number of retransmission is exceeded, in which case the device considers itself failed. While one packet is being transmitted reliably, no other packet with reliability requirements may be transmitted.
+
+If a device considers itself failed and receives any packet from the master, it will initiate recovery procedures. Until recovery is complete, the state machine is stopped. If it considers itself failed, it will not dequeue and send any packets, and no packets may be enqueued. Successful recovery includes a flush of the entire send and receive queues of the device.
+
+Since packets without reliability constraints may be sent at any time, protocol state machines for the reliability mechanism need not concern themselves with such packets: instead of queueing them for transmission, just send them directly. Pictured below are the protocol state machines for one device and one remote endpoint. Since devices will generally communicate with only the base station, for most of the operational lifetime of a device only one state machine of each type will run. Send and receive state machines run in parallel and communicate via shared variables.
+
+![Protocol state machine (send)](images/device-send.png)
+![Protocol state machine (receive)](images/device-receive.png)
+
+`enqueue_recv` is used to trigger appropriate handling for the received packet. If the sender of a packet requests an acknowledgement from the device and implicit ACKs are possible (such as sending an `REPORT` packet in response to a `QUERY` packet), `enqueue_recv` must at some point in the future send a packet that constitutes an implicit acknowledgement.
+
+Note that the sending state machine specifies a recovery procedure. In the simplest case, recovery will consist of only a global, reliable reset of the distributed state machine. More advanced procedures may include state machine action replay on the master, which will allow recovery of failed nodes by giving them their state explicitly after a failure event.
+
+#### Protocol state machine on the master
+
+On the master, the same state machines run in parallel for each device known to the state machine. Since the master will often send packets via multicast and, for a given packet, all state machine timeouts that do expire will expire at the same time, the master may send a multicast packet only once instead of once for each known destination. Additionally, when the master receives a reset command for the distributed state machine, all send/receive state machines are reset, all queues are cleared and reinitialized with the reset command packet for the distributed state machine.