BALL

This page has three main sections. (Note that the listings could be empty.) more… Tip: Remember the back-button! And don't click the number on the left.

  1. The first item displays a term, URL, text or picture.
  2. Followed by a (still unordered) listing of statements about this subject. (for details follow ">"-link)
  3. Separated by a horizontal rule a "reverse" listing of statements referring to this item in object position.
SACSP
1252 is a > Consensus Protocol
1254 title > Simple Askemos Consensus and Synchronization Protocol
1257 description >
(draft text)

Intro

The protocol is a simplified byzantine paxos Compare with requirements to highlight differences.

Should this include references to other alternatives?

The consensus protocol used in BALL deviates slightly from most alternatives found in scientific literature for historic reasons. Any protocol implementation, which guaranties the required properties may replace this layer of an Askemos implementation.

Here of concern is the actual, simplistic implementation.

Protocol-State Machine-Entanglement

Citing Wikipedia: A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. That's the case here too.

However: It's hard with "normal processing" to not accidentally include non-deterministic input from the local machine (e.g., never read any time value, which has not been part of an agreement before). Better include a double check.

The Process executed by the State Machine is implemented by two "callback" functions:

  1. propose: (optional), returns two values:

    1. The state mutations and actions to be taken. Encoded in some language accepted by #2 accept below.
    2. A checksum Hs (see below) to be compared during the agreement protocol.

    If left out, the request and some hash value derived the request is used as default

  2. accept: Interprets the input in the state machine exactly as with typical paxos deployments.

Protocol Description

The protocol is a byzantine agreement. Each place (state machine or process) has a set of notary peers commissioned to support it's execution.

Noteworthy here (though not relevant to the following description of protocol): the assignment of notaries is a matter of user decision. Faulty replica are never automatically removed or replaces from the set. Any two state machines may be executed at the same, partially overlapping or distinct sets of peers.

The following description the protocol to handle a single operation of a single place uses these acronyms:

  • P: the place (object, state machine in question)
  • R: the input request
  • OID: the identifier of P (constant since we look at a single object only)
  • N: the set of replicas commissioned to P
  • n: number of replicas in N
  • f: maxinum number of faulty replicas (with n>2f+1)
  • s: sequence number of P
  • Hs: digest (hash) of transaction s
  • t: request time
  • h: request digest (hash)

Agreement Protocol

Request Injection

For security reasons Askemos sees the client (as a software sending a request using the authority of a user) and the entity broadcasting this request to all replicas of a service as a single entity, the representative or trusted system of the user.

Conceivable alternative: send the request to one replica (often termed primary or leader) and have that one forward the request including the original signature to other peers (notaries or in Paxos terms acceptors). For simplicity BALL does currently require the source to send signed messages to all receivers, it never forwards signed messages.

The representative determines the replica set N (the notaries) of the receiving place P and broadcasts a signed message {R, t}.

Echo Phase

Each notary sends a message

{echo, OID, t, h, s, Hs}

to all other notaries. t is the wall clock time t received from the original representative (if it's within reasonable bounds) and h a cryptographic digest of the request content. s and Hs indicate the detailed execution context: the version number s of the state machine and it's digest Hs agreed upon in the last round of the protocol.

The request is then logged for processing. (Depending on local configuration the request might be tentatively processed by the peer right away, though without taking actual effect until the next phase commenced or processing might be started only then.)

Caveat: this log is currently held in volatile memory only. This necessarily results in inconsistent states when too many peers fail at once.

A client may send any number of echo messages for different requests (h) in the same version s.

Ready Phase

After receiving 2f+1 echo messages or f+1 ready messages the peer broadcasts a signed message:

{ready, OID, t, h, s, Hs+1}

(Whereby Hs+1 is the hash code derived from executing the request locally.)

A non-faulty client MUST NOT send more than a single ready message. (That is combination of values in a ready message, identical messages may be repeated).

After receiving f+1 such ready messages, the new state is committed to stable storage and resulting requests and/or replies for the client are sent to the representative.

Combining Messages

All available messages are typically send in a single batch (HTTP request).

Answering the Client

If the representative is also a member of N**and** did not fail in this round of the protocol, it simply sends it's result to the client. Otherwise it requests the last result from the quorum and awaits f+1 equivalent replies until it sends this reply to the client. Failing that it sends an error code (usually a service timeout like HTTP 503).

State Re-Synchronization Protocol

In addition to the acronyms used in the former section, the sync protocol description uses these:

  • M: system metadata about P
  • m: digest (hash) of M
  • bh: hash of binary state data

If a peer has reason to believe that

  • it is supposed to support P
  • the local sequence number s is smaller than the s upheld by the quorum

It requests the current state and updates itself to the majorities value as follows

Request Current State Digest

To request information about the current state of P the peer sends a message

{digest, OID, s, Hs}

to all notary peers N. Than it awaits (n/2)+1 replies with the same value m.

If the result is different from the local value of m the peer retrieves the actual state of P from any peer which replied like the majority. The received value of M is checked to hash to m if so, the result is commited as local state of P.

Request Current State (Meta)Data

To request the full state of P the peer sends a message

{get-place, OID }

The reply must contain the metadata M (in RDF/XML format). It may be a multipart message containing the binary state of P itself too – if that is small enough.

Request Current Binary State

In cases where the reply to get-state does not include the binary state data, M still contains the SHA256 hashcodes of the body and/or the sqlite part of the place. To request those the peer sends a message

{get-body, OID, bh }

to any of the available peers. The received binary value must be checked to hash to bh. If so it's added to the local state.

(Note that large binary state may be split into multiple parts. E.g., sqlite3 data is kept in chunks of 32kbyte by default. If so the first block will contain more hash codes to be retrieved in subsequent rounds of get-body requests. See also BHOB handling.)

1255 derived from > Paxos
1288 derived from > Sintra

replicates (BALL Core) software utile