LogCabin

usage, operation, and internals

Preface: Diego Ongaro hastily prepared and presented this talk at Scale Computing's Engineering Week in July 2015. Though its organization is poor, it contains some good content, some of which cannot be found elsewhere.
There's sort of two axes to this talk: (1) usage, operations, internals, and (2) aspect of LogCabin such as writes, reads, membership changes, compaction. Right now it's all mushed together. On the next revision, it'd be better to do a conceptual overview, then organize by (2), interspersing the (1)s.

Copyright 2015 Diego Ongaro.
Source code available at https://github.com/logcabin/logcabin-talk.
This work is licensed under the Creative Commons Attribution 4.0 International License.

Command-line client

$ logcabin -h
Run various operations on a LogCabin replicated state machine.

Usage: logcabin [options] <command> [<args>]

Commands:
  mkdir <path>    If no directory exists at <path>, create it.
  list <path>     List keys within directory at <path>.
  dump [<path>]   Recursively print keys and values within directory at <path>.
                  Defaults to printing all keys and values from root of tree.
  rmdir <path>    Recursively remove directory at <path>, if any.
  write <path>    Set/create value of file at <path> to stdin.
  read <path>     Print value of file at <path>.
  remove <path>   Remove file at <path>, if any.

Options:
  -c <addresses>, --cluster=<addresses>  Network addresses of the LogCabin
                                         servers, comma-separated
                                         [default: logcabin:5254]
  -d <path>, --dir=<path>        Set working directory [default: /]
  -p <pred>, --condition=<pred>  Set predicate on the operation of the
                                 form <path>:<value>, indicating that the key
                                 at <path> must have the given value.
  -t <time>, --timeout=<time>    Set timeout for the operation
                                 (0 means wait forever) [default: 0s]

Linearizability (schedule A)

An operation is linearizable if it appears to occur instantaneously and exactly once at some point in time between its invocation and its response.

Linearizability (schedule B)

An operation is linearizable if it appears to occur instantaneously and exactly once at some point in time between its invocation and its response.

For more detail, see Linearizability: A correctness condition for concurrent objects, Maurice P. Herlihy and Jeannette M. Wing, 1990; and Linearizability versus Serializability, a blog post by Peter Bailis, 2014.

Replicated state machines

LogCabin is one of these.

Replicated log ⇒ replicated state machine

All servers execute same commands in same order

Consensus module ensures proper log replication

System makes progress as long as any majority of servers up

Failure model: fail-stop (not Byzantine), delayed/lost msgs

Bootstrapping first server

Very first server needs to be bootstrapped

Initializes the very first server's log with a configuration entry made up of just itself.

Server1$ cat logcabin.conf
serverId = 1
listenAddresses = 192.168.5.12
storagePath = storage

Server1$ logcabind --config=logcabin.conf --bootstrap
...
Server/RaftConsensus.cc:572 in setConfiguration(): Activating configuration 1:
prev_configuration {
  servers {
    server_id: 1
    addresses: "192.168.5.12"
  }
}

Server/Main.cc:346 in main(): Done bootstrapping configuration. Exiting.
...

Bootstrapping effects

The directory structure shows a single log segment with a single entry, and no snapshot.

Server1$ tree -sh --du -F storage
storage
└── [ 16K]  server1/
    ├── [   0]  lock
    ├── [8.1K]  log/
    │   └── [4.1K]  Segmented-Binary/
    │       ├── [  51]  00000000000000000001-00000000000000000001
    │       ├── [  35]  metadata1
    │       └── [  35]  metadata2
    └── [4.0K]  snapshot/

Bootstrapped log contents

logcabin-storage dumps out the log.

Server1$ logcabin-storage --config=logcabin.conf
...
Storage/Tool.cc:255 in main(): Log contents start
Log:
metadata start: 
current_term: 1
voted_for: 0
end of metadata
startIndex: 1

Entry 1 start:
term: 1
type: CONFIGURATION
configuration {
  prev_configuration {
    servers {
      server_id: 1
      addresses: "192.168.5.12"
    }
  }
}
index: 1
cluster_time: 0
end of entry 1
Storage/Tool.cc:257 in main(): Log contents end
...
Storage/Tool.cc:260 in main(): Reading snapshot at storage/server1
Storage/Tool.cc:153 in readSnapshot(): Snapshot file not found in storage/server1/snapshot

Adding servers: starting them

First, let's start up the servers.

Server2$ cat logcabin.conf
serverId = 2
listenAddresses = 192.168.5.13
storagePath = storage

Server3$ cat logcabin.conf
serverId = 3
listenAddresses = 192.168.5.14
storagePath = storage

Server1$ logcabind --config=logcabin.conf --daemon --log=server1.log
Server2$ logcabind --config=logcabin.conf --daemon --log=server2.log
Server3$ logcabind --config=logcabin.conf --daemon --log=server3.log

Adding servers: first becomes leader

The first server (bootstrapped) gets to become leader of itself.

$ logcabinctl --server=$SERVER1IP stats get
...
server_id: 1
addresses: "192.168.5.12"
...
raft {
  current_term: 2
  state: LEADER
  commit_index: 6
  last_log_index: 6
  leader_id: 1
  voted_for: 1
  ...
  last_snapshot_index: 0
  last_snapshot_bytes: 0
  log_start_index: 1
  log_bytes: 679
  ...
  peer {
    server_id: 1
    addresses: "192.168.5.12"
    old_member: true
    new_member: false
    staging_member: false
    last_synced_index: 6
  }
}
...

Adding servers: others idle

Other servers just sit idle, since they don't have a configuration.

$ logcabinctl --server=$SERVER2IP stats get
...
server_id: 2
addresses: "192.168.5.13"
...
raft {
  current_term: 0
  state: FOLLOWER
  commit_index: 0
  last_log_index: 0
  leader_id: 0
  voted_for: 0
  ...
  last_snapshot_index: 0
  last_snapshot_bytes: 0
  log_start_index: 1
  log_bytes: 1
  ...
  peer {
    server_id: 2
    addresses: ""
    old_member: false
    new_member: false
    staging_member: false
  }
}
...

Adding servers: reconfigure

We can ask the leader to add the other two.

$ export CLUSTER=$SERVER1IP,$SERVER2IP,$SERVER3IP
$ logcabin-reconfigure --cluster=$CLUSTER set $SERVER1IP $SERVER2IP $SERVER3IP
Current configuration:
Configuration 1:
- 1: 192.168.5.12

Attempting to change cluster membership to the following:
1: 192.168.5.12 (given as 192.168.5.12)
2: 192.168.5.13 (given as 192.168.5.13)
3: 192.168.5.14 (given as 192.168.5.14)

Membership change result: OK

Current configuration:
Configuration 11:
- 1: 192.168.5.12
- 2: 192.168.5.13
- 3: 192.168.5.14

Adding servers: done

Now servers 2 and 3 are part of the cluster, have the latest configuration, and are proper followers.

$ logcabinctl --server=$SERVER2IP stats get
...
raft {
  current_term: 17
  state: FOLLOWER
  commit_index: 1045
  last_log_index: 1045
  leader_id: 1
  voted_for: 0
  ...
  peer {
    server_id: 1
    addresses: "192.168.5.12"
    old_member: true
    new_member: false
    staging_member: false
  }
  peer {
    server_id: 2
    addresses: "192.168.5.13"
    old_member: true
    new_member: false
    staging_member: false
  }
  peer {
    server_id: 3
    addresses: "192.168.5.14"
    old_member: true
    new_member: false
    staging_member: false
  }
}

Joint consensus: intro

LogCabin uses the joint consensus approach to Raft membership changes.

Older form of membership changes, prior to single-server approach.

Just for historical reasons
Somewhat more complex
More flexible

Allows transitioning from one cluster to another arbitrarily.

Can add/remove multiple servers at once
No need for any overlap

Cluster remains available throughout the change

Except if leader is not in new cluster, leader election gap

More details in Membership Changes chapter of Raft dissertation.

Joint consensus: procedure

Leader catches up new servers with the latest snapshot and (most) log entries.
Leader appends a transitional configuration entry to its log. Under this configuration, becoming leader and committing entries requires both:

a majority of the old configuration, and
a majority of the new configuration.

Leader commits transitional configuration entry.
Leader appends new configuration entry to its log.
Leader commits new configuration entry.

More details in Membership Changes chapter of Raft dissertation.

Joint consensus: debug log

Log from adding servers after bootstrapping

Server1$ cat server1.log
...
Server/RaftConsensus.cc:1595 in setConfiguration(): Attempting to change the configuration from 1
Server/RaftConsensus.cc:1603 in setConfiguration(): Adding server 1 at 192.168.5.12 to staging servers
Server/RaftConsensus.cc:1603 in setConfiguration(): Adding server 2 at 192.168.5.13 to staging servers
Server/RaftConsensus.cc:1603 in setConfiguration(): Adding server 3 at 192.168.5.14 to staging servers
...
Server/RaftConsensus.cc:1625 in setConfiguration(): Done catching up servers
Server/RaftConsensus.cc:1650 in setConfiguration(): Writing transitional configuration entry
...
Server/RaftConsensus.cc:572 in setConfiguration(): Activating configuration 10:
prev_configuration {
  servers { server_id: 1, addresses: "192.168.5.12" }
}
next_configuration {
  servers { server_id: 1, addresses: "192.168.5.12" },
  servers { server_id: 2, addresses: "192.168.5.13" },
  servers { server_id: 3, addresses: "192.168.5.14" }
}
Server/RaftConsensus.cc:572 in setConfiguration(): Activating configuration 11:
prev_configuration {
  servers { server_id: 1, addresses: "192.168.5.12" },
  servers { server_id: 2, addresses: "192.168.5.13" },
  servers { server_id: 3, addresses: "192.168.5.14" }
}
...

Joint consensus: arbitrary cluster changes

$ logcabin-reconfigure --cluster=$CLUSTER set $SERVER1IP
Current configuration:
Configuration 11:
- 1: 192.168.5.12
- 2: 192.168.5.13
- 3: 192.168.5.14

Attempting to change cluster membership to the following:
1: 192.168.5.12 (given as 192.168.5.12)

Membership change result: OK

Current configuration:
Configuration 2357:
- 1: 192.168.5.12

$ logcabin-reconfigure --cluster=$CLUSTER set $SERVER2IP $SERVER3IP
Current configuration:
Configuration 2357:
- 1: 192.168.5.12

Attempting to change cluster membership to the following:
2: 192.168.5.13 (given as 192.168.5.13)
3: 192.168.5.14 (given as 192.168.5.14)

[...wait for leader election...]
Membership change result: OK

Current configuration:
Configuration 2378:
- 2: 192.168.5.13
- 3: 192.168.5.14

Intro snapshotting

Snapshotting is discussed later, but you'll probably need some concept of it to understand half of logcabinctl.

Goal: reclaim log space

logcabinctl

$ logcabinctl --help
Inspect or modify the state of a single LogCabin server.
...
Commands:
  info get                     Print server ID and addresses.
  debug filename get           Print the server's debug log filename.
  debug filename set <path>    Change the server's debug log filename.
  debug policy get             Print the server's debug log policy.
  debug policy set <value>     Change the server's debug log policy.
  debug rotate                 Rotate the server's debug log file.
  snapshot inhibit get         Print the remaining time for which the server
                               was prevented from taking snapshots.
  snapshot inhibit set [<time>]  Abort the server's current snapshot if one is
                                 in progress, and disallow the server from
                                 starting automated snapshots for the given
                                 duration [default: 1week].
  snapshot inhibit clear       Allow the server to take snapshots normally.
  snapshot start               Begin taking a snapshot if none is in progress.
  snapshot stop                Abort the current snapshot if one is in
                               progress.
  snapshot restart             Abort the current snapshot if one is in
                               progress, then begin taking a new snapshot.
  stats get                    Print detailed server metrics.
  stats dump                   Write detailed server metrics to server's debug
                               log.

Forcing a snapshot

$ logcabinctl --server=$SERVER1IP stats get | grep bytes
  last_snapshot_bytes: 0
  log_bytes: 744921
  open_segment_bytes: 744867

$ logcabinctl --server=$SERVER1IP snapshot start

...wait a second...

$ logcabinctl --server=$SERVER1IP stats get | grep bytes
  last_snapshot_bytes: 2785
  log_bytes: 1863
  open_segment_bytes: 1863

Inhibiting snapshots

scqad (runs Scale's distributed tests) prevents automatic snapshots for one week after a test failure.

$ logcabinctl --server=$SERVER1IP snapshot inhibit set 1week
$ logcabinctl --server=$SERVER1IP snapshot inhibit get
604795.121117807 s

Allows you to see more Raft log history with logcabin-storage.

Undo:

$ logcabinctl --server=$SERVER1IP snapshot inhibit clear
$ logcabinctl --server=$SERVER1IP snapshot inhibit get
0 ns

Changing debug log verbosity

Change the server's verbosity at runtime

$ logcabinctl --server=$SERVER1IP debug policy get
NOTICE
$ logcabinctl --server=$SERVER1IP debug policy set \
  Server/RaftConsensus.cc@VERBOSE,Storage@WARNING,NOTICE

Server1$ tail server1.log
Server/RaftConsensus.cc:1393 in handleAppendEntries() VERBOSE: New commitIndex: 4997
Server/RaftConsensus.cc:2804 in setElectionTimer() VERBOSE: Will become candidate in 581 ms
...

Clients have similar control with:


LogCabin::Client::Debug::setLogPolicy(
  LogCabin::Client::Debug::logPolicyFromString(
    "Client@VERBOSE,NOTICE"));

Set server's default verbosity in config file

Code walkthrough

Core/

Random
Time (C++11)
Mutex, condition variable
STL and string utilities
Checksumming (Crypto++)
Debug logging
Config file
Buffer (ptr, len)

Event/

Event loop (epoll)
Singals
Timers
File descriptors
Event::Loop::Lock:
- block event loop thread in user-space outside of any handler
- used when removing monitored files

RPC/

Low-level framing protocol
Application-level connection initiation timeout and heartbeats
Higher-level RPC protocol
Address: DNS resolution
Service: RPC endpoint
Thread dispatch for most services

Protocol/

Mostly Protocol Buffer definitions for individual RPC types

Server/

Raft implementation
State machine

Client sessions
Forking

Daemon startup (Globals)

Storage/

In-memory and on-disk log
Opens/closes snapshot files
Filesystem layout
Filesystem utilities

Tree/

Core data structure for clients
ProtoBuf-to-method call layer

Client/

Implementation of client library API
LeaderRPC: connect to leader
MockClientImpl: in-memory Tree used for testing applications

Examples/

logcabin-reconfigure
logcabin CLI tool
Hello world
Benchmark

Monitor testing example

Code


/// Return once the state machine has
/// applied at least the given entry.
void
StateMachine::wait(uint64_t index) const
{
    std::unique_lock<Mutex> lockGuard(mutex);
    while (lastApplied < index)
        entriesApplied.wait(lockGuard);
}

Unit test


struct WaitHelper {
    explicit WaitHelper(StateMachine& stateMachine)
        : stateMachine(stateMachine)
        , iter(0) {
    }
    void operator()() {
        ++iter;
        if (iter == 1) {
            EXPECT_EQ(0U, stateMachine.lastApplied);
            stateMachine.lastApplied = 2;
        } else if (iter == 2) {
            EXPECT_EQ(2U, stateMachine.lastApplied);
            stateMachine.lastApplied = 3;
        }
    }
    StateMachine& stateMachine;
    uint64_t iter;
};
TEST_F(ServerStateMachineTest, wait)
{
    WaitHelper helper(*stateMachine);
    stateMachine->entriesApplied.callback =
        std::ref(helper);
    stateMachine->wait(3);
    EXPECT_EQ(2U, helper.iter);
}

SegmentedStorage internals

Segment: about 8MB file where consecutive log entries are written
About 3 open segments pre-allocated by background thread
New log entries appended to head segment
closed segments are immutable

Operations:

Read log entry
Append batch of log entries
Truncate suffix of log
Truncate prefix of log

Snapshotting

Challenge: writing consistent snapshot while taking requests

Writes out Raft header
Forks child
- Child writes Tree data
- Child exits 0
Parent closes/fsyncs file
Watchdog thread in parent in case child stalls

More details in Log Compaction chapter of Raft dissertation.