LogCabin

usage, operation, and internals

Preface: Diego Ongaro hastily prepared and presented this talk at Scale Computing's Engineering Week in July 2015. Though its organization is poor, it contains some good content, some of which cannot be found elsewhere.
There's sort of two axes to this talk: (1) usage, operations, internals, and (2) aspect of LogCabin such as writes, reads, membership changes, compaction. Right now it's all mushed together. On the next revision, it'd be better to do a conceptual overview, then organize by (2), interspersing the (1)s.

Copyright 2015 Diego Ongaro.
Source code available at https://github.com/logcabin/logcabin-talk.
This work is licensed under the Creative Commons Attribution 4.0 International License.
Creative Commons License

Command-line client

$ logcabin -h
Run various operations on a LogCabin replicated state machine.

Usage: logcabin [options] <command> [<args>]

Commands:
  mkdir <path>    If no directory exists at <path>, create it.
  list <path>     List keys within directory at <path>.
  dump [<path>]   Recursively print keys and values within directory at <path>.
                  Defaults to printing all keys and values from root of tree.
  rmdir <path>    Recursively remove directory at <path>, if any.
  write <path>    Set/create value of file at <path> to stdin.
  read <path>     Print value of file at <path>.
  remove <path>   Remove file at <path>, if any.

Options:
  -c <addresses>, --cluster=<addresses>  Network addresses of the LogCabin
                                         servers, comma-separated
                                         [default: logcabin:5254]
  -d <path>, --dir=<path>        Set working directory [default: /]
  -p <pred>, --condition=<pred>  Set predicate on the operation of the
                                 form <path>:<value>, indicating that the key
                                 at <path> must have the given value.
  -t <time>, --timeout=<time>    Set timeout for the operation
                                 (0 means wait forever) [default: 0s]
    

Linearizability (schedule A)

An operation is linearizable if it appears to occur instantaneously and exactly once at some point in time between its invocation and its response.

Linearizability (schedule B)

An operation is linearizable if it appears to occur instantaneously and exactly once at some point in time between its invocation and its response.

For more detail, see Linearizability: A correctness condition for concurrent objects, Maurice P. Herlihy and Jeannette M. Wing, 1990; and Linearizability versus Serializability, a blog post by Peter Bailis, 2014.

Replicated state machines

LogCabin is one of these.

  • Replicated log ⇒ replicated state machine
  • All servers execute same commands in same order
  • Consensus module ensures proper log replication
  • System makes progress as long as any majority of servers up
  • Failure model: fail-stop (not Byzantine), delayed/lost msgs
  • Bootstrapping first server

    Very first server needs to be bootstrapped

    Initializes the very first server's log with a configuration entry made up of just itself.

    Server1$ cat logcabin.conf
    serverId = 1
    listenAddresses = 192.168.5.12
    storagePath = storage
        
    Server1$ logcabind --config=logcabin.conf --bootstrap
    ...
    Server/RaftConsensus.cc:572 in setConfiguration(): Activating configuration 1:
    prev_configuration {
      servers {
        server_id: 1
        addresses: "192.168.5.12"
      }
    }
    
    Server/Main.cc:346 in main(): Done bootstrapping configuration. Exiting.
    ...
        

    Bootstrapping effects

    The directory structure shows a single log segment with a single entry, and no snapshot.

    Server1$ tree -sh --du -F storage
    storage
    └── [ 16K]  server1/
        ├── [   0]  lock
        ├── [8.1K]  log/
        │   └── [4.1K]  Segmented-Binary/
        │       ├── [  51]  00000000000000000001-00000000000000000001
        │       ├── [  35]  metadata1
        │       └── [  35]  metadata2
        └── [4.0K]  snapshot/
        

    Bootstrapped log contents

    logcabin-storage dumps out the log.

    Server1$ logcabin-storage --config=logcabin.conf
    ...
    Storage/Tool.cc:255 in main(): Log contents start
    Log:
    metadata start: 
    current_term: 1
    voted_for: 0
    end of metadata
    startIndex: 1
    
    Entry 1 start:
    term: 1
    type: CONFIGURATION
    configuration {
      prev_configuration {
        servers {
          server_id: 1
          addresses: "192.168.5.12"
        }
      }
    }
    index: 1
    cluster_time: 0
    end of entry 1
    Storage/Tool.cc:257 in main(): Log contents end
    ...
    Storage/Tool.cc:260 in main(): Reading snapshot at storage/server1
    Storage/Tool.cc:153 in readSnapshot(): Snapshot file not found in storage/server1/snapshot
        

    Adding servers: starting them

    First, let's start up the servers.

    Server2$ cat logcabin.conf
    serverId = 2
    listenAddresses = 192.168.5.13
    storagePath = storage
        
    Server3$ cat logcabin.conf
    serverId = 3
    listenAddresses = 192.168.5.14
    storagePath = storage
        
    Server1$ logcabind --config=logcabin.conf --daemon --log=server1.log
    Server2$ logcabind --config=logcabin.conf --daemon --log=server2.log
    Server3$ logcabind --config=logcabin.conf --daemon --log=server3.log
        

    Adding servers: first becomes leader

    The first server (bootstrapped) gets to become leader of itself.

    $ logcabinctl --server=$SERVER1IP stats get
    ...
    server_id: 1
    addresses: "192.168.5.12"
    ...
    raft {
      current_term: 2
      state: LEADER
      commit_index: 6
      last_log_index: 6
      leader_id: 1
      voted_for: 1
      ...
      last_snapshot_index: 0
      last_snapshot_bytes: 0
      log_start_index: 1
      log_bytes: 679
      ...
      peer {
        server_id: 1
        addresses: "192.168.5.12"
        old_member: true
        new_member: false
        staging_member: false
        last_synced_index: 6
      }
    }
    ...
        

    Adding servers: others idle

    Other servers just sit idle, since they don't have a configuration.

    $ logcabinctl --server=$SERVER2IP stats get
    ...
    server_id: 2
    addresses: "192.168.5.13"
    ...
    raft {
      current_term: 0
      state: FOLLOWER
      commit_index: 0
      last_log_index: 0
      leader_id: 0
      voted_for: 0
      ...
      last_snapshot_index: 0
      last_snapshot_bytes: 0
      log_start_index: 1
      log_bytes: 1
      ...
      peer {
        server_id: 2
        addresses: ""
        old_member: false
        new_member: false
        staging_member: false
      }
    }
    ...
        

    Adding servers: reconfigure

    We can ask the leader to add the other two.

    $ export CLUSTER=$SERVER1IP,$SERVER2IP,$SERVER3IP
    $ logcabin-reconfigure --cluster=$CLUSTER set $SERVER1IP $SERVER2IP $SERVER3IP
    Current configuration:
    Configuration 1:
    - 1: 192.168.5.12
    
    Attempting to change cluster membership to the following:
    1: 192.168.5.12 (given as 192.168.5.12)
    2: 192.168.5.13 (given as 192.168.5.13)
    3: 192.168.5.14 (given as 192.168.5.14)
    
    Membership change result: OK
    
    Current configuration:
    Configuration 11:
    - 1: 192.168.5.12
    - 2: 192.168.5.13
    - 3: 192.168.5.14
        

    Adding servers: done

    Now servers 2 and 3 are part of the cluster, have the latest configuration, and are proper followers.

    $ logcabinctl --server=$SERVER2IP stats get
    ...
    raft {
      current_term: 17
      state: FOLLOWER
      commit_index: 1045
      last_log_index: 1045
      leader_id: 1
      voted_for: 0
      ...
      peer {
        server_id: 1
        addresses: "192.168.5.12"
        old_member: true
        new_member: false
        staging_member: false
      }
      peer {
        server_id: 2
        addresses: "192.168.5.13"
        old_member: true
        new_member: false
        staging_member: false
      }
      peer {
        server_id: 3
        addresses: "192.168.5.14"
        old_member: true
        new_member: false
        staging_member: false
      }
    }
        

    Joint consensus: intro

    LogCabin uses the joint consensus approach to Raft membership changes.

    • Older form of membership changes, prior to single-server approach.
      • Just for historical reasons
      • Somewhat more complex
      • More flexible
    • Allows transitioning from one cluster to another arbitrarily.
      • Can add/remove multiple servers at once
      • No need for any overlap
    • Cluster remains available throughout the change
      • Except if leader is not in new cluster, leader election gap

    More details in Membership Changes chapter of Raft dissertation.

    Joint consensus: procedure

    1. Leader catches up new servers with the latest snapshot and (most) log entries.
    2. Leader appends a transitional configuration entry to its log. Under this configuration, becoming leader and committing entries requires both:
      • a majority of the old configuration, and
      • a majority of the new configuration.
    3. Leader commits transitional configuration entry.
    4. Leader appends new configuration entry to its log.
    5. Leader commits new configuration entry.

    More details in Membership Changes chapter of Raft dissertation.

    Joint consensus: debug log

    Log from adding servers after bootstrapping

    Server1$ cat server1.log
    ...
    Server/RaftConsensus.cc:1595 in setConfiguration(): Attempting to change the configuration from 1
    Server/RaftConsensus.cc:1603 in setConfiguration(): Adding server 1 at 192.168.5.12 to staging servers
    Server/RaftConsensus.cc:1603 in setConfiguration(): Adding server 2 at 192.168.5.13 to staging servers
    Server/RaftConsensus.cc:1603 in setConfiguration(): Adding server 3 at 192.168.5.14 to staging servers
    ...
    Server/RaftConsensus.cc:1625 in setConfiguration(): Done catching up servers
    Server/RaftConsensus.cc:1650 in setConfiguration(): Writing transitional configuration entry
    ...
    Server/RaftConsensus.cc:572 in setConfiguration(): Activating configuration 10:
    prev_configuration {
      servers { server_id: 1, addresses: "192.168.5.12" }
    }
    next_configuration {
      servers { server_id: 1, addresses: "192.168.5.12" },
      servers { server_id: 2, addresses: "192.168.5.13" },
      servers { server_id: 3, addresses: "192.168.5.14" }
    }
    Server/RaftConsensus.cc:572 in setConfiguration(): Activating configuration 11:
    prev_configuration {
      servers { server_id: 1, addresses: "192.168.5.12" },
      servers { server_id: 2, addresses: "192.168.5.13" },
      servers { server_id: 3, addresses: "192.168.5.14" }
    }
    ...
        

    Joint consensus: arbitrary cluster changes

    $ logcabin-reconfigure --cluster=$CLUSTER set $SERVER1IP
    Current configuration:
    Configuration 11:
    - 1: 192.168.5.12
    - 2: 192.168.5.13
    - 3: 192.168.5.14
    
    Attempting to change cluster membership to the following:
    1: 192.168.5.12 (given as 192.168.5.12)
    
    Membership change result: OK
    
    Current configuration:
    Configuration 2357:
    - 1: 192.168.5.12
    
    $ logcabin-reconfigure --cluster=$CLUSTER set $SERVER2IP $SERVER3IP
    Current configuration:
    Configuration 2357:
    - 1: 192.168.5.12
    
    Attempting to change cluster membership to the following:
    2: 192.168.5.13 (given as 192.168.5.13)
    3: 192.168.5.14 (given as 192.168.5.14)
    
    [...wait for leader election...]
    Membership change result: OK
    
    Current configuration:
    Configuration 2378:
    - 2: 192.168.5.13
    - 3: 192.168.5.14
        

    Intro snapshotting

    Snapshotting is discussed later, but you'll probably need some concept of it to understand half of logcabinctl.

    Goal: reclaim log space

    logcabinctl

    $ logcabinctl --help
    Inspect or modify the state of a single LogCabin server.
    ...
    Commands:
      info get                     Print server ID and addresses.
      debug filename get           Print the server's debug log filename.
      debug filename set <path>    Change the server's debug log filename.
      debug policy get             Print the server's debug log policy.
      debug policy set <value>     Change the server's debug log policy.
      debug rotate                 Rotate the server's debug log file.
      snapshot inhibit get         Print the remaining time for which the server
                                   was prevented from taking snapshots.
      snapshot inhibit set [<time>]  Abort the server's current snapshot if one is
                                     in progress, and disallow the server from
                                     starting automated snapshots for the given
                                     duration [default: 1week].
      snapshot inhibit clear       Allow the server to take snapshots normally.
      snapshot start               Begin taking a snapshot if none is in progress.
      snapshot stop                Abort the current snapshot if one is in
                                   progress.
      snapshot restart             Abort the current snapshot if one is in
                                   progress, then begin taking a new snapshot.
      stats get                    Print detailed server metrics.
      stats dump                   Write detailed server metrics to server's debug
                                   log.
        

    Forcing a snapshot

    $ logcabinctl --server=$SERVER1IP stats get | grep bytes
      last_snapshot_bytes: 0
      log_bytes: 744921
      open_segment_bytes: 744867
        
    $ logcabinctl --server=$SERVER1IP snapshot start
        
    ...wait a second...
    $ logcabinctl --server=$SERVER1IP stats get | grep bytes
      last_snapshot_bytes: 2785
      log_bytes: 1863
      open_segment_bytes: 1863
        

    Inhibiting snapshots

    scqad (runs Scale's distributed tests) prevents automatic snapshots for one week after a test failure.

    $ logcabinctl --server=$SERVER1IP snapshot inhibit set 1week
    $ logcabinctl --server=$SERVER1IP snapshot inhibit get
    604795.121117807 s
        

    Allows you to see more Raft log history with logcabin-storage.

    Undo:

    $ logcabinctl --server=$SERVER1IP snapshot inhibit clear
    $ logcabinctl --server=$SERVER1IP snapshot inhibit get
    0 ns
        

    Changing debug log verbosity

    Change the server's verbosity at runtime

    $ logcabinctl --server=$SERVER1IP debug policy get
    NOTICE
    $ logcabinctl --server=$SERVER1IP debug policy set \
      Server/RaftConsensus.cc@VERBOSE,Storage@WARNING,NOTICE
    
    Server1$ tail server1.log
    Server/RaftConsensus.cc:1393 in handleAppendEntries() VERBOSE: New commitIndex: 4997
    Server/RaftConsensus.cc:2804 in setElectionTimer() VERBOSE: Will become candidate in 581 ms
    ...
    

    Clients have similar control with:

    
    LogCabin::Client::Debug::setLogPolicy(
      LogCabin::Client::Debug::logPolicyFromString(
        "Client@VERBOSE,NOTICE"));
        

    Set server's default verbosity in config file

    Code walkthrough

    Core/

    • Random
    • Time (C++11)
    • Mutex, condition variable
    • STL and string utilities
    • Checksumming (Crypto++)
    • Debug logging
    • Config file
    • Buffer (ptr, len)

    Event/

    • Event loop (epoll)
    • Singals
    • Timers
    • File descriptors
    • Event::Loop::Lock:
      • block event loop thread in user-space outside of any handler
      • used when removing monitored files

    RPC/

    • Low-level framing protocol
    • Application-level connection initiation timeout and heartbeats
    • Higher-level RPC protocol
    • Address: DNS resolution
    • Service: RPC endpoint
    • Thread dispatch for most services

    Protocol/

    • Mostly Protocol Buffer definitions for individual RPC types

    Server/

    • Raft implementation
    • State machine
      • Client sessions
      • Forking
    • Daemon startup (Globals)

    Storage/

    • In-memory and on-disk log
    • Opens/closes snapshot files
    • Filesystem layout
    • Filesystem utilities

    Tree/

    • Core data structure for clients
    • ProtoBuf-to-method call layer

    Client/

    • Implementation of client library API
    • LeaderRPC: connect to leader
    • MockClientImpl: in-memory Tree used for testing applications

    Examples/

    • logcabin-reconfigure
    • logcabin CLI tool
    • Hello world
    • Benchmark

    Monitor testing example

    Code
    
    /// Return once the state machine has
    /// applied at least the given entry.
    void
    StateMachine::wait(uint64_t index) const
    {
        std::unique_lock<Mutex> lockGuard(mutex);
        while (lastApplied < index)
            entriesApplied.wait(lockGuard);
    }
        
    Unit test
    
    struct WaitHelper {
        explicit WaitHelper(StateMachine& stateMachine)
            : stateMachine(stateMachine)
            , iter(0) {
        }
        void operator()() {
            ++iter;
            if (iter == 1) {
                EXPECT_EQ(0U, stateMachine.lastApplied);
                stateMachine.lastApplied = 2;
            } else if (iter == 2) {
                EXPECT_EQ(2U, stateMachine.lastApplied);
                stateMachine.lastApplied = 3;
            }
        }
        StateMachine& stateMachine;
        uint64_t iter;
    };
    TEST_F(ServerStateMachineTest, wait)
    {
        WaitHelper helper(*stateMachine);
        stateMachine->entriesApplied.callback =
            std::ref(helper);
        stateMachine->wait(3);
        EXPECT_EQ(2U, helper.iter);
    }
        

    SegmentedStorage internals

    • Segment: about 8MB file where consecutive log entries are written
    • About 3 open segments pre-allocated by background thread
    • New log entries appended to head segment
    • closed segments are immutable

    Operations:

    • Read log entry
    • Append batch of log entries
    • Truncate suffix of log
    • Truncate prefix of log

    Snapshotting

    Challenge: writing consistent snapshot while taking requests

    • Writes out Raft header
    • Forks child
      • Child writes Tree data
      • Child exits 0
    • Parent closes/fsyncs file
    • Watchdog thread in parent in case child stalls

    More details in Log Compaction chapter of Raft dissertation.