etcd架构设计深度剖析 整体架构图 graph TB
subgraph "Client Layer"
KUBECTL[kubectl]
API[kube-apiserver]
APPS[Applications]
end
subgraph "etcd Cluster"
subgraph "Node 1 (Leader)"
HTTP1[HTTP API]
GRPC1[gRPC API]
RAFT1[Raft Module]
STORAGE1[Storage Engine]
WAL1[WAL]
SNAP1[Snapshot]
end
subgraph "Node 2 (Follower)"
HTTP2[HTTP API]
GRPC2[gRPC API]
RAFT2[Raft Module]
STORAGE2[Storage Engine]
WAL2[WAL]
SNAP2[Snapshot]
end
subgraph "Node 3 (Follower)"
HTTP3[HTTP API]
GRPC3[gRPC API]
RAFT3[Raft Module]
STORAGE3[Storage Engine]
WAL3[WAL]
SNAP3[Snapshot]
end
end
KUBECTL --> HTTP1
API --> GRPC1
APPS --> GRPC2
RAFT1 <-.-> RAFT2
RAFT2 <-.-> RAFT3
RAFT3 <-.-> RAFT1
RAFT1 --> STORAGE1
RAFT2 --> STORAGE2
RAFT3 --> STORAGE3
STORAGE1 --> WAL1
STORAGE2 --> WAL2
STORAGE3 --> WAL3
核心模块详解 1. API层设计 gRPC API接口 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 service KV { rpc Range(RangeRequest) returns (RangeResponse) ; rpc Put(PutRequest) returns (PutResponse) ; rpc DeleteRange(DeleteRangeRequest) returns (DeleteRangeResponse) ; rpc Txn(TxnRequest) returns (TxnResponse) ; rpc Compact(CompactionRequest) returns (CompactionResponse) ; } service Watch { rpc Watch(stream WatchRequest) returns (stream WatchResponse) ; } service Lease { rpc LeaseGrant(LeaseGrantRequest) returns (LeaseGrantResponse) ; rpc LeaseRevoke(LeaseRevokeRequest) returns (LeaseRevokeResponse) ; rpc LeaseKeepAlive(stream LeaseKeepAliveRequest) returns (stream LeaseKeepAliveResponse) ; rpc LeaseTimeToLive(LeaseTimeToLiveRequest) returns (LeaseTimeToLiveResponse) ; } service Cluster { rpc MemberAdd(MemberAddRequest) returns (MemberAddResponse) ; rpc MemberRemove(MemberRemoveRequest) returns (MemberRemoveResponse) ; rpc MemberUpdate(MemberUpdateRequest) returns (MemberUpdateResponse) ; rpc MemberList(MemberListRequest) returns (MemberListResponse) ; }
HTTP RESTful API 1 2 3 4 5 6 7 8 9 10 11 12 13 GET /v3/kv/range POST /v3/kv/put POST /v3/kv/deleterange POST /v3/kv/txn GET /v3/watch POST /v3/lease/grant POST /v3/lease/revoke POST /v3/cluster/member/add POST /v3/cluster/member/remove GET /v3/cluster/member/list
2. Raft共识层 Raft状态机 stateDiagram-v2
[*] --> Follower
Follower --> Candidate : Election timeout
Candidate --> Leader : Receives majority votes
Candidate --> Follower : Discovers higher term
Leader --> Follower : Discovers higher term
Follower --> Follower : Receives heartbeat
Candidate --> Candidate : Split vote, restart election
Raft核心组件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 type NodeState int const ( StateFollower NodeState = iota StateCandidate StateLeader ) type Entry struct { Term uint64 Index uint64 Type EntryType Data []byte } type Config struct { ID uint64 ElectionTick int HeartbeatTick int Storage Storage Applied uint64 MaxSizePerMsg uint64 MaxInflightMsgs int CheckQuorum bool PreVote bool }
Leader选举流程 sequenceDiagram
participant F1 as Follower 1
participant F2 as Follower 2
participant F3 as Follower 3
Note over F1,F3: Initial state: all followers
F1->>F1: Election timeout
F1->>F1: Become candidate
F1->>F2: RequestVote (term=2)
F1->>F3: RequestVote (term=2)
F2->>F1: VoteGranted
F3->>F1: VoteGranted
F1->>F1: Majority votes received
F1->>F1: Become leader
F1->>F2: Heartbeat (term=2)
F1->>F3: Heartbeat (term=2)
F2->>F1: HeartbeatResponse
F3->>F1: HeartbeatResponse
3. 存储引擎层 BoltDB存储架构 graph TB
subgraph "BoltDB架构"
META[Meta Page]
FREELIST[Freelist Page]
BRANCH[Branch Pages B+Tree内部节点]
LEAF[Leaf Pages B+Tree叶子节点]
META --> FREELIST
META --> BRANCH
BRANCH --> LEAF
end
subgraph "etcd存储抽象"
MVCC[MVCC Layer]
BACKEND[Backend Interface]
BOLT[BoltDB Implementation]
MVCC --> BACKEND
BACKEND --> BOLT
BOLT --> BRANCH
end
MVCC数据模型 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 type KeyValue struct { Key []byte CreateRevision int64 ModRevision int64 Version int64 Value []byte Lease int64 } type KVStore interface { Range(key, end []byte , limit, rangeRev int64 ) (*RangeResult, error ) Put(key, value []byte , lease lease.LeaseID) int64 DeleteRange(key, end []byte ) (n, rev int64 ) TxnBegin() TxnID TxnEnd(txnID TxnID) error TxnRange(txnID TxnID, key, end []byte , limit, rangeRev int64 ) (*RangeResult, error ) TxnPut(txnID TxnID, key, value []byte , lease lease.LeaseID) (rev int64 , err error ) TxnDeleteRange(txnID TxnID, key, end []byte ) (n, rev int64 ) Compact(rev int64 ) (<-chan struct {}, error ) }
存储层次结构 1 2 3 4 5 6 7 8 9 10 /var/lib/etcd/ ├── member/ │ ├── snap/ │ │ ├── 0000000000000001-0000000000000001.snap │ │ └── db │ └── wal/ │ ├── 0000000000000000-0000000000000000.wal │ └── 0.tmp └── proxy/
4. WAL日志系统 WAL文件格式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 type Record struct { Type EntryType Crc uint32 Data []byte } type WalHeader struct { Magic uint64 Version uint64 ID uint64 NodeID uint64 } type EntryType uint8 const ( EntryNormal EntryType = 0 EntryConfChange EntryType = 1 )
WAL写入流程 sequenceDiagram
participant C as Client
participant R as Raft
participant W as WAL
participant S as Storage
C->>R: Put Request
R->>R: Create Log Entry
R->>W: Write to WAL
W->>W: Sync to Disk
R->>R: Append to Log
R->>S: Apply to State Machine
S->>S: Update MVCC
R->>C: Response
5. 快照机制 快照数据结构 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 type Snapshot struct { Index uint64 Term uint64 ConfState ConfState Data []byte } type Snapshotter struct { dir string } func (s *Snapshotter) SaveSnap(snapshot raftpb.Snapshot) error { fname := fmt.Sprintf("%016x-%016x.snap" , snapshot.Metadata.Term, snapshot.Metadata.Index) return s.save(fname, snapshot) } func (s *Snapshotter) Load() (*raftpb.Snapshot, error ) { return s.loadMatching(func (fileName string ) bool { return true }) }
快照创建流程 graph TB
TRIGGER[快照触发]
--> CHECK{检查条件}
CHECK -->|满足| CREATE[创建快照]
CHECK -->|不满足| WAIT[等待下次]
CREATE --> COPY[复制状态数据]
COPY --> WRITE[写入快照文件]
WRITE --> CLEANUP[清理旧日志]
CLEANUP --> NOTIFY[通知完成]
subgraph "触发条件"
ENTRIES[日志条目数量]
TIME[时间间隔]
SIZE[文件大小]
end
6. 租约系统 租约管理架构 graph TB
subgraph "Lease Manager"
LESSOR[Lessor]
HEAP[MinHeap 按过期时间排序]
MAP[LeaseMap ID到租约映射]
end
subgraph "Lease Operations"
GRANT[Grant Lease]
RENEW[Renew Lease]
REVOKE[Revoke Lease]
EXPIRE[Expire Check]
end
GRANT --> LESSOR
RENEW --> LESSOR
REVOKE --> LESSOR
EXPIRE --> LESSOR
LESSOR --> HEAP
LESSOR --> MAP
租约数据结构 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 type Lease struct { ID LeaseID TTL int64 RemainingTTL int64 Keys map [string ]struct {} Expiry time.Time } type Lessor struct { mu sync.RWMutex leaseMap map [LeaseID]*Lease leaseHeap LeaseQueue backend backend.Backend } type Lessor interface { Grant(id LeaseID, ttl int64 ) (*Lease, error ) Revoke(id LeaseID) error Renew(id LeaseID) (int64 , error ) Lookup(id LeaseID) *Lease ExpiredLeasesC() <-chan []*Lease }
数据流分析 写操作流程 sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower1
participant F2 as Follower2
participant S as Storage
C->>L: Put Request
L->>L: Create Raft Entry
L->>L: Write to WAL
par Log Replication
L->>F1: AppendEntries RPC
L->>F2: AppendEntries RPC
end
F1->>L: Success Response
F2->>L: Success Response
L->>L: Commit Entry
L->>S: Apply to MVCC
S->>S: Update BoltDB
L->>C: Put Response
读操作流程 graph TB
CLIENT[Client Request]
--> LEADER{是否读最新数据}
LEADER -->|是| QUORUM[Quorum Read]
LEADER -->|否| LOCAL[Local Read]
QUORUM --> HEARTBEAT[发送心跳确认Leader]
HEARTBEAT --> READ_LATEST[读取最新数据]
LOCAL --> READ_LOCAL[本地读取]
READ_LATEST --> RESPONSE[返回结果]
READ_LOCAL --> RESPONSE
Watch事件流 sequenceDiagram
participant C as Client
participant W as Watchable Store
participant M as MVCC
participant S as Syncer
C->>W: Watch Request
W->>W: Create Watcher
W->>C: Watch ID
Note over M: 数据变更发生
M->>W: MVCC Event
W->>S: Queue Event
S->>S: Batch Events
S->>C: Watch Response
C->>W: Cancel Watch
W->>W: Remove Watcher
集群管理架构 成员管理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 type Member struct { ID uint64 `json:"id"` Name string `json:"name"` PeerURLs []string `json:"peerURLs"` ClientURLs []string `json:"clientURLs"` } type Cluster struct { id uint64 token string members map [uint64 ]*Member version int index uint64 } type ConfChange struct { Type ConfChangeType NodeID uint64 Context []byte }
动态重配置流程 graph TB
REQUEST[成员变更请求]
--> VALIDATE[验证变更合法性]
VALIDATE --> PROPOSE[提议配置变更]
PROPOSE --> REPLICATE[复制到多数节点]
REPLICATE --> COMMIT[提交配置变更]
COMMIT --> APPLY[应用新配置]
APPLY --> NOTIFY[通知集群成员]
网络通信架构 gRPC通信栈 graph TB
subgraph "gRPC层"
GRPC[gRPC Server/Client]
PROTO[Protocol Buffers]
HTTP2[HTTP/2 Transport]
end
subgraph "网络层"
TLS[TLS Encryption]
TCP[TCP Connection]
NET[Network Interface]
end
GRPC --> PROTO
PROTO --> HTTP2
HTTP2 --> TLS
TLS --> TCP
TCP --> NET
连接池管理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 type ClientConfig struct { DialTimeout time.Duration DialKeepAliveTime time.Duration DialKeepAliveTimeout time.Duration MaxCallSendMsgSize int MaxCallRecvMsgSize int TLS *tls.Config Username string Password string } type Balancer interface { Pick() (Endpoint, error ) UpdateAddrs(addrs []string ) error MarkUnhealthy(addr string ) }
性能优化设计 批量操作优化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 type RangeRequest struct { Key []byte RangeEnd []byte Limit int64 SortOrder RangeRequest_SortOrder SortTarget RangeRequest_SortTarget Serializable bool KeysOnly bool CountOnly bool } type TxnRequest struct { Compare []Compare Success []RequestOp Failure []RequestOp }
缓存机制 graph TB
subgraph "读取路径优化"
REQUEST[Read Request]
CACHE[Read Cache]
BACKEND[Backend Storage]
REQUEST --> CACHE
CACHE -->|Hit| RESPONSE[Response]
CACHE -->|Miss| BACKEND
BACKEND --> RESPONSE
end
subgraph "写入路径优化"
WRITE[Write Request]
BATCH[Batch Writer]
WAL[WAL Log]
STORAGE[Storage Engine]
WRITE --> BATCH
BATCH --> WAL
WAL --> STORAGE
end
核心设计原则 1. 强一致性保证
Raft共识 : 保证分布式环境下的数据一致性
WAL日志 : 确保数据持久性和恢复能力
MVCC : 提供快照隔离和版本控制
2. 高性能设计
gRPC协议 : 高效的二进制通信
批量操作 : 减少网络开销
缓存机制 : 提升读取性能
3. 高可用性
集群部署 : 支持多节点容错
自动故障转移 : Leader选举机制
数据复制 : 多副本保证数据安全
这是etcd架构的深度技术解析,展现了分布式一致性存储的核心设计思想。接下来我们将探讨具体的实现细节和最佳实践。
系列文章导航: