Apache RocketMQ 源码解析 —— Controller 高可用切换架构
一、原理及核心概念浅述
1.1 核心架构

1.2 核心概念
- controller:负责管理broker间的主备关系,可以挂在namesrv中,不影响namesrv能力,支持独立部署。
- master/slave:主备身份。
- syncStateSet:字面意思为“同步状态集合”。当备节点能够及时跟上主节点,则会纳入syncStateSet。
- epoch:用于记录每一次主备切换时的状态,避免切换后产生数据丢失或者不一致的情况。
为方便理解,在某些过程中可以把controller当作班主任,master作为小组长,slave作为小组成员。同步过程是各位同学向小组长抄作业的过程,位于syncStateSet中的是优秀作业。
二、相关代码文件及说明
核心是“controller+broker+复制过程”,因此分三块进行叙述。
2.1 Controller
该部分代码主要集中在rocketmq-controller模块下,主要有如下代码文件:
- **ControllerManager: **负责管理controller,其中存储了许多controller相关配置,并负责了心跳管理等核心功能。(班主任管理条例)
- DLederController: Controller的DLedger实现。包含了controller的基本功能。在其中实现了副本信息管理、broker存活情况探测、选举Master等核心功能。(某种班主任)
- **DefaultBrokerHeartbeatManager: **负责管理broker心跳,其中包含了broker存活情况表,以及在broker下线时的listeners,当副本掉线时,触发重新选举。(点名册)
- ReplicasInfoManager: 负责controller中事件的处理。即各种选举事件、更换SyncStateSet事件等等。(小组登记册)
- **ControllerRequestProcessor: **处理向controller发送的requests,例如让controller选举、向controller注册broker、心跳、更换SyncStateSet等等。(班主任信箱)
- **DefaultElectPolicy: **选举Master的策略。可以选择从sync状态的副本中选,也可以支持从所有副本中(无论是否同步)的unclean选举。(班规)
- …
2.2 Broker
该部分代码主要集中在rocketmq-broker模块中,可进入org/apache/rocketmq/broker/controller进行查看:
- ReplicasManager: 完成自己作为一个replica的使命——找controller,角色管理,Master更新(Expand/Shrink)SyncStateSet等等。
2.3 复制模块
该部分代码主要集中在rocketmq-store模块中的ha文件夹下:
- **HAService: **每个Replica必备的的service,负责管理作为主、备的同步任务。
- **HAClient: **每个Slave 的HAService中必备的client,负责管理同步任务中的读、写操作。
- HAConnection: 代表在Master中的HA连接,每个connection理论上对应一个slave。在该connection类中存储了传输过程中的诸多内容,包括channel、传输状态、当前传输位点等等信息。
三、核心流程
3.1 心跳
核心CODE:BROKER_HEARTBEAT
Broker端:
该部分较简单,带上code向controller发request,不再赘述:
BrokerController.sendHeartbeat() -> brokerOuterAPI.sendHeartbeat()
Controller端:
�1. 首先由**ControllerRequestProcessor**接收到code,进入处理逻辑:
private RemotingCommand handleBrokerHeartbeat(ChannelHandlerContext ctx, RemotingCommand request) throws Exception {    final BrokerHeartbeatRequestHeader requestHeader = (BrokerHeartbeatRequestHeader) request.decodeCommandCustomHeader(BrokerHeartbeatRequestHeader.class);    if (requestHeader.getBrokerId() == null) {        return RemotingCommand.createResponseCommand(ResponseCode.CONTROLLER_INVALID_REQUEST, "Heart beat with empty brokerId");    }    this.heartbeatManager.onBrokerHeartbeat(requestHeader.getClusterName(), requestHeader.getBrokerName(), requestHeader.getBrokerAddr(), requestHeader.getBrokerId(),                                            requestHeader.getHeartbeatTimeoutMills(), ctx.channel(), requestHeader.getEpoch(), requestHeader.getMaxOffset(), requestHeader.getConfirmOffset(), requestHeader.getElectionPriority());    return RemotingCommand.createResponseCommand(ResponseCode.SUCCESS, "Heart beat success");}- 之后在onBrokerHeartbeat()中,主要更新controller brokerHeartbeatManager中的**brokerLiveTable**:
public void onBrokerHeartbeat(String clusterName, String brokerName, String brokerAddr, Long brokerId,                              Long timeoutMillis, Channel channel, Integer epoch, Long maxOffset, Long confirmOffset, Integer electionPriority) {    BrokerIdentityInfo brokerIdentityInfo = new BrokerIdentityInfo(clusterName, brokerName, brokerId);    BrokerLiveInfo prev = this.brokerLiveTable.get(brokerIdentityInfo);    ......    if (null == prev) {        this.brokerLiveTable.put(...);        log.info("new broker registered, {}, brokerId:{}", brokerIdentityInfo, realBrokerId);    } else {        prev.setXXX(......)    }}�
3.2 选举
相关CODE: CONTROLLER_ELECT_MASTER
�
有如下几种情形可能触发选举:
- controller主动发起,通过triggerElectMaster():
- HeartbeatManager监听到有broker心跳失效。 (班主任发现有小组同学退学了)
- Controller检测到有一组Replica Set不存在master。(班主任发现有组长虽然在名册里,但是挂了)
 
- broker发起将自己选为master,通过ReplicaManager.brokerElect():
- Broker向controller查metadata时,没找到master信息。(同学定期检查小组情况,问班主任为啥没小组长)
- Broker向controller注册完后,仍未从controller获取到master信息。(同学报道后发现没小组长,汇报)
 
- 通过tools发起:
- 通过选举命令ReElectMasterSubCommand发起。(校长直接任命)
 
上述所有过程,最终均触发:
controller.electMaster() -> replicasInfoManager.electMaster()
// 即,所有小组长必须通过班主任任命
public ControllerResult<ElectMasterResponseHeader> electMaster(final ElectMasterRequestHeader request,                                                               final ElectPolicy electPolicy) {    ...    // 从request中取信息    ...    if (syncStateInfo.isFirstTimeForElect()) {        // 从未注册,直接任命        newMaster = brokerId;    }
    // 按选举政策选主    if (newMaster == null) {        // we should assign this assignedBrokerId when the brokerAddress need to be elected by force        Long assignedBrokerId = request.getDesignateElect() ? brokerId : null;        newMaster = electPolicy.elect(brokerReplicaInfo.getClusterName(), brokerReplicaInfo.getBrokerName(), syncStateSet, allReplicaBrokers, oldMaster, assignedBrokerId);    }
    if (newMaster != null && newMaster.equals(oldMaster)) {        // 老主 == 新主        // old master still valid, change nothing        String err = String.format("The old master %s is still alive, not need to elect new master for broker %s", oldMaster, brokerReplicaInfo.getBrokerName());        LOGGER.warn("{}", err);        // the master still exist        response.setXXX()
        result.setBody(new ElectMasterResponseBody(syncStateSet).encode());        result.setCodeAndRemark(ResponseCode.CONTROLLER_MASTER_STILL_EXIST, err);        return result;    }
    // a new master is elected    if (newMaster != null) {        // 出现不一样的新主        final int masterEpoch = syncStateInfo.getMasterEpoch();        final int syncStateSetEpoch = syncStateInfo.getSyncStateSetEpoch();        final HashSet<Long> newSyncStateSet = new HashSet<>();        //设置新的syncStateSet        newSyncStateSet.add(newMaster);        response.setXXX()...        ElectMasterResponseBody responseBody =            new ElectMasterResponseBody(newSyncStateSet);    }
        result.setBody(responseBody.encode());        final ElectMasterEvent event = new ElectMasterEvent(brokerName, newMaster);        result.addEvent(event);        return result;    }
    // 走到这里,说明没有主,选举失败    // If elect failed and the electMaster is triggered by controller (we can figure it out by brokerAddress),    // we still need to apply an ElectMasterEvent to tell the statemachine    // that the master was shutdown and no new master was elected.    if (request.getBrokerId() == null) {        final ElectMasterEvent event = new ElectMasterEvent(false, brokerName);        result.addEvent(event);        result.setCodeAndRemark(ResponseCode.CONTROLLER_MASTER_NOT_AVAILABLE, "Old master has down and failed to elect a new broker master");    } else {        result.setCodeAndRemark(ResponseCode.CONTROLLER_ELECT_MASTER_FAILED, "Failed to elect a new master");    }        return result;    }3.3 更新SyncStateSet
核心CODE: CONTROLLER_ALTER_SYNC_STATE_SET
- �由master发起,主动向controller更换syncStateSet(等价于小组长汇报优秀作业)
- controllerRequestProcessor接收更换syncStateSet的请求,进入handleAlterSyncStateSet()方法:
private RemotingCommand handleAlterSyncStateSet(ChannelHandlerContext ctx,                                                RemotingCommand request) throws Exception {    final AlterSyncStateSetRequestHeader controllerRequest = (AlterSyncStateSetRequestHeader) request.decodeCommandCustomHeader(AlterSyncStateSetRequestHeader.class);    final SyncStateSet syncStateSet = RemotingSerializable.decode(request.getBody(), SyncStateSet.class);    final CompletableFuture<RemotingCommand> future = this.controllerManager.getController().alterSyncStateSet(controllerRequest, syncStateSet);    if (future != null) {        return future.get(WAIT_TIMEOUT_OUT, TimeUnit.SECONDS);    }    return RemotingCommand.createResponseCommand(null);}- 之后进入Controller.alterSyncStateSet() -> replicasInfoManager.alterSyncStateSet()方法:
public ControllerResult<AlterSyncStateSetResponseHeader> alterSyncStateSet(    final AlterSyncStateSetRequestHeader request, final SyncStateSet syncStateSet,    final BrokerValidPredicate brokerAlivePredicate) {    final String brokerName = request.getBrokerName();  ...    final Set<Long> newSyncStateSet = syncStateSet.getSyncStateSet();    final SyncStateInfo syncStateInfo = this.syncStateSetInfoTable.get(brokerName);    final BrokerReplicaInfo brokerReplicaInfo = this.replicaInfoTable.get(brokerName);
    // 检查syncStateSet是否有变化    final Set<Long> oldSyncStateSet = syncStateInfo.getSyncStateSet();    if (oldSyncStateSet.size() == newSyncStateSet.size() && oldSyncStateSet.containsAll(newSyncStateSet)) {        String err = "The newSyncStateSet is equal with oldSyncStateSet, no needed to update syncStateSet";        ...    }
    // 检查是否是master发起的    if (!syncStateInfo.getMasterBrokerId().equals(request.getMasterBrokerId())) {        String err = String.format("Rejecting alter syncStateSet request because the current leader is:{%s}, not {%s}",                                   syncStateInfo.getMasterBrokerId(), request.getMasterBrokerId());        ...    }
    // 检查master的任期epoch是否一致    if (request.getMasterEpoch() != syncStateInfo.getMasterEpoch()) {        String err = String.format("Rejecting alter syncStateSet request because the current master epoch is:{%d}, not {%d}",                                   syncStateInfo.getMasterEpoch(), request.getMasterEpoch());        ...    }
    // 检查syncStateSet的epoch    if (syncStateSet.getSyncStateSetEpoch() != syncStateInfo.getSyncStateSetEpoch()) {        String err = String.format("Rejecting alter syncStateSet request because the current syncStateSet epoch is:{%d}, not {%d}",                                   syncStateInfo.getSyncStateSetEpoch(), syncStateSet.getSyncStateSetEpoch());        ...    }
    // 检查新的syncStateSet的合理性    for (Long replica : newSyncStateSet) {        // 检查replica是否存在        if (!brokerReplicaInfo.isBrokerExist(replica)) {            String err = String.format("Rejecting alter syncStateSet request because the replicas {%s} don't exist", replica);            ...        }        // 检查broker是否存活        if (!brokerAlivePredicate.check(brokerReplicaInfo.getClusterName(), brokerReplicaInfo.getBrokerName(), replica)) {            String err = String.format("Rejecting alter syncStateSet request because the replicas {%s} don't alive", replica);            ...        }    }
    // 检查是否包含master    if (!newSyncStateSet.contains(syncStateInfo.getMasterBrokerId())) {        String err = String.format("Rejecting alter syncStateSet request because the newSyncStateSet don't contains origin leader {%s}", syncStateInfo.getMasterBrokerId());        ...    }
    // 更新epoch    int epoch = syncStateInfo.getSyncStateSetEpoch() + 1;    ...    // 生成事件,替换syncStateSet    final AlterSyncStateSetEvent event = new AlterSyncStateSetEvent(brokerName, newSyncStateSet);    ...}- 最后通过syncStateInfo.updateSyncStateSetInfo(),更新syncStateSetInfoTable.get(brokerName)得到的syncStateInfo信息(该过程可以理解为班主任在班级分组册上找到了组长的名字,拿出组员名单,更新)。
3.4 复制
该部分较复杂,其中HAService/HAClient/HAConnection以及其中的各种Service/Reader/Writer容易产生混淆,对阅读造成阻碍。因此绘制本图帮助理解(可在粗读源码后回头理解):

下面对HA复制过程作拆解,分别讲解:
- 在各个replica的DefaultMessageStore中均注册了HAService,负责管理HA的复制。
- 在Master的 HAService中有一个**AcceptSocketService**, 负责自动接收各个slave的连接:
protected abstract class AcceptSocketService extends ServiceThread {    ...
    /**     * Starts listening to slave connections.     *     * @throws Exception If fails.     */    public void beginAccept() throws Exception {        ...    }
    @Override    public void shutdown(final boolean interrupt) {        ...    }
    @Override    public void run() {        log.info(this.getServiceName() + " service started");
        while (!this.isStopped()) {            try {                this.selector.select(1000);                Set<SelectionKey> selected = this.selector.selectedKeys();
                if (selected != null) {                    for (SelectionKey k : selected) {                        if (k.isAcceptable()) {                            SocketChannel sc = ((ServerSocketChannel) k.channel()).accept();
                            if (sc != null) {                                DefaultHAService.log.info("HAService receive new connection, "                                                          + sc.socket().getRemoteSocketAddress());                                try {                                    HAConnection conn = createConnection(sc);                                    conn.start();                                    DefaultHAService.this.addConnection(conn);                                } catch (Exception e) {                                    log.error("new HAConnection exception", e);                                    sc.close();                                }                            }                        }                        ...    }}�
- 在各个Slave 的HAService中存在一个**HAClient**,负责向master发起连接、传输请求。
public class AutoSwitchHAClient extends ServiceThread implements HAClient {    ...}
public interface HAClient {    void start();
    void shutdown();
    void wakeup();
    void updateMasterAddress(String newAddress);
    void updateHaMasterAddress(String newAddress);
    String getMasterAddress();
    String getHaMasterAddress();
    long getLastReadTimestamp();
    long getLastWriteTimestamp();
    HAConnectionState getCurrentState();
    void changeCurrentState(HAConnectionState haConnectionState);
    void closeMaster();
    long getTransferredByteInSecond();}- 当master收到slave的连接请求后,将会创建一个**HAConnection**,负责收发内容。
public interface HAConnection {    void start();
    void shutdown();
    void close();
    SocketChannel getSocketChannel();
    HAConnectionState getCurrentState();
    String getClientAddress();
    long getTransferredByteInSecond();
    long getTransferFromWhere();
    long getSlaveAckOffset();}- Master的HAConnection会与Slave的HAClient建立连接,二者均通过HAWriter(较简单,不解读,位于HAWriter类)往socket中写内容,再通过HAReader读取socket中的内容。只不过一个是HAServerReader,一个是HAClientReader:
public abstract class AbstractHAReader {    private static final Logger LOGGER = LoggerFactory.getLogger(LoggerName.STORE_LOGGER_NAME);    protected final List<HAReadHook> readHookList = new ArrayList<>();
    public boolean read(SocketChannel socketChannel, ByteBuffer byteBufferRead) {        int readSizeZeroTimes = 0;        while (byteBufferRead.hasRemaining()) {            ...            boolean result = processReadResult(byteBufferRead);            ...        }
    }
  ...
    protected abstract boolean processReadResult(ByteBuffer byteBufferRead);}- 两种HAReader均实现了processReadResult()方法,负责处理从socket中得到的数据。client需要详细阐述该方法,因为涉及到如何将读进来的数据写入commitlog,client的processReadResult():
@Overrideprotected boolean processReadResult(ByteBuffer byteBufferRead) {int readSocketPos = byteBufferRead.position();try {    while (true) {        ...            switch (AutoSwitchHAClient.this.currentState) {                case HANDSHAKE: {                    ...                    // 握手阶段,先检查commitlog完整性,截断                }                    break;                case TRANSFER: {                    // 传输阶段,将body写入commitlog                    ...                    byte[] bodyData = new byte[bodySize];                    ...
                    if (bodySize > 0) {                        // 传输阶段,将body写入commitlog                        AutoSwitchHAClient.this.messageStore.appendToCommitLog(masterOffset, bodyData, 0, bodyData.length);                    }                    haService.updateConfirmOffset(Math.min(confirmOffset, messageStore.getMaxPhyOffset()));                    ...                    break;                }                default:                    break;            }            if (isComplete) {                continue;            }
        }      // 检查buffer中是否还有数据, 如果有, compact()        ...        break;    }}...}- server的processReadResult()主要用于接收client的握手等请求,较简单。更需要解释其WriteSocketService如何向socket中调用HAwriter去写数据:
abstract class AbstractWriteSocketService extends ServiceThread {    ...
    private void transferToSlave() throws Exception {        ...        int size = this.getNextTransferDataSize();        if (size > 0) {            ...            buildTransferHeaderBuffer(this.transferOffset, size);            this.lastWriteOver = this.transferData(size);        } else {            // 无需传输,直接更新caught up的时间            AutoSwitchHAConnection.this.haService.updateConnectionLastCaughtUpTime(AutoSwitchHAConnection.this.slaveId, System.currentTimeMillis());            haService.getWaitNotifyObject().allWaitForRunning(100);        }    }
    @Override    public void run() {        AutoSwitchHAConnection.LOGGER.info(this.getServiceName() + " service started");
        while (!this.isStopped()) {            try {                this.selector.select(1000);
                switch (currentState) {                    case HANDSHAKE:                        // Wait until the slave send it handshake msg to master.                        // 等待slave的握手请求,并进行回复                        break;                    case TRANSFER:                        ...                        transferToSlave();                        break;                    default: ...                }            } catch (Exception e) {                ...            }        }    ...        // 在service结束后的一些事情    }    ...}此处同样附上server实现processReadResult(),读socket中数据的代码:
@Overrideprotected boolean processReadResult(ByteBuffer byteBufferRead) {while (true) {    ...    HAConnectionState slaveState = HAConnectionState.values()[byteBufferRead.getInt(readPosition)];    switch (slaveState) {        case HANDSHAKE:            // 收到了client的握手            ...            LOGGER.info("Receive slave handshake, slaveBrokerId:{}, isSyncFromLastFile:{}, isAsyncLearner:{}",                        AutoSwitchHAConnection.this.slaveId, AutoSwitchHAConnection.this.isSyncFromLastFile, AutoSwitchHAConnection.this.isAsyncLearner);            break;        case TRANSFER:            // 收到了client的transfer状态            ...            // 更新client状态信息            break;        default:            ...    }...}3.5 Active Controller的选举
该选举主要通过DLedger实现,在DLedgerController中通过RoleChangeHandler.handle()更新自身身份:
class RoleChangeHandler implements DLedgerLeaderElector.RoleChangeHandler {
    private final String selfId;    private final ExecutorService executorService = Executors.newSingleThreadExecutor(new ThreadFactoryImpl("DLedgerControllerRoleChangeHandler_"));    private volatile MemberState.Role currentRole = MemberState.Role.FOLLOWER;
    public RoleChangeHandler(final String selfId) {        this.selfId = selfId;    }
    @Override    public void handle(long term, MemberState.Role role) {        Runnable runnable = () -> {            switch (role) {                case CANDIDATE:                    this.currentRole = MemberState.Role.CANDIDATE;                    // 停止扫描inactive broker任务                    ...                case FOLLOWER:                    this.currentRole = MemberState.Role.FOLLOWER;                    // 停止扫描inactive broker任务                    ...                case LEADER: {                    log.info("Controller {} change role to leader, try process a initial proposal", this.selfId);                    int tryTimes = 0;                    while (true) {                        // 将会开始扫描inactive brokers                        ...                    break;                }            }        };        this.executorService.submit(runnable);    }  ...}