Friday, November 27, 2009

Node failure handling - take 2

Jonas improved the node failure handling in MySQL Cluster 7.0.9 (and 6.3.29) so here comes a re-run of the a previous blog post. And the node failure handling time has improved a lot - see below.

I created in total 11345 tables, each with 128 columns, and then hit this bug.

When all tables were created, I stopped one node and measured how long time it takes for the other node to perform the node failure handling. Here is what was written into the cluster log (look at the bold lines):

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Node 3 Disconnected

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Network partitioning - arbitration required

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: President restarts arbitration thread [state=7]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: Communication to Node 3 closed

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 1

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: GCP Take over started

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: Node 4 taking over as DICT master

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: GCP Take over completed

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: kk: 9670/4 0 0

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: LCP Take over started

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: LCP Take over completed (state = 5)

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0

2009-11-27 13:39:21 [MgmtSrvr] INFO -- Node 3: Node shutdown completed.

2009-11-27 13:39:21 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected

2009-11-27 13:39:22 [MgmtSrvr] INFO -- Node 4: Started arbitrator node 1 [ticket=2a74000295047d39]

2009-11-27 13:40:20 [MgmtSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 1 min. - state = 6

2009-11-27 13:41:03 [MgmtSrvr] INFO -- Node 4: Communication to Node 3 opened

Now, the node failure handling was completed in 1 minute 42 seconds! This is quite some improvement from the 17 minutes (although measured with 16000 tables and I will comeback when the bug mentioned above is fixed) measured before Jonas made his magic.

What does it mean? It means that we can start to recover a failed data nodes much earlier now!

No comments: