Wednesday, September 16, 2009

Node failure handling

When a data node fails, the other data nodes must perform clean-up of the failed node.

E.g, to abort transactions in the PREPARE phase (Cluster uses a variant of the Two Phase Commit protocol) that has been started on the Data node that fails.
Various protocols (if the failed data node is the master) must also fail-over.

These things are carried out quite swiftly (configurable, but a couple of seconds by default but it also depends on the size of transactions that in the prepare phase that needs to be aborted).

I wanted to check how long time it takes to do node failure handling with many tables (16000+) in the cluster, because in a "normal" database with a couple of hundred tables, then the time spent on node failure handling is not a factor.

In this case I have two data nodes:
  • Node 3 (master)
  • Node 4
I have created 16000+ tables each with 128 columns just to see what happens with failure handling and recovery.

Node 3 was killed, and the Node 4 then has to do node failure handling:
2009-09-16 14:23:41 [MgmSrvr] WARNING  -- Node 4: Failure handling of node 3 has not completed in 15 min. - state = 6

2009-09-16 14:24:41 [MgmSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 16 min. - state = 6

2009-09-16 14:25:41 [MgmSrvr] WARNING -- Node 4: Failure handling of node 3 has not completed in 17 min. - state = 6

2009-09-16 14:26:13 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened
Eventually, we get the message :
2009-09-16 14:26:13 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
We cannot start to recover Node 3 until this message has been written out in the cluster log!

If you try to connect the data node before this, then the management server will say something like "No free slot found for node 3", which basically means in this case that the node failure handling is ongoing.

Now we have got the message and we can now start data node 3 again and it will recover

Why is it taking long time:
  • The surviving data nodes that are handling the failures must write for each table that node three has crashed. This is written to disk. This is what takes time when you have many tables (about 17 minutes with >16000 tables) so about 1000 tables per minute.. So in most databases it does not take that much time. Thanks Jonas for you explanation.
However, it is important for you to understand that during this time it is STILL POSSIBLE to WRITE / READ to the Cluster, i.e., the database is online.

No comments: