Friday, April 27, 2012

MySQL Cluster: Disk Data Config



If you are using the Disk Data feature of MySQL Cluster you may find this useful. As you may know the default way of storing data in MySQL Cluster is in-memory. However, if you have tables that don't require realtime access and/or are growing larger than your available memory then you can use disk data tables. I have written a bunch of posts about disk data tables before:

However, it is quite hard to configure Disk Data tables correctly. There are some parameters to keep an eye on:
  • UNDO LOG - MySQL Cluster undo logs every write access to the disk data tables. You can grow this resource online (ALTER .. ADD ..)
  • REDO LOG - as for in-memory tables  every write is REDO logged. You can grow the REDO log with a  Rolling restart (NoOfFragmentLogFiles). The REDO LOG is shared between the disk data tables and the in-memory tables.
  • REDO BUFFER - specified as RedoBuffer=32M or higher in config.ini.  The REDO BUFFER is shared between the disk data tables and the in-memory tables.
  • UNDO BUFFER -  created from SharedGlobalMemory resource and is set at LOGFILE GROUP creation. WARNING!! Can not be changed at runtime! The LOGFILE GROUP must be recreated, and it requires that the table space (and all its data files) is dropped before.
  • TABLESPACE - can be extended at runtime.  ALTER .. ADD..
  • DiskPageBufferMemory - like the innodb_buffer_pool. The bigger the better. Can be tuned with a Rolling Restart.

For stability the most common mistakes are to set UNDO LOG and UNDO BUFFER too small.
What can happen then?

Stability and Recovery

If UNDO BUFFER is too small symptoms are:
  • Failed writes to the Disk Data tables. 
  • In worst case you can hit crashing bugs.

If UNDO LOG is too small symptoms are:
  • Failed writes to the Disk Data tables. In worst case you can hit crashing bugs.
  •  Node recovery may not be possible - the recovering node will crash.
    Severalnines has assisted a number of client to perform  a number of recovery when this has been the case.
  • If there is a Cluster crash due to a too small UNDO LOG you will most likely have to restore from backup.

How can it be possible for MySQL Cluster to fail to recover the node, when it was possible to load it with data?

The problem is that your write pattern when writing data at runtime into MySQL Cluster is different from the write pattern at recovery.
At recovery the recovering data node will copy data (over the network) from the other node in the node group. It will do this as fast as it can, table by table.

Thus, it will be a lot of data coming into the recovering data node (like a bulk write) and in this case then the UNDO LOG may be too small to handle the write log (you can argue why UNDO and REDO log while doing recovery but but..). If the UNDO LOG is over run, then the node will crash.

Configuration

As usual this depends a lot  on the access patterns you have so it is hard to give exact numbers but if you rely on disk data tables you should have:
  • [NDBD DEFAULT] SharedGlobalMemory=1024M
    Disk operation records and other meta data is allocated from this buffer. The more records you have in the disk data tables the bigger this buffer must be. 1024M is a good start.
  • [NDBD DEFAULT] DiskPageBufferMemory=4096M
    A
    s large as you can (remember that in-memory tables, and for disk data tables, indexed attributes resides in DataMemory so you need to have enough DataMemory.
    The important when tuning the DiskPageBufferMemory is to get a good hit ratio. If you use ClusterControl you will see the hit ratio (picture coming soon).
    4096M is a good start.
  • UNDO_BUFFER_SIZE=32M  or 64M (same size as you have for the RedoBuffer)
    This is specified in the CREATE LOGFILE GROUP ... 
  • UNDO LOG space should be 0.5 x NoOfFragmentLogFileSize x 4 x FragmentLogFileSize (this may take it to the extreme but better safe than sorry).
  • [NDBD DEFAULT]  DiskIoThreadPool=8    (2 is default)   
  • UndoIndexBuffer/UndoDataBuffer  - those parameters are deprecated but still in the manual (I have grepped the 7.2.4 source and find no trace of it. It was a long time those were used. When LCP was done different in Cluster. So no need to set these.

Please not that you can always add more UNDO LOG files online by issuing
 ALTER LOGFILE GROUP  ADD UNDOFILE 'undo_X.dat INITIAL_SIZE=1024M ENGINE=NDBCLUSTER;
However, if the Cluster crashes, then it is too late.

Disk Layout
  • Put the Redo log + LCP  on separate disk spindles ( A)
  • Put the Undo log on separate disk spindles  (FileSystemPathUndoFiles) (B)
  • Put the Data files (table space) on separate disk spindles (FileSystemPathDataFiles) (C)

In many cases you can combine B and C on the same disks, and if you have a good disk array, you can of course combine A,B and C. 

Bottom line is, watch your disks and utilized they are. That will determine if you have to make disk reconfigurations. Also, beware of trial and error before getting the DiskPageBufferMemory and SharedGlobalMemory right.

6 comments:

Parry said...

Hi Johan,

I am pretty new to MySQL Cluster. Our company is planning to use MySQL cluster for its new product.

I have went throught the backup and recovery options but could not find "crash recovery" anywhere i.e., recovering cluster db from a non-consistent state.

I also read that binary logs are not transaction logs and can't be used for crash recovery.

Is crash recovery possible in MySQL cluster and if it is possible, can you pls suggest how to do it or let me know a resource where I can study it?

I'll appreciate your reply. Have a great day..!!

Johan Andersson said...

Hi Parry,

Crash recovery is automatic.
Basically tables are checkpointed to disk (or table space for disk data tables), and there is a redo log written too.

Those components are used during both node recovery (one or more nodes down, but cluster is still up) and system recovery (all nodes are dead, cluster dead), as well as automatically copying changes from one node with more data to the recovering node.

Then you can use binary logs to do point in time recovery, because since the data nodes syncs the redo log to disk every TimeBetweenGlobalCheckpoints then it is a chance to loose that amount of data IF ALL nodes in the cluster crash.

But during normal operation if you have a node failure, you will not lose any data.

Hope this helps somewhat,
Johan

Jay Ward said...

Hey Johan,

I ran into a weird problem in 7.2.7 that I was curious if you had seen. For testing, I installed on two Intel E5-1650s, and started it up. With only 12 cores, I gave it 4 LQHs, 1 send, 1 recv, 1 rep, 2 TCs, and had the main and IO running together; leaving two cores for the OS. When I tried to do a backup, after writing the schemas to the CTL file, the first LQH thread tried to send a GSN_SCAN_FRAGREQ (353) signal to the second LQH thread. Since they aren't supposed to communicate, no JB exists and so the data nodes receive a SIGSEGV from the OS.

Did I miss some critical bit of info in implementing ThreadConfig?

Johan Andersson said...

Hi,
I have never seen or heard about this. What I can tell you have not missed anything, but i think you are just unlucky and have hit a bug.

Have you opened a bug report?

-J

Burak said...

Hi,

I have 2 hard drives on each data node. (1TB HDD and 480GB SSD). How do I have this (A), (​​B) and (C) combined? What data must be stored on SSD, so I effectively use SSD? On which hard drive I need to install system and cluster?
'.dat' => 225GB
'.log' => 84GB
'LCP and D1 ..' => 182GB

burak said...

Hi,

I have 2 hard drives on each data node. (1TB HDD and 480GB SSD). How do I have this (A), (​​B) and (C) combined? What data must be stored on SSD, so I effectively use SSD? On which hard drive I need to install system and cluster?
'.dat' => 225GB
'.log' => 84GB
'LCP and D1 ..' => 182GB