Friday, November 21, 2008

Disk data counters

I have measured the time it takes to do some operations with Cluster (6.3.19, which is about to be released very soon - you should upgrade) with respect to the disk, like starting it, creating table spaces etc.

Below all of this you can find the setup I used.

Initial start of Cluster (40GB of redo log files):
6.3.19: 3min 27sec

Create a 20GB undo file:
6.3.19: 6min 17sec

Create a 128MB data file for the tablespace:
6.3.19: ~3 sec

Insert 1M records a' 4096B (1 thread, batches of five):
6.3.19: 286 sec (3721.12 QPS)
(we can probably provision faster with bigger batches or more threads)

I then provisioned another 4M records (total of 5M records in DB).

Evil test: 100K Random reads (read 4096B) (5M records in DB):
6.3.19: 1290.42QPS (20 threads, io util is ~90% so we are almost completely io bound). So this is result is inline what we would expect when being io bound, especially since i have used four data nodes, each having one disk.

Setup:
  • Total of 4 data nodes on 4 computers.
  • Another computer was used to generate the inserts/reads
  • 8GB of RAM
  • Gig-E
  • 1 * 146GB SAS, 10KRPM
  • 128MB IndexMemory
  • 1024MB DataMemory
  • ODirect=1
  • SharedGlobalMemory=384M
  • NoOfFragmentLogFiles=40
  • FragmentLogFileSize=256M
  • DiskPageBufferMemory=3072M
  • Table space (one ts) with 100 datafiles a' 128MB (best practice is to use many but small data files instead of one big. This will be changed in 6.4 so that you can use one big data file).
    Point here is that you should have a few data files. One data file is bad, more than 128 is overkill since the data node won't keep more than that many data files open at once anyways. This affects how many data files the data node can write to in "parallel".
  • Extent size=1MB (which is quite ok)
  • Logfile group: One 20GB undofile and 128MB Undobuffer
    The undo file was a bit too big (not that it matter, i had the disk space) but I used 5366054928 extents out of 21474836480 (so ~25% was only used).
  • There is also a new configuration option in 6.3.19 which lets you create the data files
  • The disk data table looks like (data column will be stored on disk):
    create table dd (
    id integer primary key,
    ts integer,
    data varbinary(4096),
    index(ts)) engine=ndb TABLESPACE ts_1 storage disk;
  • Fedora core 9 ( uname -r --> 2.6.26.6-79.fc9.x86_64 )
(the config files were generated by severalnines.com/config )

Now, especially for inserts/writes there are quite a few things competing for the single disk:
  • REDO log
  • LCP
  • and the disk data itself (UNDO LOG + DATA FILES)
However, the best setup is to (if you have two disks)
  • Disk 1: REDO + LCP
  • Disk 2: UNDO LOG + DATA FILES
Three disks:
  • Disk 1: REDO + LCP
  • Disk 2: UNDO LOG
  • Disk 3: DATA FILES
If you have more than that then you can put DATA FILES and UNDO LOG on a RAID.

IMPORTANT: When you have done an --initial start, the files for the UNDO LOG and the DATA FILES are NOT removed. You have to remove them by hand. Otherwise you will get an error if you try to CREATE LOGFILE GROUP..

5 comments:

Mikiya Okuno said...

Hi!

To split I/O loads amongst several disks, I wonder the following configuration is good or not:

o 3 disks total.
o 1 disk for LCP/GCP
o 2 disks for data file and undo log file, which each disk has 1 data file and 1 undo file.

Should I use separate disks for data files and undo files and the following configuration is better than the above?

o 3 disks total.
o 1 disk for LCP/GCP.
o 1 disk for data files.
o 1 disk for undo files.

I guess the former can split I/O load better.

Johan Andersson said...

I have changed the test so that the english is more clear on this.
The last option you have is the best.

Michael Senizaiz said...

I see you are using 128M tablespace files. Is it recommended to not use files larger than this? We tried adding our third 2G tablespace, and one of the nodes crashed and restarted a few time, then both crashed and we had to revert to our backup.

What is the best practice for sizing the files? We want to make it about 80G total.

The trace file shows the same block ID's being read back and forth between DBTC and DBDIH for about 10,000 lines before a startphase 4 crash.

Johan Andersson said...

Hi trellph,
I have updated the post with

"Point here is that you should have a few data files. One data file is bad, more than 128 is overkill since the data node won't keep more than that many data files open at once anyways. This affects how many data files the data node can write to in "parallel".

What you describe sounds like a bug.
Can you please file a bug report and stick an excerpt of the trace lines in there. Would be great. Which version are you on?

Michael Senizaiz said...

Bug #40993 has been updated!