(mysql.info.gz) MySQL Cluster DB Definition
Info Catalog
(mysql.info.gz) MySQL Cluster MGM Definition
(mysql.info.gz) MySQL Cluster Config File
(mysql.info.gz) MySQL Cluster API Definition
16.3.4.5 Defining MySQL Cluster Storage Nodes
.............................................
The `[DB]' section (or its alias `[NDBD]') is used to configure the
behavior of the storage nodes. There are many parameters specified that
controls the buffer sizes, pool sizes, timeout parameters and so forth.
The only mandatory parameter is either `ExecuteOnComputer' or
`HostName' and the parameter `NoOfReplicas' which need to be defined in
the `[DB DEFAULT]' section. Most parameters should be set in the `[DB
DEFAULT]' section. Only parameters explicitly stated as possible to
have local values are allowed to be changed in the `[DB]' section.
`HostName', `Id' and `ExecuteOnComputer' needs to be defined in the
local `[DB]' section.
The `Id' value (that is, the identification of the storage node) can be
allocated when the node is started. It is possible to assign a node ID
in the configuration file.
For each parameter it is possible to use k, M, or G as a suffix to
indicate units of 1024, 1024*1024, or 1024*1024*1024. For example, 100k
means 102400. Parameters and values are case sensitive.
`[DB]Id'
This identity is the node ID used as the address of the node in all
cluster internal messages. This is an integer between 1 and 63.
Each node in the cluster has a unique identity.
`[DB]ExecuteOnComputer'
This is referring to one of the computers defined in the computer
section.
`[DB]HostName'
This parameter is similar to specifying a computer to execute on.
It defines the host name of the computer the storage node is to
reside on. Either this parameter or `ExecuteOnComputer' is
required.
`[DB]ServerPort'
Each node in the cluster will use one port as the port other nodes
use to connect the transporters to each other. This port is used
also for non-TCP transporters in the connection setup phase. The
default port will be calculated to ensure that no nodes on the
same computer receive the same port number.
`[DB]NoOfReplicas'
This parameter can be set only in the `[DB DEFAULT]' section
because it is a global parameter. It defines the number of
replicas for each table stored in the cluster. This parameter also
specifies the size of node groups. A node group is a set of nodes
that all store the same information.
Node groups are formed implicitly. The first node group is formed
by the storage nodes with the lowest node identities. And the next
by the next lowest node identities. As an example presume we have
4 storage nodes and `NoOfReplicas' is set to 2. The four storage
nodes have node IDs 2, 3, 4 and 5. Then the first node group will
be formed by node 2 and node 3. The second node group will be
formed by node 4 and node 5. It is important to configure the
cluster in such a manner such that nodes in the same node groups
are not placed on the same computer. This would cause a single HW
failure to cause a cluster crash.
If no node identities are provided then the order of the storage
nodes will be the determining factor for the node group. The
actual node group assigned will be printed by the `SHOW' command
in the management client.
There is no default value and the maximum number is 4.
`[DB]DataDir'
This parameter specifies the directory where trace files, log
files, pid files and error logs are placed.
`[DB]FileSystemPath'
This parameter specifies the directory where all files created for
metadata, REDO logs, UNDO logs and data files are placed. The
default value is to use the same directory as the `DataDir'. The
directory must be created before starting the `ndbd' process.
If you use the recommended directory hierarchy, you will use a
directory `/var/lib/mysql-cluster'. Under this directory a
directory `ndb_2_fs' will be created (if node ID was 2) which will
be the file system for that node.
`[DB]BackupDataDir'
It is possible also to specify the directory where backups will be
placed. By default, the directory `FileSystemPath/'`BACKUP' will
be chosen.
`DataMemory' and `IndexMemory' are the parameters that specify the size
of memory segments used to store the actual records and their indexes.
It is important to understand how `DataMemory' and `IndexMemory' are
used to understand how to set these parameters. For most uses, they
need to be updated to reflect the usage of the cluster.
`[DB]DataMemory'
This parameter is one of the most important parameters because it
defines the space available to store the actual records in the
database. The entire `DataMemory' will be allocated in memory so
it is important that the machine contains enough memory to handle
the `DataMemory' size.
The `DataMemory' is used to store two things. It stores the actual
records. Each record is currently of fixed size. So `VARCHAR'
columns are stored as fixed size columns. There is an overhead on
each record of 16 bytes normally. Additionally each record is
stored in a 32KB page with 128 byte page overhead. There will
also be a small amount of waste for each page because records are
only stored in one page. The maximum record size for the columns
currently is 8052 bytes.
The `DataMemory' is also used to store ordered indexes. Ordered
indexes uses about 10 bytes per record. Each record in the table
is always represented in the ordered index.
The `DataMemory' consists of 32KB pages. These pages are allocated
to partitions of the tables. Each table is normally partitioned
with the same number of partitions as there are storage nodes in
the cluster. Thus for each node there are the same number of
partitions (=fragments) as the `NoOfReplicas' is set to. Once a
page has been allocated to a partition it is currently not
possible to bring it back to the pool of free pages. The method to
restore pages to the pool is by deleting the table. Performing a
node recovery also will compress the partition because all records
are inserted into an empty partition from another live node.
Another important aspect is that the `DataMemory' also contains
UNDO information for records. For each update of a record a copy
record is allocated in the `DataMemory'. Also each copy record
will also have an instance in the ordered indexes of the table.
Unique hash indexes are updated only when the unique index columns
are updated and in that case a new entry in the index table is
inserted and at commit the old entry is deleted. Thus it is
necessary also to allocate memory to be able to handle the largest
transactions which are performed in the cluster.
Performing large transactions has no advantage in MySQL Cluster
other than the consistency of using transactions which is the
whole idea of transactions. It is not faster and consumes large
amounts of memory.
The default `DataMemory' size is 80MB. The minimum size is 1MB.
There is no maximum size, but in reality the maximum size has to
be adapted so that the process doesn't start swapping when using
the maximum size of the memory.
`[DB]IndexMemory'
The `IndexMemory' is the parameter that controls the amount of
storage used for hash indexes in MySQL Cluster. Hash indexes are
always used for primary key indexes, unique indexes, and unique
constraints. Actually when defining a primary key and a unique
index there will be two indexes created in MySQL Cluster. One
index is a hash index which is used for all tuple accesses and
also for lock handling. It is also used to ensure unique
constraints.
The size of the hash index is 25 bytes plus the size of the
primary key. For primary keys larger than 32 bytes another 8
bytes is added for some internal references.
Thus for a table defined as
CREATE TABLE example
(
a INT NOT NULL,
b INT NOT NULL,
c INT NOT NULL,
PRIMARY KEY(a),
UNIQUE(b)
) ENGINE=NDBCLUSTER;
We will have 12 bytes overhead (having no nullable columns saves 4
bytes of overhead) plus 12 bytes of data per record. In addition
we will have two ordered indexes on a and b consuming about 10
bytes each per record. We will also have a primary key hash index
in the base table with roughly 29 bytes per record. The unique
constraint is implemented by a separate table with b as primary
key and a as a column. This table will consume another 29 bytes of
index memory per record in the table and also 12 bytes of overhead
plus 8 bytes of data in the record part.
Thus for one million records, we will need 58MB of index memory to
handle the hash indexes for the primary key and the unique
constraint. For the `DataMemory' part we will need 64MB of memory
to handle the records of the base table and the unique index table
plus the two ordered index tables.
The conclusion is that hash indexes takes up a fair amount of
memory space but in return they provide very fast access to the
data. They are also used in MySQL Cluster to handle uniqueness
constraints.
Currently the only partitioning algorithm is hashing and the
ordered indexes are local to each node and can thus not be used to
handle uniqueness constraints in the general case.
An important point for both `IndexMemory' and `DataMemory' is that
the total database size is the sum of all `DataMemory' and
`IndexMemory' in each node group. Each node group is used to store
replicated information, so if there are four nodes with 2 replicas
there will be two node groups and thus the total `DataMemory'
available is 2*`DataMemory' in each of the nodes.
Another important point is about changes of `DataMemory' and
`IndexMemory'. First of all, it is highly recommended to have the
same amount of `DataMemory' and `IndexMemory' in all nodes. Since
data is distributed evenly over all nodes in the cluster the size
available is no better than the smallest sized node in the cluster
times the number of node groups.
`DataMemory' and `IndexMemory' can be changed, but it is dangerous
to decrease them because that can easily lead to a node that will
not be able to restart or even a cluster not being able to restart
since there is not enough memory space for the tables needed to
restore into the starting node. Increasing them should be quite
okay, but it is recommended that such upgrades are performed in
the same manner as a software upgrade where first the
configuration file is updated, then the management server is
restarted and then one storage node at a time is restarted by
command.
More `IndexMemory' is not used due to updates but inserts are
inserted immediately and deletes are not deleted until the
transaction is committed.
The default `IndexMemory' size is 18MB. The minimum size is 1MB.
The next three parameters are important because they affect the number
of parallel transactions and the sizes of transactions that can be
handled by the system. `MaxNoOfConcurrentTransactions' sets the number
of parallel transactions possible in a node and
`MaxNoOfConcurrentOperations' sets the number of records that can be in
update phase or locked simultaneously.
Both of these parameters and particularly `MaxNoOfConcurrentOperations'
are likely targets for users setting specific values and not using the
default value. The default value is set for systems using small
transactions and to ensure not using too much memory in the default
case.
`[DB]MaxNoOfConcurrentTransactions'
For each active transaction in the cluster there needs to be also a
transaction record in one of the nodes in the cluster. The role of
transaction coordination is spread among the nodes and thus the
total number of transactions records in the cluster is the amount
in one times the number of nodes in the cluster.
Actually transaction records are allocated to MySQL servers,
normally there is at least one transaction record allocated in the
cluster per connection that uses or have used a table in the
cluster. Thus one should ensure that there is more transaction
records in the cluster than there are concurrent connections to
all MySQL servers in the cluster.
This parameter has to be the same in all nodes in the cluster.
Changing this parameter is never safe and can cause a cluster
crash. When a node crashes one of the node (actually the oldest
surviving node) will build up the transaction state of all
transactions ongoing in the crashed node at the time of the crash.
It is thus important that this node has as many transaction
records as the failed node.
The default value for this parameter is 4096.
`[DB]MaxNoOfConcurrentOperations'
This parameter is likely to be subject for change by users. Users
performing only short, small transactions don't need to set this
parameter very high. Applications desiring to be able to perform
rather large transactions involving many records need to set this
parameter higher.
For each transaction that updates data in the cluster it is
required to have operation records. There are operation records
both in the transaction coordinator and in the nodes where the
actual updates are performed.
The operation records contain state information needed to be able
to find UNDO records for rollback, lock queues, and much other
state information.
To dimension the cluster to handle transactions where one million
records are updated simultaneously one should set this parameter
to one million divided by the number of nodes. Thus for a cluster
with four storage nodes one should set this parameter to 250000.
Also read queries which set locks use up operation records. Some
extra space is allocated in the local nodes to cater for cases
where the distribution is not perfect over the nodes.
When queries translate into using the unique hash index there will
actually be two operation records used per record in the
transaction. The first one represents the read in the index table
and the second handles the operation on the base table.
The default value for this parameter is 32768.
This parameter actually handles two parts that can be configured
separately. The first part specifies how many operation records
are to be placed in the transaction coordinator part. The second
part specifies how many operation records that are to be used in
the local database part.
If a very big transaction is performed on a 8-node cluster then
this will need as many operation records in the transaction
coordinator as there are reads, updates, deletes involved in the
transaction. The transaction will however spread the operation
records of the actual reads, updates, and inserts over all eight
nodes. Thus if it is necessary to configure the system for one
very big transaction then it is a good idea to configure those
separately. `MaxNoOfConcurrentOperations' will always be used to
calculate the number of operation records in the transaction
coordinator part of the node.
It is also important to have an idea of the memory requirements
for those operation records. In MySQL 4.1.5, operation records
consume about 1KB per record. This figure will shrink in future
5.x versions.
`[DB]MaxNoOfLocalOperations'
By default this parameter is calculated as 1.1 *
`MaxNoOfConcurrentOperations' which fits systems with many
simultaneous, not very large transactions. If the configuration
needs to handle one very large transaction at a time and there are
many nodes then it is a good idea to configure this separately.
The next set of parameters are used for temporary storage in the midst
of executing a part of a query in the cluster. All of these records
will have been released when the query part is completed and is waiting
for the commit or rollback.
Most of the defaults for these parameters will be okay for most users.
Some high-end users might want to increase those to enable more
parallelism in the system and some low-end users might want to decrease
them to save memory.
`[DB]MaxNoOfConcurrentIndexOperations'
For queries using a unique hash index another set of operation
records are temporarily used in the execution phase of the query.
This parameter sets the size of this pool. Thus this record is
only allocated while executing a part of a query, as soon as this
part has been executed the record is released. The state needed to
handle aborts and commits is handled by the normal operation
records where the pool size is set by the parameter
`MaxNoOfConcurrentOperations'.
The default value of this parameter is 8192. Only in rare cases of
extremely high parallelism using unique hash indexes should this
parameter be necessary to increase. To decrease could be performed
for memory savings if the DBA is certain that such high
parallelism is not occurring in the cluster.
`[DB]MaxNoOfFiredTriggers'
The default value of `MaxNoOfFiredTriggers' is 4000. Normally this
value should be sufficient for most systems. In some cases it
could be decreased if the DBA feels certain the parallelism in the
cluster is not so high.
This record is used when an operation is performed that affects a
unique hash index. Updating a column that is part of a unique hash
index or inserting/deleting a record in a table with unique hash
indexes will fire an insert or delete in the index table. This
record is used to represent this index table operation while its
waiting for the original operation that fired it to complete.
Thus it is short lived but can still need a fair amount of records
in its pool for temporary situations with many parallel write
operations on a base table containing a set of unique hash indexes.
`[DB]TransactionBufferMemory'
This parameter is also used for keeping fired operations to update
index tables. This part keeps the key and column information for
the fired operations. It should be very rare that this parameter
needs to be updated.
Also normal read and write operations use a similar buffer. This
buffer is even more short term in its usage so this is a compile
time parameter set to 4000*128 bytes (500KB). The parameter is
`ZATTRBUF_FILESIZE' in DBTC.HPP. A similar buffer for key info
exists which contains 4000*16 bytes, 62.5KB of buffer space. The
parameter in this case is `ZDATABUF_FILESIZE' in DBTC.HPP. `Dbtc'
is the module for handling the transaction coordination.
Similar parameters exist in the `Dblqh' module taking care of the
reads and updates where the data is located. In `Dblqh.hpp' with
`ZATTRINBUF_FILESIZE' set to 10000*128 bytes (1250KB) and
`ZDATABUF_FILE_SIZE', set to 10000*16 bytes (roughly 156KB) of
buffer space. No known instances of that any of those compile time
limits haven't been big enough has been reported so far or
discovered by any of our extensive test suites.
The default size of the `TransactionBufferMemory' is 1MB.
`[DB]MaxNoOfConcurrentScans'
This parameter is used to control the amount of parallel scans
that can be performed in the cluster. Each transaction
coordinator can handle the amount of parallel scans defined by
this parameter. Each scan query is performed by scanning all
partitions in parallel. Each partition scan will use a scan record
in the node where the partition is located. The number of those
records is the size of this parameter times the number of nodes so
that the cluster should be able to sustain maximum number of scans
in parallel from all nodes in the cluster.
Scans are performed in two cases. The first case is when no hash
or ordered indexes exists to handle the query. In this case the
query is executed by performing a full table scan. The second case
is when there is no hash index to support the query but there is
an ordered index. Using the ordered index means executing a
parallel range scan. Since the order is only kept on the local
partitions it is necessary to perform the index scan on all
partitions.
The default value of `MaxNoOfConcurrentScans' is 256. The maximum
value is 500.
This parameter will always specify the number of scans possible in
the transaction coordinator. If the number of local scan records
is not provided it is calculated as the product of
`MaxNoOfConcurrentScans' and the number of storage nodes in the
system.
`[DB]MaxNoOfLocalScans'
Possible to specify the number of local scan records if many scans
are not fully parallelized.
`[DB]BatchSizePerLocalScan'
This parameter is used to calculate the number of lock records
which needs to be there to handle many concurrent scan operations.
The default value is 64 and this value has a strong connection to
the `ScanBatchSize' defined in the API nodes.
`[DB]LongMessageBuffer'
This is an internal buffer used for message passing internally in
the node and for messages between nodes in the system. It is
highly unlikely that anybody would need to change this parameter
but it is configurable. By default it is set to 1MB.
`[DB]NoOfFragmentLogFiles'
This is an important parameter that states the size of the REDO
log files in the node. REDO log files are organized in a ring such
that it is important that the tail and the head doesn't meet.
When the tail and head have come to close the each other the node
will start aborting all updating transactions because there is no
room for the log records.
REDO log records aren't removed until three local checkpoints have
completed since the log record was inserted. The speed of
checkpoint is controlled by a set of other parameters so these
parameters are all glued together.
The default parameter value is 8, which means 8 sets of 4 16MB
files. Thus in total 512MB. Thus the unit is 64MB of REDO log
space. In high update scenarios this parameter needs to be set
very high. Test cases where it has been necessary to set it to
over 300 have been performed.
If the checkpointing is slow and there are so many writes to the
database that the log files are full and the log tail cannot be
cut for recovery reasons then all updating transactions will be
aborted with internal error code 410 which will be translated to
`Out of log file space temporarily'. This condition will prevail
until a checkpoint has completed and the log tail can be moved
forward.
`[DB]MaxNoOfSavedMessages'
This parameter sets the maximum number of trace files that will be
kept before overwriting old trace files. Trace files are generated
when the node crashes for some reason.
The default is 25 trace files.
The next set of parameters defines the pool sizes for metadata objects.
It is necessary to define the maximum number of attributes, tables,
indexes, and trigger objects used by indexes, events and replication
between clusters.
`[DB]MaxNoOfAttributes'
This parameter defines the number of attributes that can be
defined in the cluster.
The default value of this parameter is 1000. The minimum value is
32 and there is no maximum. Each attribute consumes around 200
bytes of storage in each node because metadata is fully replicated
in the servers.
`[DB]MaxNoOfTables'
A table object is allocated for each table, for each unique hash
index, and for each ordered index. This parameter sets the
maximum number of table objects in the cluster.
For each attribute that has a `BLOB' data type an extra table is
used to store most of the `BLOB' data. These tables also must be
taken into account when defining the number of tables.
The default value of this parameter is 128. The minimum is 8 and
the maximum is 1600. Each table object consumes around 20KB in
each node.
`[DB]MaxNoOfOrderedIndexes'
For each ordered index in the cluster, objects are allocated to
describe what it is indexing and its storage parts. By default
each index defined will have an ordered index also defined. Unique
indexes and primary key indexes have both an ordered index and a
hash index.
The default value of this parameter is 128. Each object consumes
around 10KB of data per node.
`[DB]MaxNoOfUniqueHashIndexes'
For each unique index (not for primary keys) a special table is
allocated that maps the unique key to the primary key of the
indexed table. By default there will be an ordered index also
defined for each unique index. To avoid this, you must use the
`USING HASH' option in the unique index definition.
The default value is 64. Each index will consume around 15KB per
node.
`[DB]MaxNoOfTriggers'
For each unique hash index an internal update, insert and delete
trigger is allocated. Thus three triggers per unique hash index.
Ordered indexes use only one trigger object. Backups also use
three trigger objects for each normal table in the cluster. When
replication between clusters is supported it will also use
internal triggers.
This parameter sets the maximum number of trigger objects in the
cluster.
The default value of this parameter is 768.
`[DB]MaxNoOfIndexes'
This parameter was deprecated in MySQL 4.1.5. You should use
`MaxNoOfOrderedIndexes' and `MaxNoOfUniqueHashIndexes' instead.
This parameter is only used by unique hash indexes. There needs to
be one record in this pool for each unique hash index defined in
the cluster.
The default value of this parameter is 128.
There is a set of boolean parameters affecting the behavior of storage
nodes. Boolean parameters can be specified to true by setting it to Y
or 1 and to false by setting it to N or 0.
`[DB]LockPagesInMainMemory'
For a number of operating systems such as Solaris and Linux it is
possible to lock a process into memory and avoid all swapping
problems. This is an important feature to provide real-time
characteristics of the cluster.
The default is that this feature is not enabled.
`[DB]StopOnError'
This parameter states whether the process is to exit on error
condition or whether it is perform an automatic restart.
The default is that this feature is enabled.
`[DB]Diskless'
In the internal interfaces it is possible to set tables as
diskless tables meaning that the tables are not checkpointed to
disk and no logging occur. They only exist in main memory. The
tables will still exist after a crash but not the records in the
table.
This feature makes the entire cluster `Diskless', in this case even
the tables doesn't exist anymore after a crash. Enabling this
feature can be done by either setting it to Y or 1.
When this feature is enabled, backups will be performed but will
not be stored because there is no "disk". In future releases it is
likely to make the backup diskless a separate configurable
parameter.
The default is that this feature is not enabled.
`[DB]RestartOnErrorInsert'
This feature is only accessible when building the debug version
where it is possible to insert errors in the execution of various
code parts to test failure cases.
The default is that this feature is not enabled.
There are quite a few parameters specifying timeouts and time intervals
between various actions in the storage nodes. Most of the timeouts are
specified in milliseconds with a few exceptions which will be mentioned
below.
`[DB]TimeBetweenWatchDogCheck'
To ensure that the main thread doesn't get stuck in an eternal loop
somewhere there is a watch dog thread which checks the main
thread. This parameter states the number of milliseconds between
each check. After three checks and still being in the same state
the process is stopped by the watch dog thread.
This parameter can easily be changed and can be different in the
nodes although there seems to be little reason for such a
difference.
The default timeout is 4000 milliseconds (4 seconds).
`[DB]StartPartialTimeout'
This parameter specifies the time that the cluster will wait for
all storage nodes to come up before the algorithm to start the
cluster is invoked. This time out is used to avoid starting only a
partial cluster if possible.
The default value is 30000 milliseconds (30 seconds). 0 means
eternal time out. Thus only start if all nodes are available.
`[DB]StartPartitionedTimeout'
If the cluster is ready start after waiting `StartPartialTimeout'
but is still in a possibly partitioned state one waits until also
this timeout has passed.
The default timeout is 60000 milliseconds (60 seconds).
`[DB]StartFailureTimeout'
If the start is not completed within the time specified by this
parameter the node start will fail. Setting this parameter to 0
means no time out is applied on the time to start the cluster.
The default value is 60000 milliseconds (60 seconds). For storage
nodes containing large data sets this parameter needs to be
increased because it could very well take 10-15 minutes to perform
a node restart of a storage node with a few gigabytes of data.
`[DB]HeartbeatIntervalDbDb'
One of the main methods of discovering failed nodes is by
heartbeats. This parameter states how often heartbeat signals are
sent and how often to expect to receive them. After missing three
heartbeat intervals in a row, the node is declared dead. Thus the
maximum time of discovering a failure through the heartbeat
mechanism is four times the heartbeat interval.
The default heartbeat interval is 1500 milliseconds (1.5 seconds).
This parameter must not be changed drastically. If one node uses
5000 milliseconds and the node watching it uses 1000 milliseconds
then obviously the node will be declared dead very quickly. So
this parameter can be changed in small steps during an online
software upgrade but not in large steps.
`[DB]HeartbeatIntervalDbApi'
In a similar manner each storage node sends heartbeats to each of
the connected MySQL servers to ensure that they behave properly.
If a MySQL server doesn't send a heartbeat in time (same algorithm
as for storage node with three heartbeats missed causing failure)
it is declared down and all ongoing transactions will be completed
and all resources will be released and the MySQL server cannot
reconnect until the completion of all activities started by the
previous MySQL instance has been completed.
The default interval is 1500 milliseconds. This interval can be
different in the storage node because each storage node
independently of all other storage nodes watches the MySQL servers
connected to it.
`[DB]TimeBetweenLocalCheckpoints'
This parameter is an exception in that it doesn't state any time
to wait before starting a new local checkpoint. This parameter is
used to ensure that in a cluster where not so many updates are
taking place that we don't perform local checkpoints. In most
clusters with high update rates it is likely that a new local
checkpoint is started immediately after the previous was completed.
The size of all write operations executed since the start of the
previous local checkpoints is added. This parameter is specified
as the logarithm of the number of words. So the default value 20
means 4MB of write operations, 21 would mean 8MB and so forth up
until the maximum value 31 which means 8GB of write operations.
All the write operations in the cluster are added together.
Setting it to 6 or lower means that local checkpoints will execute
continuously without any wait between them independent of the
workload in the cluster.
`[DB]TimeBetweenGlobalCheckpoints'
When a transaction is committed it is committed in main memory in
all nodes where mirrors of the data existed. The log records of
the transaction are not forced to disk as part of the commit
however. The reasoning here is that having the transaction safely
committed in at least two independent computers should be meeting
standards of durability.
At the same time it is also important to ensure that even the
worst of cases when the cluster completely crashes is handled
properly. To ensure this all transactions in a certain interval is
put into a global checkpoint. A global checkpoint is very similar
to a grouped commit of transactions. An entire group of
transactions is sent to disk. Thus as part of the commit the
transaction was put into a global checkpoint group. Later this
groups log records are forced to disk and then the entire group of
transaction is safely committed also on all computers disk storage
as well.
This parameter states the interval between global checkpoints. The
default time is 2000 milliseconds.
`[DB]TimeBetweenInactiveTransactionAbortCheck'
Time-out handling is performed by checking each timer on each
transaction every period of time in accordance with this
parameter. Thus if this parameter is set to 1000 milliseconds,
then every transaction will be checked for timeout once every
second.
The default for this parameter is 1000 milliseconds (1 second).
`[DB]TransactionInactiveTimeout'
If the transaction is currently not performing any queries but is
waiting for further user input, this parameter states the maximum
time that the user can wait before the transaction is aborted.
The default for this parameter is no timeout. For a real-time
database that needs to control that no transaction keeps locks for
a too long time this parameter should be set to a much smaller
value. The unit is milliseconds.
`[DB]TransactionDeadlockDetectionTimeout'
When a transaction is involved in executing a query it waits for
other nodes. If the other nodes doesn't respond it could depend on
three things. First, the node could be dead, second the operation
could have entered a lock queue and finally the node requested to
perform the action could be heavily overloaded. This timeout
parameter states how long the transaction coordinator will wait
until it aborts the transaction when waiting for query execution
of another node.
Thus this parameter is important both for node failure handling
and for deadlock detection. Setting it too high would cause a
non-desirable behavior at deadlocks and node failures.
The default time out is 1200 milliseconds (1.2 seconds).
`[DB]NoOfDiskPagesToDiskAfterRestartTUP'
When executing a local checkpoint the algorithm sends all data
pages to disk during the local checkpoint. Simply sending them
there as quickly as possible will cause unnecessary load on both
processors, networks, and disks. Thus to control the write speed
this parameter specifies how many pages per 100 milliseconds is to
be written. A page is here defined as 8KB. The unit this
parameter is specified in is thus 80KB per second. So setting it
to 20 means writing 1.6MB of data pages to disk per second during
a local checkpoint. Also writing of UNDO log records for data
pages is part of this sum. Writing of index pages (see IndexMemory
to understand what index pages are used for) and their UNDO log
records is handled by the parameter
`NoOfDiskPagesToDiskAfterRestartACC'. This parameter handles the
limitation of writes from the `DataMemory'.
So this parameter specifies how quickly local checkpoints will be
executed. This parameter is important in connection with
`NoOfFragmentLogFiles', `DataMemory', `IndexMemory'.
The default value is 40 (3.2MB of data pages per second).
`[DB]NoOfDiskPagesToDiskAfterRestartACC'
This parameter has the same unit as
`NoOfDiskPagesToDiskAfterRestartTUP' but limits the speed of
writing index pages from `IndexMemory'.
The default value of this parameter is 20 (1.6MB per second).
`[DB]NoOfDiskPagesToDiskDuringRestartTUP'
This parameter specifies the same things as
`NoOfDiskPagesToDiskAfterRestartTUP' and
`NoOfDiskPagesToDiskAfterRestartACC', only it does it for local
checkpoints executed in the node as part of a local checkpoint
when the node is restarting. As part of all node restarts a local
checkpoint is always performed. Since during a node restart it is
possible to use a higher speed of writing to disk because fewer
activities are performed in the node due to the restart phase.
This parameter handles the `DataMemory' part.
The default value is 40 (3.2MB per second).
`[DB]NoOfDiskPagesToDiskDuringRestartACC'
During Restart for `IndexMemory' part of local checkpoint.
The default value is 20 (1.6MB per second).
`[DB]ArbitrationTimeout'
This parameter specifies the time that the storage node will wait
for a response from the arbitrator when sending an arbitration
message in the case of a split network.
The default value is 1000 milliseconds (1 second).
A number of new configuration parameters were introduced in MySQL 4.1.5.
These correspond to values that previously were compile time
parameters. The main reason for this is to enable the advanced user to
have more control of the size of the process and adjust various buffer
sizes according to his needs.
All of these buffers are used as front-ends to the file system when
writing log records of various kinds to disk. If the node runs with
Diskless then these parameters can most definitely be set to their
minimum values because all disk writes are faked as okay by the file
system abstraction layer in the `NDB' storage engine.
`[DB]UndoIndexBuffer'
This buffer is used during local checkpoints. The `NDB' storage
engine uses a recovery scheme based on a consistent checkpoint
together with an operational REDO log. In order to produce a
consistent checkpoint without blocking the entire system for
writes, UNDO logging is done while performing the local
checkpoint. The UNDO logging is only activated on one fragment of
one table at a time. This optimization is possible because tables
are entirely stored in main memory.
This buffer is used for the updates on the primary key hash index.
Inserts and deletes rearrange the hash index and the `NDB' storage
engine writes UNDO log records that map all physical changes to an
index page such that they can be undone at a system restart. It
also logs all active insert operations at the start of a local
checkpoint for the fragment.
Reads and updates only set lock bits and update a header in the
hash index entry. These changes are handled by the page write
algorithm to ensure that these operations need no UNDO logging.
This buffer is 2MB by default. The minimum value is 1MB. For most
applications this is good enough. Applications doing extremely
heavy inserts and deletes together with large transactions using
large primary keys might need to extend this buffer.
If this buffer is too small, the `NDB' storage engine issues the
internal error code 677 which will be translated into "Index UNDO
buffers overloaded".
`[DB]UndoDataBuffer'
This buffer has exactly the same role as the `UndoIndexBuffer' but
is used for the data part. This buffer is used during local
checkpoint of a fragment and inserts, deletes, and updates use the
buffer.
Since these UNDO log entries tend to be bigger and more things are
logged, the buffer is also bigger by default. It is set to 16MB by
default. For some applications this might be too conservative and
they might want to decrease this size, the minimum size is 1MB. It
should be rare that applications need to increase this buffer
size. If there is a need for this it is a good idea to check if
the disks can actually handle the load that the update activity in
the database causes. If they cannot then no size of this buffer
will be big enough.
If this buffer is too small and gets congested, the `NDB' storage
engine issues the internal error code 891 which will be translated
to "Data UNDO buffers overloaded".
`[DB]RedoBuffer'
All update activities also need to be logged. This enables a
replay of these updates at system restart. The recovery algorithm
uses a consistent checkpoint produced by a "fuzzy" checkpoint of
the data together with UNDO logging of the pages. Then it applies
the REDO log to play back all changes up until the time that will
be restored in the system restart.
This buffer is 8MB by default. The minimum value is 1MB.
If this buffer is too small, the `NDB' storage engine issues the
internal error code 1221 which will be translated into "REDO log
buffers overloaded".
For cluster management, it is important to be able to control the
amount of log messages sent to stdout for various event types. The
possible events will be listed in this manual soon. There are 16
levels possible from level 0 to level 15. Setting event reporting to
level 15 means receiving all event reports of that category and setting
it to 0 means getting no event reports in that category.
The reason why most defaults are set to 0 and thus not causing any
output to stdout is that the same message is sent to the cluster log in
the management server. Only the startup message is by default generated
to stdout.
A similar set of levels can be set in management client to define what
levels to record in the cluster log.
`[DB]LogLevelStartup'
Events generated during startup of the process.
The default level is 1.
`[DB]LogLevelShutdown'
Events generated as part of graceful shutdown of a node.
The default level is 0.
`[DB]LogLevelStatistic'
Statistical events such as how many primary key reads, updates,
inserts and many other statistical information of buffer usage,
and so forth.
The default level is 0.
`[DB]LogLevelCheckpoint'
Events generated by local and global checkpoints.
The default level is 0.
`[DB]LogLevelNodeRestart'
Events generated during node restart.
The default level is 0.
`[DB]LogLevelConnection'
Events generated by connections between nodes in the cluster.
The default level is 0.
`[DB]LogLevelError'
Events generated by errors and warnings in the cluster. These are
errors not causing a node failure but still considered worth
reporting.
The default level is 0.
`[DB]LogLevelInfo'
Events generated for information about state of cluster and so
forth.
The default level is 0.
There is a set of parameters defining memory buffers that are set aside
for online backup execution.
`[DB]BackupDataBufferSize'
When executing a backup there are two buffers used for sending
data to the disk. This buffer is used to fill in data recorded by
scanning the tables in the node. When filling this to a certain
level the pages are sent to disk. This level is specified by the
`BackupWriteSize' parameter. When sending data to the disk, the
backup can continue filling this buffer until it runs out of
buffer space. When running out of buffer space, it will simply
stop the scan and wait until some disk writes return and thus free
up memory buffers to use for further scanning.
The default value is 2MB.
`[DB]BackupLogBufferSize'
This parameter has a similar role but instead used for writing a
log of all writes to the tables during execution of the backup.
The same principles apply for writing those pages as for
`BackupDataBufferSize' except that when this part runs out of
buffer space, it causes the backup to fail due to lack of backup
buffers. Thus the size of this buffer must be big enough to handle
the load caused by write activities during the backup execution.
The default parameter should be big enough. Actually it is more
likely that a backup failure is caused by a disk not able to write
as quickly as it should. If the disk subsystem is not dimensioned
for the write load caused by the applications this will create a
cluster which will have great difficulties to perform the desired
actions.
It is important to dimension the nodes in such a manner that the
processors becomes the bottleneck rather than the disks or the
network connections.
The default value is 2MB.
`[DB]BackupMemory'
This parameter is simply the sum of the two previous, the
`BackupDataBufferSize' and `BackupLogBufferSize'.
The default value is 4MB.
`[DB]BackupWriteSize'
This parameter specifies the size of the write messages to disk
for the log and data buffer used for backups.
The default value is 32KB.
Info Catalog
(mysql.info.gz) MySQL Cluster MGM Definition
(mysql.info.gz) MySQL Cluster Config File
(mysql.info.gz) MySQL Cluster API Definition
automatically generated byinfo2html