Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies

DDLs Percona XtraDB Cluster Inconsistencies Apologies for the silly title, but the issue is a real one, even though it is not a new thing. Schema upgrades are not an ordinary operation in Galera. For the subject at hand, the bottom line is: under the default Total Order Isolation (TOI) method, “the cluster replicates the schema change query as a statement before its execution.” What this means in practice is that a DDL issued in one node is replicated to other nodes in the cluster before it is even executed in the source node, let alone completed successfully.

As a result of this, it may fail in one node and be successful in another, and this without raising loud alerts or stopping nodes to protect against data inconsistency. This is not a bug in itself but rather a compromise of design. With new changes in MySQL and the recent support for atomic DDLs added in version 8.0, we may see improvements in this area over time in Galera as well. For now, the warning should be maintained: DDLs are not atomic in Galera, even in 8.0.

How Big of an Issue Is This?

To be fair, it’s rare – we do not see this happening often.

How Does the Problem Manifest Itself in Practice? Give Us an Example!

We recently investigated a case where a DBA issued a TRUNCATE TABLE statement in one node, which was taking some time to complete. His reaction at the time was to hit Ctrl+C on the session where the DDL was issued. That session seemed to hang, so he ended up opening a new session and issued a full DELETE on that same table. It also took time to complete, however this time, he was more patient. The problem was: once the statement was completed, he noticed that the two other nodes disappeared from the cluster. Upon closer inspection of the logs, both nodes reported the following error before evicting the cluster:

[ERROR] Slave SQL: Could not execute Delete_rows event on table mydb.target; Can't find record in 'target’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1073, Error_code: 1032

The DELETE was successful on the source node but failed to complete on the other two nodes. Why? Because the target table was already empty: the previous TRUNCATE TABLE was effectively executed in those nodes.

Wasn’t the TRUNCATE TABLE also successful on the source node? It turned out it wasn’t. In fact, the TRUNCATE was recorded in the slow query log of the source node like this:

# Time: (...)
# User (...)
# Schema: building Last_errno: 1317 Killed: 0
# Query_time: 145.753247 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0
# Bytes_sent: 44 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0
# No InnoDB statistics available for this query
SET timestamp=(...)
truncate target;

Note the Last_errno indicated above:

$ perror 1317
MySQL error code 1317 (ER_QUERY_INTERRUPTED): Query execution was interrupted

Our colleague Przemek reproduced a similar problem, which he described in bug PXC-3536: Modify processing to not allow threads/queries to be killed if the thread is in TOI. The Percona Engineering team has since fixed the abortion issue, leaving the following note on the bug tracker:

Fixed in 5.7[.33-31.49] and 8.0[.22-13.1]. Any attempt to kill the query or connection while in TOI will fail. Note that due to timing, the “kill query” may sometimes return success, but the operation will still have succeeded.

In the case we investigated, the MySQL error log of the source node had the answer; there was some other higher-priority transaction already in place on that node that prevented it from executing the TRUNCATE:

[Note] WSREP: MDL conflict db=mydb table=target ticket=MDL_SHARED_WRITE solved by abort

To clarify, this cluster was receiving writes from multiple nodes, and the higher-priority transaction was applied by a replication thread, which usually has precedence over local events when a conflict is detected.

As you may have suspected, the TRUNCATE was not found in the binary log of the source node; since the statement did not complete, it was not logged. Only the subsequent DELETE statement was there.

And as for the binary logs of the other two nodes, they included both the TRUNCATE as well as the DELETEs covering all rows in the target table. Rows these other nodes no longer had in place and thus were unable to remove caused the data inconsistency that followed.

Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies

How Big of an Issue Is This?

How Does the Problem Manifest Itself in Practice? Give Us an Example!

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112