The last few days were very busy dealing with a problem on the MySQL server of a customer. My company is offering fully managed hosting services, so it was up to us to investigate the troubles. I’ll try to explain some of the checks I’ve done ; maybe this can give you some ideas when you also deal with mysql troubleshooting.
Last Friday, our internal monitoring tool reported significant raise in MySQL status value ‘aborted_connects’. This was confirmed at the same time by the customer; who said he encounters many errors about « max_user_connections » limit reached. His application is written in PHP (so it results to many connections on MySQL, but with not much work done each time) and is used by hundreds of thousands people each day.
In such case, we first think « something is blocking and processlist is growing » but we were really surprised to see that processlist was almost empty : there was less than 10 threads working, and ~ 50 sessions opened, which is normal behavior. On this server, limits are the following :
max_user_connections = 200
max_connections = 510
So, to sum up information so far :
- aborted_connects is increasing
- on client side, there are errors related to « max_user_connections » …
- … but processlist is empty (or seems to be).
We now check two things :
First, can any mysql variable explain this ? We personnally built this configuration from scratch, so normally variables should be at their optimal value. But you can always forget something, so let’s check :
wait_timeout 600
Quite high, isn’t it ? Maybe we have many sleeping connections at the same time ? Let’s double check this :
while sleep 0.1 ; do mysqladmin pr | fgrep Sleep | wc -l ; done
15
21
23
20
30
35
20
[...]
Well, nothing weird in here. Moreover, there are no sleeping connections with Time > 2 seconds. We voluntarily set sleep to 100ms (and tested with smaller values too) to see extra details : in this case, customer reports « processlist is empty », but maybe processlist is empty 99% of the time, and just for a few milliseconds there are many queries coming up ? But no, wrong assertion, processlist is as empty as Monaco soccer team stadium (sorry for the reference).
It was time to gather more data efficiently. For this purpose, we used
pt-summary and
pt-stalk from Percona Toolkit (formerly known as Maatkit).
First script gives you an overview on the server : hardware, software, mounted FS, etc. I know 90% of it because I was the one installing this server, but maybe this script can show some useful information.
There are for example an array of connections from remote IP addresses. This MySQL server is used by approximately 30 web servers on a local network. This array did not show anything weird. All remote servers had the same amount of connections, and total number was OK. There are place for some minor tweaks, but it is not the issue we are facing at the moment.
Then we tried pt-stalk. This tool gathers many data MySQL related, as mutexes, processlist, innodb status, etc. Simple but efficient (you do not have to grab data one by one, this tool does it in one command line). To start using it :
wget percona.com/get/pt-stalk
chmod u+x pt-stalk
/tmp/pt-stalk --threshold 2 --cycles 1 --dest=/tmp/ptinfos
Well, no information here for a good reason : no specific mutexes shown ; everything seems normal … and processlist is empty ;
Then, we checked the documentation when aborted_connects is really incremented. Extracted paragraph follows :
If a client is unable even to connect, the server increments the Aborted_connects status variable. Unsuccessful connection attempts can occur for the following reasons:
- A client does not have privileges to connect to a database.
- A client uses an incorrect password.
- A connection packet does not contain the right information.
- It takes more than connect_timeout seconds to get a connect packet. See Section 5.1.3, “Server System Variablesâ€.
Let’s check this out :
connect_timeout = 30
Value is quite high, but I really doubt a connect packet needs more than half a minute to come.
But this gives an idea : can it be hardware related ?
Let’s check network first !
RX packets:2378856458 errors:12561 dropped:36335 overruns:0 frame:12561
We have some errors, but do not make assumptions too quickly. Uptime of this server is quite high, and we did some tcpdump capture before (and I know by experience that this can trigger errors). What is important is not the raw number, but how it changed over time. I made a small script to see variations but sorry, I did not keep it. I was working with ‘grep’ and ‘awk’. Maybe I’ll update this post in the future with it 😉 Conclusion was the following : there was only 1 error in 120 seconds, and 2 packets dropped. This cannot explain hundreds of connections refused every second ! So this is not network related, let’s move on.
We focus then on this max_user_connections : can it be thrown, even if there are not really so much connections ? Let’s read the
documentation :
For the MAX_USER_CONNECTIONS limit, an edge case can occur if the account currently has open the maximum number of connections permitted to it: A disconnect followed quickly by a connect can result in an error (ER_TOO_MANY_USER_CONNECTIONS or ER_USER_LIMIT_REACHED) if the server has not fully processed the disconnect by the time the connect occurs. When the server finishes disconnect processing, another connection will once more be permitted.
Ok we have something here : maybe we do too much connections at a time, disconnect quickly, and try to reconnect just after that. As doc said, mysql may not recycle connection quickly enough, and limit is hit. That’s something ! How can we check ?
First, let’s compare network traffic from a week ago, when everything was working well. (insert your favorite monitoring tool here for this purpose). Well no luck : traffic was the same. We checked with data from a month ago with same results.
In the meantime, this gives us an idea : if we hit high limit, can we use a higher value for max_user_connections ?
Small parenthesis here :
In the past, when a customer hit the max_user_connections, it was 100% because something was blocking : implicit or explicit lock, storage too slow, etc. Each time, the customer asked me : « Can we raise the limit » ? and my reply was always the same : if this is a specific lock, and problem happened with max_user_connections=100, problem will happen with max_user_connections=200, and so on. Increasing value wont solve the problem in this case, we need to solve the underlying problem.
But here, this is a special case: processlist is empty, CPU usage is low … So let’s try to put max_user_connections = 4000 ; To easily check impact on aborted_connects, just use
pt-mext with following line :
./pt-mext -r -- "mysqladmin ext -c 8 -i 1" |fgrep Aborted
Aborted_clients 214 0 0 0 0 0 0
Aborted_connects 3571112 0 0 0 0 0 0
Bingo : connections refused dropped to 0.
This can be explained : limit is higher, so behavior described earlier applied fully.
So, we moved on : server is now usable by the customer, but I do not like having a limit that high with no reason.
There was one thing that bothers me : how can I reach this limit (I still have a few aborted_connects) at 4000, when max_connections is still 510 ? There is definitely something wrong about it. Let’s keep this in mind.
So what we’ll be our next move ? Our current suspicion is that something is connecting way too much to the server. We don’t see any raise on trafic, but instantiating a connection is just a few bytes, maybe we do not see it.
For this we used an internal tool here, that can profile script execution of the customer, without touching the source code itself. I’m very proud of this tool, this is roughly based on a PHP plugin that gathered data, and a distant app that analyze it.
What did we get ? Not much. Application is doing only one connection per each webpage visited, which is expected behavior with PHP. and there is an explicit call to mysql_close() in the source code. Nothing weird here. I know the customer said to me « I did not change anything » but as Dr House would say : « people lies » 🙂 I would add that they are in general of good faith (they think minor updates cannot have great consequences. But in my point of view, I know it could !). So application seems to be out of the field of investigation. What next ?
There is definitely a problem at connection. We need do dig further by analyzing traffic network. In this specific task, tcpdump and Wireshark are your best friends.
I used following line :
tcpdump -n -i eth0 -w capture.pcap -s 16384 port 3306
I gathered data for approximately 30 seconds, and I monitored at the same time evolution of aborted_connects to make sure I’ll grab some.
Then, go to Wireshark to analyze it.
A normal session is looking like that :
CLIENT > SERVER (TCP) SYN
SERVER > CLIENT (TCP) SYN, ACK
CLIENT > SERVER (TCP) ACK
SERVER > CLIENT Server greeting
CLIENT > SERVER Login request
SERVER > CLIENT Response OK
CLIENT > SERVER Request use database
SERVER > CLIENT Response OK
CLIENT > SERVER Request Query (SELECT ...)
SERVER > CLIENT Response
CLIENT > SERVER Request Quit
CLIENT > SERVER (TCP) FIN, ACK
SERVER > CLIENT (TCP) FIN, ACK
CLIENT > SERVER (TCP) ACK
I’m not sure this needs any comments : we have the usual handshakes in TCP level, some dialog in MySQL protocol, then a « bye bye ». Simple.
Now take a look at both capture made : one on a successful flow, and one on a refused connection. In Wireshark, you can follow a specific stream with the following filter :
tcp.stream eq 6
Successful flow :
CLIENT > SERVER (TCP) SYN
SERVER > CLIENT (TCP) SYN, ACK
CLIENT > SERVER (TCP) ACK
SERVER > CLIENT Server greeting
CLIENT > SERVER Login request
SERVER > CLIENT Response OK
CLIENT > SERVER Request SLEEP
SERVER > CLIENT Response OK
CLIENT > SERVER (TCP) many TCP segments
CLIENT > SERVER (TCP) FIN, ACK
SERVER > CLIENT (TCP) FIN, ACK
CLIENT > SERVER (TCP) ACK
We need to take a break here : why oh why is there this « Request SLEEP » ? Seing at packet source, I cannot see anything related to this command (which is ID 0 by the way). There is no useful data in here … And why are these TCP segments not decoded as MySQL data ? Last but not least, I could not see a call to ‘Request Quit’, and I know my issue is related to connect/disconnect …
Let’s see a failing flow now, maybe this will give us ideas :
CLIENT > SERVER (TCP) SYN
SERVER > CLIENT (TCP) SYN, ACK
CLIENT > SERVER (TCP) ACK
SERVER > CLIENT Server greeting
CLIENT > SERVER Login request
SERVER > CLIENT Response Error 1226 - SQL State 42000 - Error message User 'XXX' has exceeded the 'max_user_connections' resource (current value: 4000) SERVER > CLIENT (TCP) FIN, ACK
CLIENT > SERVER (TCP) FIN, ACK
SERVER > CLIENT (TCP) ACK
CLIENT > SERVER (TCP) ACK
There is a useful information in here : the exact error code MySQL is throwing (1226). because you know, there can be different case where MySQL talks about max_user_connections or max_connections. 1226 is internally ER_USER_LIMIT_REACHED. Reading again source code, this error is raised when we hit specific user limit set in mysql.user privilege table. If I was hitting the global max_user_connections, it would have been error ER_TOO_MANY_USER_CONNECTIONS. The specific limit for a user is checked first. If it is 0, then the global limit is checked.
In our case, we do have a specific user limit. Still the same problem : how can we hit limit set to 4000, when the server has max_connections=510 ?
I was particularly puzzled by the lack of the close statement. I checked with the customer, and he was swearing there is a mysql_close() statement. After reading a few lines of code, I totally agreed with him. Can it be a bug in PHP ? But what about the profiling report I did earlier, that claims there _is_ a call to mysql_close() ?
I checked with a few lines of code, and even if we don’t call mysql_close() explicitly, PHP does when the script is ending.
Then, I saw the light. Maybe the connection is done with flag MYSQL_COMPRESS, and data is not parsed correctly with Wireshark ?
Bingo, that’s it ! One hour later, I now have a correct tcpdump analysis in Wireshark that shows me … nothing more than the expected behavior. Back to square one.
Let’s crawl the Internet a bit more. Maybe someone else already encountered this ?
You cannot imagine how difficult it is to browse Google with keyword « max_user_connections ». Soooo many people with misconfigured MySQL server or crappy source code doing extreme things …
Then … Well I’ve already seen the light a few hours back, but this is really a flashbang grenade here :
This guy seems to have the same symptoms : everything working great, then max_user_connections reached with processlist empty.
There is also a patch already written ! But this bug is impacting original MySQL version too, right ? But no bug report filled 🙁
Then, it’s time to apply the patch to source code and compile again. This step is fast, because GCC (Linux defaut compiler) detects that only one file has changed, and then rebuild only the necessary. We only need to co copy the binary « mysqld » because nothing else was changed. I checked the md5 signature of binary, just to make sure it was not the same as the old one 😉
We decided with the customer to shut down the server one morning, when trafic was very low. To speed up recovery, we of course have
LRU dump/restore functionality ON 😉
Conclusions
In a blog page, everything seems easy. But this bug has occupied me for several days. Server was in production with very important services on it. I could not stop it at all and I really needed to make the server work at any cost.
In this process, I was sometimes advised by folks at Percona. They gave me fresh look on ideas and validated my thinking and also reviewed parts of relevant mysql source code.
This case perfectly illustrates what Open source software is and isn’t. Open source does not mean bug free. Open source does not mean ‘no problem’. And more importantly :
Consulting experts is not free. They do not live of love and fresh water. If you do not have the necessary skills to diagnose a problem yourself, you either need a high skilled hosting provider, either a contract support (from Oracle or any company that provides such services), or be prepared to face very long downtimes. Sorry but praying has never solved mysql issues at that time …
I’m not objective on this matter, but I think when you choose fully managed hosting, make sure the company you choose is competent on the technology you use. I cannot even imagine how long it would have taken to find the answer to my issue if I was not used to MySQL internals. And as every business in the Internet, time is money. Downtime means loss. Discount hosting has a downside, know you know what it is.
Hi Olivier,
I’m not used to thank posts very often, but this post surely worths it. I was in the really same spot than you. I followed your directions and concluded that we hit the bug. Fortunately in our case doesn’t affects productions websites directly but our monitoring system, any way its a pain in the ass.
You man have saved me lots of time and frustration.
Thanks!
Marc