Last week my team pushed our monthly release and all looked well and good after the launch. Then at about 10PM our site monitoring alerts started to fire as our page load times soared and this continued until early morning. There were no unusual errors in our logs and our traffic load was normal. What was clear was that our database had been running quite hot on CPU.
As we continued to investigate the following day on a Friday, all systems looked quite healthy and we were very perplexed. That Friday and Saturday night there was no reoccurrence of this behavior but on Sunday it happened again. A very abrupt and fast rise in page load times, followed by a just as sudden drop in load times at around 9AM. All of this happening during what we thought was our lull in traffic.
Interestingly to us but not so topical to this post, after studying our logs back several weeks, we learned that 10PM to 9AM actually is our traffic peak. So it began to make sense why this was occurring during this time. Knowing that this was manifesting itself as a database CPU issue, we ran the following query:
select * from master..sysprocesses sp cross apply fn_get_sql(sql_handle) order by cpu desc
This query will return a row for every process in descending order of CPU load along with the exact SQL statement they are executing. The will also tell you if the process is blocking on another process and if so, which process and what resource it is waiting on. Having run this query many times under this load we were seeing a pattern with results like this:
spid | kpid | blocked | waittype | waittime | lastwaittype | waitresource | dbid |
254 | 9680 | 0 | 0x0034 | 2 | PAGELATCH_EX | 2:1:103 | 9 |
128 | 10288 | 163 | 0x0032 | 11 | PAGELATCH_SH | 2:1:103 | 9 |
129 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 9 | |
116 | 10152 | 0 | 0x0000 | 0 | SOS_SCHEDULER_YIELD | 9 | |
169 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 1 | |
52 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 4 | |
148 | 9468 | 163 | 0x0034 | 9 | PAGELATCH_EX | 2:1:103 | 11 |
90 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 9 | |
274 | 6612 | 0 | 0x0034 | 15 | PAGELATCH_EX | 2:1:103 | 9 |
131 | 8568 | 163 | 0x0032 | 4 | PAGELATCH_SH | 2:1:103 | 9 |
120 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 11 | |
105 | 8728 | 163 | 0x0034 | 11 | PAGELATCH_EX | 2:1:103 | 9 |
247 | 5660 | 0 | 0x0000 | 0 | PAGELATCH_SH | 2:1:103 | 9 |
216 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 1 | |
163 | 8108 | 0 | 0x0000 | 0 | PAGELATCH_SH | 2:1:103 | 9 |
201 | 7004 | 0 | 0x0034 | 1 | PAGELATCH_EX | 2:1:103 | 9 |
267 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 6 | |
117 | 11124 | 163 | 0x0034 | 10 | PAGELATCH_EX | 2:1:103 | 11 |
205 | 10016 | 0 | 0x0034 | 1 | PAGELATCH_EX | 2:1:103 | 9 |
210 | 10108 | 0 | 0x0034 | 14 | PAGELATCH_EX | 2:1:103 | 9 |
226 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 10 | |
164 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 1 | |
175 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 11 | |
98 | 0 | 0 | 0x0000 | 0 | MISCELLANEOUS | 7 |
What this tells me is that there are several queries lined up behind the query executing under spid 163 and they are waiting for 163 to release its shared lock on a latch so that they can aquire an exclusive lock. To quote the SQLCAT team, “Latches are lightweight synchronization primitives that are used by the SQL Server engine to guarantee consistency of in-memory structures including; index, data pages and internal structures such as non-leaf pages in a B-Tree.” See this white paper for an in depth discussion on latch contention. So the next question is what is 2:1:103 – the resource everyone is fighting for.
The first number will be the database id, the second is the database file and the third and last number is the page. 2 is temp db. If you don’t believe me, just execute:
select db_name(2)
It says “tempdb” doesn’t it? I just knew that it would.
To track down the actual object that owns that page, execute:
DBCC Traceon(3604)DBCC PAGE (2, 1, 103)
The key item of interest in the results is the Metadata: ObjectId with a value of 75. Then execute:
select object_name(75)
and the object is sysmultiobjrefs.
So what the heck am I supposed to do with this? If this was blocking on a user table that was part of my application, it seems I would have something to work with. But tempdb? Really? Ugh.
After recovering from a spiral of primal self-loathing and despair, I turn to google. I’m sure lots of people have seen contention on 2:1:103 and it will be no time at all before a clear explanation and remedy are revealed. Well, yes, there are several people struggling with contention on this object, but alas, not there is no clear explanations and answers. I find several posts talking about 2:1:1 and 2:2:3 (see this KB article for those). But those struggling with 2:1:103 seem to be being told to turn on trace flag –T1118 but then later complaining that this flag is not doing anything. Or they were told to add more files but they replied that they had plenty of files. And by the way, –T1118 was turned on already in our DB and we have plenty of files too.
Well finally I come across this post from someone with this exact issue and the poster states that they contacted Microsoft and then reports the explanation MS gave and this explanation actually makes sense. The issue is from contention of DDL operations in tempdb. In other words, the Creating and dropping of temp tables. Well we just so happen to do a lot of that and added more in this last release.
I then came upon this white paper and while it did not specifically refer to 2:1:103, it really provided an in depth description of the inner workings of tempdb and how to identify and troubleshoot different forms of contention there. Of particular interest was its explanation of contention in DML and DDL operations.
Contention in DML (select, update, insert) operations is identified by contention on 2:1:1 or 2:1:3. This can be caused by either “internal” objects such as the objects that the sql query engine creates to when executing certain hash joins, sorts and groupings. Or it can be caused by user objects like temp tables and temp table variables. At the root of the issue here is SQL Servers struggle to allocate and deallocate pages and updating metadata tables that track those pages. In addition to suggesting that those who suffer from this issue analyze query plans to make more efficient use of tempdb, it also suggests turning on –T1118 and adding data files to tempdb.
My team’s plight seemed to fall under DDL Contention. This can be determined by checking if the object under contention is a system catalog table in sysobjects. Well according to:
select * from sys.objects where object_id=75
sysmultiobjrefs is a system table. DDL contention is not impacted by internal objects but deals strictly with the creation and destruction of user objects. Namely temp tables and temp table variables.
In addition to checking on the wait resource, there are two performance counters that you can turn on to track the real time creation and destruction of temp tables:
- Temp Tables Creation Rate
- Temp Tables For Destruction
You can turn these counters on and run your queries to determine how many temporary resources they are creating. We were able to identify that 70% of our traffic was issuing a query that now created 4 temp tables. The advice given here is to:
- Use fewer temp tables and temp table variables.
- Check to see if your use of temp tables takes advantage of temp table caching which reduces contention. Sql Server will automatically cache created temp tables as long as:
-
Named constraints are not created.
-
Data Definition Language (DDL) statements that affect the table are not run after the temp table has been created, such as the CREATE INDEX or CREATE STATISTICS statements.
-
Temp object is not created by using dynamic SQL, such as: sp_executesql N'create table #t(a int)'.
-
Temp object is created inside another object, such as a stored procedure, trigger, and user-defined function; or is the return table of a user-defined, table-valued function.
-
Well we use dynamic SQL so none of our temp tables get cached which probably means 2:1:103 is being updated on every creation.
Our “quick fix” was to take the query that 70% of our traffic hits and remove the use of temp tables as well as trim it down to something highly optimized to exactly what that traffic was doing. Fortunately something very simple. Moving forward we will need to seriously curb the use of temp tables or wrap them in sql functions or stored procs.
I am intentionally putting 2:1:103 in the title of this post with the hope that someone will find this more quickly that it took me to track all of this down.
UPDATE: Michael J. Swart (@MJSwart) wrote a great post on this exact same problem here. I strongly encourage you to read about his run in with this awful awful problem.