Corruption Bug: Common Criteria and SA.

When I talk about recovering from database corruption, I mention that occasionally corruption can be caused by bugs in SQL Server. Such things are very rare, but they do happen. Case in point – MS just recently updated an article stating such. KB2888996 details a specific set of circumstances where actually running CHECKDB can cause corruption.

Scary, huh?

I thought it would be an interesting exercise to reproduce those results, and then test the patch for myself. Feel free to follow along.

WARNING: Once again, we are causing intentional damage, here. Do this only on a test system where you can afford to simply reinstall SQL Server if everything goes to hell. And before you do anything else, take a backup of your master database. Do it now. I’ll wait. <muzak plays…>

Ready? Ok, here we go…

  • Make sure you’re running an affected version of SQL Server. It should be Enterprise, Standard or Developer edition of SQL Server 2014, down to 2008 R2. See the KB article mentioned above for the specifics.  Note that you’ll need to be running a version earlier than the hotfixes or cumulative updates that they mention in the fix.
  • Enable common compliance criteria on your SQL Server instance using:
EXEC sp_configure 'common criteria compliance enabled',1;
GO
RECONFIGURE WITH OVERRIDE;
GO
  • Install the auditing trace procedures, available on Microsoft’s SQL deployment page.
  • Restart the SQL Server service. Common Criteria is one of those rare cases where just running RECONFIGURE WITH OVERRIDE won’t actually cause the change to take effect.
  • Now log into the server as SA, and run:
DBCC CHECKDB(master) WITH NO_INFOMSGS, DATA_PURITY;

Remember that out of the box, data purity checks on master and model are not enabled. That’s why I’m including it in the command above, even though it’s a default from 2005 going forward.

In my case, when I ran CHECKDB, I got the following:

Msg 2570, Level 16, State 2, Line 1
Page (1:85), slot 26 in object ID 58, index ID 1, partition ID 281474980511744, alloc unit ID 72057594037993472 (type "In-row data"). Column "name" value is out of range for data type "nvarchar".  Update column to a legal value.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'sys.sysbinobjs' (object ID 58).
CHECKDB found 0 allocation errors and 1 consistency errors in database 'master'.

So there is our data purity error. These are kind of a mixed bag. You can’t run a repair statement to fix them, but then again, you don’t risk losing any more data than you already have.  What it means is that SQL Server cannot interpret the data within a column, because the data isn’t valid or within the correct range for the data type of that column. In this case, we have an nvarchar column with a single byte of data in it.

Now you’ve probably guessed by this point, and if you haven’t then I’ll make it clear: the corruption in this case was NOT caused by CHECKDB. It has to do with some peculiarity in the security setup here. Somehow something is being logged in such a way that is bypassing the data type check, and inserting a record into this table that doesn’t fit the data type. Let’s take a look at this using DBCC PAGE.  If you’re not familiar with DBCC PAGE, please see my previous post on it, DBCC PAGE to the Rescue.

DBCC TRACEON(3604); /* Enables output to console */
GO
DBCC PAGE (1,1,85,3); /* dbid 1, file 1, page 85, style 3 (show me everything) */
GO

After bringing up the page, we do a search for slot 26, and find:

Slot 26 Column 4 Offset 0x2f Length 1 Length (physical) 1
name = INVALID COLUMN VALUE

Just for comparison, here’s what a normal, single-character value of that type would look like:

Slot 0 Column 1 Offset 0xb Length 2 Length (physical) 2
col1 = a

See the problem? Because this is an nvarchar type, the minimum length of a value is going to be 2 bytes. One for the data, and one for the unicode data that goes with it. (See here for the details on what I mean regarding the data types.) So, the value stored is too small for SQL Server to attempt to interpret as a valid nvarchar.

In the demos I do, I fix a data purity error with an update statement. Unfortunately, you can’t directly access this particular table. It’s a hidden system table, so there’s no way to even select from it, much less update it.

So what *can* we do? Restore master from a backup. Well, technically we could just ignore the error by running CHECKDB against master with PHYSICAL_ONLY, which would just avoid data purity checking. You can also disable data purity checking on master by enabling trace flag 2566.  Keep in mind, though, that disabling data purity checks means that if any new data purity issues crop up, you’re not going to know. I’m not in favor of this option, so let’s go to that backup we took at the beginning of this exercise.

Restoring master is understandably different than restoring any other database, and here’s how you have to do it. You need to stop the SQL Server service, then start it from the command prompt in single user mode, using the -m switch. In my case, I have multiple instances of SQL running on this particular machine, so I also need to specify which instance using the -s switch. Like so:

X:\<path to sqlservr.exe>\sqlservr.exe -m -s mssqlserver

Note that specifying “mssqlserver” will start the default instance, which is where we’re testing this. In a separate command window, I logged into my instance using sqlcmd.exe and restored master with a plain old restore command:

RESTORE DATABASE [master] FROM DISK = 'X:\<path to file>\master_full.bak' WITH REPLACE;

When the restore of master is complete, SQL Server will automatically shut down. I was already at the command line, so I restarted both SQL and the SQL Server Agent with net start…

Net start mssqlserver
Net start sqlserveragent

Logging in with my user account, (not sa), and running CHECKDB as above, shows that the database is now clean.

So to avoid this, should you fit the criteria at the beginning of this article and on Microsoft’s KB article, please be sure you download and install the appropriate update for your instance of SQL Server.

Keep checking those databases for corruption, folks.

Thanks for reading.

-David.

Insidious Corruption

WARNING: The following post involves intentionally doing damage to a *system* database. The usual caveats about having good backups, not attempting to do this on an important system, etc… apply here. You may very well need to remove and reinstall SQL Server if you attempt this. In short – “We’re professionals. Do not try this at home.” 

Today’s lesson is about equality. Treat all databases with equal kindness. For example, I see a lot of people not doing basic maintenance like regular backups and CHECKDB on their system databases. Maybe they think that since those databases don’t change much, there’s no reason to protect them the way that they protect user data.

I am here today to dispel that notion.  If you’ve attended one of my sessions on database corruption, you’ll remember that I state that database corruption is just a fact of life. It happens, and it can happen when and where you least expect it. Example: in the [model] database.

The [model] database is essentially SQL Server’s template. When a new database is created, first, a copy of [model] is made. Then, any further configurations or tweaks that the user specified in either the database creation dialog boxes, or the CREATE DATABASE command, are applied to the new database. The advantage of this is that anything you want to apply to any newly created databases, you can apply to the [model] database. Any objects you want in all new databases, any default file settings like size and growth, or something SIMPLE like the database’s recovery model. (See what I did there? :-))

Let me repeat that real quick – *anything* that exists in [model] will be copied to new databases.

Including corrupted pages.

*shiver*

Sooo… Let’s do one! :-)

First, take a backup of [model] and set it aside, just in case. Then we create an object we can copy to new databases.

/* Create a table that can get propagated to new databases. */
CREATE TABLE Trash (datacol CHAR(8000) NOT NULL); /* if you're gonna go, go big. */
GO
/* Drop in a row. */
INSERT INTO Trash(datacol)
SELECT REPLICATE('a',7500);
GO

Now that we have an object, we need to corrupt it. I chose to use the method described here.  You’ll need the ID of the page you want to corrupt, which you can get with DBCC IND:

/* Get the page ID of the one lone page in the Trash table. */
DBCC IND('model','Trash',0);
GO

Now, let’s assume we’re not checking [model] for corruption, and so it goes undetected.  What happens when I create a new database?

/* Create new database and check for corruption. */
CREATE DATABASE TestMe;
GO
DBCC CHECKDB(TestMe) WITH NO_INFOMSGS;
GO

Survey says…!

Msg 8939, Level 16, State 98, Line 103
Table error: Object ID 581577110, index ID 0, partition ID 72057594039107584, alloc unit ID 72057594043695104 (type In-row data), page (1:312). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 103
Object ID 581577110, index ID 0, partition ID 72057594039107584, alloc unit ID 72057594043695104 (type In-row data): Page (1:312) could not be processed. See other errors for details.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'Trash' (object ID 581577110).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestMe'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestMe).

So now I have a corruption that will appear in every database I create. Keep in mind that this need not show up in a user created object. If any part of the [model] database becomes corrupt, and we’re not checking [model] with CHECKDB, then every database created will also be corrupt, and maybe unusable.

While we’re on the subject, here’s something even worse – while playing around with this concept, I noticed that in SQL Server 2014, DATA_PURITY checks are still *off* for [master] and [model] by default. So I created another test object in [model] and caused a data purity error in it. When I ran CHECKDB on [model], without specifically adding DATA_PURITY as an option, it came back clean. When I created a new database, I ran CHECKDB on it, and lo and behold – it threw a data purity error. So a corruption that was in model was not detected, and still propagated to a newly created database.

Ouch.

Have you hugged your [msdb], [model] and [master] databases today? If not, at least make sure you’re properly checking it for corruption using CHECKDB with DATA_PURITY. Your databases will thank you.

Thanks for reading.

-David.

DBA 911 – Follow up Q & A

Thanks again to everyone who attended my web session, DBA 911 – Database Corruption, on January 7th. I had a blast spending time with you all. We must do that again, soon. :-)

In this post, I’m going to tackle some of the follow-up questions that we weren’t able to get to in the session. Some questions will require more explanation than others, so I’m going to handle them in separate posts.

First, let’s handle two of the most frequently asked questions:

Where can I get a script to automate running CHECKDB on my databases?

The best scripts for that, bar none, are Ola Hallengren’s. Check out http://ola.hallengren.com

Where can I get the slides and demo code for your presentation?

The slides and demo code are on my resource page, above. (https://dmmaxwell.wordpress.com/resources/)

Now, let’s get into the feedback questions – I’m going to just copy them out of the feedback form and paste them here. I’m going to make an assumption or two here and there, but if I don’t completely answer your question, please feel free to leave a comment below. Some of the questions answered in the presentation will be left alone, but some answers bear repeating, so I will do that here. Also, there are some questions that I want to go into a LOT more detail on, so I’m going to save a few for future posts, where I can include some demo code and explain more.

Here we go! :-)

Is the space taken in tempdb by the dbcc checks cleared after the command is run or does the dba have to manually clear the data?

Space used within TempDB for CHECKDB is cleared automatically by SQL Server. The DBA shouldn’t usually have to manually clear objects out of TempDB. Now, if the problem is that running CHECKDB causes you to run out of space for TempDB, then you need more disk space for TempDB, period. I also recommend pre-sizing TempDB, if at all possible, to the max size you need.

Can we add log file with suspect database if it is suspected due to log file full.

A full log file, meaning a log file that cannot grow, does not put a database into SUSPECT mode. SUSPECT means that SQL Server is not able to read one or more database files to begin with at all, so you wouldn’t be able to add a log file anyway. A suspect database must either be put into EMERGENCY mode in order to correct anything, or restored from backup.

A full log file is a completely different problem. If your log file fills up on an online database, then yes you can add another log file for some temporary space. I would recommend finding out what caused the log file to grow, and either fixing that process or sizing the log file appropriately, then deleting the additional log file.  There’s no good reason to maintain a database with more than one log file.

monitoring the severity alerts > 20, we are monitoring these alerts: 823,824,825 and 832?

For 823, 824 and 832, yes, since those are high-severity alerts. However, 825 is logged quietly as a Severity 10 “Informational” message, so you won’t catch it if you are only monitoring Severity 21 and up. One other reason to monitor specific error numbers outside of just monitoring the severity levels is that you can use custom notification text for each error. For the 823 alert on my servers, I include something like, “This indicates that SQL Server was unable to read an 8k page off of the disk, and that the database is damaged. Please contact the DBA on call to complete a consistency check of this database immediately.”  That way, if I’m not around, someone will see that and know it’s critical – and get a hold of the on call DBA.

how much extra overhead is there with having checksum on for a database

Little to almost none. There is a slight CPU overhead to calculating the checksum. Performance wise, however, most servers are IO-constrained, rather than CPU-constrained. For probably 99.9999% of systems out there, enabling page checksums just isn’t going to be noticed.

I may have missed this but when should you run the datapurity command?

The DATA_PURITY option should be specified when the data purity flag is off for that database.  An example would be a database that is created in SQL Server 2000, then upgraded to 2005 or a later version. In SQL Server 2000, there were no data purity checks. In SQL 2005 and up, the data purity flag was introduced.  To find the data purity flag state, you can use the DBCC DBINFO command, which is included in my demo scripts. Once you have specified DATA_PURITY for a database, that flag will be turned on, so you will get data purity checks automatically going forward, without having to specify that option.

Now, if you’re trying to save time or TempDB space by turning DATA_PURITY off, then you want to specify PHYSICAL_ONLY. Be aware, when you do that, you’re deliberately choosing not to check for logical corruption, but physical readability only. It’s the old speed vs. accuracy trade-off.  I’m not saying one or the other is better, just that you need to understand your options and have reasons for choosing them.

What time will drinks be served?

May I see some ID, please?

What is the licensing implication when you run DBCC on non production environment?

None – you run it wherever you want to. You can copy a production database to a Developer edition server and run CHECKDB on it with no licensing impact.

in What situations is used DBCC UPDATEUSAGE ?

UPDATEUSAGE is a command that verifies allocated space in the database is reported correctly. This has to do with the database catalog views, rather than with corruption or with CHECKDB, in specific. If you’re seeing odd values from sp_spaceused or if  database objects are frequently changed with CREATE, ALTER or DROP statements, then you may want to run this. This does not, however, indicate a corruption issue. I recommend reading the BOL entry for UPDATEUSAGE at http://technet.microsoft.com/en-us/library/ms188414.aspx

Do we need to run CHECKDB after a restore or repair?

If you have the time to run it after a restore, then yes, I would definitely do so. And you should definitely run CHECKDB after doing a repair, especially if you have run with REPAIR_ALLOW_DATA_LOSS. You never want to say to your boss, “Well, I think the database is OK, now.”  You want to be sure. If you have downloaded my demo scripts, you’ll see that I take additional backups both before and after every repair, and run CHECKDB again after every repair. I show this because I think it’s a good habit to have. Your situation may vary, but I would consider this a best practice.  Not only that, but sometimes you need to run multiple repairs, and making it a habit to run CHECKDB after doing one, will allow you to catch anything else that has gone wrong. For an example, see my post on Emergency Mode Repair.

That’s all for now. :-) If I have not answered your question, remember that I’m going to be handling a few other questions separately, in separate posts.  Once I’m done, if I still haven’t answered your question or you have additional questions, feel free to leave a comment, email me, or send me a question on Twitter.   I’m always happy to help.

Thanks for reading.

-David.

Quickie: Thanks and See You Soon

Massive thanks to Mike Brumley (t) and the SQL PASS Fundamentals VC for allowing me to present my session, DBA 911 – Database Corruption. I had a ton of fun, and the Q&A was awesome. I’m very pleased to see the level of thought being put into the questions – you’re all thinking hard about this stuff and it’s gratifying to know I’m in the same profession as you.

I have my list of follow up questions, both from emails and from the session yesterday. I will be posting the answers here, and emailing everyone who asked a question to let you know the post is live.

Once again, thank you. I’m humbled by having so many people attend my session. I look forward to meeting you at a SQL Saturday or other event, soon.

Thanks for reading.

-David.

SQL Saturday #242 Follow Ups and Feedback Review

SQL Saturday #242 is over, and it was one of the best times I’ve had in a while. The sessions I attended were great, and I got to spend time with some really cool people. This community is one I’m very proud to be a part of.

As far as my own session went, feedback was overwhelmingly positive. Thank you! One non-positive comment I received was in regards to putting more information on the slides. I try to keep the slides clean, but I put much more info in the notes. So if you have downloaded the slide deck, you should have the notes as well, which should give you the information you need. (Be glad it wasn’t one of my Takahashi / Lessig style presentations, where there’s no text on the slides at all. Just images.)

Also, I feel like the point of attending a session is to listen to the presenter, rather than read what’s on the slides. It’s a deliberate choice on my part to keep the slides simple, but to embellish on the topic by going into detail verbally, then inviting discussion. I assume people are attending sessions because they want to focus on the speaker, so that’s how I build my presentations. Otherwise, why not just download the slides and skip the session? :-)

I got a couple of questions I didn’t have immediate answers for during the presentation, so here’s my follow up on those.

1. If you run CHECKDB with TABLERESULTS and ALL_ERRORMSGS, do you get more than the first 1000 error messages in SSMS?

According to Books Online, no. SSMS will still only return the first 1000 error messages. However, I ran a quick test on it myself, and I’m not so sure. I’m going to ask a couple of people to look at my POC code just to make sure I didn’t do something odd, and let you know once I get an answer.

2. Does dbi_DBCCLastKnownGood get updated if you run CHECKDB with PHYSICAL_ONLY?

Yes. The only options I could run CHECKDB with and not update LastKnownGood was ESTIMATEONLY, which makes perfect sense. However, I’ll also direct you to this article by Erin Stellato, who goes into more detail on what checks will update LastKnownGood.

And that’s a wrap on SQL Saturday #242, easily one of my favorite SQL Saturdays to date. I’m now looking forward to SQL Saturday #250 in Pittsburgh PA. I hope to see you there.

Thanks for reading.

-David.

Emergency Mode Repair Playtime

I have a little database that I don’t back up, nor do I do any maintenance on it. I don’t even check it for corruption. This database is what I would consider a “throw-away” database. It would be faster to have the application recreate the DB from scratch than it would be to restore it from backup, and it doesn’t hold any permanent information. In fact, the only time it is of any use to me is when I’m actively using the application, and information is only retained for an hour. So it has really limited usage.

With that in mind, imagine my delight when this database became corrupted. “Yay! A corrupt DB to play with!” SSMS showed the database status as SUSPECT. Like any good DBA, the first thing I did was take a backup. You know how to do that, right?

BACKUP DATABASE igrep TO 'F:\igrep_corrupt.bak' WITH CONTINUE_AFTER_ERROR;
GO

Having saved off a copy of that DB, I placed the database in emergency mode, so that I could run a full consistency check with this:

ALTER DATABASE igrep
SET EMERGENCY;
GO

DBCC CHECKDB(igrep)
WITH NO_INFOMSGS;
GO

The final tally was 5 allocation errors, and 39 consistency errors. In other words, this DB was a mess. Since there was no useful data to lose at this point, I decided to go ahead and run a repair to see what it would do.

DBCC CHECKDB(igrep, repair_allow_data_loss)
WITH NO_INFOMSGS;
GO
Msg 7919, Level 16, State 3, Line 1
Repair statement not processed. Database needs to be in single user mode.

Oops. Forgot that when in EMERGENCY mode, you’re in restricted to syadmins only, but not in SINGLE_USER.

ALTER DATABASE igrep
SET SINGLE_USER;
GO

Hmmm… no error messages on that one. Just for grins and giggles…

ALTER DATABASE igrep
SET ONLINE;
GO
Msg 926, Level 14, State 1, Line 1
Database 'igrep' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.
Msg 3414, Level 21, State 2, Line 1
An error occurred during recovery, preventing the database 'igrep' (database ID 10) from restarting.
Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

Ok, wasn’t expecting success anyway. :-) Back to emergency mode. This time, we’ll run the repair as well.

ALTER DATABASE igrep
SET EMERGENCY;
GO

DBCC CHECKDB(igrep, repair_allow_data_loss)
WITH NO_INFOMSGS;
GO

Tick tock, tick tock, tick tock, DING! About 10 minutes later… here’s what DBCC reported back:

CHECKDB found 5 allocation errors and 70 consistency errors in database 'igrep'.
CHECKDB fixed 2 allocation errors and 70 consistency errors in database 'igrep'.

Woot! We’re complete! In order to fix the consistency errors, the following things happened:
* Some pages were deallocated (buh-bye, data)
* Some extents were marked as allocated to specific objects.
* Specific clustered and non-clustered indexes were rebuilt.

The other interesting thing is, note that the number of consistency errors went *up* when the allocation errors were fixed. It’s likely that…

Hold up… All the consistency errors were fixed, but only two of five allocation errors? Let’s scroll up a bit. Near the top, we find the following:

Msg 8905, Level 16, State 1, Line 2
Extent (1:682840) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Could not repair this error.
Msg 8905, Level 16, State 1, Line 2
Extent (1:682880) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Repairing this error requires other errors to be corrected first.
Msg 8905, Level 16, State 1, Line 2
Extent (1:682952) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Repairing this error requires other errors to be corrected first.

Interestingly enough, SQL has marked the database as ONLINE. I thought that would denote a successful repair, but DBCC says once again that repair_allow_data_loss is the minimum repair level for this database. Running CHECKDB again, we get three of these:

Msg 8905, Level 16, State 1, Line 1
Extent (1:682952) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.

Let’s run the same repair again. This time, we get this:

CHECKDB found 3 allocation errors and 20 consistency errors in database 'igrep'.
CHECKDB fixed 3 allocation errors and 20 consistency errors in database 'igrep'.

So fixing the allocation errors caused more consistency errors, again. Not surprising, when you think about it. Allocating and deallocating pages, especially when foreign keys are in play, is bound to cause some structural inconsistency. That’s why it’s paramount that after running a repair of any kind, you run another CHECKDB to see if the repair caused any other damage. Speaking of which – after this last repair, CHECKDB finished with no errors, and the application’s service started fine. The applicatoin itself started working again, and all was well. Keep in mind that this was the *least* likely outcome. I didn’t do a good job, or even do the right things, here – I got *extremely* lucky.

So what are the big ideas we take away from all this?

  1. Backups are your friend. Take them, test them, and store them safely.
  2. By default, try a restore before you try repair. The former is clean, the latter can get *really* messy.
  3. If you don’t have a backup, take one with CONTINUE_AFTER_ERROR before doing *anything* else.
  4. Emergency mode can bring up otherwise inaccessible databases, sometimes. Remember that to do a repair, though, you’ll still have to set it to SINGLE_USER mode
  5. Just because repair finished, doesn’t mean you’re good to go. Check and re-check your database, both before and after repair.

Anything else I missed?

Thanks,
-David.