Corruption Bug: Common Criteria and SA.

When I talk about recovering from database corruption, I mention that occasionally corruption can be caused by bugs in SQL Server. Such things are very rare, but they do happen. Case in point – MS just recently updated an article stating such. KB2888996 details a specific set of circumstances where actually running CHECKDB can cause corruption.

Scary, huh?

I thought it would be an interesting exercise to reproduce those results, and then test the patch for myself. Feel free to follow along.

WARNING: Once again, we are causing intentional damage, here. Do this only on a test system where you can afford to simply reinstall SQL Server if everything goes to hell. And before you do anything else, take a backup of your master database. Do it now. I’ll wait. <muzak plays…>

Ready? Ok, here we go…

  • Make sure you’re running an affected version of SQL Server. It should be Enterprise, Standard or Developer edition of SQL Server 2014, down to 2008 R2. See the KB article mentioned above for the specifics.  Note that you’ll need to be running a version earlier than the hotfixes or cumulative updates that they mention in the fix.
  • Enable common compliance criteria on your SQL Server instance using:
EXEC sp_configure 'common criteria compliance enabled',1;
GO
RECONFIGURE WITH OVERRIDE;
GO
  • Install the auditing trace procedures, available on Microsoft’s SQL deployment page.
  • Restart the SQL Server service. Common Criteria is one of those rare cases where just running RECONFIGURE WITH OVERRIDE won’t actually cause the change to take effect.
  • Now log into the server as SA, and run:
DBCC CHECKDB(master) WITH NO_INFOMSGS, DATA_PURITY;

Remember that out of the box, data purity checks on master and model are not enabled. That’s why I’m including it in the command above, even though it’s a default from 2005 going forward.

In my case, when I ran CHECKDB, I got the following:

Msg 2570, Level 16, State 2, Line 1
Page (1:85), slot 26 in object ID 58, index ID 1, partition ID 281474980511744, alloc unit ID 72057594037993472 (type "In-row data"). Column "name" value is out of range for data type "nvarchar".  Update column to a legal value.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'sys.sysbinobjs' (object ID 58).
CHECKDB found 0 allocation errors and 1 consistency errors in database 'master'.

So there is our data purity error. These are kind of a mixed bag. You can’t run a repair statement to fix them, but then again, you don’t risk losing any more data than you already have.  What it means is that SQL Server cannot interpret the data within a column, because the data isn’t valid or within the correct range for the data type of that column. In this case, we have an nvarchar column with a single byte of data in it.

Now you’ve probably guessed by this point, and if you haven’t then I’ll make it clear: the corruption in this case was NOT caused by CHECKDB. It has to do with some peculiarity in the security setup here. Somehow something is being logged in such a way that is bypassing the data type check, and inserting a record into this table that doesn’t fit the data type. Let’s take a look at this using DBCC PAGE.  If you’re not familiar with DBCC PAGE, please see my previous post on it, DBCC PAGE to the Rescue.

DBCC TRACEON(3604); /* Enables output to console */
GO
DBCC PAGE (1,1,85,3); /* dbid 1, file 1, page 85, style 3 (show me everything) */
GO

After bringing up the page, we do a search for slot 26, and find:

Slot 26 Column 4 Offset 0x2f Length 1 Length (physical) 1
name = INVALID COLUMN VALUE

Just for comparison, here’s what a normal, single-character value of that type would look like:

Slot 0 Column 1 Offset 0xb Length 2 Length (physical) 2
col1 = a

See the problem? Because this is an nvarchar type, the minimum length of a value is going to be 2 bytes. One for the data, and one for the unicode data that goes with it. (See here for the details on what I mean regarding the data types.) So, the value stored is too small for SQL Server to attempt to interpret as a valid nvarchar.

In the demos I do, I fix a data purity error with an update statement. Unfortunately, you can’t directly access this particular table. It’s a hidden system table, so there’s no way to even select from it, much less update it.

So what *can* we do? Restore master from a backup. Well, technically we could just ignore the error by running CHECKDB against master with PHYSICAL_ONLY, which would just avoid data purity checking. You can also disable data purity checking on master by enabling trace flag 2566.  Keep in mind, though, that disabling data purity checks means that if any new data purity issues crop up, you’re not going to know. I’m not in favor of this option, so let’s go to that backup we took at the beginning of this exercise.

Restoring master is understandably different than restoring any other database, and here’s how you have to do it. You need to stop the SQL Server service, then start it from the command prompt in single user mode, using the -m switch. In my case, I have multiple instances of SQL running on this particular machine, so I also need to specify which instance using the -s switch. Like so:

X:\<path to sqlservr.exe>\sqlservr.exe -m -s mssqlserver

Note that specifying “mssqlserver” will start the default instance, which is where we’re testing this. In a separate command window, I logged into my instance using sqlcmd.exe and restored master with a plain old restore command:

RESTORE DATABASE [master] FROM DISK = 'X:\<path to file>\master_full.bak' WITH REPLACE;

When the restore of master is complete, SQL Server will automatically shut down. I was already at the command line, so I restarted both SQL and the SQL Server Agent with net start…

Net start mssqlserver
Net start sqlserveragent

Logging in with my user account, (not sa), and running CHECKDB as above, shows that the database is now clean.

So to avoid this, should you fit the criteria at the beginning of this article and on Microsoft’s KB article, please be sure you download and install the appropriate update for your instance of SQL Server.

Keep checking those databases for corruption, folks.

Thanks for reading.

-David.

Insidious Corruption

WARNING: The following post involves intentionally doing damage to a *system* database. The usual caveats about having good backups, not attempting to do this on an important system, etc… apply here. You may very well need to remove and reinstall SQL Server if you attempt this. In short – “We’re professionals. Do not try this at home.” 

Today’s lesson is about equality. Treat all databases with equal kindness. For example, I see a lot of people not doing basic maintenance like regular backups and CHECKDB on their system databases. Maybe they think that since those databases don’t change much, there’s no reason to protect them the way that they protect user data.

I am here today to dispel that notion.  If you’ve attended one of my sessions on database corruption, you’ll remember that I state that database corruption is just a fact of life. It happens, and it can happen when and where you least expect it. Example: in the [model] database.

The [model] database is essentially SQL Server’s template. When a new database is created, first, a copy of [model] is made. Then, any further configurations or tweaks that the user specified in either the database creation dialog boxes, or the CREATE DATABASE command, are applied to the new database. The advantage of this is that anything you want to apply to any newly created databases, you can apply to the [model] database. Any objects you want in all new databases, any default file settings like size and growth, or something SIMPLE like the database’s recovery model. (See what I did there? :-))

Let me repeat that real quick – *anything* that exists in [model] will be copied to new databases.

Including corrupted pages.

*shiver*

Sooo… Let’s do one! :-)

First, take a backup of [model] and set it aside, just in case. Then we create an object we can copy to new databases.

/* Create a table that can get propagated to new databases. */
CREATE TABLE Trash (datacol CHAR(8000) NOT NULL); /* if you're gonna go, go big. */
GO
/* Drop in a row. */
INSERT INTO Trash(datacol)
SELECT REPLICATE('a',7500);
GO

Now that we have an object, we need to corrupt it. I chose to use the method described here.  You’ll need the ID of the page you want to corrupt, which you can get with DBCC IND:

/* Get the page ID of the one lone page in the Trash table. */
DBCC IND('model','Trash',0);
GO

Now, let’s assume we’re not checking [model] for corruption, and so it goes undetected.  What happens when I create a new database?

/* Create new database and check for corruption. */
CREATE DATABASE TestMe;
GO
DBCC CHECKDB(TestMe) WITH NO_INFOMSGS;
GO

Survey says…!

Msg 8939, Level 16, State 98, Line 103
Table error: Object ID 581577110, index ID 0, partition ID 72057594039107584, alloc unit ID 72057594043695104 (type In-row data), page (1:312). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 103
Object ID 581577110, index ID 0, partition ID 72057594039107584, alloc unit ID 72057594043695104 (type In-row data): Page (1:312) could not be processed. See other errors for details.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'Trash' (object ID 581577110).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'TestMe'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (TestMe).

So now I have a corruption that will appear in every database I create. Keep in mind that this need not show up in a user created object. If any part of the [model] database becomes corrupt, and we’re not checking [model] with CHECKDB, then every database created will also be corrupt, and maybe unusable.

While we’re on the subject, here’s something even worse – while playing around with this concept, I noticed that in SQL Server 2014, DATA_PURITY checks are still *off* for [master] and [model] by default. So I created another test object in [model] and caused a data purity error in it. When I ran CHECKDB on [model], without specifically adding DATA_PURITY as an option, it came back clean. When I created a new database, I ran CHECKDB on it, and lo and behold – it threw a data purity error. So a corruption that was in model was not detected, and still propagated to a newly created database.

Ouch.

Have you hugged your [msdb], [model] and [master] databases today? If not, at least make sure you’re properly checking it for corruption using CHECKDB with DATA_PURITY. Your databases will thank you.

Thanks for reading.

-David.

DBA 911 – Follow up Q & A

Thanks again to everyone who attended my web session, DBA 911 – Database Corruption, on January 7th. I had a blast spending time with you all. We must do that again, soon. :-)

In this post, I’m going to tackle some of the follow-up questions that we weren’t able to get to in the session. Some questions will require more explanation than others, so I’m going to handle them in separate posts.

First, let’s handle two of the most frequently asked questions:

Where can I get a script to automate running CHECKDB on my databases?

The best scripts for that, bar none, are Ola Hallengren’s. Check out http://ola.hallengren.com

Where can I get the slides and demo code for your presentation?

The slides and demo code are on my resource page, above. (https://dmmaxwell.wordpress.com/resources/)

Now, let’s get into the feedback questions – I’m going to just copy them out of the feedback form and paste them here. I’m going to make an assumption or two here and there, but if I don’t completely answer your question, please feel free to leave a comment below. Some of the questions answered in the presentation will be left alone, but some answers bear repeating, so I will do that here. Also, there are some questions that I want to go into a LOT more detail on, so I’m going to save a few for future posts, where I can include some demo code and explain more.

Here we go! :-)

Is the space taken in tempdb by the dbcc checks cleared after the command is run or does the dba have to manually clear the data?

Space used within TempDB for CHECKDB is cleared automatically by SQL Server. The DBA shouldn’t usually have to manually clear objects out of TempDB. Now, if the problem is that running CHECKDB causes you to run out of space for TempDB, then you need more disk space for TempDB, period. I also recommend pre-sizing TempDB, if at all possible, to the max size you need.

Can we add log file with suspect database if it is suspected due to log file full.

A full log file, meaning a log file that cannot grow, does not put a database into SUSPECT mode. SUSPECT means that SQL Server is not able to read one or more database files to begin with at all, so you wouldn’t be able to add a log file anyway. A suspect database must either be put into EMERGENCY mode in order to correct anything, or restored from backup.

A full log file is a completely different problem. If your log file fills up on an online database, then yes you can add another log file for some temporary space. I would recommend finding out what caused the log file to grow, and either fixing that process or sizing the log file appropriately, then deleting the additional log file.  There’s no good reason to maintain a database with more than one log file.

monitoring the severity alerts > 20, we are monitoring these alerts: 823,824,825 and 832?

For 823, 824 and 832, yes, since those are high-severity alerts. However, 825 is logged quietly as a Severity 10 “Informational” message, so you won’t catch it if you are only monitoring Severity 21 and up. One other reason to monitor specific error numbers outside of just monitoring the severity levels is that you can use custom notification text for each error. For the 823 alert on my servers, I include something like, “This indicates that SQL Server was unable to read an 8k page off of the disk, and that the database is damaged. Please contact the DBA on call to complete a consistency check of this database immediately.”  That way, if I’m not around, someone will see that and know it’s critical – and get a hold of the on call DBA.

how much extra overhead is there with having checksum on for a database

Little to almost none. There is a slight CPU overhead to calculating the checksum. Performance wise, however, most servers are IO-constrained, rather than CPU-constrained. For probably 99.9999% of systems out there, enabling page checksums just isn’t going to be noticed.

I may have missed this but when should you run the datapurity command?

The DATA_PURITY option should be specified when the data purity flag is off for that database.  An example would be a database that is created in SQL Server 2000, then upgraded to 2005 or a later version. In SQL Server 2000, there were no data purity checks. In SQL 2005 and up, the data purity flag was introduced.  To find the data purity flag state, you can use the DBCC DBINFO command, which is included in my demo scripts. Once you have specified DATA_PURITY for a database, that flag will be turned on, so you will get data purity checks automatically going forward, without having to specify that option.

Now, if you’re trying to save time or TempDB space by turning DATA_PURITY off, then you want to specify PHYSICAL_ONLY. Be aware, when you do that, you’re deliberately choosing not to check for logical corruption, but physical readability only. It’s the old speed vs. accuracy trade-off.  I’m not saying one or the other is better, just that you need to understand your options and have reasons for choosing them.

What time will drinks be served?

May I see some ID, please?

What is the licensing implication when you run DBCC on non production environment?

None – you run it wherever you want to. You can copy a production database to a Developer edition server and run CHECKDB on it with no licensing impact.

in What situations is used DBCC UPDATEUSAGE ?

UPDATEUSAGE is a command that verifies allocated space in the database is reported correctly. This has to do with the database catalog views, rather than with corruption or with CHECKDB, in specific. If you’re seeing odd values from sp_spaceused or if  database objects are frequently changed with CREATE, ALTER or DROP statements, then you may want to run this. This does not, however, indicate a corruption issue. I recommend reading the BOL entry for UPDATEUSAGE at http://technet.microsoft.com/en-us/library/ms188414.aspx

Do we need to run CHECKDB after a restore or repair?

If you have the time to run it after a restore, then yes, I would definitely do so. And you should definitely run CHECKDB after doing a repair, especially if you have run with REPAIR_ALLOW_DATA_LOSS. You never want to say to your boss, “Well, I think the database is OK, now.”  You want to be sure. If you have downloaded my demo scripts, you’ll see that I take additional backups both before and after every repair, and run CHECKDB again after every repair. I show this because I think it’s a good habit to have. Your situation may vary, but I would consider this a best practice.  Not only that, but sometimes you need to run multiple repairs, and making it a habit to run CHECKDB after doing one, will allow you to catch anything else that has gone wrong. For an example, see my post on Emergency Mode Repair.

That’s all for now. :-) If I have not answered your question, remember that I’m going to be handling a few other questions separately, in separate posts.  Once I’m done, if I still haven’t answered your question or you have additional questions, feel free to leave a comment, email me, or send me a question on Twitter.   I’m always happy to help.

Thanks for reading.

-David.

DBCC DBINFO

Often, I like to poke around inside databases just for fun. Just to look at the structures. It’s kind of like wandering around unexplored territory, and recognizing familiar architectures – appealing to my sense of curiosity. My first tool for the task is usually the undocumented DBCC DBINFO(). While I normally don’t encourage the use of undocumented commands, this one is safe, and doesn’t appear to be going anywhere.

Let’s look at the results for a sample database, and discuss the output. Run the following code on the database of your choice. The DB results shown here are from a demonstration database from my corruption presentation:

USE [master];
GO
/* Send output to console, required by DBINFO(). */
DBCC TRACEON(3604);
GO
/* Using TABLERESULTS for nicely formatted output. */
DBCC DBINFO(DemoRestoreOrRepair)
WITH TABLERESULTS;
GO
DBINFO Results

Click to enlarge.

We get a lot of information out of this. Much of it can also be obtained by querying sys.databases as well, but some cannot. The most interesting bits to me are:

  • dbi_Version and dbi_CreateVersion. – Tells you what version of SQL Server the database was created in, and is currently. A mismatch here indicates the DB has been upgraded.
  • dbi_DBCCFlags – A value of 2 tells you that data purity checks are being run by default when CHECKDB runs on this database. 0 means they are not, and you’ll have to specify that as an option for the CHECKDB command next time you run it. (If you *want* data purity checks, that is.)
  • dbi_DBCCLastKnownGood – The last time DBCC CHECKDB was run successfully on this database.
  • dbi_LastLogBackupTime – Just what it says on the label, the last time the log was backed up.

There are also a lot of entries about the various log sequence numbers that the database was on at specific points in time. Such as:

  • dbi_dbbackupLSN for the last backup.
  • dbi_DifferentialBaseLSN for the last full backup, which would be required to start the restore sequence.
  • dbi_checkptLSN for the last checkpoint.
  • dbi_DirtyPageLSN for the last time a transaction was committed.

Interesting stuff, I think.  What about you?

Thanks for reading.

-David.

Shrink Is A Four Letter Word

Recently, I’ve been noticing some large-scale log growth on some of my developer SQL servers, which has been causing them to run out of disk space. Though shrinking files is almost universally a bad idea, there is a time and a place for doing so provided the ramifications of shrinking are understood. Needing to shrink a log file indicates that either we’re operating on more data than we expected, or operating inefficiently. Either way, it’s almost always the result of unplanned circumstances.  Under such circumstances, you can’t always use the option you want to. Sometimes you have to do something regrettable.

When *NOT* to shrink:

  • When the growth happens in a Data file. Shrinking is going to rob you of free space required for index maintenance, not to mention fragmenting the hell out of your existing indexes, making them near useless.
  • When the growth is in TempDB.  It can cause corruption. Either add TempDB files, or kill the transaction and recreate tempdb. You’ll need to restart SQL Server for that, and killing said transaction that big may cause a very long rollback. You’ve been warned.
  • As part of regular maintenance.  If your log files are growing repeatedly, then they’re trying to get to the size they need to be in order to handle your workload. Why are you shrinking them? What do you have against them? They have dreams, too…  Fix the workload, not the log files.

Alternatives to shrinking:

When to shrink. I can only think of two situations off the top of my head

  • Under an emergency where other workloads on the same disk are affected and there’s no additional disk elsewhere for another log file.
  • When a database is being archived / retired, and will be set to read-only. Even in that case, you wouldn’t just shrink the DB. See Retiring a Database.

Benefits to shrinking:

  • Free disk space.
  • That’s about it.

Caveats to shrinking:

  • Massive performance hit during re-growth, as all transactions must be suspended.
  • Massive performance hit due to file fragmentation on re-growth.
  • Massive performance hit due to internal fragmentation on shrink.
  • Massive perfor…. You get the idea.

So really, we want to avoid shrinking log files wherever possible. The best way to avoid shrinking is to properly size files to begin with.   Make sure you have enough room for your transactions to run. If you have large transactions or volumes of data to work with, one good way to avoid file growth is to batch large transactions into smaller groups. i.e. Instead of inserting 5M rows, insert 100k rows at a time, 50 times.  This can actually be faster than a single large insert, as you’re not stuck waiting for the log file to grow to accommodate 5M records.

Some growth may still occur since, let’s face it, we have a lot of data to work with. What we want to avoid is large, recurring growth. Remember that when the log file must grow, all transactions are suspended.  If you are only growing slightly, that may not be a problem, but when you have 100GB of growth, that’s going to stop you in your tracks for the duration of the growth.

If you must shrink a log file due to uncontrolled or unexpected growth, you can use the SHRINKFILE command, but with one caveat – I would recommend shrinking the file by half instead of to its smallest size.  This way, future growth may be avoided while queries are tuned to use less log space.   For example, if a log were to grow to 40GB, after averaging a size of 2GB, I would shrink the log to 20GB, like this:

USE <DatabaseName>;
DBCC SHRINKFILE(<file ID>,<target size>);

Example:

USE BigLogDB;
DBCC SHRINKFILE(2,20000);

2 is the ID of the log file for most databases.  You can use sp_helpfile on a database to check that.  20000 is the target size, which is always in MB. Shrinking requires sysadmin or diskadmin permissions, which some developers do have.  I can also assist you with these tasks as needed.  Another advantage to this is that it gives us an idea of how large the production log files will need to be to accommodate new SQL code.

Also, I would like to stress that shrinking is used as a temporary fix, not as a workaround, and *never* as part of regular maintenance. If you have a job on a server that is regularly shrinking log files, then you either need more disk space or smaller transactions. Regular shrinking is a sign that something bad is going on. Also, while shrinking can be considered on log files, it is almost never a good idea to shrink a data file.  Avoid using SHRINKDATABASE.

Thanks for reading.
-David.

Retiring a Database

Every dog has it’s day, and every database has it’s time. Maybe a “shelf life” would be more appropriate. But with data retention times getting longer and longer, some databases must be kept around long after their applications have been retired. So what’s a good way to retire a database?

I have a set procedure that I use for retiring a database, once it’s application has no longer been used. There is some ongoing maintenance, but that’s the price you pay for data retention. If you need to put a DB in cold storage, here’s one way.

Backup – Set the recovery model of the database to SIMPLE and take a backup of the existing database so you have a starting point to go back to, should anything go wrong with this procedure.  I’ve never had to use that backup in my time, but you never know. It pays to be cautious.  We’ll need the DB to be in SIMPLE mode since the next thing we’re going to do is…

Shrink – Yes, I know that’s considered a dirty word by many data professionals, but if we understand what shrink does, and are ready to mitigate any of the problems it presents, then we can shrink a database safely.  In this case, we’re taking the database out of production and storing it.  Disk space just became far more important than performance. So, we can go ahead and shrink the database as small as we can get it.

Optimize – Now that we’ve shrunk the database, we’re going to have loads of internal fragmentation going on.  At this point, we’re going to re-index everything and update all statistics, so that when we do restore this database, it will be ready to read, quickly. Hey – I said that performance wasn’t as important as disk space. I didn’t say it wasn’t important at all. :-)

Truncate – You’ll notice (hopefully) that after the re-index / update, our DB size has grown again. We still want to save some disk space, so let’s shrink the data file again, but this time, using TRUNCATE_ONLY.  All we want to do is remove the empty space at the end of the file. This may not net us much additional space, if any, but when it comes to storage, every byte counts. For the log file, we’ll go ahead and shrink it as far down as it will go.  Since it’s not storing data, we don’t need it.

SET to READ_ONLY – When and if we do restore this database, we’re probably going to want to avoid accidentally updating the data, since it’s being held for archival purposes. If we really do need to deliberately update the data in this database, we can set it to READ_WRITE after we restore it.  If required, we can then follow this procedure again to archive that copy of the database.

Backup Again – This is the most critical part of the process. Now that we have a good copy of the database, we want to back it up to the most up-to-date version of SQL Server in our shop, as well as storing it on the most recent hardware we have, regardless of the media type. This may involve taking backups, restoring them to other, more updated servers, and backing up again. 

Restore – TEST THAT BACKUP. The only way you truly know if a backup is good is to restore it. Also, as part of our routine DR exercises, it’s a good idea to update that backup to the most recent version of SQL we have, and the most recent storage hardware as well. With that, you should be able to keep your data around indefinitely. 

Why not just take a backup, put it on a removable disk and be done? Imagine it’s 10 years later. The database we need is backed up to media we can’t read because the hardware died 4 years ago, and “nobody makes those drives anymore”. Don’t laugh. I have seen this very failure. I’ve also seen situations where we can read the media, but our backup format is so old that our newer software can’t read it. We have to go scrambling for a (sometimes unlicensed) copy of the older software. Don’t let this happen to you. Having redundant copies, i.e., a backup of the backup, is a great idea as well. Avoid single points of failure. Oh, and we are using backup compression, right? :-)

One final caveat to this: if the DB we are restoring uses features that get deprecated, we’ll want to know, so a good suggestion is to use the TRY/CATCH restore method described by Shaun Stuart.  If we do have deprecated features in use, we’ll have to decide how to address those when we find them. They may or may not matter.

Thanks for reading.

-David.