Corruption Bug: Common Criteria and SA.

When I talk about recovering from database corruption, I mention that occasionally corruption can be caused by bugs in SQL Server. Such things are very rare, but they do happen. Case in point – MS just recently updated an article stating such. KB2888996 details a specific set of circumstances where actually running CHECKDB can cause corruption.

Scary, huh?

I thought it would be an interesting exercise to reproduce those results, and then test the patch for myself. Feel free to follow along.

WARNING: Once again, we are causing intentional damage, here. Do this only on a test system where you can afford to simply reinstall SQL Server if everything goes to hell. And before you do anything else, take a backup of your master database. Do it now. I’ll wait. <muzak plays…>

Ready? Ok, here we go…

  • Make sure you’re running an affected version of SQL Server. It should be Enterprise, Standard or Developer edition of SQL Server 2014, down to 2008 R2. See the KB article mentioned above for the specifics.  Note that you’ll need to be running a version earlier than the hotfixes or cumulative updates that they mention in the fix.
  • Enable common compliance criteria on your SQL Server instance using:
EXEC sp_configure 'common criteria compliance enabled',1;
GO
RECONFIGURE WITH OVERRIDE;
GO
  • Install the auditing trace procedures, available on Microsoft’s SQL deployment page.
  • Restart the SQL Server service. Common Criteria is one of those rare cases where just running RECONFIGURE WITH OVERRIDE won’t actually cause the change to take effect.
  • Now log into the server as SA, and run:
DBCC CHECKDB(master) WITH NO_INFOMSGS, DATA_PURITY;

Remember that out of the box, data purity checks on master and model are not enabled. That’s why I’m including it in the command above, even though it’s a default from 2005 going forward.

In my case, when I ran CHECKDB, I got the following:

Msg 2570, Level 16, State 2, Line 1
Page (1:85), slot 26 in object ID 58, index ID 1, partition ID 281474980511744, alloc unit ID 72057594037993472 (type "In-row data"). Column "name" value is out of range for data type "nvarchar".  Update column to a legal value.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'sys.sysbinobjs' (object ID 58).
CHECKDB found 0 allocation errors and 1 consistency errors in database 'master'.

So there is our data purity error. These are kind of a mixed bag. You can’t run a repair statement to fix them, but then again, you don’t risk losing any more data than you already have.  What it means is that SQL Server cannot interpret the data within a column, because the data isn’t valid or within the correct range for the data type of that column. In this case, we have an nvarchar column with a single byte of data in it.

Now you’ve probably guessed by this point, and if you haven’t then I’ll make it clear: the corruption in this case was NOT caused by CHECKDB. It has to do with some peculiarity in the security setup here. Somehow something is being logged in such a way that is bypassing the data type check, and inserting a record into this table that doesn’t fit the data type. Let’s take a look at this using DBCC PAGE.  If you’re not familiar with DBCC PAGE, please see my previous post on it, DBCC PAGE to the Rescue.

DBCC TRACEON(3604); /* Enables output to console */
GO
DBCC PAGE (1,1,85,3); /* dbid 1, file 1, page 85, style 3 (show me everything) */
GO

After bringing up the page, we do a search for slot 26, and find:

Slot 26 Column 4 Offset 0x2f Length 1 Length (physical) 1
name = INVALID COLUMN VALUE

Just for comparison, here’s what a normal, single-character value of that type would look like:

Slot 0 Column 1 Offset 0xb Length 2 Length (physical) 2
col1 = a

See the problem? Because this is an nvarchar type, the minimum length of a value is going to be 2 bytes. One for the data, and one for the unicode data that goes with it. (See here for the details on what I mean regarding the data types.) So, the value stored is too small for SQL Server to attempt to interpret as a valid nvarchar.

In the demos I do, I fix a data purity error with an update statement. Unfortunately, you can’t directly access this particular table. It’s a hidden system table, so there’s no way to even select from it, much less update it.

So what *can* we do? Restore master from a backup. Well, technically we could just ignore the error by running CHECKDB against master with PHYSICAL_ONLY, which would just avoid data purity checking. You can also disable data purity checking on master by enabling trace flag 2566.  Keep in mind, though, that disabling data purity checks means that if any new data purity issues crop up, you’re not going to know. I’m not in favor of this option, so let’s go to that backup we took at the beginning of this exercise.

Restoring master is understandably different than restoring any other database, and here’s how you have to do it. You need to stop the SQL Server service, then start it from the command prompt in single user mode, using the -m switch. In my case, I have multiple instances of SQL running on this particular machine, so I also need to specify which instance using the -s switch. Like so:

X:\<path to sqlservr.exe>\sqlservr.exe -m -s mssqlserver

Note that specifying “mssqlserver” will start the default instance, which is where we’re testing this. In a separate command window, I logged into my instance using sqlcmd.exe and restored master with a plain old restore command:

RESTORE DATABASE [master] FROM DISK = 'X:\<path to file>\master_full.bak' WITH REPLACE;

When the restore of master is complete, SQL Server will automatically shut down. I was already at the command line, so I restarted both SQL and the SQL Server Agent with net start…

Net start mssqlserver
Net start sqlserveragent

Logging in with my user account, (not sa), and running CHECKDB as above, shows that the database is now clean.

So to avoid this, should you fit the criteria at the beginning of this article and on Microsoft’s KB article, please be sure you download and install the appropriate update for your instance of SQL Server.

Keep checking those databases for corruption, folks.

Thanks for reading.

-David.

Advertisements

To Backup or not to Backup. In Development, is the question.

Recently an old debate has resurfaced for me where the central question is: Do we, or do we not back up the development environment? Say it with me, now:

“It depends.”

I have seen a trend towards reducing or even eliminating the backups in development or test environments, since they don’t get as high a priority as the production environment. Development isn’t a client- or public-facing area of the business, so it gets lower priority when it’s down. Dev and Test *are* the production environment for your developers, though. That’s where they work, day in and day out. How comfortable would you feel telling a developer that they’ll have to re-do the last couple of hours worth of work? How about the last day or so? More?

Another common argument that I hear against backing up the development environment is the storage cost of what could be considered duplicate data, since Dev and Test are usually copies of Production. Fair enough, but if development is not as high a priority to the business, wouldn’t we be able to save costs on backup storage using things like de-dupe, snapshots, and compression?  In addition to storage cost, how much are you paying your developers? I assume they’re still drawing a salary while the environment is unavailable, or while they are re-doing work that was lost due to a database issue. Is that not also a cost?

I’m heavily skewed in favor of backing up the development environment. As the DBA, I’m very nervous about having anything without a backup. But I have to admit that there are some situations where backups just aren’t as useful as they should be, and it’s easier to just recreate instead of restore. The key question is this: What is your RPO/RTO for your developers, and how will you ensure that? Most of the time, backups are part of the procedure. Sometimes they aren’t.

One situation I have seen more than a few times is where production databases are regularly copied to the development or test environments, and the latest version of database changes are applied. Not just to make sure that the databases are code-current, but this has the additional benefit of testing the deployment scripts as well. In this case, is it worth taking a backup? Again, it depends. What is your SLA for development, and is it faster to restore the development database, or to re-do all the changes?  If it takes an hour to restore, but an additional hour to re-apply the changes, is that extra hour worth saving a little bit on storage cost for backups?

There’s no clear-cut answer, but I know that I restores save careers. I will almost always advocate for backing up all my environments.

What about you?

Thanks for reading.

-David.

Shrink Is A Four Letter Word

Recently, I’ve been noticing some large-scale log growth on some of my developer SQL servers, which has been causing them to run out of disk space. Though shrinking files is almost universally a bad idea, there is a time and a place for doing so provided the ramifications of shrinking are understood. Needing to shrink a log file indicates that either we’re operating on more data than we expected, or operating inefficiently. Either way, it’s almost always the result of unplanned circumstances.  Under such circumstances, you can’t always use the option you want to. Sometimes you have to do something regrettable.

When *NOT* to shrink:

  • When the growth happens in a Data file. Shrinking is going to rob you of free space required for index maintenance, not to mention fragmenting the hell out of your existing indexes, making them near useless.
  • When the growth is in TempDB.  It can cause corruption. Either add TempDB files, or kill the transaction and recreate tempdb. You’ll need to restart SQL Server for that, and killing said transaction that big may cause a very long rollback. You’ve been warned.
  • As part of regular maintenance.  If your log files are growing repeatedly, then they’re trying to get to the size they need to be in order to handle your workload. Why are you shrinking them? What do you have against them? They have dreams, too…  Fix the workload, not the log files.

Alternatives to shrinking:

When to shrink. I can only think of two situations off the top of my head

  • Under an emergency where other workloads on the same disk are affected and there’s no additional disk elsewhere for another log file.
  • When a database is being archived / retired, and will be set to read-only. Even in that case, you wouldn’t just shrink the DB. See Retiring a Database.

Benefits to shrinking:

  • Free disk space.
  • That’s about it.

Caveats to shrinking:

  • Massive performance hit during re-growth, as all transactions must be suspended.
  • Massive performance hit due to file fragmentation on re-growth.
  • Massive performance hit due to internal fragmentation on shrink.
  • Massive perfor…. You get the idea.

So really, we want to avoid shrinking log files wherever possible. The best way to avoid shrinking is to properly size files to begin with.   Make sure you have enough room for your transactions to run. If you have large transactions or volumes of data to work with, one good way to avoid file growth is to batch large transactions into smaller groups. i.e. Instead of inserting 5M rows, insert 100k rows at a time, 50 times.  This can actually be faster than a single large insert, as you’re not stuck waiting for the log file to grow to accommodate 5M records.

Some growth may still occur since, let’s face it, we have a lot of data to work with. What we want to avoid is large, recurring growth. Remember that when the log file must grow, all transactions are suspended.  If you are only growing slightly, that may not be a problem, but when you have 100GB of growth, that’s going to stop you in your tracks for the duration of the growth.

If you must shrink a log file due to uncontrolled or unexpected growth, you can use the SHRINKFILE command, but with one caveat – I would recommend shrinking the file by half instead of to its smallest size.  This way, future growth may be avoided while queries are tuned to use less log space.   For example, if a log were to grow to 40GB, after averaging a size of 2GB, I would shrink the log to 20GB, like this:

USE <DatabaseName>;
DBCC SHRINKFILE(<file ID>,<target size>);

Example:

USE BigLogDB;
DBCC SHRINKFILE(2,20000);

2 is the ID of the log file for most databases.  You can use sp_helpfile on a database to check that.  20000 is the target size, which is always in MB. Shrinking requires sysadmin or diskadmin permissions, which some developers do have.  I can also assist you with these tasks as needed.  Another advantage to this is that it gives us an idea of how large the production log files will need to be to accommodate new SQL code.

Also, I would like to stress that shrinking is used as a temporary fix, not as a workaround, and *never* as part of regular maintenance. If you have a job on a server that is regularly shrinking log files, then you either need more disk space or smaller transactions. Regular shrinking is a sign that something bad is going on. Also, while shrinking can be considered on log files, it is almost never a good idea to shrink a data file.  Avoid using SHRINKDATABASE.

Thanks for reading.
-David.

Retiring a Database

Every dog has it’s day, and every database has it’s time. Maybe a “shelf life” would be more appropriate. But with data retention times getting longer and longer, some databases must be kept around long after their applications have been retired. So what’s a good way to retire a database?

I have a set procedure that I use for retiring a database, once it’s application has no longer been used. There is some ongoing maintenance, but that’s the price you pay for data retention. If you need to put a DB in cold storage, here’s one way.

Backup – Set the recovery model of the database to SIMPLE and take a backup of the existing database so you have a starting point to go back to, should anything go wrong with this procedure.  I’ve never had to use that backup in my time, but you never know. It pays to be cautious.  We’ll need the DB to be in SIMPLE mode since the next thing we’re going to do is…

Shrink – Yes, I know that’s considered a dirty word by many data professionals, but if we understand what shrink does, and are ready to mitigate any of the problems it presents, then we can shrink a database safely.  In this case, we’re taking the database out of production and storing it.  Disk space just became far more important than performance. So, we can go ahead and shrink the database as small as we can get it.

Optimize – Now that we’ve shrunk the database, we’re going to have loads of internal fragmentation going on.  At this point, we’re going to re-index everything and update all statistics, so that when we do restore this database, it will be ready to read, quickly. Hey – I said that performance wasn’t as important as disk space. I didn’t say it wasn’t important at all. :-)

Truncate – You’ll notice (hopefully) that after the re-index / update, our DB size has grown again. We still want to save some disk space, so let’s shrink the data file again, but this time, using TRUNCATE_ONLY.  All we want to do is remove the empty space at the end of the file. This may not net us much additional space, if any, but when it comes to storage, every byte counts. For the log file, we’ll go ahead and shrink it as far down as it will go.  Since it’s not storing data, we don’t need it.

SET to READ_ONLY – When and if we do restore this database, we’re probably going to want to avoid accidentally updating the data, since it’s being held for archival purposes. If we really do need to deliberately update the data in this database, we can set it to READ_WRITE after we restore it.  If required, we can then follow this procedure again to archive that copy of the database.

Backup Again – This is the most critical part of the process. Now that we have a good copy of the database, we want to back it up to the most up-to-date version of SQL Server in our shop, as well as storing it on the most recent hardware we have, regardless of the media type. This may involve taking backups, restoring them to other, more updated servers, and backing up again. 

Restore – TEST THAT BACKUP. The only way you truly know if a backup is good is to restore it. Also, as part of our routine DR exercises, it’s a good idea to update that backup to the most recent version of SQL we have, and the most recent storage hardware as well. With that, you should be able to keep your data around indefinitely. 

Why not just take a backup, put it on a removable disk and be done? Imagine it’s 10 years later. The database we need is backed up to media we can’t read because the hardware died 4 years ago, and “nobody makes those drives anymore”. Don’t laugh. I have seen this very failure. I’ve also seen situations where we can read the media, but our backup format is so old that our newer software can’t read it. We have to go scrambling for a (sometimes unlicensed) copy of the older software. Don’t let this happen to you. Having redundant copies, i.e., a backup of the backup, is a great idea as well. Avoid single points of failure. Oh, and we are using backup compression, right? :-)

One final caveat to this: if the DB we are restoring uses features that get deprecated, we’ll want to know, so a good suggestion is to use the TRY/CATCH restore method described by Shaun Stuart.  If we do have deprecated features in use, we’ll have to decide how to address those when we find them. They may or may not matter.

Thanks for reading.

-David.

Dealing With Disk Space Issues.

On a typical day a few months ago, I received an automated warning in my inbox that a particular server was low on disk space. This came along with the requisite “panic” emails I usually get from other areas who are also on that notification list.  I responded that I was looking into it in order to keep the email volume down. :-)

I mapped the volume in question to my local machine and fired up SpaceSniffer to get a quick read on what was taking up all the space. It was a dedicated SQL Server data file drive and the usual suspects were taking up the lion’s share of the space. (SpaceSniffer is a fantastic tool, by the way. You should check it out, if you haven’t already.)

Now, when it comes to disk space issues, there are three basic options:

1. Do nothing.  This should always be considered as an approach, though you may rarely use this as a solution. Sometimes you really don’t need to do anything.

2. Reduce the space required. You can relocate files, clean up un-needed data, etc. You can even shrink files if you understand what the consequences would be and have a plan for dealing with that.

3. Increase the space available. In this case, we were on a SAN drive that had available space left, so we could grow the drive to make more space available.

Another quick check showed that most of the data files were over 90% used, and none of them would be able to grow unless more space was made available. That eliminated option 1. There were no other files on that drive to clean up or delete, i.e., no databases to drop, so we couldn’t reduce the amount of space required for option 2. That left us with option 3: Grow the drive.  This was an online operation handled off-hours by our SAN admin, and was successful.

After resolving any issue, it’s a good idea to go back and do some root-cause analysis. What caused the disk space issue?  One of the first things I do when taking on a new client or job is to set up a monitoring job on every server that keeps a few month’s worth of history on the database files. Using the sys.dm_io_virtual_file_stats DMV I take a snapshot of the disk activity for each data file, as well as it’s size on disk, every 15 minutes.  This is then logged to a table in my administrative database on each server.

With the stats in place, I was able to put together a quick query to see which database had grown the most over the last few months. It turned out that one of the smaller databases had now become the second biggest database, all within the space of the last 8 weeks. Checking with the owner of the application, I found out a lot of new data had recently been imported, causing the database to grow by over 600%.  After that, it didn’t take much more growth on the part of the other data files to push that disk over its threshold, and send an alert.

Here are the scripts I use, in case you’re interested.

  • DDL for FileIOStatsRaw table and vw_FileIOByHour view. – Created in the administrative DB. (In my case, [SQLAdmin].)
  • GetFileIOStats stored procedure – Run this as frequently as you want and output to the FileIOStatsRaw table. – Also created in the admin DB. Will keep 90 days by default, but call it how you want in an agent job.
  • Trend Query:
SELECT 
 DBName, 
 Yr, 
 Mo, 
 CalculatedGrowthInMBPerMonth = SUM(FileGrowthBytes) / 1024.0
FROM dbo.vw_FileIOByHour 
WHERE FileType = 'ROWS' 
GROUP BY DBName, Yr, Mo 
ORDER BY DBName, Yr, Mo;
GO

How do you monitor, measure and alert on disk space usage?

Thanks for reading.

-David.

Be A Master of Disaster

Disaster recovery plans are a must have. If you’re not motivated enough by recent natural disasters, or data breaches, or anything else that’s been wreaking digital havoc in this world lately, I’m not sure what will do the trick. If you haven’t made a DR plan, I would take a few minutes to just sketch out a quick, paper napkin-style plan.  What would you do first, second, third, etc… It doesn’t have to be complete, it just has to be a plan.  Go ahead. I’ll wait…

Got it?  Good.  Now. How do you know that will work?

Plans are an important, and great first step. Much like restores, however, they don’t do you any good unless you test them.  It’s not always easy, or even practical to do so. Sometimes even testing DR plans can cause outages.

Further, we may be in the habit of constantly testing our backups, but how often do we test DR plans?  Quarterly? Yearly? I’ve been in more than one DR rehearsal where we found that major changes in other parts of the infrastructure have had catastrophic impacts on our DR processes. How often do we do documentation in the first place, let along update all the related DR documentation when a major system changes?

In my current position, if disaster struck right now, I’m confident we would be fine. But only because we recently went through several very painful DR tests that did not go as planned. Just about everything that could go wrong, did.  However, I don’t see that as a bad thing. In fact, I see it as a very positive thing. We tested, failed, adjusted the plan, tested again, repeated that process a few more times, and off we went. Now we know we have a solid DR plan.

I think that’s worth a few failures. :-)

How do you test your DR plans?

Thanks for reading.

-David.