Corruption Bug: Common Criteria and SA.

When I talk about recovering from database corruption, I mention that occasionally corruption can be caused by bugs in SQL Server. Such things are very rare, but they do happen. Case in point – MS just recently updated an article stating such. KB2888996 details a specific set of circumstances where actually running CHECKDB can cause corruption.

Scary, huh?

I thought it would be an interesting exercise to reproduce those results, and then test the patch for myself. Feel free to follow along.

WARNING: Once again, we are causing intentional damage, here. Do this only on a test system where you can afford to simply reinstall SQL Server if everything goes to hell. And before you do anything else, take a backup of your master database. Do it now. I’ll wait. <muzak plays…>

Ready? Ok, here we go…

  • Make sure you’re running an affected version of SQL Server. It should be Enterprise, Standard or Developer edition of SQL Server 2014, down to 2008 R2. See the KB article mentioned above for the specifics.  Note that you’ll need to be running a version earlier than the hotfixes or cumulative updates that they mention in the fix.
  • Enable common compliance criteria on your SQL Server instance using:
EXEC sp_configure 'common criteria compliance enabled',1;
GO
RECONFIGURE WITH OVERRIDE;
GO
  • Install the auditing trace procedures, available on Microsoft’s SQL deployment page.
  • Restart the SQL Server service. Common Criteria is one of those rare cases where just running RECONFIGURE WITH OVERRIDE won’t actually cause the change to take effect.
  • Now log into the server as SA, and run:
DBCC CHECKDB(master) WITH NO_INFOMSGS, DATA_PURITY;

Remember that out of the box, data purity checks on master and model are not enabled. That’s why I’m including it in the command above, even though it’s a default from 2005 going forward.

In my case, when I ran CHECKDB, I got the following:

Msg 2570, Level 16, State 2, Line 1
Page (1:85), slot 26 in object ID 58, index ID 1, partition ID 281474980511744, alloc unit ID 72057594037993472 (type "In-row data"). Column "name" value is out of range for data type "nvarchar".  Update column to a legal value.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'sys.sysbinobjs' (object ID 58).
CHECKDB found 0 allocation errors and 1 consistency errors in database 'master'.

So there is our data purity error. These are kind of a mixed bag. You can’t run a repair statement to fix them, but then again, you don’t risk losing any more data than you already have.  What it means is that SQL Server cannot interpret the data within a column, because the data isn’t valid or within the correct range for the data type of that column. In this case, we have an nvarchar column with a single byte of data in it.

Now you’ve probably guessed by this point, and if you haven’t then I’ll make it clear: the corruption in this case was NOT caused by CHECKDB. It has to do with some peculiarity in the security setup here. Somehow something is being logged in such a way that is bypassing the data type check, and inserting a record into this table that doesn’t fit the data type. Let’s take a look at this using DBCC PAGE.  If you’re not familiar with DBCC PAGE, please see my previous post on it, DBCC PAGE to the Rescue.

DBCC TRACEON(3604); /* Enables output to console */
GO
DBCC PAGE (1,1,85,3); /* dbid 1, file 1, page 85, style 3 (show me everything) */
GO

After bringing up the page, we do a search for slot 26, and find:

Slot 26 Column 4 Offset 0x2f Length 1 Length (physical) 1
name = INVALID COLUMN VALUE

Just for comparison, here’s what a normal, single-character value of that type would look like:

Slot 0 Column 1 Offset 0xb Length 2 Length (physical) 2
col1 = a

See the problem? Because this is an nvarchar type, the minimum length of a value is going to be 2 bytes. One for the data, and one for the unicode data that goes with it. (See here for the details on what I mean regarding the data types.) So, the value stored is too small for SQL Server to attempt to interpret as a valid nvarchar.

In the demos I do, I fix a data purity error with an update statement. Unfortunately, you can’t directly access this particular table. It’s a hidden system table, so there’s no way to even select from it, much less update it.

So what *can* we do? Restore master from a backup. Well, technically we could just ignore the error by running CHECKDB against master with PHYSICAL_ONLY, which would just avoid data purity checking. You can also disable data purity checking on master by enabling trace flag 2566.  Keep in mind, though, that disabling data purity checks means that if any new data purity issues crop up, you’re not going to know. I’m not in favor of this option, so let’s go to that backup we took at the beginning of this exercise.

Restoring master is understandably different than restoring any other database, and here’s how you have to do it. You need to stop the SQL Server service, then start it from the command prompt in single user mode, using the -m switch. In my case, I have multiple instances of SQL running on this particular machine, so I also need to specify which instance using the -s switch. Like so:

X:\<path to sqlservr.exe>\sqlservr.exe -m -s mssqlserver

Note that specifying “mssqlserver” will start the default instance, which is where we’re testing this. In a separate command window, I logged into my instance using sqlcmd.exe and restored master with a plain old restore command:

RESTORE DATABASE [master] FROM DISK = 'X:\<path to file>\master_full.bak' WITH REPLACE;

When the restore of master is complete, SQL Server will automatically shut down. I was already at the command line, so I restarted both SQL and the SQL Server Agent with net start…

Net start mssqlserver
Net start sqlserveragent

Logging in with my user account, (not sa), and running CHECKDB as above, shows that the database is now clean.

So to avoid this, should you fit the criteria at the beginning of this article and on Microsoft’s KB article, please be sure you download and install the appropriate update for your instance of SQL Server.

Keep checking those databases for corruption, folks.

Thanks for reading.

-David.

To Backup or not to Backup. In Development, is the question.

Recently an old debate has resurfaced for me where the central question is: Do we, or do we not back up the development environment? Say it with me, now:

“It depends.”

I have seen a trend towards reducing or even eliminating the backups in development or test environments, since they don’t get as high a priority as the production environment. Development isn’t a client- or public-facing area of the business, so it gets lower priority when it’s down. Dev and Test *are* the production environment for your developers, though. That’s where they work, day in and day out. How comfortable would you feel telling a developer that they’ll have to re-do the last couple of hours worth of work? How about the last day or so? More?

Another common argument that I hear against backing up the development environment is the storage cost of what could be considered duplicate data, since Dev and Test are usually copies of Production. Fair enough, but if development is not as high a priority to the business, wouldn’t we be able to save costs on backup storage using things like de-dupe, snapshots, and compression?  In addition to storage cost, how much are you paying your developers? I assume they’re still drawing a salary while the environment is unavailable, or while they are re-doing work that was lost due to a database issue. Is that not also a cost?

I’m heavily skewed in favor of backing up the development environment. As the DBA, I’m very nervous about having anything without a backup. But I have to admit that there are some situations where backups just aren’t as useful as they should be, and it’s easier to just recreate instead of restore. The key question is this: What is your RPO/RTO for your developers, and how will you ensure that? Most of the time, backups are part of the procedure. Sometimes they aren’t.

One situation I have seen more than a few times is where production databases are regularly copied to the development or test environments, and the latest version of database changes are applied. Not just to make sure that the databases are code-current, but this has the additional benefit of testing the deployment scripts as well. In this case, is it worth taking a backup? Again, it depends. What is your SLA for development, and is it faster to restore the development database, or to re-do all the changes?  If it takes an hour to restore, but an additional hour to re-apply the changes, is that extra hour worth saving a little bit on storage cost for backups?

There’s no clear-cut answer, but I know that I restores save careers. I will almost always advocate for backing up all my environments.

What about you?

Thanks for reading.

-David.

Shrink Is A Four Letter Word

Recently, I’ve been noticing some large-scale log growth on some of my developer SQL servers, which has been causing them to run out of disk space. Though shrinking files is almost universally a bad idea, there is a time and a place for doing so provided the ramifications of shrinking are understood. Needing to shrink a log file indicates that either we’re operating on more data than we expected, or operating inefficiently. Either way, it’s almost always the result of unplanned circumstances.  Under such circumstances, you can’t always use the option you want to. Sometimes you have to do something regrettable.

When *NOT* to shrink:

  • When the growth happens in a Data file. Shrinking is going to rob you of free space required for index maintenance, not to mention fragmenting the hell out of your existing indexes, making them near useless.
  • When the growth is in TempDB.  It can cause corruption. Either add TempDB files, or kill the transaction and recreate tempdb. You’ll need to restart SQL Server for that, and killing said transaction that big may cause a very long rollback. You’ve been warned.
  • As part of regular maintenance.  If your log files are growing repeatedly, then they’re trying to get to the size they need to be in order to handle your workload. Why are you shrinking them? What do you have against them? They have dreams, too…  Fix the workload, not the log files.

Alternatives to shrinking:

When to shrink. I can only think of two situations off the top of my head

  • Under an emergency where other workloads on the same disk are affected and there’s no additional disk elsewhere for another log file.
  • When a database is being archived / retired, and will be set to read-only. Even in that case, you wouldn’t just shrink the DB. See Retiring a Database.

Benefits to shrinking:

  • Free disk space.
  • That’s about it.

Caveats to shrinking:

  • Massive performance hit during re-growth, as all transactions must be suspended.
  • Massive performance hit due to file fragmentation on re-growth.
  • Massive performance hit due to internal fragmentation on shrink.
  • Massive perfor…. You get the idea.

So really, we want to avoid shrinking log files wherever possible. The best way to avoid shrinking is to properly size files to begin with.   Make sure you have enough room for your transactions to run. If you have large transactions or volumes of data to work with, one good way to avoid file growth is to batch large transactions into smaller groups. i.e. Instead of inserting 5M rows, insert 100k rows at a time, 50 times.  This can actually be faster than a single large insert, as you’re not stuck waiting for the log file to grow to accommodate 5M records.

Some growth may still occur since, let’s face it, we have a lot of data to work with. What we want to avoid is large, recurring growth. Remember that when the log file must grow, all transactions are suspended.  If you are only growing slightly, that may not be a problem, but when you have 100GB of growth, that’s going to stop you in your tracks for the duration of the growth.

If you must shrink a log file due to uncontrolled or unexpected growth, you can use the SHRINKFILE command, but with one caveat – I would recommend shrinking the file by half instead of to its smallest size.  This way, future growth may be avoided while queries are tuned to use less log space.   For example, if a log were to grow to 40GB, after averaging a size of 2GB, I would shrink the log to 20GB, like this:

USE <DatabaseName>;
DBCC SHRINKFILE(<file ID>,<target size>);

Example:

USE BigLogDB;
DBCC SHRINKFILE(2,20000);

2 is the ID of the log file for most databases.  You can use sp_helpfile on a database to check that.  20000 is the target size, which is always in MB. Shrinking requires sysadmin or diskadmin permissions, which some developers do have.  I can also assist you with these tasks as needed.  Another advantage to this is that it gives us an idea of how large the production log files will need to be to accommodate new SQL code.

Also, I would like to stress that shrinking is used as a temporary fix, not as a workaround, and *never* as part of regular maintenance. If you have a job on a server that is regularly shrinking log files, then you either need more disk space or smaller transactions. Regular shrinking is a sign that something bad is going on. Also, while shrinking can be considered on log files, it is almost never a good idea to shrink a data file.  Avoid using SHRINKDATABASE.

Thanks for reading.
-David.

Retiring a Database

Every dog has it’s day, and every database has it’s time. Maybe a “shelf life” would be more appropriate. But with data retention times getting longer and longer, some databases must be kept around long after their applications have been retired. So what’s a good way to retire a database?

I have a set procedure that I use for retiring a database, once it’s application has no longer been used. There is some ongoing maintenance, but that’s the price you pay for data retention. If you need to put a DB in cold storage, here’s one way.

Backup – Set the recovery model of the database to SIMPLE and take a backup of the existing database so you have a starting point to go back to, should anything go wrong with this procedure.  I’ve never had to use that backup in my time, but you never know. It pays to be cautious.  We’ll need the DB to be in SIMPLE mode since the next thing we’re going to do is…

Shrink – Yes, I know that’s considered a dirty word by many data professionals, but if we understand what shrink does, and are ready to mitigate any of the problems it presents, then we can shrink a database safely.  In this case, we’re taking the database out of production and storing it.  Disk space just became far more important than performance. So, we can go ahead and shrink the database as small as we can get it.

Optimize – Now that we’ve shrunk the database, we’re going to have loads of internal fragmentation going on.  At this point, we’re going to re-index everything and update all statistics, so that when we do restore this database, it will be ready to read, quickly. Hey – I said that performance wasn’t as important as disk space. I didn’t say it wasn’t important at all. :-)

Truncate – You’ll notice (hopefully) that after the re-index / update, our DB size has grown again. We still want to save some disk space, so let’s shrink the data file again, but this time, using TRUNCATE_ONLY.  All we want to do is remove the empty space at the end of the file. This may not net us much additional space, if any, but when it comes to storage, every byte counts. For the log file, we’ll go ahead and shrink it as far down as it will go.  Since it’s not storing data, we don’t need it.

SET to READ_ONLY – When and if we do restore this database, we’re probably going to want to avoid accidentally updating the data, since it’s being held for archival purposes. If we really do need to deliberately update the data in this database, we can set it to READ_WRITE after we restore it.  If required, we can then follow this procedure again to archive that copy of the database.

Backup Again – This is the most critical part of the process. Now that we have a good copy of the database, we want to back it up to the most up-to-date version of SQL Server in our shop, as well as storing it on the most recent hardware we have, regardless of the media type. This may involve taking backups, restoring them to other, more updated servers, and backing up again. 

Restore – TEST THAT BACKUP. The only way you truly know if a backup is good is to restore it. Also, as part of our routine DR exercises, it’s a good idea to update that backup to the most recent version of SQL we have, and the most recent storage hardware as well. With that, you should be able to keep your data around indefinitely. 

Why not just take a backup, put it on a removable disk and be done? Imagine it’s 10 years later. The database we need is backed up to media we can’t read because the hardware died 4 years ago, and “nobody makes those drives anymore”. Don’t laugh. I have seen this very failure. I’ve also seen situations where we can read the media, but our backup format is so old that our newer software can’t read it. We have to go scrambling for a (sometimes unlicensed) copy of the older software. Don’t let this happen to you. Having redundant copies, i.e., a backup of the backup, is a great idea as well. Avoid single points of failure. Oh, and we are using backup compression, right? :-)

One final caveat to this: if the DB we are restoring uses features that get deprecated, we’ll want to know, so a good suggestion is to use the TRY/CATCH restore method described by Shaun Stuart.  If we do have deprecated features in use, we’ll have to decide how to address those when we find them. They may or may not matter.

Thanks for reading.

-David.

Dealing With Disk Space Issues.

On a typical day a few months ago, I received an automated warning in my inbox that a particular server was low on disk space. This came along with the requisite “panic” emails I usually get from other areas who are also on that notification list.  I responded that I was looking into it in order to keep the email volume down. :-)

I mapped the volume in question to my local machine and fired up SpaceSniffer to get a quick read on what was taking up all the space. It was a dedicated SQL Server data file drive and the usual suspects were taking up the lion’s share of the space. (SpaceSniffer is a fantastic tool, by the way. You should check it out, if you haven’t already.)

Now, when it comes to disk space issues, there are three basic options:

1. Do nothing.  This should always be considered as an approach, though you may rarely use this as a solution. Sometimes you really don’t need to do anything.

2. Reduce the space required. You can relocate files, clean up un-needed data, etc. You can even shrink files if you understand what the consequences would be and have a plan for dealing with that.

3. Increase the space available. In this case, we were on a SAN drive that had available space left, so we could grow the drive to make more space available.

Another quick check showed that most of the data files were over 90% used, and none of them would be able to grow unless more space was made available. That eliminated option 1. There were no other files on that drive to clean up or delete, i.e., no databases to drop, so we couldn’t reduce the amount of space required for option 2. That left us with option 3: Grow the drive.  This was an online operation handled off-hours by our SAN admin, and was successful.

After resolving any issue, it’s a good idea to go back and do some root-cause analysis. What caused the disk space issue?  One of the first things I do when taking on a new client or job is to set up a monitoring job on every server that keeps a few month’s worth of history on the database files. Using the sys.dm_io_virtual_file_stats DMV I take a snapshot of the disk activity for each data file, as well as it’s size on disk, every 15 minutes.  This is then logged to a table in my administrative database on each server.

With the stats in place, I was able to put together a quick query to see which database had grown the most over the last few months. It turned out that one of the smaller databases had now become the second biggest database, all within the space of the last 8 weeks. Checking with the owner of the application, I found out a lot of new data had recently been imported, causing the database to grow by over 600%.  After that, it didn’t take much more growth on the part of the other data files to push that disk over its threshold, and send an alert.

Here are the scripts I use, in case you’re interested.

  • DDL for FileIOStatsRaw table and vw_FileIOByHour view. – Created in the administrative DB. (In my case, [SQLAdmin].)
  • GetFileIOStats stored procedure – Run this as frequently as you want and output to the FileIOStatsRaw table. – Also created in the admin DB. Will keep 90 days by default, but call it how you want in an agent job.
  • Trend Query:
SELECT 
 DBName, 
 Yr, 
 Mo, 
 CalculatedGrowthInMBPerMonth = SUM(FileGrowthBytes) / 1024.0
FROM dbo.vw_FileIOByHour 
WHERE FileType = 'ROWS' 
GROUP BY DBName, Yr, Mo 
ORDER BY DBName, Yr, Mo;
GO

How do you monitor, measure and alert on disk space usage?

Thanks for reading.

-David.

Be A Master of Disaster

Disaster recovery plans are a must have. If you’re not motivated enough by recent natural disasters, or data breaches, or anything else that’s been wreaking digital havoc in this world lately, I’m not sure what will do the trick. If you haven’t made a DR plan, I would take a few minutes to just sketch out a quick, paper napkin-style plan.  What would you do first, second, third, etc… It doesn’t have to be complete, it just has to be a plan.  Go ahead. I’ll wait…

Got it?  Good.  Now. How do you know that will work?

Plans are an important, and great first step. Much like restores, however, they don’t do you any good unless you test them.  It’s not always easy, or even practical to do so. Sometimes even testing DR plans can cause outages.

Further, we may be in the habit of constantly testing our backups, but how often do we test DR plans?  Quarterly? Yearly? I’ve been in more than one DR rehearsal where we found that major changes in other parts of the infrastructure have had catastrophic impacts on our DR processes. How often do we do documentation in the first place, let along update all the related DR documentation when a major system changes?

In my current position, if disaster struck right now, I’m confident we would be fine. But only because we recently went through several very painful DR tests that did not go as planned. Just about everything that could go wrong, did.  However, I don’t see that as a bad thing. In fact, I see it as a very positive thing. We tested, failed, adjusted the plan, tested again, repeated that process a few more times, and off we went. Now we know we have a solid DR plan.

I think that’s worth a few failures. :-)

How do you test your DR plans?

Thanks for reading.

-David.

Quick Tip: SSMS Pseudo-Full Screen View

I have a number of nervous and idle habits that may potentially annoy my cube neighbors. They haven’t complained, yet. One of my habits is that when I’m working on a particularly interesting or annoying script, I will randomly click around the query window. Click, click, click. I was doing this a couple of days ago, and for some reason, I was clicking on the tab at the top of the query window, where the name of the script is.

Tabbed T-SQL Window
Normal tabbed view.

Then, I double clicked on it.

Floating Window
Ooh, that’s neat.

Ooh, that’s neat, I thought. Then I clicked on the maximize button, and this is what I got.

Almost Full Screen
Full screen, but I still have my taskbar.

Awesome. Finally, I restored the window and dragged it back into SSMS to put it back into tabbed view.

Drag To Center
Drag and drop it back where it was.

Now, let me explain: Another habit of mine is sliding the mouse to the left or right of the screen when I’m coding. Sharp-eyed readers will notice that that’s where my side panes are minimized. So when I’m trying to focus on the code, I inevitably cause the side panes to slide out in front of where I’m either reading or typing. Now, I could just put SSMS into full screen to hide the panes, but I’m not a fan of that either, because I have status icons in my taskbar I want to see. This is a perfect hybrid for me, and it even gets the SSMS toolbars out of the way. Bonus! I figured if it was useful to me, it might be useful to someone else as well. So here you go.

Thanks for reading.
-David.

T-SQL Tuesday #36: What Community Means To Me

First off, thanks to Chris Yates (blog | twilter) for hosting this edition of T-SQL Tuesday. For those of you who don’t know T-SQL Tuesday is a recurring blog party started by Adam Machanic (blog | twitter), that happens on the second Tuesday of each month. The host for that month picks a theme, and off we go.  This is my first time participating, but I guarantee won’t be my last. :-)

This month’s theme is: What does the community mean to you? I’ve been thinking about this a lot lately, and I keep coming back to a few thoughts each time.

Community means being able to explain what you do for 50+ hours per week with a single sentence. You have no idea how many times I tell people what I do, and it takes two paragraphs or more to explain it. I don’t have to do that when I’m around you folks, and I love that.

Community means there’s always help available. It may not be fast, but it’s there, and it’s good – every time. When I show DB2 or Oracle people what we do with #sqlhelp, their jaws hit the floor every time. When I tell them about SQL Saturday, the rest of them hits the floor as well. Recently, I even heard a DBA from another product say, “Wow – I should have learned SQL, instead.” I replied with, “Never too late to start, we’ll help you.” And I know we will.

Community means spending an entire Saturday learning. Even if it’s a gorgeous day outside. It says something about a community when a couple hundred people take a weekend to train themselves, get better, and help others get better as well. I just don’t see that in a lot of other groups. At least not on the scale that we do it. It shows our professionalism and dedication, and it makes a difference..

Community means giving, just as much as receiving. Yes, I was a little slow to start, but the primary reason I started this blog, started speaking and writing about SQL Server is because I get so much from the community, I just had to start giving something back. I’ve gotten so much from the community – it just makes sense to start giving back. And on top of that, it feels good. Whether it’s answering a question on #sqlhelp, speaking at a SQL Saturday, or when someone mentions something I wrote, it feels good to know that I helped someone. (Even if it was by being a bad example. ;-) )

Community means being welcome. I think this applies to just about any community, but I really feel it in this one, more so than in many others I’ve participated in. This is one of the most welcoming groups of people I’ve known. Maybe it’s just a function of common experience, but I feel like I can just walk up to anyone, and start a pleasant conversation. You don’t get that everywhere, and I’m glad to have it here. No wonder many of us call it #SQLFamily. It feels like home.

Thanks for reading.

-David.

Emergency Mode Repair Playtime

I have a little database that I don’t back up, nor do I do any maintenance on it. I don’t even check it for corruption. This database is what I would consider a “throw-away” database. It would be faster to have the application recreate the DB from scratch than it would be to restore it from backup, and it doesn’t hold any permanent information. In fact, the only time it is of any use to me is when I’m actively using the application, and information is only retained for an hour. So it has really limited usage.

With that in mind, imagine my delight when this database became corrupted. “Yay! A corrupt DB to play with!” SSMS showed the database status as SUSPECT. Like any good DBA, the first thing I did was take a backup. You know how to do that, right?

BACKUP DATABASE igrep TO 'F:\igrep_corrupt.bak' WITH CONTINUE_AFTER_ERROR;
GO

Having saved off a copy of that DB, I placed the database in emergency mode, so that I could run a full consistency check with this:

ALTER DATABASE igrep
SET EMERGENCY;
GO

DBCC CHECKDB(igrep)
WITH NO_INFOMSGS;
GO

The final tally was 5 allocation errors, and 39 consistency errors. In other words, this DB was a mess. Since there was no useful data to lose at this point, I decided to go ahead and run a repair to see what it would do.

DBCC CHECKDB(igrep, repair_allow_data_loss)
WITH NO_INFOMSGS;
GO
Msg 7919, Level 16, State 3, Line 1
Repair statement not processed. Database needs to be in single user mode.

Oops. Forgot that when in EMERGENCY mode, you’re in restricted to syadmins only, but not in SINGLE_USER.

ALTER DATABASE igrep
SET SINGLE_USER;
GO

Hmmm… no error messages on that one. Just for grins and giggles…

ALTER DATABASE igrep
SET ONLINE;
GO
Msg 926, Level 14, State 1, Line 1
Database 'igrep' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.
Msg 3414, Level 21, State 2, Line 1
An error occurred during recovery, preventing the database 'igrep' (database ID 10) from restarting.
Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

Ok, wasn’t expecting success anyway. :-) Back to emergency mode. This time, we’ll run the repair as well.

ALTER DATABASE igrep
SET EMERGENCY;
GO

DBCC CHECKDB(igrep, repair_allow_data_loss)
WITH NO_INFOMSGS;
GO

Tick tock, tick tock, tick tock, DING! About 10 minutes later… here’s what DBCC reported back:

CHECKDB found 5 allocation errors and 70 consistency errors in database 'igrep'.
CHECKDB fixed 2 allocation errors and 70 consistency errors in database 'igrep'.

Woot! We’re complete! In order to fix the consistency errors, the following things happened:
* Some pages were deallocated (buh-bye, data)
* Some extents were marked as allocated to specific objects.
* Specific clustered and non-clustered indexes were rebuilt.

The other interesting thing is, note that the number of consistency errors went *up* when the allocation errors were fixed. It’s likely that…

Hold up… All the consistency errors were fixed, but only two of five allocation errors? Let’s scroll up a bit. Near the top, we find the following:

Msg 8905, Level 16, State 1, Line 2
Extent (1:682840) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Could not repair this error.
Msg 8905, Level 16, State 1, Line 2
Extent (1:682880) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Repairing this error requires other errors to be corrected first.
Msg 8905, Level 16, State 1, Line 2
Extent (1:682952) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Repairing this error requires other errors to be corrected first.

Interestingly enough, SQL has marked the database as ONLINE. I thought that would denote a successful repair, but DBCC says once again that repair_allow_data_loss is the minimum repair level for this database. Running CHECKDB again, we get three of these:

Msg 8905, Level 16, State 1, Line 1
Extent (1:682952) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.

Let’s run the same repair again. This time, we get this:

CHECKDB found 3 allocation errors and 20 consistency errors in database 'igrep'.
CHECKDB fixed 3 allocation errors and 20 consistency errors in database 'igrep'.

So fixing the allocation errors caused more consistency errors, again. Not surprising, when you think about it. Allocating and deallocating pages, especially when foreign keys are in play, is bound to cause some structural inconsistency. That’s why it’s paramount that after running a repair of any kind, you run another CHECKDB to see if the repair caused any other damage. Speaking of which – after this last repair, CHECKDB finished with no errors, and the application’s service started fine. The applicatoin itself started working again, and all was well. Keep in mind that this was the *least* likely outcome. I didn’t do a good job, or even do the right things, here – I got *extremely* lucky.

So what are the big ideas we take away from all this?

  1. Backups are your friend. Take them, test them, and store them safely.
  2. By default, try a restore before you try repair. The former is clean, the latter can get *really* messy.
  3. If you don’t have a backup, take one with CONTINUE_AFTER_ERROR before doing *anything* else.
  4. Emergency mode can bring up otherwise inaccessible databases, sometimes. Remember that to do a repair, though, you’ll still have to set it to SINGLE_USER mode
  5. Just because repair finished, doesn’t mean you’re good to go. Check and re-check your database, both before and after repair.

Anything else I missed?

Thanks,
-David.