To Backup or not to Backup. In Development, is the question.

Recently an old debate has resurfaced for me where the central question is: Do we, or do we not back up the development environment? Say it with me, now:

“It depends.”

I have seen a trend towards reducing or even eliminating the backups in development or test environments, since they don’t get as high a priority as the production environment. Development isn’t a client- or public-facing area of the business, so it gets lower priority when it’s down. Dev and Test *are* the production environment for your developers, though. That’s where they work, day in and day out. How comfortable would you feel telling a developer that they’ll have to re-do the last couple of hours worth of work? How about the last day or so? More?

Another common argument that I hear against backing up the development environment is the storage cost of what could be considered duplicate data, since Dev and Test are usually copies of Production. Fair enough, but if development is not as high a priority to the business, wouldn’t we be able to save costs on backup storage using things like de-dupe, snapshots, and compression?  In addition to storage cost, how much are you paying your developers? I assume they’re still drawing a salary while the environment is unavailable, or while they are re-doing work that was lost due to a database issue. Is that not also a cost?

I’m heavily skewed in favor of backing up the development environment. As the DBA, I’m very nervous about having anything without a backup. But I have to admit that there are some situations where backups just aren’t as useful as they should be, and it’s easier to just recreate instead of restore. The key question is this: What is your RPO/RTO for your developers, and how will you ensure that? Most of the time, backups are part of the procedure. Sometimes they aren’t.

One situation I have seen more than a few times is where production databases are regularly copied to the development or test environments, and the latest version of database changes are applied. Not just to make sure that the databases are code-current, but this has the additional benefit of testing the deployment scripts as well. In this case, is it worth taking a backup? Again, it depends. What is your SLA for development, and is it faster to restore the development database, or to re-do all the changes?  If it takes an hour to restore, but an additional hour to re-apply the changes, is that extra hour worth saving a little bit on storage cost for backups?

There’s no clear-cut answer, but I know that I restores save careers. I will almost always advocate for backing up all my environments.

What about you?

Thanks for reading.

-David.

SPROC: Backup History Report

A while back, I wrote a script to show backup throughput over time. It was intended to diagnose deteriorating disk IO performance for the backup destination. I figured it could be useful for other things as well, so I combined it with my backup history report into a single stored procedure. I hope you find this as useful as I do. Please let me know if you have any questions on it, or suggestions for improvements.

Thanks for reading,
-David.

USE [SQLAdmin];
GO

IF (SELECT OBJECT_ID('dbo.ShowCurrentBackupHistory')) IS NULL
BEGIN
    EXEC
('CREATE PROCEDURE dbo.ShowCurrentBackupHistory
            AS
            PRINT ''STUB VERSION - Replace with actual procedure.'';
            '
)
END
GO

ALTER PROCEDURE dbo.ShowCurrentBackupHistory
/*******************************************************************************
AUTHOR:
  David M Maxwell
  
DATE:
  Nov 6 2013

DESCRIPTION:
  
  Shows the last @days of backup history for all databases.

  Shows name, current recovery model, backup type (full, diff or log), finish time,
  duration in seconds, backup size in MB, throughput in MB per second, user name
  that took the backup, and the backup file destination.

DEPENDENCIES:
  None.
  
TEST EXEC:

EXEC dbo.ShowCurrentBackupHistory
   @days = 30 -- Last 30 days of history
GO

TO DO LIST:
  1.

NOTES:  
  Please send any comments or suggestions to dmmaxwell@gmail.com.

CHANGE HISTORY:
Date - Author - Change
Nov 6 2013 - David M Maxwell - Initial version. Pending feedback.

*******************************************************************************/
  
@days int = 30
AS

SET NOCOUNT ON
;

/* Convert the @days to a datetime to keep things simple. */
DECLARE @dt datetime
SELECT
@dt = dateadd(dd,-@days,getdate())

SELECT
  
DBName = bs.database_name
,CurrentRecModel = db.recovery_model_desc /* IF NULL, then DB no longer exists. */
,BackupType =
  
CASE bs.type
    WHEN
'D' THEN 'Full'
    
WHEN 'I' THEN 'Diff'
    
WHEN 'L' THEN 'TLog'
    
ELSE 'Unknown'
  
END
,DateCompleted = bs.backup_finish_date
,Duration = DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date)
,
DataSizeMB = CAST(ROUND(bs.backup_size / 1048576.0,2) AS numeric(10,2))
,
[MB/sec] = CAST(ROUND((bs.backup_size / 1048576.0) /
      
CASE
        
WHEN DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date) > 0
        
THEN DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date)
        
ELSE 1
      
END,2) AS numeric(10,2))
,
UserName = bs.[user_name]
,BackupFile = bmf.physical_device_name
FROM msdb.dbo.backupset AS bs
INNER JOIN msdb.dbo.backupmediafamily AS bmf
  
ON bs.media_set_id = bmf.media_set_id
LEFT OUTER JOIN master.sys.databases db
  
ON bs.database_name = db.name
WHERE bs.backup_finish_date > @dt
ORDER BY
  
bs.backup_finish_date DESC;
GO

Script: Backup Throughput

Recently on one particular server, I noticed that backups were taking much longer than they should have. The databases or backups weren’t getting significantly larger,  so I wanted to know if there was a problem with the backup storage, and to do that I wanted to look at backup throughput over time. Here’s the query I wrote to do that.

The query makes use of two tables in MSDB, namely [backupset] and [backupmediafamily]. From [backupset], we can get the database name, start and finish times, and the size of the backup. To get the file name, we need to go to [backupmediafamily], and join on the media_set_id.  By calculating the difference in seconds between start and finish times, and converting the backup size from bytes to MB, we can get the backup throughput in MB/s.

Also, in this case, I’m filtering for backups that are either full or differential, and are taking longer than 10 minutes. This eliminates backups that only take a few seconds, since those backups will give atrificially low throughput numbers. (If I only backed up 1MB, and it took 1s, then my throughput is 1MB/s even if I could have done more.)

SELECT
bs.database_name AS DBName
,bs.backup_start_date AS DateStarted
,bs.backup_finish_date AS DateCompleted
,Duration =
DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date)
,bs.backup_size / 1048576.0 AS DataSizeMB
,[MB/sec] = (bs.backup_size / 1048576.0) /
CASE
WHEN DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date) > 0
THEN DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date)
ELSE 1
END
,bmf.physical_device_name AS BackupFile
FROM msdb.dbo.backupset AS bs
INNER JOIN msdb.dbo.backupmediafamily AS bmf
ON bs.media_set_id = bmf.media_set_id
WHERE bs.type != 'L'
AND DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date) > 600
;

That will give me the backup throughput for each Full or Differential backup running longer than 10 minutes that remains in the backup history. If you want to look at averages for all databases over time, you can group by parts of the datetimes. One way to do that would be like this:

WITH BackupThroughput AS (
SELECT
bs.backup_finish_date AS DateCompleted
,[MB/sec] = (bs.backup_size / 1048576.0) /
CASE
WHEN DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date) > 0
THEN DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date)
ELSE 1
END
FROM msdb.dbo.backupset AS bs
INNER JOIN msdb.dbo.backupmediafamily AS bmf
ON bs.media_set_id = bmf.media_set_id
WHERE bs.type != 'L'
AND DATEDIFF(SECOND, bs.backup_start_date, bs.backup_finish_date) > 600
)
SELECT
BackupDate = MIN(DateCompleted)
,AverageThroughput = AVG([MB/sec])
FROM BackupThroughput
GROUP BY
DATEPART(YEAR,DateCompleted)
,DATEPART(MONTH,DateCompleted)
,DATEPART(DAY,DateCompleted)
ORDER BY
DATEPART(YEAR,DateCompleted)
,DATEPART(MONTH,DateCompleted)
,DATEPART(DAY,DateCompleted)

With that information in hand, I can check to see if there was an obvious drop in backup throughput, and then investigate that.

Hope that helps.
-David.

Retiring a Database

Every dog has it’s day, and every database has it’s time. Maybe a “shelf life” would be more appropriate. But with data retention times getting longer and longer, some databases must be kept around long after their applications have been retired. So what’s a good way to retire a database?

I have a set procedure that I use for retiring a database, once it’s application has no longer been used. There is some ongoing maintenance, but that’s the price you pay for data retention. If you need to put a DB in cold storage, here’s one way.

Backup – Set the recovery model of the database to SIMPLE and take a backup of the existing database so you have a starting point to go back to, should anything go wrong with this procedure.  I’ve never had to use that backup in my time, but you never know. It pays to be cautious.  We’ll need the DB to be in SIMPLE mode since the next thing we’re going to do is…

Shrink – Yes, I know that’s considered a dirty word by many data professionals, but if we understand what shrink does, and are ready to mitigate any of the problems it presents, then we can shrink a database safely.  In this case, we’re taking the database out of production and storing it.  Disk space just became far more important than performance. So, we can go ahead and shrink the database as small as we can get it.

Optimize – Now that we’ve shrunk the database, we’re going to have loads of internal fragmentation going on.  At this point, we’re going to re-index everything and update all statistics, so that when we do restore this database, it will be ready to read, quickly. Hey – I said that performance wasn’t as important as disk space. I didn’t say it wasn’t important at all. :-)

Truncate – You’ll notice (hopefully) that after the re-index / update, our DB size has grown again. We still want to save some disk space, so let’s shrink the data file again, but this time, using TRUNCATE_ONLY.  All we want to do is remove the empty space at the end of the file. This may not net us much additional space, if any, but when it comes to storage, every byte counts. For the log file, we’ll go ahead and shrink it as far down as it will go.  Since it’s not storing data, we don’t need it.

SET to READ_ONLY – When and if we do restore this database, we’re probably going to want to avoid accidentally updating the data, since it’s being held for archival purposes. If we really do need to deliberately update the data in this database, we can set it to READ_WRITE after we restore it.  If required, we can then follow this procedure again to archive that copy of the database.

Backup Again – This is the most critical part of the process. Now that we have a good copy of the database, we want to back it up to the most up-to-date version of SQL Server in our shop, as well as storing it on the most recent hardware we have, regardless of the media type. This may involve taking backups, restoring them to other, more updated servers, and backing up again. 

Restore – TEST THAT BACKUP. The only way you truly know if a backup is good is to restore it. Also, as part of our routine DR exercises, it’s a good idea to update that backup to the most recent version of SQL we have, and the most recent storage hardware as well. With that, you should be able to keep your data around indefinitely. 

Why not just take a backup, put it on a removable disk and be done? Imagine it’s 10 years later. The database we need is backed up to media we can’t read because the hardware died 4 years ago, and “nobody makes those drives anymore”. Don’t laugh. I have seen this very failure. I’ve also seen situations where we can read the media, but our backup format is so old that our newer software can’t read it. We have to go scrambling for a (sometimes unlicensed) copy of the older software. Don’t let this happen to you. Having redundant copies, i.e., a backup of the backup, is a great idea as well. Avoid single points of failure. Oh, and we are using backup compression, right? :-)

One final caveat to this: if the DB we are restoring uses features that get deprecated, we’ll want to know, so a good suggestion is to use the TRY/CATCH restore method described by Shaun Stuart.  If we do have deprecated features in use, we’ll have to decide how to address those when we find them. They may or may not matter.

Thanks for reading.

-David.

Emergency Mode Repair Playtime

I have a little database that I don’t back up, nor do I do any maintenance on it. I don’t even check it for corruption. This database is what I would consider a “throw-away” database. It would be faster to have the application recreate the DB from scratch than it would be to restore it from backup, and it doesn’t hold any permanent information. In fact, the only time it is of any use to me is when I’m actively using the application, and information is only retained for an hour. So it has really limited usage.

With that in mind, imagine my delight when this database became corrupted. “Yay! A corrupt DB to play with!” SSMS showed the database status as SUSPECT. Like any good DBA, the first thing I did was take a backup. You know how to do that, right?

BACKUP DATABASE igrep TO 'F:\igrep_corrupt.bak' WITH CONTINUE_AFTER_ERROR;
GO

Having saved off a copy of that DB, I placed the database in emergency mode, so that I could run a full consistency check with this:

ALTER DATABASE igrep
SET EMERGENCY;
GO

DBCC CHECKDB(igrep)
WITH NO_INFOMSGS;
GO

The final tally was 5 allocation errors, and 39 consistency errors. In other words, this DB was a mess. Since there was no useful data to lose at this point, I decided to go ahead and run a repair to see what it would do.

DBCC CHECKDB(igrep, repair_allow_data_loss)
WITH NO_INFOMSGS;
GO
Msg 7919, Level 16, State 3, Line 1
Repair statement not processed. Database needs to be in single user mode.

Oops. Forgot that when in EMERGENCY mode, you’re in restricted to syadmins only, but not in SINGLE_USER.

ALTER DATABASE igrep
SET SINGLE_USER;
GO

Hmmm… no error messages on that one. Just for grins and giggles…

ALTER DATABASE igrep
SET ONLINE;
GO
Msg 926, Level 14, State 1, Line 1
Database 'igrep' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.
Msg 3414, Level 21, State 2, Line 1
An error occurred during recovery, preventing the database 'igrep' (database ID 10) from restarting.
Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

Ok, wasn’t expecting success anyway. :-) Back to emergency mode. This time, we’ll run the repair as well.

ALTER DATABASE igrep
SET EMERGENCY;
GO

DBCC CHECKDB(igrep, repair_allow_data_loss)
WITH NO_INFOMSGS;
GO

Tick tock, tick tock, tick tock, DING! About 10 minutes later… here’s what DBCC reported back:

CHECKDB found 5 allocation errors and 70 consistency errors in database 'igrep'.
CHECKDB fixed 2 allocation errors and 70 consistency errors in database 'igrep'.

Woot! We’re complete! In order to fix the consistency errors, the following things happened:
* Some pages were deallocated (buh-bye, data)
* Some extents were marked as allocated to specific objects.
* Specific clustered and non-clustered indexes were rebuilt.

The other interesting thing is, note that the number of consistency errors went *up* when the allocation errors were fixed. It’s likely that…

Hold up… All the consistency errors were fixed, but only two of five allocation errors? Let’s scroll up a bit. Near the top, we find the following:

Msg 8905, Level 16, State 1, Line 2
Extent (1:682840) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Could not repair this error.
Msg 8905, Level 16, State 1, Line 2
Extent (1:682880) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Repairing this error requires other errors to be corrected first.
Msg 8905, Level 16, State 1, Line 2
Extent (1:682952) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.
Repairing this error requires other errors to be corrected first.

Interestingly enough, SQL has marked the database as ONLINE. I thought that would denote a successful repair, but DBCC says once again that repair_allow_data_loss is the minimum repair level for this database. Running CHECKDB again, we get three of these:

Msg 8905, Level 16, State 1, Line 1
Extent (1:682952) in database ID 10 is marked allocated in the GAM, but no SGAM or IAM has allocated it.

Let’s run the same repair again. This time, we get this:

CHECKDB found 3 allocation errors and 20 consistency errors in database 'igrep'.
CHECKDB fixed 3 allocation errors and 20 consistency errors in database 'igrep'.

So fixing the allocation errors caused more consistency errors, again. Not surprising, when you think about it. Allocating and deallocating pages, especially when foreign keys are in play, is bound to cause some structural inconsistency. That’s why it’s paramount that after running a repair of any kind, you run another CHECKDB to see if the repair caused any other damage. Speaking of which – after this last repair, CHECKDB finished with no errors, and the application’s service started fine. The applicatoin itself started working again, and all was well. Keep in mind that this was the *least* likely outcome. I didn’t do a good job, or even do the right things, here – I got *extremely* lucky.

So what are the big ideas we take away from all this?

  1. Backups are your friend. Take them, test them, and store them safely.
  2. By default, try a restore before you try repair. The former is clean, the latter can get *really* messy.
  3. If you don’t have a backup, take one with CONTINUE_AFTER_ERROR before doing *anything* else.
  4. Emergency mode can bring up otherwise inaccessible databases, sometimes. Remember that to do a repair, though, you’ll still have to set it to SINGLE_USER mode
  5. Just because repair finished, doesn’t mean you’re good to go. Check and re-check your database, both before and after repair.

Anything else I missed?

Thanks,
-David.

T-SQL Toolbox – 30 Day Backup History

I keep a little toolbox of T-SQL scripts handy in Simplenote. There are a couple of plug-ins for my favorite browsers, or smartphones if you have one.  I don’t, but that’s a whole other post for a whole other blog. With that, I’m never far from a useful script or two.  One that I like to have around, especially when I’m in a new environment is my “30 Days of Backup History” scriptlet.  It does just what it says on the label – shows you the backup history for all the backed up databases on the server for the last 30 days.  So here it is. Feel free to take this, modify, rearrange, chuck it in a stored procedure, etc… If you repost it somewhere else, I just ask for a token acknowledgement.

/* ADMIN: Backup history for the last 30 days. */
DECLARE @dt datetime = dateadd(dd,-30,getdate());

SELECT
  DBName = bs.database_name
 ,BackupType =
  CASE bs.type
    WHEN 'D' THEN 'Full'
    WHEN 'I' THEN 'Diff'
    WHEN 'L' THEN 'TLog'
  ELSE 'Unknown'
  END
 ,DateCompleted = bs.backup_finish_date
 ,DataSize = bs.backup_size
 ,UserName = bs.[user_name]
 ,BackupFile = bmf.physical_device_name
FROM msdb.dbo.backupset AS bs
INNER JOIN msdb.dbo.backupmediafamily AS bmf
  ON bs.media_set_id = bmf.media_set_id
WHERE bs.backup_finish_date > @dt
ORDER BY bs.backup_finish_date DESC;
GO

Thanks for reading,

-David.

Database Recovery Models – The Basics

If you’re like me, your first experience with SQL Server database recovery models was when you tried to to take a log backup on a database in SIMPLE recovery mode, and got an error for your troubles. In fairness, back in SQL Server 2000, we didn’t have the handy “Ignore database in SIMPLE recovery mode” checkbox, so we’ve got an excuse there… So what do these recovery models really mean, and what can we do or not do with them?

In short, recovery models have as much to do with how the transaction log is used as how we can back up the database. Log backups are driven both by the frequency with which space in the transaction log is marked for re-use, and with how we want to recover the database. So let’s look at it from both of those perspectives.

Transaction Log Re-use

First, let’s look at exactly what populates the transaction log, and how we use that space. When you modify something in a database, that data modification happens in memory. First, a record of the details of that modification are written to the transaction log, so that we can recover that transaction if the system crashes. Once a CHECKPOINT occurs, those modified or “dirty” pages in the buffer are written to the data file. However, we only need to keep the copy in the transaction log if we want to be able to recover the database to a specific point in time, or to roll back (or forward) a transaction. So the space in the transaction log can be cleared to keep it low.

In SIMPLE recovery mode, when a CHECKPOINT occurs, space in the transaction log that isn’t being used for an active transaction is marked for reuse, or “truncated”. The space is then available for use by other transactions.

In FULL or BULK_LOGGED recovery mode, CHECKPOINT still writes dirty pages to the disk files, but does not mark that space in the transaction log as free.  That space is still considered in use.

Backups

As far as the transaction log is concerned, there are two types of backups. Full or Differential Backups, and Transaction Log backups.

When a FULL or DIFFERENTIAL backup occurs, a CHECKPOINT is issued, and the data file or files are backed up, along with just enough transaction log to be able to go through the recovery phase of restore. However, the transaction log is not truncated, meaning the space holding the transaction information is not marked for reuse.

When a LOG backup starts, a CHECKPOINT is issued and the pages in the log file that have been flushed from cache are backed up. Log backups ignore the data files. Once the log backup is complete, the space in the log file that is not being used for active transactions, is then marked for reuse.

Recovery Models

With that in mind, let’s look at what the different recovery models offer.

Simple
In SIMPLE recovery mode, you can take FULL or DIFFERENTIAL backups, but not log backups. Log space is automatically marked for reuse by CHECKPOINT, so size of the transaction log is kept under control without log backups. However, since that is the case, any changes done since the last FULL or DIFFERENTIAL backup are lost in the event of a disaster. You’re only as good as your last backup. All work since that last backup would be lost.

Full
In the FULL recovery mode, you can take FULL and DIFFERENTIAL backups, and you must also take LOG backups as well. LOG backups allow you to do two things: First, they mark the transaction log space available for reuse, keeping the log size down. Second, LOG backups allow you to recover to a specific point in time between backups, so as long as the log itself is not damaged, no work would be lost. The only down side to this is that very large operations can still grow the log very quickly.

Bulk Logged
BULK_LOGGED recovery mode splits the difference between SIMPLE and FULL recovery. Certain large operations are not logged, including bulk inserts, SELECT INTO queries, and some index maintenance operations. You can take FULL or DIFFERENTIAL backups, and must also take LOG backups to mark the log space for reuse. The upside is that not logging those bulk operations contributes to keeping the log size down. Due to that, however, you cannot recover to a point in time like you can with the FULL recovery model. You can still take a final log backup and recover to the most recent backup point, but you cannot recover to a point in between backups like you can with the FULL recovery model.

Considerations

Backup strategy is driven first by the business recovery requirements, and second by the necessity to manage the transaction log size.

Scenario 1 – The few users that use this data only make minimal modifications to the database throughout the day, and none of those change much data. The work can easily be re-done. This is a good situation for using the SIMPLE recovery model, since the amount of transaction log use is low, and there is no major loss of time should the data need to be re-entered.

Scenario 2 – Business users make many modifications to the database throughout the day. There is a legal requirement to be able to view the state of the database at any point in time. Due to the volume of users, the transaction log can grow rapidly if left unchecked. This is a good situation for the FULL recovery model, mostly due to the requirement to be able to restore the database to any point in time. Frequent log backups will also keep the transaction log size down.

Scenario 3 – A small development group uses this database for testing queries on large amounts of data which is loaded into the database via BULK INSERT. The group would like to lose minimal amounts of data in the case of a disaster, but can accept about 4 hours worth of re-work. This is a great case for using the BULK_LOGGED recovery model, since point-in-time recovery is not required, and the database is a frequent target of large operations.

Another way I have seen recovery models used is to switch the database into BULK_LOGGED recovery mode for index maintenance, then put it back in FULL once maintenance is complete. In fact, this can also be good if you have a nightly process that does bulk inserts into the database. You get to run the large inserts without bloating the log, and still get point in time recovery for daily user operations.

One last thing, now that I have mentioned switching recovery models. Note that switching from the FULL or BULK_LOGGED recovery model to the SIMPLE recovery model will break the log backup chain. This is fine if you no longer need to recover from log backups, but if you do, then once you switch back to FULL or BULK_LOGGED, you will need to take a FULL backup to get back to a state that will allow recovery from log backups.

Thanks for reading.
-David.

__________________

Resources

SQL Server Books Online

SQL Skills – In Recovery (Paul Randal)