1553 lines
		
	
	
		
			76 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			1553 lines
		
	
	
		
			76 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
<!DOCTYPE html>
 | 
						|
<html><head>
 | 
						|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
 | 
						|
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
 | 
						|
<link href="sqlite.css" rel="stylesheet">
 | 
						|
<title>Atomic Commit In SQLite</title>
 | 
						|
<!-- path= -->
 | 
						|
</head>
 | 
						|
<body>
 | 
						|
<div class=nosearch>
 | 
						|
<a href="index.html">
 | 
						|
<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite" border="0">
 | 
						|
</a>
 | 
						|
<div><!-- IE hack to prevent disappearing logo --></div>
 | 
						|
<div class="tagline desktoponly">
 | 
						|
Small. Fast. Reliable.<br>Choose any three.
 | 
						|
</div>
 | 
						|
<div class="menu mainmenu">
 | 
						|
<ul>
 | 
						|
<li><a href="index.html">Home</a>
 | 
						|
<li class='mobileonly'><a href="javascript:void(0)" onclick='toggle_div("submenu")'>Menu</a>
 | 
						|
<li class='wideonly'><a href='about.html'>About</a>
 | 
						|
<li class='desktoponly'><a href="docs.html">Documentation</a>
 | 
						|
<li class='desktoponly'><a href="download.html">Download</a>
 | 
						|
<li class='wideonly'><a href='copyright.html'>License</a>
 | 
						|
<li class='desktoponly'><a href="support.html">Support</a>
 | 
						|
<li class='desktoponly'><a href="prosupport.html">Purchase</a>
 | 
						|
<li class='search' id='search_menubutton'>
 | 
						|
<a href="javascript:void(0)" onclick='toggle_search()'>Search</a>
 | 
						|
</ul>
 | 
						|
</div>
 | 
						|
<div class="menu submenu" id="submenu">
 | 
						|
<ul>
 | 
						|
<li><a href='about.html'>About</a>
 | 
						|
<li><a href='docs.html'>Documentation</a>
 | 
						|
<li><a href='download.html'>Download</a>
 | 
						|
<li><a href='support.html'>Support</a>
 | 
						|
<li><a href='prosupport.html'>Purchase</a>
 | 
						|
</ul>
 | 
						|
</div>
 | 
						|
<div class="searchmenu" id="searchmenu">
 | 
						|
<form method="GET" action="search">
 | 
						|
<select name="s" id="searchtype">
 | 
						|
<option value="d">Search Documentation</option>
 | 
						|
<option value="c">Search Changelog</option>
 | 
						|
</select>
 | 
						|
<input type="text" name="q" id="searchbox" value="">
 | 
						|
<input type="submit" value="Go">
 | 
						|
</form>
 | 
						|
</div>
 | 
						|
</div>
 | 
						|
<script>
 | 
						|
function toggle_div(nm) {
 | 
						|
var w = document.getElementById(nm);
 | 
						|
if( w.style.display=="block" ){
 | 
						|
w.style.display = "none";
 | 
						|
}else{
 | 
						|
w.style.display = "block";
 | 
						|
}
 | 
						|
}
 | 
						|
function toggle_search() {
 | 
						|
var w = document.getElementById("searchmenu");
 | 
						|
if( w.style.display=="block" ){
 | 
						|
w.style.display = "none";
 | 
						|
} else {
 | 
						|
w.style.display = "block";
 | 
						|
setTimeout(function(){
 | 
						|
document.getElementById("searchbox").focus()
 | 
						|
}, 30);
 | 
						|
}
 | 
						|
}
 | 
						|
function div_off(nm){document.getElementById(nm).style.display="none";}
 | 
						|
window.onbeforeunload = function(e){div_off("submenu");}
 | 
						|
/* Disable the Search feature if we are not operating from CGI, since */
 | 
						|
/* Search is accomplished using CGI and will not work without it. */
 | 
						|
if( !location.origin || !location.origin.match || !location.origin.match(/http/) ){
 | 
						|
document.getElementById("search_menubutton").style.display = "none";
 | 
						|
}
 | 
						|
/* Used by the Hide/Show button beside syntax diagrams, to toggle the */
 | 
						|
function hideorshow(btn,obj){
 | 
						|
var x = document.getElementById(obj);
 | 
						|
var b = document.getElementById(btn);
 | 
						|
if( x.style.display!='none' ){
 | 
						|
x.style.display = 'none';
 | 
						|
b.innerHTML='show';
 | 
						|
}else{
 | 
						|
x.style.display = '';
 | 
						|
b.innerHTML='hide';
 | 
						|
}
 | 
						|
return false;
 | 
						|
}
 | 
						|
</script>
 | 
						|
</div>
 | 
						|
<div class=fancy>
 | 
						|
<div class=nosearch>
 | 
						|
<div class="fancy_title">
 | 
						|
Atomic Commit In SQLite
 | 
						|
</div>
 | 
						|
<div class="fancy_toc">
 | 
						|
<a onclick="toggle_toc()">
 | 
						|
<span class="fancy_toc_mark" id="toc_mk">►</span>
 | 
						|
Table Of Contents
 | 
						|
</a>
 | 
						|
<div id="toc_sub"><div class="fancy-toc1"><a href="#_introduction">1.  Introduction</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_hardware_assumptions">2.  Hardware Assumptions</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_single_file_commit">3.  Single File Commit</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_initial_state">3.1.  Initial State</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_acquiring_a_read_lock">3.2.  Acquiring A Read Lock</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_reading_information_out_of_the_database">3.3.  Reading Information Out Of The Database</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_obtaining_a_reserved_lock">3.4.  Obtaining A Reserved Lock</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_creating_a_rollback_journal_file">3.5.  Creating A Rollback Journal File</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_changing_database_pages_in_user_space">3.6.  Changing Database Pages In User Space</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_flushing_the_rollback_journal_file_to_mass_storage">3.7.  Flushing The Rollback Journal File To Mass Storage</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_obtaining_an_exclusive_lock">3.8.  Obtaining An Exclusive Lock</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_writing_changes_to_the_database_file">3.9.  Writing Changes To The Database File</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#0_flushing_changes_to_mass_storage">3.10. 0 Flushing Changes To Mass Storage</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#1_deleting_the_rollback_journal">3.11. 1 Deleting The Rollback Journal</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#2_releasing_the_lock">3.12. 2 Releasing The Lock</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_rollback">4.  Rollback</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_when_something_goes_wrong_">4.1.  When Something Goes Wrong...</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_hot_rollback_journals">4.2.  Hot Rollback Journals</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_obtaining_an_exclusive_lock_on_the_database">4.3.  Obtaining An Exclusive Lock On The Database</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_rolling_back_incomplete_changes">4.4.  Rolling Back Incomplete Changes</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_deleting_the_hot_journal">4.5.  Deleting The Hot Journal</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_continue_as_if_the_uncompleted_writes_had_never_happened">4.6.  Continue As If The Uncompleted Writes Had Never Happened</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_multi_file_commit">5.  Multi-file Commit</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_separate_rollback_journals_for_each_database">5.1.  Separate Rollback Journals For Each Database</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_the_super_journal_file">5.2.  The Super-Journal File</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_updating_rollback_journal_headers">5.3.  Updating Rollback Journal Headers</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_updating_the_database_files">5.4.  Updating The Database Files</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_delete_the_super_journal_file">5.5.  Delete The Super-Journal File</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_clean_up_the_rollback_journals">5.6.  Clean Up The Rollback Journals</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_additional_details_of_the_commit_process">6.  Additional Details Of The Commit Process</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_always_journal_complete_sectors">6.1.  Always Journal Complete Sectors</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_dealing_with_garbage_written_into_journal_files">6.2.  Dealing With Garbage Written Into Journal Files</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_cache_spill_prior_to_commit">6.3.  Cache Spill Prior To Commit</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_optimizations">7.  Optimizations</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_cache_retained_between_transactions">7.1.  Cache Retained Between Transactions</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_exclusive_access_mode">7.2.  Exclusive Access Mode</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_do_not_journal_freelist_pages">7.3.  Do Not Journal Freelist Pages</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_single_page_updates_and_atomic_sector_writes">7.4.  Single Page Updates And Atomic Sector Writes</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_filesystems_with_safe_append_semantics">7.5.  Filesystems With Safe Append Semantics</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_persistent_rollback_journals">7.6.  Persistent Rollback Journals</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_testing_atomic_commit_behavior">8.  Testing Atomic Commit Behavior</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#_things_that_can_go_wrong">9.  Things That Can Go Wrong</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_broken_locking_implementations">9.1.  Broken Locking Implementations</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_incomplete_disk_flushes">9.2.  Incomplete Disk Flushes</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_partial_file_deletions">9.3.  Partial File Deletions</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_garbage_written_into_files">9.4.  Garbage Written Into Files</a></div>
 | 
						|
<div class="fancy-toc2"><a href="#_deleting_or_renaming_a_hot_journal">9.5.  Deleting Or Renaming A Hot Journal</a></div>
 | 
						|
<div class="fancy-toc1"><a href="#10_0_future_directions_and_conclusion">10. 10.0 Future Directions And Conclusion</a></div>
 | 
						|
</div>
 | 
						|
</div>
 | 
						|
<script>
 | 
						|
function toggle_toc(){
 | 
						|
var sub = document.getElementById("toc_sub")
 | 
						|
var mk = document.getElementById("toc_mk")
 | 
						|
if( sub.style.display!="block" ){
 | 
						|
sub.style.display = "block";
 | 
						|
mk.innerHTML = "▼";
 | 
						|
} else {
 | 
						|
sub.style.display = "none";
 | 
						|
mk.innerHTML = "►";
 | 
						|
}
 | 
						|
}
 | 
						|
</script>
 | 
						|
</div>
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
<h1 id="_introduction"><span>1. </span> Introduction</h1>
 | 
						|
 | 
						|
<p>An important feature of transactional databases like SQLite
 | 
						|
is "atomic commit".  
 | 
						|
Atomic commit means that either all database changes within a single 
 | 
						|
transaction occur or none of them occur.  With atomic commit, it
 | 
						|
is as if many different writes to different sections of the database
 | 
						|
file occur instantaneously and simultaneously.
 | 
						|
Real hardware serializes writes to mass storage, and writing
 | 
						|
a single sector takes a finite amount of time.
 | 
						|
So it is impossible to truly write many different sectors of a 
 | 
						|
database file simultaneously and/or instantaneously.
 | 
						|
But the atomic commit logic within
 | 
						|
SQLite makes it appear as if the changes for a transaction
 | 
						|
are all written instantaneously and simultaneously.</p>
 | 
						|
 | 
						|
<p>SQLite has the important property that transactions appear
 | 
						|
to be atomic even if the transaction is interrupted by an
 | 
						|
operating system crash or power failure.</p>
 | 
						|
 | 
						|
<p>This article describes the techniques used by SQLite to create the
 | 
						|
illusion of atomic commit.</p>
 | 
						|
 | 
						|
<p>The information in this article applies only when SQLite is operating
 | 
						|
in "rollback mode", or in other words when SQLite is not 
 | 
						|
using a <a href="wal.html">write-ahead log</a>.  SQLite still supports atomic commit when
 | 
						|
write-ahead logging is enabled, but it accomplishes atomic commit by
 | 
						|
a different mechanism from the one described in this article.  See
 | 
						|
the <a href="wal.html">write-ahead log documentation</a> for additional information on how
 | 
						|
SQLite supports atomic commit in that context.</p>
 | 
						|
 | 
						|
<a name="hardware"></a>
 | 
						|
 | 
						|
<h1 id="_hardware_assumptions"><span>2. </span> Hardware Assumptions</h1>
 | 
						|
 | 
						|
<p>Throughout this article, we will call the mass storage device "disk"
 | 
						|
even though the mass storage device might really be flash memory.</p>
 | 
						|
 | 
						|
<p>We assume that disk is written in chunks which we call a "sector".
 | 
						|
It is not possible to modify any part of the disk smaller than a sector.
 | 
						|
To change a part of the disk smaller than a sector, you have to read in
 | 
						|
the full sector that contains the part you want to change, make the
 | 
						|
change, then write back out the complete sector.</p>
 | 
						|
 | 
						|
<p>On a traditional spinning disk, a sector is the minimum unit of transfer
 | 
						|
in both directions, both reading and writing.  On flash memory, however,
 | 
						|
the minimum size of a read is typically much smaller than a minimum write.
 | 
						|
SQLite is only concerned with the minimum write amount and so for the
 | 
						|
purposes of this article, when we say "sector" we mean the minimum amount
 | 
						|
of data that can be written to mass storage in a single go.</p>
 | 
						|
 | 
						|
<p>
 | 
						|
  Prior to SQLite version 3.3.14, a sector size of 512 bytes was
 | 
						|
  assumed in all cases.  There was a compile-time option to change
 | 
						|
  this but the code had never been tested with a larger value.  The
 | 
						|
  512 byte sector assumption seemed reasonable since until very recently
 | 
						|
  all disk drives used a 512 byte sector internally.  However, there
 | 
						|
  has recently been a push to increase the sector size of disks to
 | 
						|
  4096 bytes.  Also the sector size
 | 
						|
  for flash memory is usually larger than 512 bytes.  For these reasons,
 | 
						|
  versions of SQLite beginning with 3.3.14 have a method in the OS
 | 
						|
  interface layer that interrogates the underlying filesystem to find
 | 
						|
  the true sector size.  As currently implemented (version 3.5.0) this
 | 
						|
  method still returns a hard-coded value of 512 bytes, since there
 | 
						|
  is no standard way of discovering the true sector size on either
 | 
						|
  Unix or Windows.  But the method is available for embedded device
 | 
						|
  manufacturers to tweak according to their own needs.  And we have
 | 
						|
  left open the possibility of filling in a more meaningful implementation
 | 
						|
  on Unix and Windows in the future.</p>
 | 
						|
 | 
						|
<p>SQLite has traditionally assumed that a sector write is <u>not</u> atomic.
 | 
						|
However, SQLite does always assume that a sector write is linear.  By "linear"
 | 
						|
we mean that SQLite assumes that when writing a sector, the hardware begins
 | 
						|
at one end of the data and writes byte by byte until it gets to
 | 
						|
the other end.  The write might go from beginning to end or from
 | 
						|
end to beginning.  If a power failure occurs in the middle of a
 | 
						|
sector write it might be that part of the sector was modified
 | 
						|
and another part was left unchanged.  The key assumption by SQLite
 | 
						|
is that if any part of the sector gets changed, then either the
 | 
						|
first or the last bytes will be changed.  So the hardware will
 | 
						|
never start writing a sector in the middle and work towards the
 | 
						|
ends.  We do not know if this assumption is always true but it
 | 
						|
seems reasonable.</p>
 | 
						|
 | 
						|
<p>The previous paragraph states that SQLite does not assume that
 | 
						|
sector writes are atomic.  This is true by default.  But as of
 | 
						|
SQLite version 3.5.0, there is a new interface called the
 | 
						|
Virtual File System (<a href="vfs.html">VFS</a>) interface.  The <a href="vfs.html">VFS</a> is the only means
 | 
						|
by which SQLite communicates to the underlying filesystem.  The
 | 
						|
code comes with default VFS implementations for Unix and Windows
 | 
						|
and there is a mechanism for creating new custom VFS implementations
 | 
						|
at runtime.  In this new VFS interface there is a method called
 | 
						|
xDeviceCharacteristics.  This method interrogates the underlying
 | 
						|
filesystem to discover various properties and behaviors that the
 | 
						|
filesystem may or may not exhibit.  The xDeviceCharacteristics
 | 
						|
method might indicate that sector writes are atomic, and if it does
 | 
						|
so indicate, SQLite will try to take advantage of that fact.  But
 | 
						|
the default xDeviceCharacteristics method for both Unix and Windows
 | 
						|
does not indicate atomic sector writes and so these optimizations
 | 
						|
are normally omitted.</p>
 | 
						|
 | 
						|
<p>SQLite assumes that the operating system will buffer writes and
 | 
						|
that a write request will return before data has actually been stored
 | 
						|
in the mass storage device.
 | 
						|
SQLite further assumes that write operations will be reordered by
 | 
						|
the operating system.
 | 
						|
For this reason, SQLite does a "flush" or "fsync" operation at key
 | 
						|
points.  SQLite assumes that the flush or fsync will not return until
 | 
						|
all pending write operations for the file that is being flushed have
 | 
						|
completed.  We are told that the flush and fsync primitives
 | 
						|
are broken on some versions of Windows and Linux.  This is unfortunate.
 | 
						|
It opens SQLite up to the possibility of database corruption following
 | 
						|
a power loss in the middle of a commit.  However, there is nothing 
 | 
						|
that SQLite can do to test for or remedy the situation.  SQLite
 | 
						|
assumes that the operating system that it is running on works as
 | 
						|
advertised.  If that is not quite the case, well then hopefully you
 | 
						|
will not lose power too often.</p>
 | 
						|
 | 
						|
<p>SQLite assumes that when a file grows in length that the new
 | 
						|
file space originally contains garbage and then later is filled in
 | 
						|
with the data actually written.  In other words, SQLite assumes that
 | 
						|
the file size is updated before the file content.  This is a 
 | 
						|
pessimistic assumption and SQLite has to do some extra work to make
 | 
						|
sure that it does not cause database corruption if power is lost
 | 
						|
between the time when the file size is increased and when the
 | 
						|
new content is written.  The xDeviceCharacteristics method of
 | 
						|
the <a href="vfs.html">VFS</a> might indicate that the filesystem will always write the
 | 
						|
data before updating the file size.  (This is the 
 | 
						|
SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
 | 
						|
at the code.)  When the xDeviceCharacteristics method indicates
 | 
						|
that files content is written before the file size is increased,
 | 
						|
SQLite can forego some of its pedantic database protection steps
 | 
						|
and thereby decrease the amount of disk I/O needed to perform a
 | 
						|
commit.  The current implementation, however, makes no such assumptions
 | 
						|
for the default VFSes for Windows and Unix.</p>
 | 
						|
 | 
						|
<p>SQLite assumes that a file deletion is atomic from the
 | 
						|
point of view of a user process.  By this we mean that if SQLite
 | 
						|
requests that a file be deleted and the power is lost during the
 | 
						|
delete operation, once power is restored either the file will
 | 
						|
exist completely with all if its original content unaltered, or
 | 
						|
else the file will not be seen in the filesystem at all.  If
 | 
						|
after power is restored the file is only partially deleted,
 | 
						|
if some of its data has been altered or erased,
 | 
						|
or the file has been truncated but not completely removed, then
 | 
						|
database corruption will likely result.</p>
 | 
						|
 | 
						|
<p>SQLite assumes that the detection and/or correction of 
 | 
						|
bit errors caused by cosmic rays, thermal noise, quantum
 | 
						|
fluctuations, device driver bugs, or other mechanisms, is the 
 | 
						|
responsibility of the underlying hardware and operating system.  
 | 
						|
SQLite does not add any redundancy to the database file for
 | 
						|
the purpose of detecting corruption or I/O errors.
 | 
						|
SQLite assumes that the data it reads is exactly the same data 
 | 
						|
that it previously wrote.</p>
 | 
						|
 | 
						|
<p>By default, SQLite assumes that an operating system call to write
 | 
						|
a range of bytes will not damage or alter any bytes outside of that range
 | 
						|
even if a power loss or OS crash occurs during that write.  We
 | 
						|
call this the "<a href="psow.html">powersafe overwrite</a>" property.  
 | 
						|
Prior to <a href="releaselog/3_7_9.html">version 3.7.9</a> (2011-11-01),
 | 
						|
SQLite did not assume powersafe overwrite.  But with the standard
 | 
						|
sector size increasing from 512 to 4096 bytes on most disk drives, it
 | 
						|
has become necessary to assume powersafe overwrite in order to maintain
 | 
						|
historical performance levels and so powersafe overwrite is assumed by
 | 
						|
default in recent versions of SQLite.  The assumption of powersafe 
 | 
						|
overwrite property can be disabled at compile-time or a run-time if
 | 
						|
desired.  See the <a href="psow.html">powersafe overwrite documentation</a> for further
 | 
						|
details.
 | 
						|
 | 
						|
 | 
						|
<a name="section_3_0"></a>
 | 
						|
</p><h1 id="_single_file_commit"><span>3. </span> Single File Commit</h1>
 | 
						|
 | 
						|
<p>We begin with an overview of the steps SQLite takes in order to
 | 
						|
perform an atomic commit of a transaction against a single database
 | 
						|
file.  The details of file formats used to guard against damage from
 | 
						|
power failures and techniques for performing an atomic commit across
 | 
						|
multiple databases are discussed in later sections.</p>
 | 
						|
 | 
						|
<a name="initstate"></a>
 | 
						|
 | 
						|
<h2 id="_initial_state"><span>3.1. </span> Initial State</h2>
 | 
						|
 | 
						|
<img src="images/ac/commit-0.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The state of the computer when a database connection is
 | 
						|
first opened is shown conceptually by the diagram at the
 | 
						|
right.
 | 
						|
The area of the diagram on the extreme right (labeled "Disk") represents
 | 
						|
information stored on the mass storage device.  Each rectangle is
 | 
						|
a sector.  The blue color represents that the sectors contain
 | 
						|
original data.
 | 
						|
The middle area is the operating systems disk cache.  At the
 | 
						|
onset of our example, the cache is cold and this is represented
 | 
						|
by leaving the rectangles of the disk cache empty.
 | 
						|
The left area of the diagram shows the content of memory for
 | 
						|
the process that is using SQLite.  The database connection has
 | 
						|
just been opened and no information has been read yet, so the
 | 
						|
user space is empty.
 | 
						|
</p>
 | 
						|
<br clear="both">
 | 
						|
 | 
						|
<a name="rdlck"></a>
 | 
						|
 | 
						|
<h2 id="_acquiring_a_read_lock"><span>3.2. </span> Acquiring A Read Lock</h2>
 | 
						|
 | 
						|
<img src="images/ac/commit-1.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Before SQLite can write to a database, it must first read
 | 
						|
the database to see what is there already.  Even if it is just
 | 
						|
appending new data, SQLite still has to read in the database
 | 
						|
schema from the "<a href="schematab.html">sqlite_schema</a>" table so that it can know
 | 
						|
how to parse the INSERT statements and discover where in the
 | 
						|
database file the new information should be stored.</p>
 | 
						|
 | 
						|
<p>The first step toward reading from the database file
 | 
						|
is obtaining a shared lock on the database file.  A "shared"
 | 
						|
lock allows two or more database connections to read from the
 | 
						|
database file at the same time.  But a shared lock prevents
 | 
						|
another database connection from writing to the database file
 | 
						|
while we are reading it.  This is necessary because if another
 | 
						|
database connection were writing to the database file at the
 | 
						|
same time we are reading from the database file, we might read
 | 
						|
some data before the change and other data after the change.
 | 
						|
This would make it appear as if the change made by the other
 | 
						|
process is not atomic.</p>
 | 
						|
 | 
						|
<p>Notice that the shared lock is on the operating system
 | 
						|
disk cache, not on the disk itself.  File locks
 | 
						|
really are just flags within the operating system kernel,
 | 
						|
usually.  (The details depend on the specific OS layer
 | 
						|
interface.)  Hence, the lock will instantly vanish if the
 | 
						|
operating system crashes or if there is a power loss.  It
 | 
						|
is usually also the case that the lock will vanish if the
 | 
						|
process that created the lock exits.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
 | 
						|
<a name="section_3_3"></a>
 | 
						|
<h2 id="_reading_information_out_of_the_database"><span>3.3. </span> Reading Information Out Of The Database</h2>
 | 
						|
 | 
						|
<img src="images/ac/commit-2.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>After the shared lock is acquired, we can begin reading
 | 
						|
information from the database file.  In this scenario, we
 | 
						|
are assuming a cold cache, so information must first be
 | 
						|
read from mass storage into the operating system cache then
 | 
						|
transferred from operating system cache into user space.
 | 
						|
On subsequent reads, some or all of the information might
 | 
						|
already be found in the operating system cache and so only
 | 
						|
the transfer to user space would be required.</p>
 | 
						|
 | 
						|
<p>Usually only a subset of the pages in the database file
 | 
						|
are read.  In this example we are showing three
 | 
						|
pages out of eight being read.  In a typical application, a
 | 
						|
database will have thousands of pages and a query will normally
 | 
						|
only touch a small percentage of those pages.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
 | 
						|
<a name="rsvdlock"></a>
 | 
						|
 | 
						|
<h2 id="_obtaining_a_reserved_lock"><span>3.4. </span> Obtaining A Reserved Lock</h2>
 | 
						|
 | 
						|
<img src="images/ac/commit-3.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Before making changes to the database, SQLite first
 | 
						|
obtains a "reserved" lock on the database file.  A reserved
 | 
						|
lock is similar to a shared lock in that both a reserved lock
 | 
						|
and shared lock allow other processes to read from the database
 | 
						|
file.  A single reserve lock can coexist with multiple shared
 | 
						|
locks from other processes.  However, there can only be a
 | 
						|
single reserved lock on the database file.  Hence only a
 | 
						|
single process can be attempting to write to the database
 | 
						|
at one time.</p>
 | 
						|
 | 
						|
<p>The idea behind a reserved lock is that it signals that
 | 
						|
a process intends to modify the database file in the near
 | 
						|
future but has not yet started to make the modifications.
 | 
						|
And because the modifications have not yet started, other
 | 
						|
processes can continue to read from the database.  However,
 | 
						|
no other process should also begin trying to write to the
 | 
						|
database.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_5"></a>
 | 
						|
<h2 id="_creating_a_rollback_journal_file"><span>3.5. </span> Creating A Rollback Journal File</h2>
 | 
						|
<img src="images/ac/commit-4.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Prior to making any changes to the database file, SQLite first
 | 
						|
creates a separate rollback journal file and writes into the 
 | 
						|
rollback journal the original
 | 
						|
content of the database pages that are to be altered.
 | 
						|
The idea behind the rollback journal is that it contains
 | 
						|
all information needed to restore the database back to 
 | 
						|
its original state.</p>
 | 
						|
 | 
						|
<p>The rollback journal contains a small header (shown in green
 | 
						|
in the diagram) that records the original size of the database
 | 
						|
file.  So if a change causes the database file to grow, we
 | 
						|
will still know the original size of the database.  The page
 | 
						|
number is stored together with each database page that is 
 | 
						|
written into the rollback journal.</p>
 | 
						|
 | 
						|
<p>
 | 
						|
  When a new file is created, most desktop operating systems
 | 
						|
  (Windows, Linux, Mac OS X) will not actually write anything to
 | 
						|
  disk.  The new file is created in the operating systems disk
 | 
						|
  cache only.  The file is not created on mass storage until sometime
 | 
						|
  later, when the operating system has a spare moment.  This creates
 | 
						|
  the impression to users that I/O is happening much faster than
 | 
						|
  is possible when doing real disk I/O.  We illustrate this idea in
 | 
						|
  the diagram to the right by showing that the new rollback journal
 | 
						|
  appears in the operating system disk cache only and not on the
 | 
						|
  disk itself.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_6"></a>
 | 
						|
<h2 id="_changing_database_pages_in_user_space"><span>3.6. </span> Changing Database Pages In User Space</h2>
 | 
						|
<img src="images/ac/commit-5.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>After the original page content has been saved in the rollback
 | 
						|
journal, the pages can be modified in user memory.  Each database
 | 
						|
connection has its own private copy of user space, so the changes
 | 
						|
that are made in user space are only visible to the database connection
 | 
						|
that is making the changes.  Other database connections still see
 | 
						|
the information in operating system disk cache buffers which have
 | 
						|
not yet been changed.  And so even though one process is busy
 | 
						|
modifying the database, other processes can continue to read their
 | 
						|
own copies of the original database content.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_7"></a>
 | 
						|
<h2 id="_flushing_the_rollback_journal_file_to_mass_storage"><span>3.7. </span> Flushing The Rollback Journal File To Mass Storage</h2>
 | 
						|
<img src="images/ac/commit-6.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The next step is to flush the content of the rollback journal
 | 
						|
file to nonvolatile storage.
 | 
						|
As we will see later, 
 | 
						|
this is a critical step in insuring that the database can survive
 | 
						|
an unexpected power loss.
 | 
						|
This step also takes a lot of time, since writing to nonvolatile
 | 
						|
storage is normally a slow operation.</p>
 | 
						|
 | 
						|
<p>This step is usually more complicated than simply flushing
 | 
						|
the rollback journal to the disk.  On most platforms two separate
 | 
						|
flush (or fsync()) operations are required.  The first flush writes
 | 
						|
out the base rollback journal content.  Then the header of the
 | 
						|
rollback journal is modified to show the number of pages in the 
 | 
						|
rollback journal.  Then the header is flushed to disk.  The details
 | 
						|
on why we do this header modification and extra flush are provided
 | 
						|
in a later section of this paper.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_8"></a>
 | 
						|
<h2 id="_obtaining_an_exclusive_lock"><span>3.8. </span> Obtaining An Exclusive Lock</h2>
 | 
						|
<img src="images/ac/commit-7.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Prior to making changes to the database file itself, we must
 | 
						|
obtain an exclusive lock on the database file.  Obtaining an
 | 
						|
exclusive lock is really a two-step process.  First SQLite obtains
 | 
						|
a "pending" lock.  Then it escalates the pending lock to an
 | 
						|
exclusive lock.</p>
 | 
						|
 | 
						|
<p>A pending lock allows other processes that already have a
 | 
						|
shared lock to continue reading the database file.  But it
 | 
						|
prevents new shared locks from being established.  The idea
 | 
						|
behind a pending lock is to prevent writer starvation caused
 | 
						|
by a large pool of readers.  There might be dozens, even hundreds,
 | 
						|
of other processes trying to read the database file.  Each process
 | 
						|
acquires a shared lock before it starts reading, reads what it
 | 
						|
needs, then releases the shared lock.  If, however, there are
 | 
						|
many different processes all reading from the same database, it
 | 
						|
might happen that a new process always acquires its shared lock before
 | 
						|
the previous process releases its shared lock.  And so there is
 | 
						|
never an instant when there are no shared locks on the database
 | 
						|
file and hence there is never an opportunity for the writer to
 | 
						|
seize the exclusive lock.  A pending lock is designed to prevent
 | 
						|
that cycle by allowing existing shared locks to proceed but
 | 
						|
blocking new shared locks from being established.  Eventually
 | 
						|
all shared locks will clear and the pending lock will then be
 | 
						|
able to escalate into an exclusive lock.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_9"></a>
 | 
						|
<h2 id="_writing_changes_to_the_database_file"><span>3.9. </span> Writing Changes To The Database File</h2>
 | 
						|
<img src="images/ac/commit-8.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Once an exclusive lock is held, we know that no other
 | 
						|
processes are reading from the database file and it is
 | 
						|
safe to write changes into the database file.  Usually
 | 
						|
those changes only go as far as the operating systems disk
 | 
						|
cache and do not make it all the way to mass storage.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_10"></a>
 | 
						|
<h2 id="0_flushing_changes_to_mass_storage"><span>3.10. </span>0 Flushing Changes To Mass Storage</h2>
 | 
						|
<img src="images/ac/commit-9.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Another flush must occur to make sure that all the
 | 
						|
database changes are written into nonvolatile storage.
 | 
						|
This is a critical step to ensure that the database will
 | 
						|
survive a power loss without damage.  However, because
 | 
						|
of the inherent slowness of writing to disk or flash memory, 
 | 
						|
this step together with the rollback journal file flush in section
 | 
						|
3.7 above takes up most of the time required to complete a
 | 
						|
transaction commit in SQLite.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_11"></a>
 | 
						|
<h2 id="1_deleting_the_rollback_journal"><span>3.11. </span>1 Deleting The Rollback Journal</h2>
 | 
						|
<img src="images/ac/commit-A.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>After the database changes are all safely on the mass
 | 
						|
storage device, the rollback journal file is deleted.
 | 
						|
This is the instant where the transaction commits.
 | 
						|
If a power failure or system crash occurs prior to this
 | 
						|
point, then recovery processes to be described later make
 | 
						|
it appear as if no changes were ever made to the database
 | 
						|
file.  If a power failure or system crash occurs after
 | 
						|
the rollback journal is deleted, then it appears as if
 | 
						|
all changes have been written to disk.  Thus, SQLite gives
 | 
						|
the appearance of having made no changes to the database
 | 
						|
file or having made the complete set of changes to the
 | 
						|
database file depending on whether or not the rollback
 | 
						|
journal file exists.</p>
 | 
						|
 | 
						|
<p>Deleting a file is not really an atomic operation, but
 | 
						|
it appears to be from the point of view of a user process.
 | 
						|
A process is always able to ask the operating system "does
 | 
						|
this file exist?" and the process will get back a yes or no
 | 
						|
answer.  After a power failure that occurs during a 
 | 
						|
transaction commit, SQLite will ask the operating system
 | 
						|
whether or not the rollback journal file exists.  If the
 | 
						|
answer is "yes" then the transaction is incomplete and is
 | 
						|
rolled back.  If the answer is "no" then it means the transaction
 | 
						|
did commit.</p>
 | 
						|
 | 
						|
<p>The existence of a transaction depends on whether or
 | 
						|
not the rollback journal file exists and the deletion
 | 
						|
of a file appears to be an atomic operation from the point of
 | 
						|
view of a user-space process.  Therefore, 
 | 
						|
a transaction appears to be an atomic operation.</p>
 | 
						|
 | 
						|
<p>The act of deleting a file is expensive on many systems.
 | 
						|
As an optimization, SQLite can be configured to truncate
 | 
						|
the journal file to zero bytes in length
 | 
						|
or overwrite the journal file header with zeros.  In either
 | 
						|
case, the resulting journal file is no longer capable of rolling
 | 
						|
back and so the transaction still commits.  Truncating a file
 | 
						|
to zero length, like deleting a file, is assumed to be an atomic
 | 
						|
operation from the point of view of a user process.  Overwriting
 | 
						|
the header of the journal with zeros is not atomic, but if any
 | 
						|
part of the header is malformed the journal will not roll back.
 | 
						|
Hence, one can say that the commit occurs as soon as the header
 | 
						|
is sufficiently changed to make it invalid.  Typically this happens
 | 
						|
as soon as the first byte of the header is zeroed.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_3_12"></a>
 | 
						|
<h2 id="2_releasing_the_lock"><span>3.12. </span>2 Releasing The Lock</h2>
 | 
						|
<img src="images/ac/commit-B.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The last step in the commit process is to release the
 | 
						|
exclusive lock so that other processes can once again
 | 
						|
start accessing the database file.</p>
 | 
						|
 | 
						|
<p>In the diagram at the right, we show that the information
 | 
						|
that was held in user space is cleared when the lock is released.
 | 
						|
This used to be literally true for older versions of SQLite.  But
 | 
						|
more recent versions of SQLite keep the user space information
 | 
						|
in memory in case it might be needed again at the start of the
 | 
						|
next transaction.  It is cheaper to reuse information that is
 | 
						|
already in local memory than to transfer the information back
 | 
						|
from the operating system disk cache or to read it off of the
 | 
						|
disk drive again.  Prior to reusing the information in user space,
 | 
						|
we must first reacquire the shared lock and then we have to check
 | 
						|
to make sure that no other process modified the database file while
 | 
						|
we were not holding a lock.  There is a counter in the first page
 | 
						|
of the database that is incremented every time the database file
 | 
						|
is modified.  We can find out if another process has modified the
 | 
						|
database by checking that counter.  If the database was modified,
 | 
						|
then the user space cache must be cleared and reread.  But it is
 | 
						|
commonly the case that no changes have been made and the user
 | 
						|
space cache can be reused for a significant performance savings.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="rollback"></a>
 | 
						|
 | 
						|
<h1 id="_rollback"><span>4. </span> Rollback</h1>
 | 
						|
 | 
						|
<p>An atomic commit is supposed to happen instantaneously.  But the processing
 | 
						|
described above clearly takes a finite amount of time.
 | 
						|
Suppose the power to the computer were cut
 | 
						|
part way through the commit operation described above.  In order
 | 
						|
to maintain the illusion that the changes were instantaneous, we
 | 
						|
have to "rollback" any partial changes and restore the database to
 | 
						|
the state it was in prior to the beginning of the transaction.</p>
 | 
						|
 | 
						|
<a name="crisis"></a>
 | 
						|
 | 
						|
<h2 id="_when_something_goes_wrong_"><span>4.1. </span> When Something Goes Wrong...</h2>
 | 
						|
<img src="images/ac/rollback-0.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Suppose the power loss occurred
 | 
						|
during <a href="#section_3_10">step 3.10</a> above,
 | 
						|
while the database changes were being written to disk.
 | 
						|
After power is restored, the situation might be something
 | 
						|
like what is shown to the right.  We were trying to change
 | 
						|
three pages of the database file but only one page was
 | 
						|
successfully written.  Another page was partially written
 | 
						|
and a third page was not written at all.</p>
 | 
						|
 | 
						|
<p>The rollback journal is complete and intact on disk when
 | 
						|
the power is restored.  This is a key point.  The reason for
 | 
						|
the flush operation in <a href="#section_3_7">step 3.7</a>
 | 
						|
is to make absolutely sure that
 | 
						|
all of the rollback journal is safely on nonvolatile storage
 | 
						|
prior to making any changes to the database file itself.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_4_2"></a>
 | 
						|
<h2 id="_hot_rollback_journals"><span>4.2. </span> Hot Rollback Journals</h2>
 | 
						|
<img src="images/ac/rollback-1.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The first time that any SQLite process attempts to access
 | 
						|
the database file, it obtains a shared lock as described in
 | 
						|
<a href="section_3_2">section 3.2</a> above.
 | 
						|
But then it notices that there is a 
 | 
						|
rollback journal file present.  SQLite then checks to see if
 | 
						|
the rollback journal is a "hot journal".   A hot journal is
 | 
						|
a rollback journal that needs to be played back in order to
 | 
						|
restore the database to a sane state.  A hot journal only
 | 
						|
exists when an earlier process was in the middle of committing
 | 
						|
a transaction when it crashed or lost power.</p>
 | 
						|
 | 
						|
<p>A rollback journal is a "hot" journal if all of the following
 | 
						|
are true:</p>
 | 
						|
 | 
						|
<ul>
 | 
						|
<li>The rollback journal exists.
 | 
						|
</li><li>The rollback journal is not an empty file.
 | 
						|
</li><li>There is no reserved lock on the main database file.
 | 
						|
</li><li>The header of the rollback journal is well-formed and in particular
 | 
						|
    has not been zeroed out.
 | 
						|
</li><li>The rollback journal does not
 | 
						|
contain the name of a super-journal file (see
 | 
						|
<a href="#section_5_5">section 5.5</a> below) or if does
 | 
						|
contain the name of a super-journal, then that super-journal
 | 
						|
file exists.
 | 
						|
</li></ul>
 | 
						|
 | 
						|
<p>The presence of a hot journal is our indication
 | 
						|
that a previous process was trying to commit a transaction but
 | 
						|
it aborted for some reason prior to the completion of the
 | 
						|
commit.  A hot journal means that
 | 
						|
the database file is in an inconsistent state and needs to
 | 
						|
be repaired (by rollback) prior to being used.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="exlock"></a>
 | 
						|
 | 
						|
<h2 id="_obtaining_an_exclusive_lock_on_the_database"><span>4.3. </span> Obtaining An Exclusive Lock On The Database</h2>
 | 
						|
<img src="images/ac/rollback-2.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The first step toward dealing with a hot journal is to
 | 
						|
obtain an exclusive lock on the database file.  This prevents two
 | 
						|
or more processes from trying to rollback the same hot journal
 | 
						|
at the same time.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_4_4"></a>
 | 
						|
<h2 id="_rolling_back_incomplete_changes"><span>4.4. </span> Rolling Back Incomplete Changes</h2>
 | 
						|
<img src="images/ac/rollback-3.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Once a process obtains an exclusive lock, it is permitted
 | 
						|
to write to the database file.  It then proceeds to read the
 | 
						|
original content of pages out of the rollback journal and write
 | 
						|
that content back to where it came from in the database file.
 | 
						|
Recall that the header of the rollback journal records the original
 | 
						|
size of the database file prior to the start of the aborted
 | 
						|
transaction.  SQLite uses this information to truncate the
 | 
						|
database file back to its original size in cases where the
 | 
						|
incomplete transaction caused the database to grow.  At the
 | 
						|
end of this step, the database should be the same size and
 | 
						|
contain the same information as it did before the start of
 | 
						|
the aborted transaction.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="delhotjrnl"></a>
 | 
						|
 | 
						|
<h2 id="_deleting_the_hot_journal"><span>4.5. </span> Deleting The Hot Journal</h2>
 | 
						|
<img src="images/ac/rollback-4.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>After all information in the rollback journal has been
 | 
						|
played back into the database file (and flushed to disk in case
 | 
						|
we encounter yet another power failure), the hot rollback journal
 | 
						|
can be deleted.</p>
 | 
						|
 | 
						|
<p>As in <a href="#section_3_11">section 3.11</a>, the journal
 | 
						|
file might be truncated to zero length or its header might
 | 
						|
be overwritten with zeros as an optimization on systems where
 | 
						|
deleting a file is expensive.  Either way, the journal is no 
 | 
						|
longer hot after this step.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="cont"></a>
 | 
						|
 | 
						|
<h2 id="_continue_as_if_the_uncompleted_writes_had_never_happened"><span>4.6. </span> Continue As If The Uncompleted Writes Had Never Happened</h2>
 | 
						|
<img src="images/ac/rollback-5.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The final recovery step is to reduce the exclusive lock back
 | 
						|
to a shared lock.  Once this happens, the database is back in the
 | 
						|
state that it would have been if the aborted transaction had never
 | 
						|
started.  Since all of this recovery activity happens completely
 | 
						|
automatically and transparently, it appears to the program using
 | 
						|
SQLite as if the aborted transaction had never begun.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="multicommit"></a>
 | 
						|
 | 
						|
<h1 id="_multi_file_commit"><span>5. </span> Multi-file Commit</h1>
 | 
						|
 | 
						|
<p>SQLite allows a single 
 | 
						|
<a href="c3ref/sqlite3.html">database connection</a> to talk to
 | 
						|
two or more database files simultaneously through the use of
 | 
						|
the <a href="lang_attach.html">ATTACH DATABASE</a> command.
 | 
						|
When multiple database files are modified within a single
 | 
						|
transaction, all files are updated atomically.  
 | 
						|
In other words, either all of the database files are updated or
 | 
						|
else none of them are.
 | 
						|
Achieving an atomic commit across multiple database files is
 | 
						|
more complex that doing so for a single file.  This section
 | 
						|
describes how SQLite works that bit of magic.</p>
 | 
						|
 | 
						|
<a name="multijrnl"></a>
 | 
						|
 | 
						|
<h2 id="_separate_rollback_journals_for_each_database"><span>5.1. </span> Separate Rollback Journals For Each Database</h2>
 | 
						|
<img src="images/ac/multi-0.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>When multiple database files are involved in a transaction,
 | 
						|
each database has its own rollback journal and each database
 | 
						|
is locked separately.  The diagram at the right shows a scenario
 | 
						|
where three different database files have been modified within
 | 
						|
one transaction.  The situation at this step is analogous to 
 | 
						|
the single-file transaction scenario at 
 | 
						|
<a href="#section_3_6">step 3.6</a>.  Each database file has
 | 
						|
a reserved lock.  For each database, the original content of pages 
 | 
						|
that are being changed have been written into the rollback journal
 | 
						|
for that database, but the content of the journals have not yet
 | 
						|
been flushed to disk.  No changes have been made to the database
 | 
						|
file itself yet, though presumably there are changes being held
 | 
						|
in user memory.</p>
 | 
						|
 | 
						|
<p>For brevity, the diagrams in this section are simplified from
 | 
						|
those that came before.  Blue color still signifies original content
 | 
						|
and pink still signifies new content.  But the individual pages
 | 
						|
in the rollback journal and the database file are not shown and
 | 
						|
we are not making the distinction between information in the
 | 
						|
operating system cache and information that is on disk.  All of
 | 
						|
these factors still apply in a multi-file commit scenario.  They
 | 
						|
just take up a lot of space in the diagrams and they do not add
 | 
						|
any new information, so they are omitted here.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="sprjrnl"></a>
 | 
						|
 | 
						|
<h2 id="_the_super_journal_file"><span>5.2. </span> The Super-Journal File</h2>
 | 
						|
<img src="images/ac/multi-1.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The next step in a multi-file commit is the creation of a
 | 
						|
"super-journal" file.  The name of the super-journal file is
 | 
						|
the same name as the original database filename (the database
 | 
						|
that was opened using the 
 | 
						|
<a href="c3ref/open.html">sqlite3_open()</a> interface,
 | 
						|
not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary
 | 
						|
databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where
 | 
						|
<i>HHHHHHHH</i> is a random 32-bit hexadecimal number.  The
 | 
						|
random <i>HHHHHHHH</i> suffix changes for every new super-journal.</p>
 | 
						|
 | 
						|
<p><i>(Nota bene: The formula for computing the super-journal filename
 | 
						|
given in the previous paragraph corresponds to the implementation as
 | 
						|
of SQLite version 3.5.0.  But this formula is not part of the SQLite
 | 
						|
specification and is subject to change in future releases.)</i></p>
 | 
						|
 | 
						|
<p>Unlike the rollback journals, the super-journal does not contain
 | 
						|
any original database page content.  Instead, the super-journal contains
 | 
						|
the full pathnames for rollback journals for every database that is
 | 
						|
participating in the transaction.</p>
 | 
						|
 | 
						|
<p>After the super-journal is constructed, its content is flushed
 | 
						|
to disk before any further actions are taken.  On Unix, the directory
 | 
						|
that contains the super-journal is also synced in order to make sure
 | 
						|
the super-journal file will appear in the directory following a
 | 
						|
power failure.</p>
 | 
						|
 | 
						|
<p>The purpose of the super-journal is to ensure that multi-file
 | 
						|
transactions are atomic across a power-loss.  But if the database files
 | 
						|
have other settings that compromise integrity across a power-loss event
 | 
						|
(such as <a href="pragma.html#pragma_synchronous">PRAGMA synchronous=OFF</a> or <a href="pragma.html#pragma_journal_mode">PRAGMA journal_mode=MEMORY</a>) then
 | 
						|
the creation of the super-journal is omitted, as an optimization.
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="multijrnlupdate"></a>
 | 
						|
 | 
						|
</p><h2 id="_updating_rollback_journal_headers"><span>5.3. </span> Updating Rollback Journal Headers</h2>
 | 
						|
<img src="images/ac/multi-2.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The next step is to record the full pathname of the super-journal file
 | 
						|
in the header of every rollback journal.  Space to hold the 
 | 
						|
super-journal filename was reserved at the beginning of each rollback journal
 | 
						|
as the rollback journals were created.</p>
 | 
						|
 | 
						|
<p>The content of each rollback journal is flushed to disk both before
 | 
						|
and after the super-journal filename is written into the rollback
 | 
						|
journal header.  It is important to do both of these flushes.  Fortunately,
 | 
						|
the second flush is usually inexpensive since typically only a single
 | 
						|
page of the journal file (the first page) has changed.</p>
 | 
						|
 | 
						|
<p>This step is analogous to 
 | 
						|
<a href="#section_3_7">step 3.7</a> in the single-file commit
 | 
						|
scenario described above.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="multidbupdate"></a>
 | 
						|
 | 
						|
<h2 id="_updating_the_database_files"><span>5.4. </span> Updating The Database Files</h2>
 | 
						|
<img src="images/ac/multi-3.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>Once all rollback journal files have been flushed to disk, it
 | 
						|
is safe to begin updating database files.  We have to obtain an
 | 
						|
exclusive lock on all database files before writing the changes.
 | 
						|
After all the changes are written, it is important to flush the
 | 
						|
changes to disk so that they will be preserved in the event of
 | 
						|
a power failure or operating system crash.</p>
 | 
						|
 | 
						|
<p>This step corresponds to steps
 | 
						|
<a href="#section_3_8">3.8</a>,
 | 
						|
<a href="#section_3_9">3.9</a>, and
 | 
						|
<a href="#section_3_10">3.10</a> in the single-file commit
 | 
						|
scenario described previously.</p>
 | 
						|
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="section_5_5"></a>
 | 
						|
<h2 id="_delete_the_super_journal_file"><span>5.5. </span> Delete The Super-Journal File</h2>
 | 
						|
<img src="images/ac/multi-4.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The next step is to delete the super-journal file.
 | 
						|
This is the point where the multi-file transaction commits.
 | 
						|
This step corresponds to 
 | 
						|
<a href="#section_3_11">step 3.11</a> in the single-file
 | 
						|
commit scenario where the rollback journal is deleted.</p>
 | 
						|
 | 
						|
<p>If a power failure or operating system crash occurs at this
 | 
						|
point, the transaction will not rollback when the system reboots
 | 
						|
even though there are rollback journals present.  The
 | 
						|
difference is the super-journal pathname in the header of the
 | 
						|
rollback journal.  Upon restart, SQLite only considers a journal
 | 
						|
to be hot and will only playback the journal if there is no
 | 
						|
super-journal filename in the header (which is the case for
 | 
						|
a single-file commit) or if the super-journal file still
 | 
						|
exists on disk.</p>
 | 
						|
 | 
						|
<br clear="both">
 | 
						|
<a name="cleanup"></a>
 | 
						|
 | 
						|
<h2 id="_clean_up_the_rollback_journals"><span>5.6. </span> Clean Up The Rollback Journals</h2>
 | 
						|
<img src="images/ac/multi-5.gif" align="right" hspace="15">
 | 
						|
 | 
						|
<p>The final step in a multi-file commit is to delete the
 | 
						|
individual rollback journals and drop the exclusive locks on
 | 
						|
the database files so that other processes can see the changes.
 | 
						|
This corresponds to 
 | 
						|
<a href="#section_3_12">step 3.12</a> in the single-file
 | 
						|
commit sequence.</p>
 | 
						|
 | 
						|
<p>The transaction has already committed at this point so timing
 | 
						|
is not critical in the deletion of the rollback journals.
 | 
						|
The current implementation deletes a single rollback journal
 | 
						|
then unlocks the corresponding database file before proceeding
 | 
						|
to the next rollback journal.  But in the future we might change
 | 
						|
this so that all rollback journals are deleted before any database
 | 
						|
files are unlocked.  As long as the rollback journal is deleted before
 | 
						|
its corresponding database file is unlocked it does not matter in what
 | 
						|
order the rollback journals are deleted or the database files are
 | 
						|
unlocked.</p>
 | 
						|
 | 
						|
<a name="moredetail"></a>
 | 
						|
 | 
						|
<h1 id="_additional_details_of_the_commit_process"><span>6. </span> Additional Details Of The Commit Process</h1>
 | 
						|
 | 
						|
<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
 | 
						|
how atomic commit works in SQLite.  But it glosses over a number of
 | 
						|
important details.  The following subsections will attempt to fill
 | 
						|
in the gaps.</p>
 | 
						|
 | 
						|
<a name="completesectors"></a>
 | 
						|
 | 
						|
<h2 id="_always_journal_complete_sectors"><span>6.1. </span> Always Journal Complete Sectors</h2>
 | 
						|
 | 
						|
<p>When the original content of a database page is written into
 | 
						|
the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
 | 
						|
SQLite always writes a complete sector of data, even if the
 | 
						|
page size of the database is smaller than the sector size.  
 | 
						|
Historically, the sector size in SQLite has been hard coded to 512
 | 
						|
bytes and since the minimum page size is also 512 bytes, this has never
 | 
						|
been an issue.  But beginning with SQLite version 3.3.14, it is possible
 | 
						|
for SQLite to use mass storage devices with a sector size larger than 512
 | 
						|
bytes.  So, beginning with version 3.3.14, whenever any page within a
 | 
						|
sector is written into the journal file, all pages in that same sector
 | 
						|
are stored with it.</p>
 | 
						|
 | 
						|
<p>It is important to store all pages of a sector in the rollback
 | 
						|
journal in order to prevent database corruption following a power
 | 
						|
loss while writing the sector.  Suppose that pages 1, 2, 3, and 4 are
 | 
						|
all stored in sector 1 and that page 2 is modified.  In order to write
 | 
						|
the changes to page 2, the underlying hardware must also rewrite the
 | 
						|
content of pages 1, 3, and 4 since the hardware must write the complete
 | 
						|
sector.  If this write operation is interrupted by a power outage,
 | 
						|
one or more of the pages 1, 3, or 4 might be left with incorrect data.
 | 
						|
Hence, to avoid lasting corruption to the database, the original content
 | 
						|
of all of those pages must be contained in the rollback journal.</p>
 | 
						|
 | 
						|
<a name="journalgarbage"></a>
 | 
						|
 | 
						|
<h2 id="_dealing_with_garbage_written_into_journal_files"><span>6.2. </span> Dealing With Garbage Written Into Journal Files</h2>
 | 
						|
 | 
						|
<p>When data is appended to the end of the rollback journal,
 | 
						|
SQLite normally makes the pessimistic assumption that the file
 | 
						|
is first extended with invalid "garbage" data and that afterwards
 | 
						|
the correct data replaces the garbage.  In other words, SQLite assumes
 | 
						|
that the file size is increased first and then afterwards the content
 | 
						|
is written into the file.  If a power failure occurs after the file
 | 
						|
size has been increased but before the file content has been written,
 | 
						|
the rollback journal can be left containing garbage data.  If after
 | 
						|
power is restored, another SQLite process sees the rollback journal
 | 
						|
containing the garbage data and tries to roll it back into the original
 | 
						|
database file, it might copy some of the garbage into the database file
 | 
						|
and thus corrupt the database file.</p>
 | 
						|
 | 
						|
<p>SQLite uses two defenses against this problem.  In the first place,
 | 
						|
SQLite records the number of pages in the rollback journal in the header
 | 
						|
of the rollback journal.  This number is initially zero.  So during an
 | 
						|
attempt to rollback an incomplete (and possibly corrupt) rollback
 | 
						|
journal, the process doing the rollback will see that the journal
 | 
						|
contains zero pages and will thus make no changes to the database.  Prior
 | 
						|
to a commit, the rollback journal is flushed to disk to ensure that
 | 
						|
all content has been synced to disk and there is no "garbage" left
 | 
						|
in the file, and only then is the page count in the header changed from
 | 
						|
zero to true number of pages in the rollback journal.  The rollback journal
 | 
						|
header is always kept in a separate sector from any page data so that
 | 
						|
it can be overwritten and flushed without risking damage to a data
 | 
						|
page if a power outage occurs.  Notice that the rollback journal
 | 
						|
is flushed to disk twice: once to write the page content and a second
 | 
						|
time to write the page count in the header.</p>
 | 
						|
 | 
						|
<p>The previous paragraph describes what happens when the
 | 
						|
synchronous pragma setting is "full".</p>
 | 
						|
 | 
						|
<blockquote>
 | 
						|
PRAGMA synchronous=FULL;
 | 
						|
</blockquote>
 | 
						|
 | 
						|
<p>The default synchronous setting is full so the above is what usually
 | 
						|
happens.  However, if the synchronous setting is lowered to "normal",
 | 
						|
SQLite only flushes the rollback journal once, after the page count has
 | 
						|
been written.
 | 
						|
This carries a risk of corruption because it might happen that the 
 | 
						|
modified (non-zero) page count reaches the disk surface before all
 | 
						|
of the data does.  The data will have been written first, but SQLite
 | 
						|
assumes that the underlying filesystem can reorder write requests and
 | 
						|
that the page count can be burned into oxide first even though its
 | 
						|
write request occurred last.  So as a second line of defense, SQLite
 | 
						|
also uses a 32-bit checksum on every page of data in the rollback
 | 
						|
journal.  This checksum is evaluated for each page during rollback
 | 
						|
while rolling back a journal as described in 
 | 
						|
<a href="#section_4_4">section 4.4</a>.  If an incorrect checksum
 | 
						|
is seen, the rollback is abandoned.  Note that the checksum does
 | 
						|
not guarantee that the page data is correct since there is a small
 | 
						|
but finite probability that the checksum might be right even if the data is
 | 
						|
corrupt.  But the checksum does at least make such an error unlikely.
 | 
						|
</p>
 | 
						|
 | 
						|
<p>Note that the checksums in the rollback journal are not necessary
 | 
						|
if the synchronous setting is FULL.  We only depend on the checksums
 | 
						|
when synchronous is lowered to NORMAL.  Nevertheless, the checksums
 | 
						|
never hurt and so they are included in the rollback journal regardless
 | 
						|
of the synchronous setting.</p>
 | 
						|
 | 
						|
<a name="cachespill"></a>
 | 
						|
 | 
						|
<h2 id="_cache_spill_prior_to_commit"><span>6.3. </span> Cache Spill Prior To Commit</h2>
 | 
						|
 | 
						|
<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
 | 
						|
assumes that all database changes fit in memory until it is time to
 | 
						|
commit.  This is the common case.  But sometimes a larger change will
 | 
						|
overflow the user-space cache prior to transaction commit.  In those
 | 
						|
cases, the cache must spill to the database before the transaction
 | 
						|
is complete.</p>
 | 
						|
 | 
						|
<p>At the beginning of a cache spill, the status of the database
 | 
						|
connection is as shown in <a href="#section_3_6">step 3.6</a>.
 | 
						|
Original page content has been saved in the rollback journal and
 | 
						|
modifications of the pages exist in user memory.  To spill the cache,
 | 
						|
SQLite executes steps <a href="#section_3_7">3.7</a> through
 | 
						|
<a href="#section_3_9">3.9</a>.  In other words, the rollback journal
 | 
						|
is flushed to disk, an exclusive lock is acquired, and changes are
 | 
						|
written into the database.  But the remaining steps are deferred
 | 
						|
until the transaction really commits.  A new journal header is
 | 
						|
appended to the end of the rollback journal (in its own sector)
 | 
						|
and the exclusive database lock is retained, but otherwise processing
 | 
						|
returns to <a href="#section_3_6">step 3.6</a>.  When the transaction
 | 
						|
commits, or if another cache spill occurs, steps
 | 
						|
<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are
 | 
						|
repeated.  (Step <a href="#section_3_8">3.8</a> is omitted on second
 | 
						|
and subsequent passes since an exclusive database lock is already held
 | 
						|
due to the first pass.)</p>
 | 
						|
 | 
						|
<p>A cache spill causes the lock on the database file to
 | 
						|
escalate from reserved to exclusive.  This reduces concurrency.
 | 
						|
A cache spill also causes extra disk flush or fsync operations to
 | 
						|
occur and these operations are slow, hence a cache spill can
 | 
						|
seriously reduce performance.
 | 
						|
For these reasons a cache spill is avoided whenever possible.</p>
 | 
						|
 | 
						|
<a name="opts"></a>
 | 
						|
 | 
						|
<h1 id="_optimizations"><span>7. </span> Optimizations</h1>
 | 
						|
 | 
						|
<p>Profiling indicates that for most systems and in most circumstances
 | 
						|
SQLite spends most of its time doing disk I/O.  It follows then that
 | 
						|
anything we can do to reduce the amount of disk I/O will likely have a
 | 
						|
large positive impact on the performance of SQLite.  This section
 | 
						|
describes some of the techniques used by SQLite to try to reduce the
 | 
						|
amount of disk I/O to a minimum while still preserving atomic commit.</p>
 | 
						|
 | 
						|
<a name="keepcache"></a>
 | 
						|
 | 
						|
<h2 id="_cache_retained_between_transactions"><span>7.1. </span> Cache Retained Between Transactions</h2>
 | 
						|
 | 
						|
<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
 | 
						|
that once the shared lock has been released, all user-space cache
 | 
						|
images of database content must be discarded.  This is done because
 | 
						|
without a shared lock, other processes are free to modify the database
 | 
						|
file content and so any user-space image of that content might become
 | 
						|
obsolete.  Consequently, each new transaction would begin by rereading
 | 
						|
data which had previously been read.  This is not as bad as it sounds
 | 
						|
at first since the data being read is still likely in the operating
 | 
						|
systems file cache.  So the "read" is really just a copy of data
 | 
						|
from kernel space into user space.  But even so, it still takes time.</p>
 | 
						|
 | 
						|
<p>Beginning with SQLite version 3.3.14 a mechanism has been added
 | 
						|
to try to reduce the needless rereading of data.  In newer versions
 | 
						|
of SQLite, the data in the user-space pager cache is retained when
 | 
						|
the lock on the database file is released.  Later, after the
 | 
						|
shared lock is acquired at the beginning of the next transaction,
 | 
						|
SQLite checks to see if any other process has modified the database
 | 
						|
file.  If the database has been changed in any way since the lock
 | 
						|
was last released, the user-space cache is erased at that point.
 | 
						|
But commonly the database file is unchanged and the user-space cache
 | 
						|
can be retained, and some unnecessary read operations can be avoided.</p>
 | 
						|
 | 
						|
<p>In order to determine whether or not the database file has changed,
 | 
						|
SQLite uses a counter in the database header (in bytes 24 through 27)
 | 
						|
which is incremented during every change operation.  SQLite saves a copy
 | 
						|
of this counter prior to releasing its database lock.  Then after
 | 
						|
acquiring the next database lock it compares the saved counter value
 | 
						|
against the current counter value and erases the cache if the values
 | 
						|
are different, or reuses the cache if they are the same.</p>
 | 
						|
 | 
						|
<a name="section_7_2"></a>
 | 
						|
<h2 id="_exclusive_access_mode"><span>7.2. </span> Exclusive Access Mode</h2>
 | 
						|
 | 
						|
<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".
 | 
						|
In exclusive access mode, SQLite retains the exclusive
 | 
						|
database lock at the conclusion of each transaction.  This prevents
 | 
						|
other processes from accessing the database, but in many deployments
 | 
						|
only a single process is using a database so this is not a
 | 
						|
serious problem.  The advantage of exclusive access mode is that
 | 
						|
disk I/O can be reduced in three ways:</p>
 | 
						|
 | 
						|
<ol>
 | 
						|
<li><p>It is not necessary to increment the change counter in the
 | 
						|
database header for transactions after the first transaction.  This
 | 
						|
will often save a write of page one to both the rollback
 | 
						|
journal and the main database file.</p></li>
 | 
						|
 | 
						|
<li><p>No other processes can change the database so there is never
 | 
						|
a need to check the change counter and clear the user-space cache
 | 
						|
at the beginning of a transaction.</p></li>
 | 
						|
 | 
						|
<li><p>Each transaction can be committed by overwriting the rollback
 | 
						|
journal header with zeros rather than deleting the journal file.
 | 
						|
This avoids having to modify the directory entry for the journal file
 | 
						|
and it avoids having to deallocate disk sectors associated with the 
 | 
						|
journal.  Furthermore, the next transaction will overwrite existing
 | 
						|
journal file content rather than append new content and on most systems
 | 
						|
overwriting is much faster than appending.</p></li>
 | 
						|
</ol>
 | 
						|
 | 
						|
<p>The third optimization, zeroing the journal file header rather than
 | 
						|
deleting the rollback journal file,
 | 
						|
does not depend on holding an exclusive lock at all times.
 | 
						|
This optimization can be set independently of exclusive lock mode
 | 
						|
using the <a href="pragma.html#pragma_journal_mode">journal_mode pragma</a>
 | 
						|
as described in <a href="#section_7_6">section 7.6</a> below.</p>
 | 
						|
 | 
						|
<a name="freelistjrnl"></a>
 | 
						|
 | 
						|
<h2 id="_do_not_journal_freelist_pages"><span>7.3. </span> Do Not Journal Freelist Pages</h2>
 | 
						|
 | 
						|
<p>When information is deleted from an SQLite database, the pages used
 | 
						|
to hold the deleted information are added to a "<a href="fileformat2.html#freelist">freelist</a>".  Subsequent
 | 
						|
inserts will draw pages off of this freelist rather than expanding the
 | 
						|
database file.</p>
 | 
						|
 | 
						|
<p>Some freelist pages contain critical data; specifically the locations
 | 
						|
of other freelist pages.  But most freelist pages contain nothing useful.
 | 
						|
These latter freelist pages are called "leaf" pages.  We are free to
 | 
						|
modify the content of a leaf freelist page in the database without
 | 
						|
changing the meaning of the database in any way.</p>
 | 
						|
 | 
						|
<p>Because the content of leaf freelist pages is unimportant, SQLite
 | 
						|
avoids storing leaf freelist page content in the rollback journal
 | 
						|
in <a href="#section_3_5">step 3.5</a> of the commit process.
 | 
						|
If a leaf freelist page is changed and that change does not get rolled back
 | 
						|
during a transaction recovery, the database is not harmed by the omission.
 | 
						|
Similarly, the content of a new freelist page is never written back
 | 
						|
into the database at <a href="#section_3_9">step 3.9</a> nor
 | 
						|
read from the database at <a href="#section_3_3">step 3.3</a>.
 | 
						|
These optimizations can greatly reduce the amount of I/O that occurs
 | 
						|
when making changes to a database file that contains free space.</p>
 | 
						|
 | 
						|
<a name="atomicsector"></a>
 | 
						|
 | 
						|
<h2 id="_single_page_updates_and_atomic_sector_writes"><span>7.4. </span> Single Page Updates And Atomic Sector Writes</h2>
 | 
						|
 | 
						|
<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
 | 
						|
interface contains a method named xDeviceCharacteristics which reports
 | 
						|
on special properties that the underlying mass storage device
 | 
						|
might have.  Among the special properties that
 | 
						|
xDeviceCharacteristics might report is the ability of to do an
 | 
						|
atomic sector write.</p>
 | 
						|
 | 
						|
<p>Recall that by default SQLite assumes that sector writes are
 | 
						|
linear but not atomic.  A linear write starts at one end of the
 | 
						|
sector and changes information byte by byte until it gets to the
 | 
						|
other end of the sector.  If a power loss occurs in the middle of
 | 
						|
a linear write then part of the sector might be modified while the
 | 
						|
other end is unchanged.  In an atomic sector write, either the entire
 | 
						|
sector is overwritten or else nothing in the sector is changed.</p>
 | 
						|
 | 
						|
<p>We believe that most modern disk drives implement atomic sector
 | 
						|
writes.  When power is lost, the drive uses energy stored in capacitors
 | 
						|
and/or the angular momentum of the disk platter to provide power to 
 | 
						|
complete any operation in progress.  Nevertheless, there are so many
 | 
						|
layers in between the write system call and the on-board disk drive
 | 
						|
electronics that we take the safe approach in both Unix and w32 VFS
 | 
						|
implementations and assume that sector writes are not atomic.  On the
 | 
						|
other hand, device
 | 
						|
manufacturers with more control over their filesystems might want
 | 
						|
to consider enabling the atomic write property of xDeviceCharacteristics
 | 
						|
if their hardware really does do atomic writes.</p>
 | 
						|
 | 
						|
<p>When sector writes are atomic and the page size of a database is
 | 
						|
the same as a sector size, and when there is a database change that
 | 
						|
only touches a single database page, then SQLite skips the whole
 | 
						|
journaling and syncing process and simply writes the modified page
 | 
						|
directly into the database file.  The change counter in the first
 | 
						|
page of the database file is modified separately since no harm is
 | 
						|
done if power is lost before the change counter can be updated.</p>
 | 
						|
 | 
						|
<a name="safeappend"></a>
 | 
						|
 | 
						|
<h2 id="_filesystems_with_safe_append_semantics"><span>7.5. </span> Filesystems With Safe Append Semantics</h2>
 | 
						|
 | 
						|
<p>Another optimization introduced in SQLite version 3.5.0 makes
 | 
						|
use of "safe append" behavior of the underlying disk.
 | 
						|
Recall that SQLite assumes that when data is appended to a file
 | 
						|
(specifically to the rollback journal) that the size of the file
 | 
						|
is increased first and that the content is written second.  So
 | 
						|
if power is lost after the file size is increased but before the
 | 
						|
content is written, the file is left containing invalid "garbage"
 | 
						|
data.  The xDeviceCharacteristics method of the VFS might, however,
 | 
						|
indicate that the filesystem implements "safe append" semantics.
 | 
						|
This means that the content is written before the file size is
 | 
						|
increased so that it is impossible for garbage to be introduced
 | 
						|
into the rollback journal by a power loss or system crash.</p>
 | 
						|
 | 
						|
<p>When safe append semantics are indicated for a filesystem,
 | 
						|
SQLite always stores the special value of -1 for the page count
 | 
						|
in the header of the rollback journal.  The -1 page count value
 | 
						|
tells any process attempting to rollback the journal that the
 | 
						|
number of pages in the journal should be computed from the journal
 | 
						|
size.  This -1 value is never changed.  So that when a commit
 | 
						|
occurs, we save a single flush operation and a sector write of
 | 
						|
the first page of the journal file.  Furthermore, when a cache
 | 
						|
spill occurs we no longer need to append a new journal header
 | 
						|
to the end of the journal; we can simply continue appending
 | 
						|
new pages to the end of the existing journal.</p>
 | 
						|
 | 
						|
<a name="section_7_6"></a>
 | 
						|
<h2 id="_persistent_rollback_journals"><span>7.6. </span> Persistent Rollback Journals</h2>
 | 
						|
 | 
						|
<p>Deleting a file is an expensive operation on many systems.
 | 
						|
So as an optimization, SQLite can be configured to avoid the
 | 
						|
delete operation of <a href="#section_3_11">section 3.11</a>.
 | 
						|
Instead of deleting the journal file in order to commit a transaction,
 | 
						|
the file is either truncated to zero bytes in length or its
 | 
						|
header is overwritten with zeros.  Truncating the file to zero
 | 
						|
length saves having to make modifications to the directory containing
 | 
						|
the file since the file is not removed from the directory. 
 | 
						|
Overwriting the header has the additional savings of not having
 | 
						|
to update the length of the file (in the "inode" on many systems)
 | 
						|
and not having to deal with newly freed disk sectors.  Furthermore,
 | 
						|
at the next transaction the journal will be created by overwriting
 | 
						|
existing content rather than appending new content onto the end
 | 
						|
of a file, and overwriting is often much faster than appending.</p>
 | 
						|
 | 
						|
<p>SQLite can be configured to commit transactions by overwriting
 | 
						|
the journal header with zeros instead of deleting the journal file
 | 
						|
by setting the "PERSIST" journaling mode using the 
 | 
						|
<a href="pragma.html#pragma_journal_mode">journal_mode</a> PRAGMA.
 | 
						|
For example:</p>
 | 
						|
 | 
						|
<blockquote><pre>
 | 
						|
PRAGMA journal_mode=PERSIST;
 | 
						|
</pre></blockquote>
 | 
						|
 | 
						|
<p>The use of persistent journal mode provides a noticeable performance
 | 
						|
improvement on many systems.  Of course, the drawback is that the 
 | 
						|
journal files remain on the disk, using disk space and cluttering
 | 
						|
directories, long after the transaction commits.  The only safe way
 | 
						|
to delete a persistent journal file is to commit a transaction
 | 
						|
with journaling mode set to DELETE:</p>
 | 
						|
 | 
						|
<blockquote><pre>
 | 
						|
PRAGMA journal_mode=DELETE;
 | 
						|
BEGIN EXCLUSIVE;
 | 
						|
COMMIT;
 | 
						|
</pre></blockquote>
 | 
						|
 | 
						|
<p>Beware of deleting persistent journal files by any other means
 | 
						|
since the journal file might be hot, in which case deleting it will
 | 
						|
corrupt the corresponding database file.</p>
 | 
						|
 | 
						|
<p>Beginning in SQLite <a href="releaselog/3_6_4.html">version 3.6.4</a> (2008-10-15), 
 | 
						|
the TRUNCATE journal mode is
 | 
						|
also supported:</p>
 | 
						|
 | 
						|
<blockquote><pre>
 | 
						|
PRAGMA journal_mode=TRUNCATE;
 | 
						|
</pre></blockquote>
 | 
						|
 | 
						|
<p>In truncate journal mode, the transaction is committed by truncating
 | 
						|
the journal file to zero length rather than deleting the journal file
 | 
						|
(as in DELETE mode) or by zeroing the header (as in PERSIST mode).
 | 
						|
TRUNCATE mode shares the advantage of PERSIST mode that the directory
 | 
						|
that contains the journal file and database does not need to be updated.
 | 
						|
Hence truncating a file is often faster than deleting it.  TRUNCATE has
 | 
						|
the additional advantage that it is not followed by a
 | 
						|
system call (ex: fsync()) to synchronize the change to disk.  It might
 | 
						|
be safer if it did.
 | 
						|
But on many modern filesystems, a truncate is an atomic and
 | 
						|
synchronous operation and so we think that TRUNCATE will usually be safe
 | 
						|
in the face of power failures.  If you are uncertain about whether or
 | 
						|
not TRUNCATE will be synchronous and atomic on your filesystem and it is
 | 
						|
important to you that your database survive a power loss or operating
 | 
						|
system crash that occurs during the truncation operation, then you might
 | 
						|
consider using a different journaling mode.</p>
 | 
						|
 | 
						|
<p>On embedded systems with synchronous filesystems, TRUNCATE results
 | 
						|
in slower behavior than PERSIST.  The commit operation is the same speed.
 | 
						|
But subsequent transactions are slower following a TRUNCATE because it is
 | 
						|
faster to overwrite existing content than to append to the end of a file.
 | 
						|
New journal file entries will always be appended following a TRUNCATE but
 | 
						|
will usually overwrite with PERSIST.</p>
 | 
						|
 | 
						|
<a name="testing"></a>
 | 
						|
 | 
						|
<h1 id="_testing_atomic_commit_behavior"><span>8. </span> Testing Atomic Commit Behavior</h1>
 | 
						|
 | 
						|
<p>The developers of SQLite are confident that it is robust
 | 
						|
in the face of power failures and system crashes because the
 | 
						|
automatic test procedures do extensive checks on
 | 
						|
the ability of SQLite to recover from simulated power loss.
 | 
						|
We call these the "crash tests".</p>
 | 
						|
 | 
						|
<p>Crash tests in SQLite use a modified VFS that can simulate
 | 
						|
the kinds of filesystem damage that occur during a power
 | 
						|
loss or operating system crash.  The crash-test VFS can simulate
 | 
						|
incomplete sector writes, pages filled with garbage data because
 | 
						|
a write has not completed, and out of order writes, all occurring
 | 
						|
at varying points during a test scenario.  Crash tests execute
 | 
						|
transactions over and over, varying the time at which a simulated
 | 
						|
power loss occurs and the properties of the damage inflicted.
 | 
						|
Each test then reopens the database after the simulated crash and
 | 
						|
verifies that the transaction either occurred completely
 | 
						|
or not at all and that the database is in a completely
 | 
						|
consistent state.</p>
 | 
						|
 | 
						|
<p>The crash tests in SQLite have discovered a number of very
 | 
						|
subtle bugs (now fixed) in the recovery mechanism.  Some of 
 | 
						|
these bugs were very obscure and unlikely to have been found
 | 
						|
using only code inspection and analysis techniques.  From this
 | 
						|
experience, the developers of SQLite feel confident that any other
 | 
						|
database system that does not use a similar crash test system
 | 
						|
likely contains undetected bugs that will lead to database
 | 
						|
corruption following a system crash or power failure.</p>
 | 
						|
 | 
						|
<a name="sect_9_0"></a>
 | 
						|
 | 
						|
<h1 id="_things_that_can_go_wrong"><span>9. </span> Things That Can Go Wrong</h1>
 | 
						|
 | 
						|
<p>The atomic commit mechanism in SQLite has proven to be robust,
 | 
						|
but it can be circumvented by a sufficiently creative
 | 
						|
adversary or a sufficiently broken operating system implementation.
 | 
						|
This section describes a few of the ways in which an SQLite database
 | 
						|
might be corrupted by a power failure or system crash.
 | 
						|
(See also: <a href="howtocorrupt.html">How To Corrupt Your Database Files</a>.)</p>
 | 
						|
 | 
						|
<a name="brokenlocks"></a>
 | 
						|
 | 
						|
<h2 id="_broken_locking_implementations"><span>9.1. </span> Broken Locking Implementations</h2>
 | 
						|
 | 
						|
<p>SQLite uses filesystem locks to make sure that only one
 | 
						|
process and database connection is trying to modify the database
 | 
						|
at a time.  The filesystem locking mechanism is implemented
 | 
						|
in the VFS layer and is different for every operating system.
 | 
						|
SQLite depends on this implementation being correct.  If something
 | 
						|
goes wrong and two or more processes are able to write the same
 | 
						|
database file at the same time, severe damage can result.</p>
 | 
						|
 | 
						|
<p>We have received reports of implementations of both
 | 
						|
Windows network filesystems and NFS in which locking was
 | 
						|
subtly broken.  We can not verify these reports, but as
 | 
						|
locking is difficult to get right on a network filesystem
 | 
						|
we have no reason to doubt them.  You are advised to 
 | 
						|
avoid using SQLite on a network filesystem in the first place,
 | 
						|
since performance will be slow.  But if you must use a 
 | 
						|
network filesystem to store SQLite database files, consider
 | 
						|
using a secondary locking mechanism to prevent simultaneous
 | 
						|
writes to the same database even if the native filesystem
 | 
						|
locking mechanism malfunctions.</p>
 | 
						|
 | 
						|
<p>The versions of SQLite that come preinstalled on Apple
 | 
						|
Mac OS X computers contain a version of SQLite that has been
 | 
						|
extended to use alternative locking strategies that work on
 | 
						|
all network filesystems that Apple supports.  These extensions
 | 
						|
used by Apple work great as long as all processes are accessing
 | 
						|
the database file in the same way.  Unfortunately, the locking
 | 
						|
mechanisms do not exclude one another, so if one process is
 | 
						|
accessing a file using (for example) AFP locking and another
 | 
						|
process (perhaps on a different machine) is using dot-file locks,
 | 
						|
the two processes might collide because AFP locks do not exclude
 | 
						|
dot-file locks or vice versa.</p>
 | 
						|
 | 
						|
<a name="fsync"></a>
 | 
						|
 | 
						|
<h2 id="_incomplete_disk_flushes"><span>9.2. </span> Incomplete Disk Flushes</h2>
 | 
						|
 | 
						|
<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()
 | 
						|
system call on w32 in order to sync the file system buffers onto disk
 | 
						|
oxide as shown in <a href="#section_3_7">step 3.7</a> and
 | 
						|
<a href="#section_3_10">step 3.10</a>.  Unfortunately, we have received
 | 
						|
reports that neither of these interfaces works as advertised on many
 | 
						|
systems.  We hear that FlushFileBuffers() can be completely disabled
 | 
						|
using registry settings on some Windows versions.  Some historical
 | 
						|
versions of Linux contain versions of fsync() which are no-ops on
 | 
						|
some filesystems, we are told.  Even on systems where 
 | 
						|
FlushFileBuffers() and fsync() are said to be working, often
 | 
						|
the IDE disk control lies and says that data has reached oxide
 | 
						|
while it is still held only in the volatile control cache.</p>
 | 
						|
 | 
						|
<p>On the Mac, you can set this pragma:</p>
 | 
						|
 | 
						|
<blockquote>
 | 
						|
PRAGMA fullfsync=ON;
 | 
						|
</blockquote>
 | 
						|
 | 
						|
<p>Setting fullfsync on a Mac will guarantee that data really does
 | 
						|
get pushed out to the disk platter on a flush.  But the implementation
 | 
						|
of fullfsync involves resetting the disk controller.  And so not only
 | 
						|
is it profoundly slow, it also slows down other unrelated disk I/O.
 | 
						|
So its use is not recommended.</p>
 | 
						|
 | 
						|
<a name="filedel"></a>
 | 
						|
 | 
						|
<h2 id="_partial_file_deletions"><span>9.3. </span> Partial File Deletions</h2>
 | 
						|
 | 
						|
<p>SQLite assumes that file deletion is an atomic operation from the
 | 
						|
point of view of a user process.  If power fails in the middle of
 | 
						|
a file deletion, then after power is restored SQLite expects to see
 | 
						|
either the entire file with all of its original data intact, or it
 | 
						|
expects not to find the file at all.  Transactions may not be atomic
 | 
						|
on systems that do not work this way.</p>
 | 
						|
 | 
						|
<a name="filegarbage"></a>
 | 
						|
 | 
						|
<h2 id="_garbage_written_into_files"><span>9.4. </span> Garbage Written Into Files</h2>
 | 
						|
 | 
						|
<p>SQLite database files are ordinary disk files that can be
 | 
						|
opened and written by ordinary user processes.  A rogue process
 | 
						|
can open an SQLite database and fill it with corrupt data.  
 | 
						|
Corrupt data might also be introduced into an SQLite database
 | 
						|
by bugs in the operating system or disk controller; especially
 | 
						|
bugs triggered by a power failure.  There is nothing SQLite can
 | 
						|
do to defend against these kinds of problems.</p>
 | 
						|
 | 
						|
<a name="mvhotjrnl"></a>
 | 
						|
 | 
						|
<h2 id="_deleting_or_renaming_a_hot_journal"><span>9.5. </span> Deleting Or Renaming A Hot Journal</h2>
 | 
						|
 | 
						|
<p>If a crash or power loss does occur and a hot journal is left on
 | 
						|
the disk, it is essential that the original database file and the hot
 | 
						|
journal remain on disk with their original names until the database
 | 
						|
file is opened by another SQLite process and rolled back.  
 | 
						|
During recovery at <a href="#section_4_2">step 4.2</a> SQLite locates
 | 
						|
the hot journal by looking for a file in the same directory as the
 | 
						|
database being opened and whose name is derived from the name of the
 | 
						|
file being opened.  If either the original database file or the
 | 
						|
hot journal have been moved or renamed, then the hot journal will
 | 
						|
not be seen and the database will not be rolled back.</p>
 | 
						|
 | 
						|
<p>We suspect that a common failure mode for SQLite recovery happens
 | 
						|
like this:  A power failure occurs.  After power is restored, a well-meaning
 | 
						|
user or system administrator begins looking around on the disk for
 | 
						|
damage.  They see their database file named "important.data".  This file
 | 
						|
is perhaps familiar to them.  But after the crash, there is also a
 | 
						|
hot journal named "important.data-journal".  The user then deletes
 | 
						|
the hot journal, thinking that they are helping to cleanup the system.
 | 
						|
We know of no way to prevent this other than user education.</p>
 | 
						|
 | 
						|
<p>If there are multiple (hard or symbolic) links to a database file,
 | 
						|
the journal will be created using the name of the link through which
 | 
						|
the file was opened.  If a crash occurs and the database is opened again
 | 
						|
using a different link, the hot journal will not be located and no
 | 
						|
rollback will occur.</p>
 | 
						|
 | 
						|
<p>Sometimes a power failure will cause a filesystem to be corrupted
 | 
						|
such that recently changed filenames are forgotten and the file is
 | 
						|
moved into a "/lost+found" directory.  When that happens, the hot
 | 
						|
journal will not be found and recovery will not occur.
 | 
						|
SQLite tries to prevent this
 | 
						|
by opening and syncing the directory containing the rollback journal
 | 
						|
at the same time it syncs the journal file itself.  However, the
 | 
						|
movement of files into /lost+found can be caused by unrelated processes
 | 
						|
creating unrelated files in the same directory as the main database file.
 | 
						|
And since this is out from under the control of SQLite, there is nothing
 | 
						|
that SQLite can do to prevent it.  If you are running on a system that
 | 
						|
is vulnerable to this kind of filesystem namespace corruption (most
 | 
						|
modern journalling filesystems are immune, we believe) then you might
 | 
						|
want to consider putting each SQLite database file in its own private
 | 
						|
subdirectory.</p>
 | 
						|
 | 
						|
<a name="future"></a>
 | 
						|
 | 
						|
<h1 id="10_0_future_directions_and_conclusion"><span>10. </span>10.0 Future Directions And Conclusion</h1>
 | 
						|
 | 
						|
<p>Every now and then someone discovers a new failure mode for
 | 
						|
the atomic commit mechanism in SQLite and the developers have to
 | 
						|
put in a patch.  This is happening less and less and the
 | 
						|
failure modes are becoming more and more obscure.  But it would
 | 
						|
still be foolish to suppose that the atomic commit logic of
 | 
						|
SQLite is entirely bug-free.  The developers are committed to fixing
 | 
						|
these bugs as quickly as they might be found.</p>
 | 
						|
 | 
						|
<p>
 | 
						|
The developers are also on the lookout for new ways to
 | 
						|
optimize the commit mechanism.  The current VFS implementations
 | 
						|
for Unix (Linux and Mac OS X) and Windows make pessimistic assumptions about
 | 
						|
the behavior of those systems.  After consultation with experts
 | 
						|
on how these systems work, we might be able to relax some of the
 | 
						|
assumptions on these systems and allow them to run faster.  In
 | 
						|
particular, we suspect that most modern filesystems exhibit the
 | 
						|
safe append property and that many of them might support atomic
 | 
						|
sector writes.  But until this is known for certain, SQLite will
 | 
						|
take the conservative approach and assume the worst.</p>
 | 
						|
 |