1981 lines
109 KiB
HTML
1981 lines
109 KiB
HTML
<!DOCTYPE html>
|
|
<html><head>
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
|
|
<link href="sqlite.css" rel="stylesheet">
|
|
<title>No Title</title>
|
|
<!-- path= -->
|
|
</head>
|
|
<body>
|
|
<div class=nosearch>
|
|
<a href="index.html">
|
|
<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite" border="0">
|
|
</a>
|
|
<div><!-- IE hack to prevent disappearing logo --></div>
|
|
<div class="tagline desktoponly">
|
|
Small. Fast. Reliable.<br>Choose any three.
|
|
</div>
|
|
<div class="menu mainmenu">
|
|
<ul>
|
|
<li><a href="index.html">Home</a>
|
|
<li class='mobileonly'><a href="javascript:void(0)" onclick='toggle_div("submenu")'>Menu</a>
|
|
<li class='wideonly'><a href='about.html'>About</a>
|
|
<li class='desktoponly'><a href="docs.html">Documentation</a>
|
|
<li class='desktoponly'><a href="download.html">Download</a>
|
|
<li class='wideonly'><a href='copyright.html'>License</a>
|
|
<li class='desktoponly'><a href="support.html">Support</a>
|
|
<li class='desktoponly'><a href="prosupport.html">Purchase</a>
|
|
<li class='search' id='search_menubutton'>
|
|
<a href="javascript:void(0)" onclick='toggle_search()'>Search</a>
|
|
</ul>
|
|
</div>
|
|
<div class="menu submenu" id="submenu">
|
|
<ul>
|
|
<li><a href='about.html'>About</a>
|
|
<li><a href='docs.html'>Documentation</a>
|
|
<li><a href='download.html'>Download</a>
|
|
<li><a href='support.html'>Support</a>
|
|
<li><a href='prosupport.html'>Purchase</a>
|
|
</ul>
|
|
</div>
|
|
<div class="searchmenu" id="searchmenu">
|
|
<form method="GET" action="search">
|
|
<select name="s" id="searchtype">
|
|
<option value="d">Search Documentation</option>
|
|
<option value="c">Search Changelog</option>
|
|
</select>
|
|
<input type="text" name="q" id="searchbox" value="">
|
|
<input type="submit" value="Go">
|
|
</form>
|
|
</div>
|
|
</div>
|
|
<script>
|
|
function toggle_div(nm) {
|
|
var w = document.getElementById(nm);
|
|
if( w.style.display=="block" ){
|
|
w.style.display = "none";
|
|
}else{
|
|
w.style.display = "block";
|
|
}
|
|
}
|
|
function toggle_search() {
|
|
var w = document.getElementById("searchmenu");
|
|
if( w.style.display=="block" ){
|
|
w.style.display = "none";
|
|
} else {
|
|
w.style.display = "block";
|
|
setTimeout(function(){
|
|
document.getElementById("searchbox").focus()
|
|
}, 30);
|
|
}
|
|
}
|
|
function div_off(nm){document.getElementById(nm).style.display="none";}
|
|
window.onbeforeunload = function(e){div_off("submenu");}
|
|
/* Disable the Search feature if we are not operating from CGI, since */
|
|
/* Search is accomplished using CGI and will not work without it. */
|
|
if( !location.origin || !location.origin.match || !location.origin.match(/http/) ){
|
|
document.getElementById("search_menubutton").style.display = "none";
|
|
}
|
|
/* Used by the Hide/Show button beside syntax diagrams, to toggle the */
|
|
function hideorshow(btn,obj){
|
|
var x = document.getElementById(obj);
|
|
var b = document.getElementById(btn);
|
|
if( x.style.display!='none' ){
|
|
x.style.display = 'none';
|
|
b.innerHTML='show';
|
|
}else{
|
|
x.style.display = '';
|
|
b.innerHTML='hide';
|
|
}
|
|
return false;
|
|
}
|
|
</script>
|
|
</div>
|
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
|
|
<html>
|
|
<head>
|
|
<link type="text/css" rel="stylesheet" href="images/fileformat/rtdocs.css">
|
|
<script type="text/javascript" src=images/fileformat/rtdocs.js></script>
|
|
</head>
|
|
<div id=document_title>SQLite File IO Specification</div>
|
|
<div id=toc_header>Table Of Contents</div>
|
|
<div id=toc>
|
|
<b>Javascript is required for some features of this document, including
|
|
table of contents, figure numbering and internal references (section
|
|
numbers and hyper-links.
|
|
</b>
|
|
</div>
|
|
<!-- End of standard rt docs header -->
|
|
<h1 id=overview>Overview</h1>
|
|
<p>
|
|
SQLite stores an entire database within a single file, the format of
|
|
which is described in the <i>SQLite File Database File Format</i>
|
|
document <cite>ff_sqlitert_requirements</cite>. Each database file is
|
|
stored within a file system, presumably provided by the host operating
|
|
system. Instead of interfacing with the operating system directly,
|
|
the host application is required to supply an adaptor component that
|
|
implements the <i>SQLite Virtual File System</i> interface
|
|
(described in <cite>capi_sqlitert_requirements</cite>). The adaptor
|
|
component is responsible for translating the calls made by SQLite to
|
|
the <i>VFS</i> interface into calls to the file-system interface
|
|
provided by the operating system. This arrangement is depicted in figure
|
|
<cite>figure_vfs_role</cite>.
|
|
<center><img src="images/fileformat/vfs_role.gif">
|
|
<p><i>Figure <span class=fig id=figure_vfs_role></span> - Virtual File System (VFS) Adaptor</i>
|
|
</center>
|
|
<p>
|
|
Although it would be easy to design a system that uses the <i>VFS</i>
|
|
interface to read and update the content of a database file stored
|
|
within a file-system, there are several complicated issues that need
|
|
to be addressed by such a system:
|
|
<ol>
|
|
<li><p>SQLite is required to <b>implement atomic and durable
|
|
transactions</b> (the 'A' and 'D' from the ACID acronym), even if an
|
|
application, operating system or power failure occurs midway through or
|
|
shortly after updating a database file.
|
|
<p>To implement atomic transactions in the face of potential
|
|
application, operating system or power failures, database writers write
|
|
a copy of those portions of the database file that they are going to
|
|
modify into a second file, the <i>journal file</i>, before writing
|
|
to the database file. If a failure does occur while modifying the
|
|
database file, SQLite can reconstruct the original database
|
|
(before the modifications were attempted) based on the contents of
|
|
the <i>journal file</i>.
|
|
<li><p>SQLite is required to <b>implement isolated transactions</b> (the 'I'
|
|
from the ACID acronym).
|
|
<p>This is done by using the file locking facilities provided by the
|
|
VFS adaptor to serialize writers (write transactions) and preventing
|
|
readers (read transactions) from accessing database files while writers
|
|
are midway through updating them.
|
|
<li><p>For performance reasons, it is advantageous to <b>minimize the
|
|
quantity of data read and written</b> to and from the file-system.
|
|
<p>As one might expect, the amount of data read from the database
|
|
file is minimized by caching portions of the database file in main
|
|
memory. Additionally, multiple updates to the database file that
|
|
are part of the same <i>write transaction</i> may be cached in
|
|
main memory and written to the file together, allowing for
|
|
more efficient IO patterns and eliminating the redundant write
|
|
operations that could take place if part of the database file is
|
|
modified more than once within a single <i>write transaction</i>.
|
|
</ol>
|
|
<p class=todo>
|
|
System requirement references for the above points.
|
|
<p>
|
|
This document describes in detail the way that SQLite uses the API
|
|
provided by the VFS adaptor component to solve the problems and implement
|
|
the strategies enumerated above. It also specifies the assumptions made
|
|
about the properties of the system that the VFS adaptor provides
|
|
access to. For example, specific assumptions about the extent of
|
|
data corruption that may occur if a power failure occurs while a
|
|
database file is being updated are presented in section
|
|
<cite>fs_characteristics</cite>.
|
|
<p>
|
|
This document does not specify the details of the interface that must
|
|
be implemented by the VFS adaptor component, that is left to
|
|
<cite>capi_sqlitert_requirements</cite>.
|
|
<h2>Relationship to Other Documents</h2>
|
|
<p>
|
|
Related to C-API requirements:
|
|
<ol>
|
|
<li>Opening a connection.
|
|
<li>Closing a connection.
|
|
</ol>
|
|
<p>
|
|
Related to SQL requirements:
|
|
<ol>
|
|
<li value=3>Opening a read-only transaction.
|
|
<li>Terminating a read-only transaction.
|
|
<li>Opening a read-write transaction.
|
|
<li>Committing a read-write transaction.
|
|
<li>Rolling back a read-write transaction.
|
|
<li>Opening a statement transaction.
|
|
<li>Committing a statement transaction.
|
|
<li>Rolling back a statement transaction.
|
|
<li>Committing a multi-file transaction.
|
|
</ol>
|
|
<p>
|
|
Related to file-format requirements:
|
|
<ol>
|
|
<li value=12>Pinning (reading) a database page.
|
|
<li>Unpinning a database page.
|
|
<li>Modifying the contents of a database page.
|
|
<li>Appending a new page to the database file.
|
|
<li>Truncating a page from the end of the database file.
|
|
</ol>
|
|
<h2>Document Structure</h2>
|
|
<p>
|
|
Section <cite>vfs_assumptions</cite> of this document describes the
|
|
various assumptions made about the system to which the VFS adaptor
|
|
component provides access. The basic capabilities and functions
|
|
required from the VFS implementation are presented along with the
|
|
description of the VFS interface in
|
|
<cite>capi_sqlitert_requirements</cite>. Section
|
|
<cite>vfs_assumptions</cite> complements this by describing in more
|
|
detail the assumptions made about VFS implementations on which the
|
|
algorithms presented in this document depend. Some of these assumptions
|
|
relate to performance issues, but most concern the expected state of
|
|
the file-system following a failure that occurs midway through
|
|
modifying a database file.
|
|
<p>
|
|
Section <cite>database_connections</cite> introduces the concept of
|
|
a <i>database connection</i>, a combination of a file-handle and
|
|
in-memory cache used to access a database file. It also describes the
|
|
VFS operations required when a new <i>database connection</i> is
|
|
created (opened), and when one is destroyed (closed).
|
|
<p>
|
|
Section <cite>reading_data</cite> describes the steps required to
|
|
open a <i>read transaction</i> and read data from a database file.
|
|
<p>
|
|
Section <cite>writing_data</cite> describes the steps required to
|
|
open a <i>write transaction </i> and write data to a database file.
|
|
<p>
|
|
Section <cite>rollback</cite> describes the way in which aborted
|
|
<i>write transactions</i> may be rolled back (reverted), either as
|
|
a result of an explicit user directive or because an application,
|
|
operating system or power failure occurred while SQLite was midway
|
|
through updating a database file.
|
|
<p>
|
|
Section <cite>page_cache_algorithms</cite> describes some of the
|
|
algorithms used to determine exactly which portions of the database
|
|
file are cached by a <i>page cache</i>, and the effect that they
|
|
have on the quantity and nature of the required VFS operations.
|
|
It may at first seem odd to include the <i>page cache</i>, which is
|
|
primarily an implementation detail, in this document. However, it is
|
|
necessary to acknowledge and describe the <i>page cache</i> in order to
|
|
provide a more complete explanation of the nature and quantity of IO
|
|
performed by SQLite.
|
|
<h2>Glossary</h2>
|
|
<p class=todo>
|
|
After this document is ready, make the vocabulary consistent and
|
|
then add a glossary here.
|
|
<h1 id=vfs_assumptions>VFS Adaptor Related Assumptions</h1>
|
|
<p>
|
|
This section documents those assumptions made about the system that
|
|
the VFS adaptor provides access to. The assumptions noted in section
|
|
<cite>fs_characteristics</cite> are particularly important. If these
|
|
assumptions are not true, then a power or operating system failure
|
|
may cause SQLite databases to become corrupted.
|
|
<h2 id=fs_performance>Performance Related Assumptions</h2>
|
|
<p>
|
|
SQLite uses the assumptions in this section to try to speed up
|
|
reading from and writing to the database file.
|
|
<p class=req id=A21010>
|
|
It is assumed that writing a series of sequential blocks of data to
|
|
a file in order is faster than writing the same blocks in an arbitrary
|
|
order.
|
|
</p> <h2 id=fs_characteristics>System Failure Related Assumptions</h2>
|
|
<p>
|
|
In the event of an operating system or power failure, the various
|
|
combinations of file-system software and storage hardware available
|
|
provide varying levels of guarantee as to the integrity of the data
|
|
written to the file system just before or during the failure. The exact
|
|
combination of IO operations that SQLite is required to perform in
|
|
order to safely modify a database file depend on the exact
|
|
characteristics of the target platform.
|
|
<p>
|
|
This section describes the assumptions that SQLite makes about the
|
|
content of a file-system following a power or system failure. In
|
|
other words, it describes the extent of file and file-system corruption
|
|
that such an event may cause.
|
|
<p>
|
|
SQLite queries an implementation for file-system characteristics
|
|
using the xDeviceCharacteristics() and xSectorSize() methods of the
|
|
database file file-handle. These two methods are only ever called
|
|
on file-handles open on database files. They are not called for
|
|
<i>journal files</i>, <i>master-journal files</i> or
|
|
<i>temporary database files</i>.
|
|
<p>
|
|
The file-system <i>sector size</i> value determined by calling the
|
|
xSectorSize() method is a power of 2 value between 512 and 32768,
|
|
inclusive <span class=todo>reference to exactly how this is
|
|
determined</span>. SQLite assumes that the underlying storage
|
|
device stores data in blocks of <i>sector-size</i> bytes each,
|
|
sectors. It is also assumed that each aligned block of
|
|
<i>sector-size</i> bytes of each file is stored in a single device
|
|
sector. If the file is not an exact multiple of <i>sector-size</i>
|
|
bytes in size, then the final device sector is partially empty.
|
|
<p>
|
|
Normally, SQLite assumes that if a power failure occurs while
|
|
updating any portion of a sector then the contents of the entire
|
|
device sector is suspect following recovery. After writing to
|
|
any part of a sector within a file, it is assumed that the modified
|
|
sector contents are held in a volatile buffer somewhere within
|
|
the system (main memory, disk cache etc.). SQLite does not assume
|
|
that the updated data has reached the persistent storage media, until
|
|
after it has successfully <i>synced</i> the corresponding file by
|
|
invoking the VFS xSync() method. <i>Syncing</i> a file causes all
|
|
modifications to the file up until that point to be committed to
|
|
persistent storage.
|
|
<p>
|
|
Based on the above, SQLite is designed around a model of the
|
|
file-system whereby any sector of a file written to is considered to be
|
|
in a transient state until after the file has been successfully
|
|
<i>synced</i>. Should a power or system failure occur while a sector
|
|
is in a transient state, it is impossible to predict its contents
|
|
following recovery. It may be written correctly, not written at all,
|
|
overwritten with random data, or any combination thereof.
|
|
<p>
|
|
For example, if the <i>sector-size</i> of a given file-system is
|
|
2048 bytes, and SQLite opens a file and writes a 1024 byte block
|
|
of data to offset 3072 of the file, then according to the model
|
|
the second sector of the file is in the transient state. If a
|
|
power failure or operating system crash occurs before or during
|
|
the next call to xSync() on the file handle, then following system
|
|
recovery SQLite assumes that all file data between byte offsets 2048
|
|
and 4095, inclusive, is invalid. It also assumes that since the first
|
|
sector of the file, containing the data from byte offset 0 to 2047
|
|
inclusive, is valid, since it was not in a transient state when the
|
|
crash occurred.
|
|
<p>
|
|
Assuming that any and all sectors in the transient state may be
|
|
corrupted following a power or system failure is a very pessimistic
|
|
approach. Some modern systems provide more sophisticated guarantees
|
|
than this. SQLite allows the VFS implementation to specify at runtime
|
|
that the current platform supports zero or more of the following
|
|
properties:
|
|
<ul>
|
|
<li><p>The <b>safe-append</b> property. If a system supports the
|
|
<i>safe-append</i> property, it means that when a file is extended
|
|
the new data is written to the persistent media before the size
|
|
of the file itself is updated. This guarantees that if a failure
|
|
occurs after a file has been extended, following recovery
|
|
the write operations that extended the file will appear to have
|
|
succeeded or not occurred at all. It is not possible for invalid
|
|
or garbage data to appear in the extended region of the file.
|
|
<li><p>The <b>atomic-write</b> property. A system that supports this
|
|
property also specifies the size or sizes of the blocks that it
|
|
is capable of writing. Valid sizes are powers of two greater than
|
|
512. If a write operation modifies a block of <i>n</i> bytes,
|
|
where <i>n</i> is one of the block sizes for which <i>atomic-write</i>
|
|
is supported, then it is impossible for an aligned write of <i>n</i>
|
|
bytes to cause data corruption. If a failure occurs after such
|
|
a write operation and before the applicable file handle is
|
|
<i>synced</i>, then following recovery it will appear as if the
|
|
write operation succeeded or did not take place at all. It is not
|
|
possible that only part of the data specified by the write operation
|
|
was written to persistent media, nor is it possible for any content
|
|
of the sectors spanned by the write operation to be replaced with
|
|
garbage data, as it is normally assumed to be.
|
|
<li><p>The <b>sequential-write</b> property. A system that supports the
|
|
<i>sequential-write</i> property guarantees that the various write
|
|
operations on files within the same file-system are written to the
|
|
persistent media in the same order that they are performed by the
|
|
application and that each operation is concluded before the next
|
|
is begun. If a system supports the <i>sequential-write</i>
|
|
property, then the model used to determine the possible states of
|
|
the file-system following a failure is different.
|
|
<p>If a system supports <i>sequential-write</i> it is assumed that
|
|
<i>syncing</i> any file within the file system flushes all write
|
|
operations on all files (not just the <i>synced</i> file) to
|
|
the persistent media. If a failure does occur, it is not known
|
|
whether or not any of the write operations performed by SQLite
|
|
since the last time a file was <i>synced</i>. SQLite is able to
|
|
assume that if the write operations of unknown status are arranged
|
|
in the order that they occurred:
|
|
<ol>
|
|
<li> the first <i>n</i> operations will have been executed
|
|
successfully,
|
|
<li> the next operation puts all device sectors that it modifies
|
|
into the transient state, so that following recovery each
|
|
sector may be partially written, completely written, not
|
|
written at all or populated with garbage data,
|
|
<li> the remaining operations will not have had any effect on
|
|
the contents of the file-system.
|
|
</ol>
|
|
</ul>
|
|
<h3 id=fs_assumption_details>Failure Related Assumption Details</h3>
|
|
<p>
|
|
This section describes how the assumptions presented in the parent
|
|
section apply to the individual API functions and operations provided
|
|
by the VFS to SQLite for the purposes of modifying the contents of the
|
|
file-system.
|
|
<p>
|
|
SQLite manipulates the contents of the file-system using a combination
|
|
of the following four types of operation:
|
|
<ul>
|
|
<li> <b>Create file</b> operations. SQLite may create new files
|
|
within the file-system by invoking the xOpen() method of
|
|
the sqlite3_io_methods object.
|
|
<li> <b>Delete file</b> operations. SQLite may remove files from the
|
|
file system by calling the xDelete() method of the
|
|
sqlite3_io_methods object.
|
|
<li> <b>Truncate file</b> operations. SQLite may truncate existing
|
|
files by invoking the xTruncate() method of the sqlite3_file
|
|
object.
|
|
<li> <b>Write file</b> operations. SQLite may modify the contents
|
|
and increase the size of a file by files by invoking the xWrite()
|
|
method of the sqlite3_file object.
|
|
</ul>
|
|
<p>
|
|
Additionally, all VFS implementations are required to provide the
|
|
<i>sync file</i> operation, accessed via the xSync() method of the
|
|
sqlite3_file object, used to flush create, write and truncate operations
|
|
on a file to the persistent storage medium.
|
|
<p>
|
|
The formalized assumptions in this section refer to <i>system failure</i>
|
|
events. In this context, this should be interpreted as any failure that
|
|
causes the system to stop operating. For example a power failure or
|
|
operating system crash.
|
|
<p>
|
|
SQLite does not assume that a <b>create file</b> operation has actually
|
|
modified the file-system records within persistent storage until
|
|
after the file has been successfully <i>synced</i>.
|
|
<p class=req id=A21001>
|
|
If a system failure occurs during or after a "create file"
|
|
operation, but before the created file has been <i>synced</i>, then
|
|
SQLite assumes that it is possible that the created file may not
|
|
exist following system recovery.
|
|
</p> <p>
|
|
Of course, it is also possible that it does exist following system
|
|
recovery.
|
|
<p class=req id=A21002>
|
|
If a "create file" operation is executed by SQLite, and then the
|
|
created file <i>synced</i>, then SQLite assumes that the file-system
|
|
modifications corresponding to the "create file" operation have been
|
|
committed to persistent media. It is assumed that if a system
|
|
failure occurs any time after the file has been successfully
|
|
<i>synced</i>, then the file is guaranteed to appear in the file-system
|
|
following system recovery.
|
|
</p> <p>
|
|
A <b>delete file</b> operation (invoked by a call to the VFS xDelete()
|
|
method) is assumed to be an atomic and durable operation.
|
|
</p>
|
|
<p class=req id=A21003>
|
|
If a system failure occurs at any time after a "delete file"
|
|
operation (call to the VFS xDelete() method) returns successfully, it is
|
|
assumed that the file-system will not contain the deleted file following
|
|
system recovery.
|
|
</p><p class=req id=A21004>
|
|
If a system failure occurs during a "delete file" operation,
|
|
it is assumed that following system recovery the file-system will
|
|
either contain the file being deleted in the state it was in before
|
|
the operation was attempted, or not contain the file at all. It is
|
|
assumed that it is not possible for the file to have become corrupted
|
|
purely as a result of a failure occurring during a "delete file"
|
|
operation.
|
|
</p> <p>
|
|
The effects of a <b>truncate file</b> operation are not assumed to
|
|
be made persistent until after the corresponding file has been
|
|
<i>synced</i>.
|
|
<p class=req id=A21005>
|
|
If a system failure occurs during or after a "truncate file"
|
|
operation, but before the truncated file has been <i>synced</i>, then
|
|
SQLite assumes that the size of the truncated file is either as large
|
|
or larger than the size that it was to be truncated to.
|
|
</p><p class=req id=A21006>
|
|
If a system failure occurs during or after a "truncate file"
|
|
operation, but before the truncated file has been <i>synced</i>, then
|
|
it is assumed that the contents of the file up to the size that the
|
|
file was to be truncated to are not corrupted.
|
|
</p> <p>
|
|
The above two assumptions may be interpreted to mean that if a
|
|
system failure occurs after file truncation but before the truncated
|
|
file is <i>synced</i>, the contents of the file following the point
|
|
at which it was to be truncated may not be trusted. They may contain
|
|
the original file data, or may contain garbage.
|
|
<p class=req id=A21007>
|
|
If a "truncate file" operation is executed by SQLite, and then the
|
|
truncated file <i>synced</i>, then SQLite assumes that the file-system
|
|
modifications corresponding to the "truncate file" operation have been
|
|
committed to persistent media. It is assumed that if a system
|
|
failure occurs any time after the file has been successfully
|
|
<i>synced</i>, then the effects of the file truncation are guaranteed
|
|
to appear in the file system following recovery.
|
|
</p> <p>
|
|
A <b>write file</b> operation modifies the contents of an existing file
|
|
within the file-system. It may also increase the size of the file.
|
|
The effects of a <i>write file</i> operation are not assumed to
|
|
be made persistent until after the corresponding file has been
|
|
<i>synced</i>.
|
|
<p class=req id=A21008>
|
|
If a system failure occurs during or after a "write file"
|
|
operation, but before the corresponding file has been <i>synced</i>,
|
|
then it is assumed that the content of all sectors spanned by the
|
|
<i>write file</i> operation are untrustworthy following system
|
|
recovery. This includes regions of the sectors that were not
|
|
actually modified by the write file operation.
|
|
</p><p class=req id=A21011>
|
|
If a system failure occurs on a system that supports the
|
|
<i>atomic-write</i> property for blocks of size <i>N</i> bytes
|
|
following an aligned write of <i>N</i>
|
|
bytes to a file but before the file has been successfully <i>synced</i>,
|
|
then is assumed following recovery that all sectors spanned by the
|
|
write operation were correctly updated, or that none of the sectors were
|
|
modified at all.
|
|
</p><p class=req id=A21012>
|
|
If a system failure occurs on a system that supports the
|
|
<i>safe-append</i> following a write operation that appends data
|
|
to the end of the file without modifying any of the existing file
|
|
content but before the file has been successfully <i>synced</i>,
|
|
then is assumed following recovery that either the data was
|
|
correctly appended to the file, or that the file size remains
|
|
unchanged. It is assumed that it is impossible that the file be
|
|
extended but populated with incorrect data.
|
|
</p><p class=req id=A21013>
|
|
Following a system recovery, if a device sector is deemed to be
|
|
untrustworthy as defined by A21008 and neither A21011 or A21012
|
|
apply to the range of bytes written, then no assumption can be
|
|
made about the content of the sector following recovery. It is
|
|
assumed that it is possible for such a sector to be written
|
|
correctly, not written at all, populated with garbage data or any
|
|
combination thereof.
|
|
</p><p class=req id=A21009>
|
|
If a system failure occurs during or after a "write file"
|
|
operation that causes the file to grow, but before the corresponding
|
|
file has been <i>synced</i>, then it is assumed that the size of
|
|
the file following recovery is as large or larger than it was when
|
|
it was most recently <i>synced</i>.
|
|
</p> <p>
|
|
If a system supports the <i>sequential-write</i> property, then further
|
|
assumptions may be made with respect to the state of the file-system
|
|
following recovery from a <i>system failure</i>. Specifically, it is
|
|
assumed that create, truncate, delete and write file operations are
|
|
applied to the persistent representation in the same order as they
|
|
are performed by SQLite. Furthermore, it is assumed that the
|
|
file-system waits until one operation is safely written to the
|
|
persistent media before the next is attempted, just as if the relevant
|
|
file were <i>synced</i> following each operation.
|
|
<p class=req id=A21014>
|
|
If a system failure occurs on a system that supports the
|
|
<i>sequential-write</i> property, then it is assumed that all
|
|
operations completed before the last time any file was <i>synced</i>
|
|
have been successfully committed to persistent media.
|
|
</p><p class=req id=A21015>
|
|
If a system failure occurs on a system that supports the
|
|
<i>sequential-write</i> property, then it is assumed that the set
|
|
of possible states that the file-system may be in following recovery
|
|
is the same as if each of the write operations performed since the most
|
|
recent time a file was <i>synced</i> was itself followed by a <i>sync
|
|
file</i> operation, and that the system failure may have occurred during
|
|
any of the write or <i>sync file</i> operations.
|
|
</p><!--
|
|
<p>
|
|
The return value of the xSectorSize() method, the <i>sector-size</i>, is
|
|
expected by SQLite to be a power of 2 value greater than or equal to 512.
|
|
<p class=todo>
|
|
What does it do if this is not the case? If the sector size is less
|
|
than 512 then 512 is used instead. How about a non power-of-two value?
|
|
UPDATE: How this situation is handled should be described in the API
|
|
requirements. Here we can just refer to the other document.
|
|
<p>
|
|
SQLite assumes that files are stored and written to within the
|
|
file-system as a collection of blocks (hereafter sectors) of data, each
|
|
<i>sector-size</i> bytes in size. This model is used to derive
|
|
the following assumptions related to the expected state of the
|
|
file-system following a power failure or operating system crash.
|
|
<ul>
|
|
<li>
|
|
After part or all of a file sector has been modified
|
|
using the xWrite() method of an open file-handle, the sector
|
|
is said to be in a transient state, where the operating system
|
|
makes no guarantees about the actual content of the sector on the
|
|
persistent media. The sector remains in the transient state until
|
|
the next successful call to xSync() on the same file-handle
|
|
returns. If a power failure or operating system crash occurs, then
|
|
part or all of all sectors in the transient state when the crash
|
|
occurred may contain invalid data following system recovery.
|
|
<li>
|
|
Following a power failure or operating system crash, the content
|
|
of all sectors that were not in a transient state when the crash
|
|
occurred may be trusted.
|
|
</ul>
|
|
<p class=todo>
|
|
What do we assume about the other three file-system write
|
|
operations - xTruncate(), xDelete() and "create file"?
|
|
<p>
|
|
The xDeviceCharacteristics() method returns a set of flags,
|
|
indicating which of the following properties (if any) the
|
|
file-system provides:
|
|
<ul>
|
|
<li>The <b><i>sequential IO</i></b> property. If a file-system has this
|
|
property, then in the event of a crash at most a single sector
|
|
may contain invalid data. The file-system guarantees
|
|
<li>The <b><i>safe-append</i></b> property.
|
|
<li>The <b><i>atomic write</i></b> property.
|
|
</ul>
|
|
<p class=todo>
|
|
Write an explanation as to how the file-system properties influence
|
|
the model used to predict file damage after a catastrophe.
|
|
-->
|
|
<h1 id=database_connections>Database Connections</h1>
|
|
<p>
|
|
Within this document, the term <i>database connection</i> has a slightly
|
|
different meaning from that which one might assume. The handles returned
|
|
by the <code>sqlite3_open()</code> and <code>sqlite3_open16()</code>
|
|
APIs (<span class=todo>reference</span>) are referred to as <i>database
|
|
handles</i>. A <i>database connection</i> is a connection to a single
|
|
database file using a single file-handle, which is held open for the
|
|
lifetime of the connection. Using the SQL ATTACH syntax, multiple
|
|
<i>database connections</i> may be accessed via a single <i>database
|
|
handle</i>. Or, using SQLite's <i>shared-cache mode</i> feature, multiple
|
|
<i>database handles</i> may access a single <i>database connection</i>.
|
|
<p>
|
|
Usually, a new <i>database connection</i> is opened whenever the user opens
|
|
new <i>database handle</i> on a real database file (not an in-memory
|
|
database) or when a database file is attached to an existing <i>database
|
|
connection</i> using the SQL ATTACH syntax. However if the <i>shared-cache
|
|
mode</i> feature is enabled, then the database file may be accessed through
|
|
an existing <i>database connection</i>. For more information on
|
|
<i>shared-cache mode</i>, refer to <span class=todo>Reference</span>. The
|
|
various IO operations required to open a new connection are detailed in
|
|
section <cite>open_new_connection</cite> of this document.
|
|
<p>
|
|
Similarly, a <i>database connection</i> is usually closed when the user
|
|
closes a <i>database handle</i> that is open on a real database file or
|
|
has had one or more real database files attached to it using the ATTACH
|
|
mechanism, or when a real database file is detached from a <i>database
|
|
connection</i> using the DETACH syntax. Again, the exception is if
|
|
<i>shared-cache mode</i> is enabled. In this case, a <i>database
|
|
connection</i> is not closed until its number of users reaches zero.
|
|
The IO related steps required to close a <i>database connection</i> are
|
|
described in section <cite>closing_database_connection</cite>.
|
|
<p class=todo>
|
|
After sections 4 and 5 are finished, come back here and see if we can add a
|
|
list of state items associated with each database connection to make things
|
|
easier to understand. i.e each database connection has a file handle, a set
|
|
of entries in the page cache, an expected page size etc.
|
|
<h2 id=open_new_connection>Opening a New Connection</h2>
|
|
<p>
|
|
This section describes the VFS operations that take place when a
|
|
new database connection is created.
|
|
<p>
|
|
Opening a new database connection is a two-step process:
|
|
<ol>
|
|
<li> A file-handle is opened on the database file.
|
|
<li> If step 1 was successful, an attempt is made to read the
|
|
<i>database file header</i> from the database file using the
|
|
new file-handle.
|
|
</ol>
|
|
<p>
|
|
In step 2 of the procedure above, the database file is not locked
|
|
before it is read from. This is the only exception to the locking
|
|
rules described in section <cite>reading_data</cite>.
|
|
<p>
|
|
The reason for attempting to read the <i>database file header</i>
|
|
is to determine the <i>page-size</i> used by the database file.
|
|
Because it is not possible to be certain as to the <i>page-size</i>
|
|
without holding at least a <i>shared lock</i> on the database file
|
|
(because some other <i>database connection</i> might have changed it
|
|
since the <i>database file header</i> was read), the value read from the
|
|
<i>database file header</i> is known as the <i>expected page size</i>.
|
|
<p class=req id=H35060>
|
|
When a new <i>database connection</i> is required, SQLite shall attempt
|
|
to open a file-handle on the database file. If the attempt fails, then
|
|
no new <i>database connection</i> is created and an error returned.
|
|
<p class=req id=H35070>
|
|
When a new <i>database connection</i> is required, after opening the
|
|
new file-handle, SQLite shall attempt to read the first 100 bytes
|
|
of the database file. If the attempt fails for any other reason than
|
|
that the opened file is less than 100 bytes in size, then
|
|
the file-handle is closed, no new <i>database connection</i> is created
|
|
and an error returned instead.
|
|
<p class=req id=H35080>
|
|
If the <i>database file header</i> is successfully read from a newly
|
|
opened database file, the connections <i>expected page-size</i> shall
|
|
be set to the value stored in the <i>page-size field</i> of the
|
|
database header.
|
|
<p class=req id=H35090>
|
|
If the <i>database file header</i> cannot be read from a newly opened
|
|
database file (because the file is less than 100 bytes in size), the
|
|
connections <i>expected page-size</i> shall be set to the compile time
|
|
value of the SQLITE_DEFAULT_PAGESIZE option.
|
|
<h2 id=closing_database_connection>Closing a Connection</h2>
|
|
<p>
|
|
This section describes the VFS operations that take place when an
|
|
existing database connection is closed (destroyed).
|
|
<p>
|
|
Closing a database connection is a simple matter. The open VFS
|
|
file-handle is closed and in-memory <i>page cache</i> related resources
|
|
are released.
|
|
<p class=req id=H35400>
|
|
When a <i>database connection</i> is closed, SQLite shall close the
|
|
associated file handle at the VFS level.
|
|
<p class=req id=H35430>
|
|
When a <i>database connection</i> is closed, all associated <i>page
|
|
cache</i> entries shall be discarded.
|
|
<h1 id=page_cache>The Page Cache</h1>
|
|
<p>
|
|
The contents of an SQLite database file are formatted as a set of
|
|
fixed size pages. See <cite>ff_sqlitert_requirements</cite> for a
|
|
complete description of the format used. The <i>page size</i> used
|
|
for a particular database is stored as part of the database file
|
|
header at a well-known offset within the first 100 bytes of the
|
|
file. Almost all read and write operations performed by SQLite on
|
|
database files are done on blocks of data <i>page-size</i> bytes
|
|
in size.
|
|
<p>
|
|
All SQLite database connections running within a single process share
|
|
a single <i>page cache</i>. The <i>page cache</i> caches data read from
|
|
database files in main-memory on a per-page basis. When SQLite requires
|
|
data from a database file to satisfy a database query, it checks the <i>
|
|
page cache</i> for usable cached versions of the required database
|
|
pages before loading it from the database file. If no usable cache
|
|
entry can be found and the database page data is loaded from the database
|
|
file, it is cached in the <i>page cache</i> in case the same data is
|
|
needed again later. Because reading from the database file is assumed to
|
|
be an order of magnitude slower than reading from main-memory, caching
|
|
database page content in the <i>page cache</i> to minimize the number
|
|
of read operations performed on the database file is a significant
|
|
performance enhancement.
|
|
<p>
|
|
The <i>page cache</i> is also used to buffer database write operations.
|
|
When SQLite is required to modify one of more of the <i>database pages</i>
|
|
that make up a database file, it first modifies the cached version of
|
|
the page in the <i>page cache</i>. At that point the page is considered
|
|
a "dirty" page. At some point later on, the new content of the "dirty"
|
|
page is copied from the <i>page cache</i> into the database file via
|
|
the VFS interface. Buffering writes in the <i>page cache</i> can reduce
|
|
the number of write operations required on the database file (in cases
|
|
where the same page is updated twice) and allows optimizations based
|
|
on the assumptions outlined in section <cite>fs_performance</cite>.
|
|
<p>
|
|
Database read and write operations, and the way in which they interact
|
|
with and use the <i>page cache</i>, are described in detail in sections
|
|
<cite>reading_data</cite> and <cite>writing_data</cite> of this document,
|
|
respectively.
|
|
<p>
|
|
At any one time, the <i>page cache</i> contains zero or more <i>page cache
|
|
entries</i>, each of which has the following data associated with it:
|
|
<ul>
|
|
<li><p>
|
|
A reference to <b>the associated <i>database connection</i></b>. Each
|
|
entry in the <i>page cache</i> is associated with a single <i>database
|
|
connection</i>; the <i>database connection</i> that created the entry.
|
|
A <i>page cache entry</i> is only ever used by the <i>database
|
|
connection</i> that created it. Page cache entries are not shared between
|
|
<i>database connections</i>.
|
|
<li><p>
|
|
The <b><i>page number</i></b> of the cached page. Pages are sequentially
|
|
numbered within a database file starting from page 1 (page 1 begins at
|
|
byte offset 0). Refer to <cite>ff_sqlitert_requirements</cite> for
|
|
details.
|
|
<li><p>
|
|
The <b>cached data</b>; a blob of data <i>page-size</i> bytes in size.
|
|
</ul>
|
|
<p>
|
|
The first two elements in the list above, the associated <i>database
|
|
connection</i> and the <i>page number</i>, uniquely identify the
|
|
<i>page cache entry</i>. At no time may the <i>page cache</i> contain two
|
|
entries for which both the <i>database connection</i> and <i>page
|
|
number</i> are identical. Or, put another way, a single <i>database
|
|
connection</i> never caches more than one copy of a database page
|
|
within the <i>page cache</i>.
|
|
<p>
|
|
At any one time, each <i>page cache entry</i> may be said to be a <i>clean
|
|
page</i>, a <i>non-writable dirty page</i> or a <i>writable dirty page</i>,
|
|
according to the following definitions:
|
|
<ul>
|
|
<li> <p>A <b><i>clean page</i></b> is one for which the cached data
|
|
currently matches the contents of the corresponding page of
|
|
the database file. The page has not been modified since it was
|
|
loaded from the file.
|
|
<li> <p>A <b><i>dirty page</i></b> is a <i>page cache entry</i> for which
|
|
the cached data has been modified since it was loaded from the database
|
|
file, and so no longer matches the current contents of the
|
|
corresponding database file page. A <i>dirty page</i> is one that is
|
|
currently buffering a modification made to the database file as part
|
|
of a <i>write transaction</i>.
|
|
<li> <p>Within this document, the term <b><i>non-writable dirty
|
|
page</i></b> is used specifically to refer to a <i>page cache
|
|
entry</i> with modified content for which it is not yet safe to update
|
|
the database file with. It is not safe to update a database file with
|
|
a buffered write if a power or system failure that occurs during or
|
|
soon after the update may cause the database to become corrupt
|
|
following system recovery, according to the assumptions made in
|
|
section <cite>fs_assumption_details</cite>.
|
|
<li> <p>A <i>dirty page</i> for which it would be safe to update the
|
|
corresponding database file page with the modified contents of
|
|
without risking database corruption is known as a
|
|
<b><i>writable dirty page</i></b>.
|
|
</ul>
|
|
<p>
|
|
The exact logic used to determine if a <i>page cache entry</i> with
|
|
modified content is a <i>dirty page</i> or <i>writable page</i> is
|
|
presented in section <cite>page_cache_algorithms</cite>.
|
|
<p>
|
|
Because main-memory is a limited resource, the <i>page cache</i> cannot
|
|
be allowed to grow indefinitely. As a result, unless all database files
|
|
opened by database connections within the process are quite small,
|
|
sometimes data must be discarded from the <i>page cache</i>. In practice
|
|
this means <i>page cache entries</i> must be purged to make room
|
|
for new ones. If a <i>page cache entry</i> being removed from the <i>page
|
|
cache</i> to free main-memory is a <i>dirty page</i>, then its contents
|
|
must be saved into the database file before it can be discarded without
|
|
data loss. The following two sub-sections describe the algorithms used by
|
|
the <i>page cache</i> to determine exactly when existing <i>page cache
|
|
entries</i> are purged (discarded).
|
|
<h2>Page Cache Configuration</h2>
|
|
<p class=todo>
|
|
Describe the parameters set to configure the page cache limits.
|
|
<h2 id=page_cache_algorithms>Page Cache Algorithms</h2>
|
|
<p class=todo>
|
|
Requirements describing the way in which the configuration parameters
|
|
are used. About LRU etc.
|
|
<h1 id=reading_data>Reading Data</h1>
|
|
<p>
|
|
In order to return data from the database to the user, for example as
|
|
the results of a SELECT query, SQLite must at some point read data
|
|
from the database file. Usually, data is read from the database file in
|
|
aligned blocks of <i>page-size</i> bytes. The exception is when the
|
|
database file header fields are being inspected, before the
|
|
<i>page-size</i> used by the database can be known.
|
|
<p>
|
|
With two exceptions, a <i>database connection</i> must have an open
|
|
transaction (either a <i>read-only transaction</i> or a
|
|
<i>read/write transaction</i>) on the database before data may be
|
|
read from the database file.
|
|
<p>
|
|
The two exceptions are:
|
|
<ul>
|
|
<li> When an attempt is made to read the 100 byte <i>database file
|
|
header</i> immediately after opening the <i>database connection</i>
|
|
(see section <cite>open_new_connection</cite>). When this occurs
|
|
no lock is held on the database file.
|
|
<li> Data read while in the process of opening a read-only transaction
|
|
(see section <cite>open_read_only_trans</cite>). These read
|
|
operations occur after a <i>shared lock</i> is held on the database
|
|
file.
|
|
</ul>
|
|
<p>
|
|
Once a transaction has been opened, reading data from a database
|
|
connection is a simple operation. Using the xRead() method of the
|
|
file-handle open on the database file, the required database file
|
|
pages are read one at a time. SQLite never reads partial pages and
|
|
always uses a single call to xRead() for each required page.
|
|
<p>
|
|
After reading the data for a database page, SQLite stores the raw
|
|
page of data in the <i>page cache</i>. Each time a page of data is
|
|
required by the upper layers, the <i>page cache</i> is queried
|
|
to see if it contains a copy of the required page stored by
|
|
the current <i>database connection</i>. If such an entry can be
|
|
found, then the required data is read from the <i>page cache</i> instead
|
|
of the database file. Only a connection with an open transaction
|
|
transaction (either a <i>read-only transaction</i> or a
|
|
<i>read/write transaction</i>) on the database may read data from the
|
|
<i>page cache</i>. In this sense reading from the <i>page cache</i> is no
|
|
different to reading from the <i>database file</i>.
|
|
<p>
|
|
Refer to section <cite>page_cache_algorithms</cite> for a description
|
|
of exactly how and for how long page data is stored in the
|
|
<i>page cache</i>.
|
|
<p class=req id=H35010>
|
|
Except for the read operation required by H35070 and those reads made
|
|
as part of opening a read-only transaction, SQLite shall ensure that
|
|
a <i>database connection</i> has an open read-only or read/write
|
|
transaction when any data is read from the <i>database file</i>.
|
|
<p class=req id=H35020>
|
|
Aside from those read operations described by H35070 and H21XXX, SQLite
|
|
shall read data from the database file in aligned blocks of
|
|
<i>page-size</i> bytes, where <i>page-size</i> is the database page size
|
|
used by the database file.
|
|
<p class=req id=H35420>
|
|
SQLite shall ensure that a <i>database connection</i> has an open
|
|
read-only or read/write transaction before using data stored in the <i>page
|
|
cache</i> to satisfy user queries.
|
|
<h2 id=open_read_only_trans>Opening a Read-Only Transaction</h2>
|
|
<p>
|
|
Before data may be read from a <i>database file</i> or queried from
|
|
the <i>page cache</i>, a <i>read-only transaction</i> must be
|
|
successfully opened by the associated database connection (this is true
|
|
even if the connection will eventually write to the database, as a
|
|
<i>read/write transaction</i> may only be opened by upgrading from a
|
|
<i>read-only transaction</i>). This section describes the procedure
|
|
for opening a <i>read-only transaction</i>.
|
|
<p>
|
|
The key element of a <i>read-only transaction</i> is that the
|
|
file-handle open on the database file obtains and holds a
|
|
<i>shared-lock</i> on the database file. Because a connection requires
|
|
an <i>exclusive-lock</i> before it may actually modify the contents
|
|
of the database file, and by definition while one connection is holding
|
|
a <i>shared-lock</i> no other connection may hold an
|
|
<i>exclusive-lock</i>, holding a <i>shared-lock</i> guarantees that
|
|
no other process may modify the database file while the <i>read-only
|
|
transaction</i> remains open. This ensures that <i>read-only
|
|
transactions</i> are sufficiently isolated from the transactions of
|
|
other database users (see section <cite>overview</cite>).
|
|
<p>Obtaining the <i>shared lock</i> itself on the database file is quite
|
|
simple, SQLite just calls the xLock() method of the database file
|
|
handle. Some of the other processes that take place as part of
|
|
opening the <i>read-only transaction</i> are quite complex. The
|
|
steps that SQLite is required to take to open a <i>read-only
|
|
transaction</i>, in the order in which they must occur, is as follows:
|
|
<ol>
|
|
<li>A <i>shared-lock</i> is obtained on the database file.
|
|
<li>The connection checks if a <i>hot journal file</i> exists in the
|
|
file-system. If one does, then it is rolled back before continuing.
|
|
<li>The connection checks if the data in the <i>page cache</i> may
|
|
still be trusted. If not, all page cache data is discarded.
|
|
<li>If the file-size is not zero bytes and the page cache does not
|
|
contain valid data for the first page of the database, then the
|
|
data for the first page must be read from the database.
|
|
</ol>
|
|
<p>
|
|
Of course, an error may occur while attempting any of the 4 steps
|
|
enumerated above. If this happens, then the <i>shared-lock</i> is
|
|
released (if it was obtained) and an error returned to the user.
|
|
Step 2 of the procedure above is described in more detail in section
|
|
<cite>hot_journal_detection</cite>. Section <cite>cache_validation</cite>
|
|
describes the process identified by step 3 above. Further detail
|
|
on step 4 may be found in section <cite>read_page_one</cite>.
|
|
<p class=req id=H35100>
|
|
When required to open a <i>read-only transaction</i> using a
|
|
<i>database connection</i>, SQLite shall first attempt to obtain
|
|
a <i>shared-lock</i> on the file-handle open on the database file.
|
|
<p class=req id=H35110>
|
|
If, while opening a <i>read-only transaction</i>, SQLite fails to obtain
|
|
the <i>shared-lock</i> on the database file, then the process is
|
|
abandoned, no transaction is opened and an error returned to the user.
|
|
<p>
|
|
The most common reason an attempt to obtain a <i>shared-lock</i> may
|
|
fail is that some other connection is holding an <i>exclusive</i> or
|
|
<i>pending lock</i>. However it may also fail because some other
|
|
error (e.g. an IO or comms related error) occurs within the call to the
|
|
xLock() method.
|
|
<p class=req id=H35030>
|
|
While opening a <i>read-only transaction</i>, after successfully
|
|
obtaining a <i>shared lock</i> on the database file, SQLite shall
|
|
attempt to detect and roll back a <i>hot journal file</i> associated
|
|
with the same database file.
|
|
<p class=req id=H35120>
|
|
If, while opening a <i>read-only transaction</i>, SQLite encounters
|
|
an error while attempting to detect or roll back a <i>hot journal
|
|
file</i>, then the <i>shared-lock</i> on the database file is released,
|
|
no transaction is opened and an error returned to the user.
|
|
<p>
|
|
Section <cite>hot_journal_detection</cite> contains a description of
|
|
and requirements governing the detection of a hot-journal file referred
|
|
to in the above requirements.
|
|
<p class=req id=H35040>
|
|
Assuming no errors have occurred, then after attempting to detect and
|
|
roll back a <i>hot journal file</i>, if the <i>page cache</i> contains
|
|
any entries associated with the current <i>database connection</i>,
|
|
then SQLite shall validate the contents of the <i>page cache</i> by
|
|
testing the <i>file change counter</i>. This procedure is known as
|
|
<i>cache validation</i>.
|
|
<p>
|
|
The <i>cache validation</i> process is described in detail in section
|
|
<cite>cache_validation</cite>
|
|
<p class=req id=H35050>
|
|
If the cache validate procedure prescribed by H35040 is required and
|
|
does not prove that the <i>page cache</i> entries associated with the
|
|
current <i>database connection</i> are valid, then SQLite shall discard
|
|
all entries associated with the current <i>database connection</i> from
|
|
the <i>page cache</i>.
|
|
<p>
|
|
The numbered list above notes that the data for the first page of the
|
|
database file, if it exists and is not already loaded into the <i>page
|
|
cache</i>, must be read from the database file before the <i>read-only
|
|
transaction</i> may be considered opened. This is handled by
|
|
requirement H35240.
|
|
<h3 id=hot_journal_detection>Hot Journal Detection</h3>
|
|
<p>
|
|
This section describes the procedure that SQLite uses to detect a
|
|
<i>hot journal file</i>. If a <i>hot journal file</i> is detected,
|
|
this indicates that at some point the process of writing a
|
|
transaction to the database was interrupted and a recovery operation
|
|
(<i>hot journal rollback</i>) needs to take place. This section does
|
|
not describe the process of <i>hot journal rollback</i> (see section
|
|
<cite>hot_journal_rollback</cite>) or the processes by which a
|
|
<i>hot journal file</i> may be created (see section
|
|
<cite>writing_data</cite>).
|
|
<p>
|
|
The procedure used to detect a <i>hot-journal file</i> is quite
|
|
complex. The following steps take place:
|
|
<ol class=spacedlist>
|
|
<li>Using the VFS xAccess() method, SQLite queries the file-system
|
|
to see if the journal file associated with the database exists.
|
|
If it does not, then there is no hot-journal file.
|
|
<li>By invoking the xCheckReservedLock() method of the file-handle
|
|
opened on the database file, SQLite checks if some other connection
|
|
holds a <i>reserved lock</i> or greater. If some other connection
|
|
does hold a <i>reserved lock</i>, this indicates that the other
|
|
connection is midway through a <i>read/write transaction</i> (see
|
|
section <cite>writing_data</cite>). In this case the
|
|
<i>journal file</i> is not a <i>hot-journal</i> and must not be
|
|
rolled back.
|
|
<li>Using the xFileSize() method of the file-handle opened
|
|
on the database file, SQLite checks if the database file is
|
|
0 bytes in size. If it is, the journal file is not considered
|
|
to be a <i>hot journal</i> file. Instead of rolling back the
|
|
journal file, in this case it is deleted from the file-system
|
|
by calling the VFS xDelete() method. <span class=todo>Technically,
|
|
there is a race condition here. This step should be moved to
|
|
after the exclusive lock is held.</span>
|
|
<li>An attempt is made to upgrade to an <i>exclusive lock</i> on the
|
|
database file. If the attempt fails, then all locks, including
|
|
the recently obtained <i>shared lock</i> are dropped. The attempt
|
|
to open a <i>read-only transaction</i> has failed. This occurs
|
|
when some other connection is also attempting to open a
|
|
<i>read-only transaction</i> and the attempt to gain the
|
|
<i>exclusive lock</i> fails because the other connection is also
|
|
holding a <i>shared lock</i>. It is left to the other connection
|
|
to roll back the <i>hot journal</i>.
|
|
<div style="margin-top:0.5em"></div>
|
|
It is important that the file-handle lock is upgraded
|
|
directly from <i>shared</i> to <i>exclusive</i> in this case,
|
|
instead of first upgrading to <i>reserved</i> or </i>pending</i>
|
|
locks as is required when obtaining an <i>exclusive lock</i> to
|
|
write to the database file (section <cite>writing_data</cite>).
|
|
If SQLite were to first upgrade to a <i>reserved</i> or
|
|
<i>pending</i> lock in this scenario, then a second process also
|
|
trying to open a <i>read-transaction</i> on the database file might
|
|
detect the <i>reserved</i> lock in step 2 of this process,
|
|
conclude that there was no <i>hot journal</i>, and commence
|
|
reading data from the <i>database file</i>.
|
|
<li>The xAccess() method is invoked again to detect if the journal
|
|
file is still in the file system. If it is, then it is a
|
|
hot-journal file and SQLite tries to roll it back (see section
|
|
<cite>rollback</cite>).
|
|
</ol>
|
|
<p class=todo> Master journal file pointers?
|
|
<p>
|
|
The following requirements describe step 1 of the above procedure in
|
|
more detail.
|
|
<p class=req id=H35140>
|
|
When required to attempt to detect a <i>hot-journal file</i>, SQLite
|
|
shall first use the xAccess() method of the VFS layer to check if a
|
|
journal file exists in the file-system.
|
|
<p class=req id=H35510>
|
|
If the call to xAccess() required by H35140 fails (due to an IO error or
|
|
similar), then SQLite shall abandon the attempt to open a <i>read-only
|
|
transaction</i>, relinquish the <i>shared lock</i> held on the database
|
|
file and return an error to the user.
|
|
<p class=req id=H35150>
|
|
When required to attempt to detect a <i>hot-journal file</i>, if the
|
|
call to xAccess() required by H35140 indicates that a journal file does
|
|
not exist, then SQLite shall conclude that there is no <i>hot-journal
|
|
file</i> in the file system and therefore that no <i>hot journal
|
|
rollback</i> is required.
|
|
<p>
|
|
The following requirements describe step 2 of the above procedure in
|
|
more detail.
|
|
<p class=req id=H35160>
|
|
When required to attempt to detect a <i>hot-journal file</i>, if the
|
|
call to xAccess() required by H35140 indicates that a journal file
|
|
is present, then the xCheckReservedLock() method of the database file
|
|
file-handle is invoked to determine whether or not some other
|
|
process is holding a <i>reserved</i> or greater lock on the database
|
|
file.
|
|
<p class=req id=H35520>
|
|
If the call to xCheckReservedLock() required by H35160 fails (due to an
|
|
IO or other internal VFS error), then SQLite shall abandon the attempt
|
|
to open a <i>read-only transaction</i>, relinquish the <i>shared lock</i>
|
|
held on the database file and return an error to the user.
|
|
<p class=req id=H35170>
|
|
If the call to xCheckReservedLock() required by H35160 indicates that
|
|
some other <i>database connection</i> is holding a <i>reserved</i>
|
|
or greater lock on the database file, then SQLite shall conclude that
|
|
there is no <i>hot journal file</i>. In this case the attempt to detect
|
|
a <i>hot journal file</i> is concluded.
|
|
<p>
|
|
The following requirements describe step 3 of the above procedure in
|
|
more detail.
|
|
<p class=req id=H35440>
|
|
If while attempting to detect a <i>hot-journal file</i> the call to
|
|
xCheckReservedLock() indicates that no process holds a <i>reserved</i>
|
|
or greater lock on the <i>database file</i>, then SQLite shall open
|
|
a file handle on the potentially hot journal file using the VFS xOpen()
|
|
method.
|
|
<p class=req id=H35530>
|
|
If the call to xOpen() required by H35440 fails (due to an IO or other
|
|
internal VFS error), then SQLite shall abandon the attempt to open a
|
|
<i>read-only transaction</i>, relinquish the <i>shared lock</i> held on
|
|
the database file and return an error to the user.
|
|
<p class=req id=H35450>
|
|
After successfully opening a file-handle on a potentially hot journal
|
|
file, SQLite shall query the file for its size in bytes using the
|
|
xFileSize() method of the open file handle.
|
|
<p class=req id=H35540>
|
|
If the call to xFileSize() required by H35450 fails (due to an IO or
|
|
other internal VFS error), then SQLite shall abandon the attempt to open
|
|
a <i>read-only transaction</i>, relinquish the <i>shared lock</i> held on
|
|
the database file, close the file handle opened on the journal file and
|
|
return an error to the user.
|
|
<p class=req id=H35460>
|
|
If the size of a potentially hot journal file is revealed to be zero
|
|
bytes by a query required by H35450, then SQLite shall close the
|
|
file handle opened on the journal file and delete the journal file using
|
|
a call to the VFS xDelete() method. In this case SQLite shall conclude
|
|
that there is no <i>hot journal file</i>.
|
|
<p class=req id=H35550>
|
|
If the call to xDelete() required by H35450 fails (due to an IO or
|
|
other internal VFS error), then SQLite shall abandon the attempt to open
|
|
a <i>read-only transaction</i>, relinquish the <i>shared lock</i> held on
|
|
the database file and return an error to the user.
|
|
<p>
|
|
The following requirements describe step 4 of the above procedure in
|
|
more detail.
|
|
<p class=req id=H35470>
|
|
If the size of a potentially hot journal file is revealed to be greater
|
|
than zero bytes by a query required by H35450, then SQLite shall attempt
|
|
to upgrade the <i>shared lock</i> held by the <i>database connection</i>
|
|
on the <i>database file</i> directly to an <i>exclusive lock</i>.
|
|
<p class=req id=H35480>
|
|
If an attempt to upgrade to an <i>exclusive lock</i> prescribed by
|
|
H35470 fails for any reason, then SQLite shall release all locks held by
|
|
the <i>database connection</i> and close the file handle opened on the
|
|
<i>journal file</i>. The attempt to open a <i>read-only transaction</i>
|
|
shall be deemed to have failed and an error returned to the user.
|
|
<p>
|
|
Finally, the following requirements describe step 5 of the above
|
|
procedure in more detail.
|
|
<p class=req id=H35490>
|
|
If, as part of the <i>hot journal file</i> detection process, the
|
|
attempt to upgrade to an <i>exclusive lock</i> mandated by H35470 is
|
|
successful, then SQLite shall query the file-system using the xAccess()
|
|
method of the VFS implementation to test whether or not the journal
|
|
file is still present in the file-system.
|
|
<p class=req id=H35560>
|
|
If the call to xAccess() required by H35490 fails (due to an IO or
|
|
other internal VFS error), then SQLite shall abandon the attempt to open
|
|
a <i>read-only transaction</i>, relinquish the lock held on the
|
|
database file, close the file handle opened on the journal file and
|
|
return an error to the user.
|
|
<p class=req id=H35570>
|
|
If the call to xAccess() required by H35490 reveals that the journal
|
|
file is no longer present in the file system, then SQLite shall abandon
|
|
the attempt to open a <i>read-only transaction</i>, relinquish the
|
|
lock held on the database file, close the file handle opened on the
|
|
journal file and return an SQLITE_BUSY error to the user.
|
|
<p class=req id=H35500>
|
|
If the xAccess() query required by H35490 reveals that the journal
|
|
file is still present in the file system, then SQLite shall conclude
|
|
that the journal file is a <i>hot journal file</i> that needs to
|
|
be rolled back. SQLite shall immediately begin <i>hot journal
|
|
rollback</i>.
|
|
<h3 id=cache_validation>Cache Validation</h3>
|
|
<p>
|
|
When a <i>database connection</i> opens a <i>read transaction</i>, the
|
|
<i>page cache</i> may already contain data associated with the
|
|
<i>database connection</i>. However, if another process has modified
|
|
the database file since the cached pages were loaded it is possible that
|
|
the cached data is invalid.
|
|
<p>
|
|
SQLite determines whether or not the <i>page cache</i> entries belonging
|
|
to the <i>database connection</i> are valid or not using the <i>file
|
|
change counter</i>, a field in the <i>database file header</i>. The
|
|
<i>file change counter</i> is a 4-byte big-endian integer field stored
|
|
starting at byte offset 24 of the <i>database file header</i>. Before the
|
|
conclusion of a <i>read/write transaction</i> that modifies the contents
|
|
of the database file in any way (see section <cite>writing_data</cite>),
|
|
the value stored in the <i>file change counter</i> is incremented. When
|
|
a <i>database connection</i> unlocks the database file, it stores the
|
|
current value of the <i>file change counter</i>. Later, while opening a
|
|
new <i>read-only transaction</i>, SQLite checks the value of the <i>file
|
|
change counter</i> stored in the database file. If the value has not
|
|
changed since the database file was unlocked, then the <i>page cache</i>
|
|
entries can be trusted. If the value has changed, then the <i>page
|
|
cache</i> entries cannot be trusted and all entries associated with
|
|
the current <i>database connection</i> are discarded.
|
|
<p class=req id=H35180>
|
|
When a file-handle open on a database file is unlocked, if the
|
|
<i>page cache</i> contains one or more entries belonging to the
|
|
associated <i>database connection</i>, SQLite shall store the value
|
|
of the <i>file change counter</i> internally.
|
|
<p class=req id=H35190>
|
|
When required to perform <i>cache validation</i> as part of opening
|
|
a <i>read transaction</i>, SQLite shall read a 16 byte block
|
|
starting at byte offset 24 of the <i>database file</i> using the xRead()
|
|
method of the <i>database connections</i> file handle.
|
|
<p class=todo>
|
|
Why a 16 byte block? Why not 4? (something to do with encrypted
|
|
databases).
|
|
<p class=req id=H35200>
|
|
While performing <i>cache validation</i>, after loading the 16 byte
|
|
block as required by H35190, SQLite shall compare the 32-bit big-endian
|
|
integer stored in the first 4 bytes of the block to the most
|
|
recently stored value of the <i>file change counter</i> (see H35180).
|
|
If the values are not the same, then SQLite shall conclude that
|
|
the contents of the cache are invalid.
|
|
<p>
|
|
Requirement H35050 (section <cite>open_read_only_trans</cite>)
|
|
specifies the action SQLite is required to take upon determining that
|
|
the cache contents are invalid.
|
|
<h3 id=read_page_one>Page 1 and the Expected Page Size</h3>
|
|
<p>
|
|
As the last step in opening a <i>read transaction</i> on a database
|
|
file that is more than 0 bytes in size, SQLite is required to load
|
|
data for page 1 of the database into the <i>page cache</i>, if it is
|
|
not already there. This is slightly more complicated than it seems,
|
|
as the database <i>page-size</i> is no known at this point.
|
|
<p>
|
|
Even though the database <i>page-size</i> cannot be known for sure,
|
|
SQLite is usually able to guess correctly by assuming it to be equal to
|
|
the connections <i>expected page size</i>. The <i>expected page size</i>
|
|
is the value of the <i>page-size</i> field read from the
|
|
<i>database file header</i> while opening the database connection
|
|
(see section <cite>open_new_connection</cite>), or the <i>page-size</i>
|
|
of the database file when the most <i>read transaction</i> was concluded.
|
|
<p class=req id=H35210>
|
|
During the conclusion of a <i>read transaction</i>, before unlocking
|
|
the database file, SQLite shall set the connections
|
|
<i>expected page size</i> to the current database <i>page-size</i>.
|
|
<p class=req id=H35220>
|
|
As part of opening a new <i>read transaction</i>, immediately after
|
|
performing <i>cache validation</i>, if there is no data for database
|
|
page 1 in the <i>page cache</i>, SQLite shall read <i>N</i> bytes from
|
|
the start of the database file using the xRead() method of the
|
|
connections file handle, where <i>N</i> is the connections current
|
|
<i>expected page size</i> value.
|
|
<p class=req id=H35230>
|
|
If page 1 data is read as required by H35230, then the value of the
|
|
<i>page-size</i> field that appears in the database file header that
|
|
consumes the first 100 bytes of the read block is not the same as the
|
|
connections current <i>expected page size</i>, then the
|
|
<i>expected page size</i> is set to this value, the database file is
|
|
unlocked and the entire procedure to open a <i>read transaction</i>
|
|
is repeated.
|
|
<p class=req id=H35240>
|
|
If page 1 data is read as required by H35230, then the value of the
|
|
<i>page-size</i> field that appears in the database file header that
|
|
consumes the first 100 bytes of the read block is the same as the
|
|
connections current <i>expected page size</i>, then the block of data
|
|
read is stored in the <i>page cache</i> as page 1.
|
|
<h2>Reading Database Data</h2>
|
|
<p class=todo>
|
|
Add something about checking the page-cache first etc.
|
|
<h2>Ending a Read-only Transaction</h2>
|
|
<p>
|
|
To end a <i>read-only transaction</i>, SQLite simply relinquishes the
|
|
<i>shared lock</i> on the file-handle open on the database file. No
|
|
other action is required.
|
|
<p class=req id=H35130>
|
|
When required to end a <i>read-only transaction</i>, SQLite shall
|
|
relinquish the <i>shared lock</i> held on the database file by
|
|
calling the xUnlock() method of the file-handle.
|
|
<p>
|
|
See also requirements H35180 and H35210 above.
|
|
<h1 id=writing_data>Writing Data</h1>
|
|
<p>
|
|
Using DDL or DML SQL statements, SQLite users may modify the contents and
|
|
size of a database file. Exactly how changes to the logical database are
|
|
translated to modifications to the database file is described in
|
|
<cite>ff_sqlitert_requirements</cite>. From the point of view of the
|
|
sub-systems described in this document, each DDL or DML statement executed
|
|
results in the contents of zero or more database file pages being
|
|
overwritten with new data. A DDL or DML statement may also append or
|
|
truncate one or more pages to or from the end of the database file. One
|
|
or more DDL and/or DML statements are grouped together to make up a
|
|
single <i>write transaction</i>. A <i>write transaction</i> is required
|
|
to have the special properties described in section <cite>overview</cite>;
|
|
a <i>write transaction</i> must be isolated, durable and atomic.
|
|
<p>
|
|
SQLite accomplishes these goals using the following techniques:
|
|
<ul>
|
|
<li><p>
|
|
To ensure that <i>write transactions</i> are <b>isolated</b>, before
|
|
beginning to modify the contents of the database file to reflect the
|
|
results of a <i>write transaction</i>, SQLite obtains an <i>exclusive
|
|
lock</i> on the <i>database file</i>. The lock is not relinquished
|
|
until the <i>write transaction</i> is concluded. Because reading from
|
|
the <i>database file</i> requires a <i>shared lock</i> (see section
|
|
<cite>reading_data</cite>) and holding an <i>exclusive
|
|
lock</i> guarantees that no other <i>database connection</i> is holding
|
|
or can obtain a <i>shared lock</i>, this ensures that no other
|
|
connection may read data from the <i>database file</i> at a point when
|
|
a <i>write transaction</i> has been partially applied.
|
|
<li><p>Ensuring that <i>write transactions</i> are <b>atomic</b> is the most
|
|
complex task required of the system. In this case, <i>atomic</i> means
|
|
that even if a system failure occurs, an attempt to commit a <i>write
|
|
transaction</i> to the database file either results in all changes
|
|
that are a part of the transaction being successfully applied to the
|
|
database file, or none of the changes are successfully applied. There
|
|
is no chance that a subset of the changes only are applied. Hence from
|
|
the point of view of an external observer, the <i>write transaction</i>
|
|
appears to be an atomic event.
|
|
<p>
|
|
Of course, it is usually not possible to atomically apply all the
|
|
changes required by a <i>write transaction</i> to a database file
|
|
within the file-system. For example, if a <i>write transaction</i>
|
|
requires ten pages of a database file to be modified, and a power
|
|
outage causes a system failure after sqlite has modified only five
|
|
pages, then the database file will almost certainly be in an
|
|
inconsistent state following system recovery.
|
|
<p>
|
|
SQLite solves this problem by using a <i>journal file</i>. In almost
|
|
all cases, before the <i>database file</i> is modified in any way,
|
|
SQLite stores sufficient information in the <i>journal file</i> to
|
|
allow the original the database file to be reconstructed if a system
|
|
failure occurs while the database file is being updated to reflect
|
|
the modifications made by the <i>write transaction</i>. Each time
|
|
SQLite opens a database file, it checks if such a system failure has
|
|
occurred and, if so,
|
|
reconstructs the database file based on the contents
|
|
of the journal file. The procedure used to detect whether or not this
|
|
process, coined <i>hot journal rollback</i>, is required is described
|
|
in section <cite>hot_journal_detection</cite>. <i>Hot journal rollback
|
|
</i> itself is described in section <cite>hot_journal_rollback</cite>.
|
|
<p>
|
|
The same technique ensures that an SQLite database file cannot be
|
|
corrupted by a system failure that occurs at an inopportune moment.
|
|
If a system failure does occur before SQLite has had a chance to
|
|
execute sufficient <i>sync file</i> operations to ensure that the
|
|
changes that make up a <i>write transaction</i> have made it safely
|
|
to persistent storage, then the <i>journal file</i> will be used
|
|
to restore the database to a known good state following system
|
|
recovery.
|
|
<li><p>
|
|
So that <i>write transactions</i> are <b>durable</b> in the face of
|
|
a system failure, SQLite executes a <i>sync file</i> operation on the
|
|
database file before concluding the <i>write transaction</i>
|
|
</ul>
|
|
<p>
|
|
The <i>page cache</i> is used to buffer modifications to the database
|
|
file image before they are written to the <i>database file</i>. When
|
|
the contents of a page is required to be modified as the results of
|
|
an operation within a <i>write transaction</i>, the modified copy is
|
|
stored in the <i>page cache</i>. Similarly, if new pages are appended
|
|
to the end of a database file, they are added to the <i>page cache</i>
|
|
instead of being immediately written to the database file within the
|
|
file-system.
|
|
<p>
|
|
Ideally, all changes for an entire write transaction are buffered in
|
|
the page cache until the end of the transaction. When the user commits
|
|
the transaction, all changes are applied to the database file in the
|
|
most efficient way possible, taking into account the assumptions
|
|
enumerated in section <cite>fs_performance</cite>. Unfortunately, since
|
|
main-memory is a limited resource, this is not always possible for
|
|
large transactions. In this case changes are buffered in the <i>page
|
|
cache</i> until some internal condition or limit is reached,
|
|
then written out to the database file in order to free resources
|
|
as they are required. Section <cite>page_cache_algorithms</cite>
|
|
describes the circumstances under which changes are flushed through
|
|
to the database file mid-transaction to free <i>page cache</i> resources.
|
|
<p>
|
|
Even if an application or system failure does not occur while a
|
|
<i>write transaction</i> is in progress, a rollback operation to restore
|
|
the database file and <i>page cache</i> to the state that it was in before
|
|
the transaction started may be required. This may occur if the user
|
|
explicitly requests transaction rollback (by issuing a "ROLLBACK" command),
|
|
or automatically, as a result of encountering an SQL constraint (see
|
|
<cite>sql_sqlitert_requirements</cite>). For this reason, the original page
|
|
content is stored in the <i>journal file</i> before the page is even
|
|
modified within the <i>page cache</i>.
|
|
<p class=todo>
|
|
Introduce the following sub-sections.
|
|
<h2 id=journal_file_format>Journal File Format</h2>
|
|
<p>
|
|
This section describes the format used by an SQLite <i>journal file</i>.
|
|
<p>
|
|
A journal file consists of one or more <i>journal headers</i>, zero
|
|
or more <i>journal records</i> and optionally a <i>master journal
|
|
pointer</i>. Each journal file always begins with a
|
|
<i>journal header</i>, followed by zero or more <i>journal records</i>.
|
|
Following this may be a second <i>journal header</i> followed by a
|
|
second set of zero or more <i>journal records</i> and so on. There
|
|
is no limit to the number of <i>journal headers</i> a journal file
|
|
may contain. Following the <i>journal headers</i> and their accompanying
|
|
sets of <i>journal records</i> may be the optional <i>master journal
|
|
pointer</i>. Or, the file may simply end following the final <i>journal
|
|
record</i>.
|
|
<p>
|
|
This section only describes the format of the journal file and the
|
|
various objects that make it up. But because a journal file may be
|
|
read by an SQLite process following recovery from a system failure
|
|
(<i>hot journal rollback</i>, see section
|
|
<cite>hot_journal_rollback</cite>) it is also important to describe
|
|
the way the file is created and populated within the file-system
|
|
using a combination of <i>write file</i>, <i>sync file</i> and
|
|
<i>truncate file</i> operations. These are described in section
|
|
<cite>write_transactions</cite>.
|
|
<h3 id=journal_header_format>Journal Header Format</h3>
|
|
<p>
|
|
A <i>journal header</i> is <i>sector-size</i> bytes in size, where <i>
|
|
sector-size</i> is the value returned by the xSectorSize method of
|
|
the file handle opened on the database file. Only the first 28 bytes
|
|
of the <i>journal header</i> are used, the remainder may contain garbage
|
|
data. The first 28 bytes of each <i>journal header</i> consists of an
|
|
eight byte block set to a well-known value, followed by five big-endian
|
|
32-bit unsigned integer fields.
|
|
<center><img src="images/fileformat/journal_header.gif">
|
|
<p><i>Figure <span class=fig id=figure_journal_header></span> - Journal Header Format</i>
|
|
</center>
|
|
<p>
|
|
Figure <cite>figure_journal_header</cite> graphically depicts the layout
|
|
of a <i>journal header</i>. The individual fields are described in
|
|
the following table. The offsets in the 'byte offset' column of the
|
|
table are relative to the start of the <i>journal header</i>.
|
|
<table class=striped>
|
|
<tr><th>Byte offset<th>Size in bytes<th width=100%>Description
|
|
<tr><td>0<td>8<td>The <b>journal magic</b> field always contains a
|
|
well-known 8-byte string value used to identify SQLite
|
|
journal files. The well-known sequence of byte values
|
|
is:
|
|
<pre>0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7</pre>
|
|
<tr><td>8<td>4<td>This field, the <b>record count</b>, is set to the
|
|
number of <i>journal records</i> that follow this
|
|
<i>journal header</i> in the <i>journal file</i>.
|
|
<tr><td>12<td>4<td>The <b>checksum initializer</b> field is set to a
|
|
pseudo-random value. It is used as part of the
|
|
algorithm to calculate the checksum for all <i>journal
|
|
records</i> that follow this <i>journal header</i>.
|
|
<tr><td>16<td>4<td>This field, the <b>database page count</b>, is set
|
|
to the number of pages that the <i>database file</i>
|
|
contained before any modifications associated with
|
|
<i>write transaction</i> are applied.
|
|
<tr><td>20<td>4<td>This field, the <b>sector size</b>, is set to the
|
|
<i>sector size</i> of the device on which the
|
|
<i>journal file</i> was created, in bytes. This value
|
|
is required when reading the journal file to determine
|
|
the size of each <i>journal header</i>.
|
|
<tr><td>24<td>4<td>The <b>page size</b> field contains the database page
|
|
size used by the corresponding <i>database file</i>
|
|
when the <i>journal file</i> was created, in bytes.
|
|
</table>
|
|
<p>
|
|
All <i>journal headers</i> are positioned in the file so that they
|
|
start at a <i>sector size</i> aligned offset. To achieve this, unused
|
|
space may be left between the start of the second and subsequent
|
|
<i>journal headers</i> and the end of the <i>journal records</i>
|
|
associated with the previous header.
|
|
<h3 id=journal_record_format>Journal Record Format</h3>
|
|
<p>
|
|
Each <i>journal record</i> contains the original data for a database page
|
|
modified by the <i>write transaction</i>. If a rollback is required, then
|
|
this data may be used to restore the contents of the database page to the
|
|
state it was in before the <i>write transaction</i> was started.
|
|
<center><img src="images/fileformat/journal_record.gif">
|
|
<p><i>Figure <span class=fig id=figure_journal_record></span> - Journal Record Format</i>
|
|
</center>
|
|
<p>
|
|
A <i>journal record</i>, depicted graphically by figure
|
|
<cite>figure_journal_record</cite>, contains three fields, as described
|
|
in the following table. Byte offsets are relative to the start of the
|
|
<i>journal record</i>.
|
|
<table class=striped>
|
|
<tr><th>Byte offset<th>Size in bytes<th width=100%>Description
|
|
<tr><td>0<td>4<td>The page number of the database page associated with
|
|
this <i>journal record</i>, stored as a 4 byte
|
|
big-endian unsigned integer.
|
|
<tr><td>4<td><i>page-size<td>
|
|
This field contains the original data for the page,
|
|
exactly as it appeared in the database file before the
|
|
<i>write transaction</i> began.
|
|
<tr><td style="white-space: nowrap">4 + <i>page-size</i><td>4<td>
|
|
This field contains a checksum value, calculated based
|
|
on the contents of the journaled database page data
|
|
(the previous field) and the values stored in the
|
|
<i>checksum initializer</i> field of the preceding
|
|
<i>journal header</i>.
|
|
</table>
|
|
<p>
|
|
The set of <i>journal records</i> that follow a <i>journal header</i>
|
|
in a <i>journal file</i> are packed tightly together. There are no
|
|
alignment requirements for <i>journal records</i> as there are for
|
|
<i>journal headers</i>.
|
|
<h3>Master Journal Pointer</h3>
|
|
<p>
|
|
To support <i>atomic</i> transactions that modify more than one
|
|
database file, SQLite sometimes includes a <i>master journal pointer</i>
|
|
record in a <i>journal file</i>. Multiple file transactions are
|
|
described in section <cite>multifile_transactions</cite>. A <i>
|
|
master journal pointer</i> contains the name of a <i>master journal-file
|
|
</i> along with a check-sum and some well known values that allow
|
|
the <i>master journal pointer</i> to be recognized as such when
|
|
the <i>journal file</i> is read during a rollback operation (section
|
|
<cite>rollback</cite>).
|
|
<p>
|
|
As is the case for a <i>journal header</i>, the start of a <i>master
|
|
journal pointer</i> is always positioned at a <i>sector size</i>
|
|
aligned offset. If the <i>journal record</i> or <i>journal header</i>
|
|
that appears immediately before the <i>master journal pointer</i> does
|
|
not end at an aligned offset, then unused space is left between the
|
|
end of the <i>journal record</i> or <i>journal header</i> and the start
|
|
of the <i>master journal pointer</i>.
|
|
<center><img src="images/fileformat/master_journal_ptr.gif">
|
|
<p><i>Figure <span class=fig id=figure_master_journal_ptr></span> - Master Journal Pointer Format</i>
|
|
</center>
|
|
<p>
|
|
A <i>master journal pointer</i>, depicted graphically by figure
|
|
<cite>figure_master_journal_ptr</cite>, contains five fields, as
|
|
described in the following table. Byte offsets are relative to the
|
|
start of the <i>master journal pointer</i>.
|
|
<table class=striped>
|
|
<tr><th>Byte offset<th>Size in bytes<th width=100%>Description
|
|
<tr><td>0<td>4<td>This field, the <b>locking page number</b>, is always
|
|
set to the page number of the database <i>locking page</i>
|
|
stored as a 4-byte big-endian integer. The <i>locking page</i>
|
|
is the page that begins at byte offset 2<super>30</super> of the
|
|
database file. Even if the database file is large enough to
|
|
contain the <i>locking page</i>, the <i>locking page</i> is
|
|
never used to store any data and so the first four bytes of of a
|
|
valid <i>journal record</i> will never contain this value. For
|
|
further description of the <i>locking page</i>, refer to
|
|
<cite>ff_sqlitert_requirements</cite>.
|
|
<tr><td>4<td><i>name-length</i><td>
|
|
The <b>master journal name</b> field contains the name of the
|
|
master journal file, encoded as a utf-8 string. There is no
|
|
nul-terminator appended to the string.
|
|
<tr><td>4 + <i>name-length</i><td><i>4<td>
|
|
The <b>name-length</b> field contains the length of the
|
|
previous field in bytes, formatted as a 4-byte big-endian
|
|
unsigned integer.
|
|
<tr><td>8 + <i>name-length</i><td><i>4<td>
|
|
The <b>checksum</b> field contains a checksum value stored as
|
|
a 4-byte big-endian signed integer. The checksum value is
|
|
calculated as the sum of the bytes that make up the <i>
|
|
master journal name</i> field, interpreting each byte as
|
|
an 8-bit signed integer.
|
|
<tr><td style="white-space: nowrap">12 + <i>name-length</i><td><i>8<td>
|
|
Finally, the <b>journal magic</b> field always contains a
|
|
well-known 8-byte string value; the same value stored in the
|
|
first 8 bytes of a <i>journal header</i>. The well-known
|
|
sequence of bytes is:
|
|
<pre>0xd9 0xd5 0x05 0xf9 0x20 0xa1 0x63 0xd7</pre>
|
|
</table>
|
|
<h2 id=write_transactions>Write Transactions</h2>
|
|
<p>
|
|
This section describes the progression of an SQLite <i>write
|
|
transaction</i>. From the point of view of the systems described in
|
|
this document, most <i>write transactions</i> consist of three steps:
|
|
<ol>
|
|
<li><p>The <i>write transaction</i> is opened. This process is described
|
|
in section <cite>opening_a_write_transaction</cite>.
|
|
<li><p>The end-user executes DML or DDL SQL statements that require the
|
|
structure of the database file of the database file to be modified.
|
|
These modifications may be any combination of operations to
|
|
<ul><li>modify the content of an existing database page,
|
|
<li>append a new database page to the database file image, or
|
|
<li>truncate (discard) a database page from the end of the
|
|
database file.
|
|
</ul>
|
|
These operations are described in detail in section
|
|
<cite>modifying_appending_truncating</cite>. How user DDL or DML
|
|
SQL statements are mapped to combinations of these three operations
|
|
is described in <cite>ff_sqlitert_requirements</cite>.
|
|
<li><p>The <i>write transaction</i> is concluded and the changes made
|
|
permanently committed to the database. The process required to
|
|
commit a transaction is described in section
|
|
<cite>committing_a_transaction</cite>.
|
|
</ol>
|
|
<p>
|
|
As an alternative to step 3 above, the transaction may be rolled back.
|
|
Transaction rollback is described in section <cite>rollback</cite>.
|
|
Finally, it is also important to remember that a <i>write transaction</i>
|
|
may be interrupted by a <i>system failure</i> at any point. In this
|
|
case, the contents of the file-system (the <i>database file</i> and
|
|
<i>journal file</i>) must be left in such a state so as to enable
|
|
the <i>database file</i> to be restored to the state it was in before
|
|
the interrupted <i>write transaction</i> was started. This is known
|
|
as <i>hot journal rollback</i>, and is described in section
|
|
<cite>hot_journal_rollback</cite>. Section
|
|
<cite>fs_assumption_details</cite> describes the assumptions made
|
|
regarding the effects of a <i>system failure</i> on the file-system
|
|
contents following recovery.
|
|
<h3 id=opening_a_write_transaction>Beginning a Write Transaction</h3>
|
|
<p>
|
|
Before any database pages may be modified within the <i>page cache</i>,
|
|
the <i>database connection</i> must open a <i>write transaction</i>.
|
|
Opening a <i>write transaction</i> requires that the <i>database
|
|
connection</i> obtains a <i>reserved lock</i> (or greater) on the
|
|
<i>database file</i>. Because obtaining a <i>reserved lock</i> on
|
|
a <i>database file</i> guarantees that no other <i>database
|
|
connection</i> may hold or obtain a <i>reserved lock</i> or greater,
|
|
it follows that no other <i>database connection</i> may have an
|
|
open <i>write transaction</i>.
|
|
<p>
|
|
A <i>reserved lock</i> on the <i>database file</i> may be thought of
|
|
as an exclusive lock on the <i>journal file</i>. No
|
|
<i>database connection</i> may read from or write to a <i>journal
|
|
file</i> without a <i>reserved</i> or greater lock on the corresponding
|
|
<i>database file</i>.
|
|
<p>
|
|
Before opening a <i>write transaction</i>, a <i>database connection</i>
|
|
must have an open <i>read transaction</i>, opened via the procedure
|
|
described in section <cite>open_read_only_trans</cite>. This ensures
|
|
that there is no <i>hot-journal file</i> that needs to be rolled back
|
|
and that any data stored in the <i>page cache</i> can be trusted.
|
|
<p>
|
|
Once a <i>read transaction</i> has been opened, upgrading to a
|
|
<i>write transaction</i> is a two step process, as follows:
|
|
<ol>
|
|
<li>A <i>reserved lock</i> is obtained on the <i>database file</i>.
|
|
<li>The <i>journal file</i> is opened and created if necessary (using
|
|
the VFS xOpen method), and a <i>journal file header</i> written
|
|
to the start of it using a single call to the file handles xWrite
|
|
method.
|
|
</ol>
|
|
<p>
|
|
Requirements describing step 1 of the above procedure in detail:
|
|
<p class=req id=H35350>
|
|
When required to open a <i>write transaction</i> on the database,
|
|
SQLite shall first open a <i>read transaction</i>, if the <i>database
|
|
connection</i> in question has not already opened one.
|
|
<p class=req id=H35360>
|
|
When required to open a <i>write transaction</i> on the database, after
|
|
ensuring a <i>read transaction</i> has already been opened, SQLite
|
|
shall obtain a <i>reserved lock</i> on the database file by calling
|
|
the xLock method of the file-handle open on the database file.
|
|
<p class=req id=H35580>
|
|
If an attempt to acquire a <i>reserved lock</i> prescribed by
|
|
requirement H35360 fails, then SQLite shall deem the attempt to
|
|
open a <i>write transaction</i> to have failed and return an error
|
|
to the user.
|
|
<p>
|
|
Requirements describing step 2 of the above procedure in detail:
|
|
<p class=req id=H35370>
|
|
When required to open a <i>write transaction</i> on the database, after
|
|
obtaining a <i>reserved lock</i> on the database file, SQLite shall
|
|
open a read/write file-handle on the corresponding <i>journal file</i>.
|
|
<p class=req id=H35380>
|
|
When required to open a <i>write transaction</i> on the database, after
|
|
opening a file-handle on the <i>journal file</i>, SQLite shall append
|
|
a <i>journal header</i> to the (currently empty) <i>journal file</i>.
|
|
<h4 id=writing_journal_header>Writing a Journal Header</h4>
|
|
<p>
|
|
Requirements describing how a <i>journal header</i> is appended to
|
|
a journal file:
|
|
<p class=req id=H35680>
|
|
When required to append a <i>journal header</i> to the <i>journal
|
|
file</i>, SQLite shall do so by writing a block of <i>sector-size</i>
|
|
bytes using a single call to the xWrite method of the file-handle
|
|
open on the <i>journal file</i>. The block of data written shall begin
|
|
at the smallest sector-size aligned offset at or following the current
|
|
end of the <i>journal file</i>.
|
|
<p class=req id=H35690>
|
|
The first 8 bytes of the <i>journal header</i> required to be written
|
|
by H35680 shall contain the following values, in order from byte offset 0
|
|
to 7: 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63 and 0xd7.
|
|
<p class=req id=H35700>
|
|
Bytes 8-11 of the <i>journal header</i> required to be written by
|
|
H35680 shall contain 0x00.
|
|
<p class=req id=H35710>
|
|
Bytes 12-15 of the <i>journal header</i> required to be written by
|
|
H35680 shall contain the number of pages that the database file
|
|
contained when the current <i>write-transaction</i> was started,
|
|
formatted as a 4-byte big-endian unsigned integer.
|
|
<p class=req id=H35720>
|
|
Bytes 16-19 of the <i>journal header</i> required to be written by
|
|
H35680 shall contain pseudo-randomly generated values.
|
|
<p class=req id=H35730>
|
|
Bytes 20-23 of the <i>journal header</i> required to be written by
|
|
H35680 shall contain the <i>sector size</i> used by the VFS layer,
|
|
formatted as a 4-byte big-endian unsigned integer.
|
|
<p class=req id=H35740>
|
|
Bytes 24-27 of the <i>journal header</i> required to be written by
|
|
H35680 shall contain the <i>page size</i> used by the database at
|
|
the start of the <i>write transaction</i>, formatted as a 4-byte
|
|
big-endian unsigned integer.
|
|
<h3 id=modifying_appending_truncating>
|
|
Modifying, Adding or Truncating a Database Page
|
|
</h3>
|
|
<p>
|
|
When the end-user executes a DML or DDL SQL statement to modify the
|
|
database schema or content, SQLite is required to update the database
|
|
file image to reflect the new database state. This involves modifying
|
|
the content of, appending or truncating one of more database file
|
|
pages. Instead of modifying the database file directly using the VFS
|
|
interface, changes are first buffered within the <i>page cache</i>.
|
|
<p>
|
|
Before modifying a database page within the <i>page cache</i> that
|
|
may need to be restored by a rollback operation, the page must be
|
|
<i>journalled</i>. <i>Journalling a page</i> is the process of copying
|
|
that pages original data into the journal file so that it can be
|
|
recovered if the <i>write transaction</i> is rolled back. The process
|
|
of journalling a page is described in section
|
|
<cite>journalling_a_page</cite>.
|
|
<p class=req id=H35590>
|
|
When required to modify the contents of an existing database page that
|
|
existed and was not a <i>free-list leaf page</i> when the <i>write
|
|
transaction</i> was opened, SQLite shall journal the page if it has not
|
|
already been journalled within the current <i>write transaction</i>.
|
|
<p class=req id=H35600>
|
|
When required to modify the contents of an existing database page,
|
|
SQLite shall update the cached version of the database page content
|
|
stored as part of the <i>page cache entry</i> associated with the page.
|
|
<p>
|
|
When a new database page is appended to a database file, there is
|
|
no requirement to add a record to the <i>journal file</i>. If a
|
|
rollback is required the database file will simply be truncated back
|
|
to its original size based on the value stored at byte offset 12
|
|
of the <i>journal file</i>.
|
|
<p class=req id=H35610>
|
|
When required to append a new database page to the database file,
|
|
SQLite shall create a new <i>page cache entry</i> corresponding to
|
|
the new page and insert it into the <i>page cache</i>. The <i>dirty
|
|
flag</i> of the new <i>page cache entry</i> shall be set.
|
|
<p>
|
|
If required to truncate a database page from the end of the database
|
|
file, the associated <i>page cache entry</i> is discarded. The adjusted
|
|
size of the database file is stored internally. The database file
|
|
is not actually truncated until the current <i>write transaction</i>
|
|
is committed (see section <cite>committing_a_transaction</cite>).
|
|
<p class=req id=H35620>
|
|
When required to truncate (remove) a database page that existed and was
|
|
not a <i>free-list leaf page</i> when the <i>write transaction</i> was
|
|
opened from the end of a database file, SQLite shall journal the page if
|
|
it has not already been journalled within the current <i>write
|
|
transaction</i>.
|
|
<p class=req id=H35630>
|
|
When required to truncate a database page from the end of the database
|
|
file, SQLite shall discard the associated <i>page cache entry</i>
|
|
from the page cache.
|
|
<h4 id=journalling_a_page>Journalling a Database Page</h4>
|
|
<p>
|
|
A page is journalled by adding a <i>journal record</i> to the <i>
|
|
journal file</i>. The format of a <i>journal record</i> is described
|
|
in section <cite>journal_record_format</cite>.
|
|
<p class=req id=H35270>
|
|
When required to <i>journal a database page</i>, SQLite shall first
|
|
append the <i>page number</i> of the page being journalled to the
|
|
<i>journal file</i>, formatted as a 4-byte big-endian unsigned integer,
|
|
using a single call to the xWrite method of the file-handle opened
|
|
on the journal file.
|
|
<p class=req id=H35280>
|
|
When required to <i>journal a database page</i>, if the attempt to
|
|
append the <i>page number</i> to the journal file is successful,
|
|
then the current page data (<i>page-size</i> bytes) shall be appended
|
|
to the journal file, using a single call to the xWrite method of the
|
|
file-handle opened on the journal file.
|
|
<p class=req id=H35290>
|
|
When required to <i>journal a database page</i>, if the attempt to
|
|
append the current page data to the journal file is successful,
|
|
then SQLite shall append a 4-byte big-endian integer checksum value
|
|
to the to the journal file, using a single call to the xWrite method
|
|
of the file-handle opened on the journal file.
|
|
<p>
|
|
The checksum value written to the <i>journal file</i> immediately after
|
|
the page data (requirement H35290), is a function of both the page
|
|
data and the <i>checksum initializer</i> field stored in the
|
|
<i>journal header</i> (see section <cite>journal_header_format</cite>).
|
|
Specifically, it is the sum of the <i>checksum initializer</i> and
|
|
the value of every 200th byte of page data interpreted as an 8-bit
|
|
unsigned integer, starting with the (<i>page-size</i> % 200)'th
|
|
byte of page data. For example, if the <i>page-size</i> is 1024 bytes,
|
|
then a checksum is calculated by adding the values of the bytes at
|
|
offsets 23, 223, 423, 623, 823 and 1023 (the last byte of the page)
|
|
together with the value of the <i>checksum initializer</i>.
|
|
<p class=req id=H35300>
|
|
The checksum value written to the <i>journal file</i> by the write
|
|
required by H35290 shall be equal to the sum of the <i>checksum
|
|
initializer</i> field stored in the <i>journal header</i> (H35700) and
|
|
every 200th byte of the page data, beginning with the
|
|
(<i>page-size</i> % 200)th byte.
|
|
<p>
|
|
The '%' character is used in requirement H35300 to represent the
|
|
modulo operator, just as it is in programming languages such as C, Java
|
|
and Javascript.
|
|
<h3 id=syncing_journal_file>Syncing the Journal File</h3>
|
|
<p>
|
|
Even after the original data of a database page has been written into
|
|
the journal file using calls to the journal file file-handle xWrite
|
|
method (section <cite>journalling_a_page</cite>), it is still not
|
|
safe to write to the page within the database file. This is because
|
|
in the event of a system failure the data written to the journal file
|
|
may still be corrupted (see section <cite>fs_characteristics</cite>).
|
|
Before the page can be updated within the database itself, the
|
|
following procedure takes place:
|
|
<ol>
|
|
<li> The xSync method of the file-handle opened on the journal file
|
|
is called. This operation ensures that all <i>journal records</i>
|
|
in the journal file have been written to persistent storage, and
|
|
that they will not become corrupted as a result of a subsequent
|
|
system failure.
|
|
<li> The <i>journal record count</i> field (see section
|
|
<cite>journal_header_format</cite>) of the most recently written
|
|
journal header in the journal file is updated to contain the
|
|
number of <i>journal records</i> added to the journal file since
|
|
the header was written.
|
|
<li> The xSync method is called again, to ensure that the update to
|
|
the <i>journal record count</i> has been committed to persistent
|
|
storage.
|
|
</ol>
|
|
<p>
|
|
If all three of the steps enumerated above are executed successfully,
|
|
then it is safe to modify the content of the <i>journalled</i>
|
|
database pages within the database file itself. The combination of
|
|
the three steps above is referred to as <i>syncing the journal file</i>.
|
|
<p class=req id=H35750>
|
|
When required to <i>sync the journal file</i>, SQLite shall invoke the
|
|
xSync method of the file handle open on the <i>journal file</i>.
|
|
<p class=req id=H35760>
|
|
When required to <i>sync the journal file</i>, after invoking the
|
|
xSync method as required by H35750, SQLite shall update the <i>record
|
|
count</i> of the <i>journal header</i> most recently written to the
|
|
<i>journal file</i>. The 4-byte field shall be updated to contain
|
|
the number of <i>journal records</i> that have been written to the
|
|
<i>journal file</i> since the <i>journal header</i> was written,
|
|
formatted as a 4-byte big-endian unsigned integer.
|
|
<p class=req id=H35770>
|
|
When required to <i>sync the journal file</i>, after updating the
|
|
<i>record count</i> field of a <i>journal header</i> as required by
|
|
H35760, SQLite shall invoke the xSync method of the file handle open
|
|
on the <i>journal file</i>.
|
|
<h3 id=upgrading_to_exclusive_lock>Upgrading to an Exclusive Lock</h3>
|
|
<p>
|
|
Before the content of a page modified within the <i>page cache</i> may
|
|
be written to the database file, an <i>exclusive lock</i> must be held
|
|
on the database file. The purpose of this lock is to prevent another
|
|
connection from reading from the database file while the first
|
|
connection is midway through writing to it. Whether the reason for
|
|
writing to the database file is because a transaction is being committed,
|
|
or to free up space within the <i>page cache</i>, upgrading to an
|
|
<i>exclusive lock</i> always occurs immediately after
|
|
<i>syncing the journal file</i>.
|
|
<p class=req id=H35780>
|
|
When required to upgrade to an <i>exclusive lock</i> as part of a write
|
|
transaction, SQLite shall first attempt to obtain a <i>pending lock</i>
|
|
on the database file if one is not already held by invoking the xLock
|
|
method of the file handle opened on the <i>database file</i>.
|
|
<p class=req id=H35790>
|
|
When required to upgrade to an <i>exclusive lock</i> as part of a write
|
|
transaction, after successfully obtaining a <i>pending lock</i> SQLite
|
|
shall attempt to obtain an <i>exclusive lock</i> by invoking the
|
|
xLock method of the file handle opened on the <i>database file</i>.
|
|
<p class=todo>
|
|
What happens if the exclusive lock cannot be obtained? It is not
|
|
possible for the attempt to upgrade from a reserved to a pending
|
|
lock to fail.
|
|
<h3 id=committing_a_transaction>Committing a Transaction</h3>
|
|
<p>
|
|
Committing a <i>write transaction</i> is the final step in updating the
|
|
database file. Committing a transaction is a seven step process,
|
|
summarized as follows:
|
|
<ol>
|
|
<li><p>
|
|
The database file header <i>change counter</i> field is incremented.
|
|
The <i>change counter</i>, described in
|
|
<cite>ff_sqlitert_requirements</cite>, is used by the <i>cache
|
|
validation</i> procedure described in section
|
|
<cite>cache_validation</cite>.
|
|
<li><p>
|
|
The <i>journal file</i> is synced. The steps required to <i>sync the
|
|
journal file</i> are described in section
|
|
<cite>syncing_journal_file</cite>.
|
|
<li><p>
|
|
Upgrade to an <i>exclusive lock</i> on the database file, if an
|
|
<i>exclusive lock</i> is not already held. Upgrading to an
|
|
<i>exclusive lock</i> is described in section
|
|
<cite>upgrading_to_exclusive_lock</cite>.
|
|
<li><p>
|
|
Copy the contents of all <i>dirty pages</i> stored in the <i>page
|
|
cache</i> into the database file. The set of dirty pages are written
|
|
to the database file in page number order in order to improve
|
|
performance (see the assumptions in section <cite>fs_performance</cite>
|
|
for details).
|
|
<li><p>
|
|
The database file is synced to ensure that all updates are stored
|
|
safely on the persistent media.
|
|
<li><p>
|
|
The file-handle open on the <i>journal file</i> is closed and the
|
|
journal file itself deleted. At this point the <i>write transaction</i>
|
|
transaction has been irrevocably committed.
|
|
<li><p>
|
|
The database file is unlocked.
|
|
</ol>
|
|
<p class=todo>
|
|
Expand on and explain the above a bit.
|
|
<p>
|
|
The following requirements describe the steps enumerated above in more
|
|
detail.
|
|
<p class=req id=H35800>
|
|
When required to <i>commit a write-transaction</i>, SQLite shall
|
|
modify page 1 to increment the value stored in the <i>change counter</i>
|
|
field of the <i>database file header</i>.
|
|
<p>
|
|
The <i>change counter</i> is a 4-byte big-endian integer field stored
|
|
at byte offset 24 of the <i>database file</i>. The modification to page 1
|
|
required by H35800 is made using the process described in section
|
|
<cite>modifying_appending_truncating</cite>. If page 1 has not already
|
|
been journalled as a part of the current write-transaction, then
|
|
incrementing the <i>change counter</i> may require that page 1 be
|
|
journalled. In all cases the <i>page cache entry</i> corresponding to
|
|
page 1 becomes a <i>dirty page</i> as part of incrementing the <i>change
|
|
counter</i> value.
|
|
<p class=req id=H35810>
|
|
When required to <i>commit a write-transaction</i>, after incrementing
|
|
the <i>change counter</i> field, SQLite shall <i>sync the journal
|
|
file</i>.
|
|
<p class=req id=H35820>
|
|
When required to <i>commit a write-transaction</i>, after <i>syncing
|
|
the journal file</i> as required by H35810, if an <i>exclusive lock</i>
|
|
on the database file is not already held, SQLite shall attempt to
|
|
<i>upgrade to an exclusive lock</i>.
|
|
<p class=req id=H35830>
|
|
When required to <i>commit a write-transaction</i>, after <i>syncing
|
|
the journal file</i> as required by H35810 and ensuring that an
|
|
<i>exclusive lock</i> is held on the database file as required by
|
|
H35830, SQLite shall copy the contents of all <i>dirty page</i>
|
|
stored in the <i>page cache</i> into the <i>database file</i> using
|
|
calls to the xWrite method of the <i>database connection</i> file
|
|
handle. Each call to xWrite shall write the contents of a single
|
|
<i>dirty page</i> (<i>page-size</i> bytes of data) to the database
|
|
file. Dirty pages shall be written in order of <i>page number</i>,
|
|
from lowest to highest.
|
|
<p class=req id=H35840>
|
|
When required to <i>commit a write-transaction</i>, after copying the
|
|
contents of any <i>dirty pages</i> to the database file as required
|
|
by H35830, SQLite shall sync the database file by invoking the xSync
|
|
method of the <i>database connection</i> file handle.
|
|
<p class=req id=H35850>
|
|
When required to <i>commit a write-transaction</i>, after syncing
|
|
the database file as required by H35840, SQLite shall close the
|
|
file-handle opened on the <i>journal file</i> and delete the
|
|
<i>journal file</i> from the file system via a call to the VFS
|
|
xDelete method.
|
|
<p class=req id=H35860>
|
|
When required to <i>commit a write-transaction</i>, after deleting
|
|
the <i>journal file</i> as required by H35850, SQLite shall relinquish
|
|
all locks held on the <i>database file</i> by invoking the xUnlock
|
|
method of the <i>database connection</i> file handle.
|
|
<p class=todo>
|
|
Is the shared lock held after committing a <i>write transaction</i>?
|
|
<h3>Purging a Dirty Page</h3>
|
|
<p>
|
|
Usually, no data is actually written to the database file until the
|
|
user commits the active <i>write transaction</i>. The exception is
|
|
if a single <i>write transaction</i> contains too many modifications
|
|
to be stored in the <i>page cache</i>. In this case, some of the
|
|
database file modifications stored in the <i>page cache</i> must be
|
|
applied to the database file before the transaction is committed so
|
|
that the associated <i>page cache entries</i> can be purged from the
|
|
page cache to free memory. Exactly when this condition is reached and
|
|
dirty pages must be purged is described in section
|
|
<cite>page_cache_algorithms</cite>.
|
|
<p>
|
|
Before the contents of the <i>page cache entry</i> can be written into
|
|
the database file, the <i>page cache entry</i> must meet the criteria
|
|
for a <i>writable dirty page</i>, as defined in section
|
|
<cite>page_cache_algorithms</cite>. If the dirty page selected by the
|
|
algorithms in section <cite>page_cache_algorithms</cite> for purging,
|
|
SQLite is required to <i>sync the journal file</i>. Immediately after
|
|
the journal file is synced, all dirty pages associated with the
|
|
<i>database connection</i> are classified as <i>writable dirty pages</i>.
|
|
<p class=req id=H35640>
|
|
When required to purge a <i>non-writable dirty page</i> from the
|
|
<i>page cache</i>, SQLite shall <i>sync the journal file</i> before
|
|
proceeding with the write operation required by H35670.
|
|
<p class=req id=H35660>
|
|
After <i>syncing the journal file</i> as required by H35640, SQLite
|
|
shall append a new <i>journal header</i> to the <i>journal file</i>
|
|
before proceeding with the write operation required by H35670.
|
|
<p>
|
|
Appending a new <i>journal header</i> to the journal file is described
|
|
in section <cite>writing_journal_header</cite>.
|
|
<p>
|
|
Once the dirty page being purged is writable, it is simply written
|
|
into the database file.
|
|
<p class=req id=H35670>
|
|
When required to purge a <i>page cache entry</i> that is a
|
|
<i>dirty page</i> SQLite shall write the page data into the database
|
|
file, using a single call to the xWrite method of the <i>database
|
|
connection</i> file handle.
|
|
<h2 id="multifile_transactions">Multi-File Transactions</h2>
|
|
<h2 id="statement_transactions">Statement Transactions</h2>
|
|
<h1 id=rollback>Rollback</h1>
|
|
<h2 id=hot_journal_rollback>Hot Journal Rollback</h2>
|
|
<h2>Transaction Rollback</h2>
|
|
<h2>Statement Rollback</h2>
|
|
<h1>References</h1>
|
|
<table id="refs" style="width:auto; margin: 1em 5ex">
|
|
<tr><td style="width:5ex" id="capi_sqlitert_requirements">[1]<td>
|
|
C API Requirements Document.
|
|
<tr><td style="width:5ex" id="sql_sqlitert_requirements">[2]<td>
|
|
SQL Requirements Document.
|
|
<tr><td style="width:5ex" id="ff_sqlitert_requirements">[3]<td>
|
|
File Format Requirements Document.
|
|
</table>
|
|
|
|
|