-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-mv-data.tex
1400 lines (1291 loc) · 66.2 KB
/
02-mv-data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%=====================================================================
%% Transactions on Multiversion Data
%%=====================================================================
\chapter{Transactions on Multiversion Data}
\label{chapter:mv-data}
% Explanation of the details of transactions on multiversion data.
% The transaction model.
% No implementation (index structure) details.
The theory of general transaction processing in traditional
databases is well-defined and mature, and the basic principles are
well presented in many
textbooks~\cite{bernstein:1987:cc-n-r,gray:1993:transactionprocessing,papadimitriou:1986:cc-theory}.
In this chapter, we concentrate on the theory of multiversion
transaction processing, and highlight the differences to classical
transaction theory.
The traditional read-write model assumes that transactions are sequences
of reads and writes on data items, without distinguishing item deletions and
insertions from
updates~\cite{bernstein:1987:cc-n-r,papadimitriou:1986:cc-theory}.
The theory of transaction processing in this dissertation is based on
the recoverable transaction model presented by Sippu and
Soisalon-Soininen~\cite{sippu:2001:theory}, which in turn is based on the
model proposed by C.~Mohan~\cite{mohan:1990:aries-kvl,mohan:1992:aries-im}.
In this model, data-item insertions and deletions are made explicit,
and structure-modification operations are included in the model.
We assume the partially persistent transaction-time model, as
described in the introduction.
We will use the terminology presented in the consensus glossary of
temporal database concepts by Böhlen et~al.~\cite{bohlen:1998:consensus}, for
the relevant parts.
The chapter begins with a short review of the fundamental concepts of
transaction management in traditional databases, and continues to
describe the most important aspects of multiversion database theory.
%, highlighting the differences to classical database theory.
%% Fundamentals of Snapshot Database Theory
%%---------------------------------------------------------------------
\section{Fundamentals of Snapshot Database Theory}
\label{sec:mv-data:snapshot}
%In traditional snapshot-database transaction
%theory~\cite{bernstein:1987:cc-n-r,papadimitriou:1986:cc-theory},
%database indexes are collections
We assume our logical database consists of data items of the form $(k,
w)$, where $k$~is the key and $w$~is the value of
the data item~\cite{sippu:2001:theory}.
%The value~$w$ can contain the actual data stored with the key (the
%row in the relation), in the case of a \emph{sparse} or
%\emph{primary index}; or a pointer to some storage location, in the
%case of a \emph{dense} or \emph{secondary index}.
The logical model thus has no knowledge of any time dimension.
Databases that follow this data model are called
\emph{snapshot databases}\phantomsection\label{def:snapshot-database}.
As \figref{fig:tx-snapshot} shows, any change to a snapshot database
overwrites the information about previous states.
The \emph{schedule} of a transaction, as shown in the figure, is a list of
actions issued by the transaction.
The format of the schedule is explained in the following paragraphs.
\begin{figure}[htb]
\centering
\input{images/tx-snapshot.tex}
\figcaption{Transaction in a snapshot database}{The transaction has
executed the schedule $b w[1,\alpha,\alpha'] d[2] w[5,\epsilon] c$.}
\label{fig:tx-snapshot}
\end{figure}
In the traditional transaction model, transactions in snapshot
databases consist of user actions that either \emph{write} keys to or
\emph{read} keys from the database.
In addition, a transaction must issue either a \emph{commit} or an
\emph{abort action} to indicate that the transaction is finished.
After an abort action, the read and write actions are undone and the
transaction commits.
A read action of a key $k$ by a transaction $T_i$ is denoted by
$r_i[k]$, and a write action by $w_i[k]$.
The commit and abort actions are denoted by $c_i$ and $a_i$, respectively.
This classical transaction model is inadequate for
modelling key-range queries and next-key queries, and it does not
differentiate between key insertions, updates, and deletes.
Mohan has proposed a more general transaction model alongside his
ARIES recovery algorithm~\cite{mohan:1990:aries-kvl,mohan:1992:aries-im},
which Sippu and Soisalon-Soininen have
further refined~\cite{sippu:2001:theory}.
In the model of Sippu and Soisalon-Soininen, transactions may \emph{retrieve}
the first matching data item $(k, w)$ for which $k~\theta~x$, where $\theta$
is either $>$ or $\geq$ and $x$ is a given search key;
\emph{insert} a new data item $(k, w)$; or
\emph{delete} the data item that has the supplied key $k$.
In this model, key-range queries are performed by consecutive key
retrievals with $\theta$ set to $>$ and $x_{i+1} = k_i$.
Retrieval of a data item $(k, w)$ with $k~\theta~x$ by transaction
$T_i$ is denoted by $r_i[k, \theta x, w]$;
insertion of data item $(k, w)$ by $n_i[k, w]$;
and deletion of data item $(k, w)$ by $d_i[k, w]$.
Transactions must also issue an explicit \emph{begin action} $b_i$
before performing any other operations.
%The \emph{commit action} is denoted by $c_i$.
% For convenience, we will use shorter forms $r_i[k,\theta x]$,
% $n_i[k]$, and $d_i[k]$ for the user actions when such forms
% are sufficient.
Updating an existing data item is often not defined as a separate
action, but rather implemented by first deleting the existing data
item and then inserting the new data item to replace the old one.
For convenience and simplicity, we define a \emph{write action}
to either insert a new data item, if no data item with
the same key already exists; or to first logically delete an existing data
item and then replace the deleted data item by a new one.
A write action of transaction $T_i$ inserting a data item
$(k, w)$ is denoted by $w_i[k,w]$, and a write action replacing an existing
data item $(k, w')$ by $(k, w)$ is denoted by $w_i[k,w',w]$.
We do not equate the term ``update'' with the write action, because a key
deletion is also an update action.
An update action can thus refer to any data-item modifying action.
% This notation can be shorten to $w_i[k]$, when such a form is
% sufficient.
We will follow this combined model with read, write, and delete
actions in this dissertation.
When sufficient, we will use shortened forms $r_i[k, \theta x]$ (or $r_i[k]$
as a shorthand for $r_i[k, =$$k]$) to denote item retrieval, $w_i[k]$ to
denote the write action, and $d_i[k]$ to denote the delete action; possibly
omitting the indices if discussing only a single transaction.
Transactions that only consist of data-item retrieval actions are called
\emph{read-only transactions}, and transactions that include
write and delete actions (as described above) are called \emph{updating
transactions}.
Updating transactions modify the same data set, in the order the
update actions are issued.
Concurrency-control and recovery algorithms are used to maintain the
\abbr{ACID}\phantomsection\label{def:acid} properties of transactions:
atomicity, consistency, isolation, and
durability~\cite{gray:1993:transactionprocessing}.
If a database management system supports the serializable isolation
level, the actions of the transactions will be executed in such an
order that the outcome is the same as if the transactions had
executed in a serial order, each transaction in its entirety in
isolation from the others.
Transaction isolation in databases is achieved by using
a concurrency-control algorithm, such as key-value locking (also known
as key-range
locking~\cite{gray:1993:transactionprocessing,mohan:1990:aries-kvl}).
%two-phase locking, and strict two-phase
%locking~\cite{gray:1993:transactionprocessing,bernstein:1987:cc-n-r}.
The \abbr{ANSI/ISO} \mbox{SQL-92} specification~\cite{ansi:1992:sql}
defines four isolation levels based on the anomalies that can occur
during the execution of concurrent transactions.
The isolation levels are named
(1)~read uncommitted,
(2)~read committed,
(3)~repeatable read, and
(4)~serializable.
The initial definition of the phenomena, however, fails to properly
classify the different isolation levels.
Berenson et~al.\ propose an updated model for the isolation
levels~\cite{berenson:1995:sql-critique}.
The anomalies, or phenomena, that cannot occur in each of the levels
in the updated model are
(1)~dirty writes,
(2)~dirty reads,
(3)~fuzzy reads, and
(4)~phantoms.
The phenomena exclusion is additive, so that transactions operating
on the repeatable read level must not encounter dirty writes, dirty
reads, or fuzzy reads; but may encounter phantoms.
Only the serializable isolation level avoids all the
anomalies, and thus guarantees transaction isolation.
The serializable isolation level is implemented by
key-value locking, for example.
In fact, phantom avoidance cannot be achieved by locking schemes that
only lock those key values that have been read or written.
The key-value locking approach avoids phantoms by locking both the
accessed key and the next key found in
the database~\cite{mohan:1990:aries-kvl,mohan:1992:aries-im}, thus in
effect locking the range of keys from the accessed key to the next
existing key.
Another approach for avoiding phantoms is to use predicate locking,
but this is often unpractical as predicate satisfiability in general
is a known \NP-complete problem and thus the approach can be highly
inefficient~\cite{gray:1993:transactionprocessing}.
A drawback of the serializable concurrency-control algorithms is that
they hinder the overall performance of the database system by forcing
other transactions to wait for access to locked keys.
Database management systems thus allow the user to set some
transactions to run on a lower isolation level to enhance the
performance of the system.
Recovery is another important aspect of database systems.
Applications rely on databases to maintain their data and to make sure
that the data is available and consistent even after a system crash.
The standard recovery algorithm is the ARIES algorithm (\emph{Algorithm
for Recovery and Isolation Exploiting
Semantics}~\cite{mohan:1990:aries-kvl,mohan:1992:aries-im,mohan:1992:aries,mohan:1999:aries-history}),
which was developed by C.~Mohan at IBM\@.
The ARIES algorithm is used in many commercial database systems, such as
\abbr{IBM DB2} and Microsoft SQL Server~\cite{mohan:1999:aries-history}; and
it is taught in database courses in many universities, including ours.
The algorithm maintains data integrity even in the presence of system crashes,
and guarantees that the data set contains all the updates of all committed
transactions, and none of the updates of transactions that were aborted or had
not committed before the system crashed.
In our discussion, we will assume the standard
write-ahead-logging~(\abbr{WAL}\phantomsection\label{def:wal}~\cite{mohan:1992:aries})
and steal-and-no-force page-buffering
policy~\cite{gray:1993:transactionprocessing}.
Under these policies, the buffer manager may steal unfixed dirty
pages from the page buffer and flush them to disk, possibly
causing updates of uncommitted transactions to be written to disk
(\emph{steal}).
When a transaction commits, the log is forced to the disk, but the dirty
pages are not (\emph{no-force}).
Whenever writing a data page to the disk, the log file must first be
written so that all the updates in the disk versions of the data
pages are also present in the disk version of the log, as dictated by
the write-ahead-logging policy.
Furthermore, the log file must be flushed to disk whenever committing
a transaction, at least up to and including the log entry for the
commit action.
These requirements are necessary to ensure that database recovery can
bring the database to a consistent state after a system crash.
%% Different Concepts of Time and History
%%---------------------------------------------------------------------
\section{Different Concepts of Time and History}
\label{sec:mv-data:time}
% Valid time
The concept of time in connection with database theory has multiple
definitions.
The latest consensus glossary of temporal database
concepts~\cite{bohlen:1998:consensus} presents three different
definitions for time: valid time, transaction time, and user-defined
time.
The first of these, \emph{valid time}\phantomsection\label{def:valid-time},
is the time when a recorded fact (data item) is true in the modelled
reality.
Valid time is usually provided by the user, and it may be changed later if
the corresponding real-life fact changes.
For example, events in a calendar database have starting times and
ending times.
These times may naturally be changed if the event is rescheduled to
another time, and the time of the scheduled event is thus an example of valid
time in a database.
It should be noted that the valid time may also be a single time
instant, instead of a time range.
Any change in a valid-time database overwrites the information about the
earlier state of the database, in the same way as in snapshot
databases.
An example of a transaction operating in a valid-time database is shown in
\figref{fig:tx-valid}.
\begin{figure}[htb]
\centering
\input{images/tx-valid.tex}
\figcaption{Transaction in a valid-time database}{The transaction has
executed the schedule $b w[1,\alpha,\alpha'] d[2] w[5,\epsilon] c$.}
\label{fig:tx-valid}
\end{figure}
% Transaction time
The second time concept, \emph{transaction
time}\phantomsection\label{def:tx-time}, is used to record the time when a
data item is current in the database and may be retrieved.
The transaction time of a data item is always an interval which
begins when the item is stored in the database.
Initially, the end of the transaction-time interval is undefined, and
the item is retrievable in all transaction-time instants starting
from the instant the item was inserted into the database.
Transaction time is always provided by the database management
system, and it must be an increasing value so that the ordering
of the transaction-time instants is consistent.
The transaction-time interval of a data item is called a life span, and we
give a formal definition below:
% If the data item is deleted at transaction-time instant $v$, the
% transaction-time interval end-time is set to $v$.
% Thus, the transaction-time interval of a data item is a closed-open
% interval of the form $[v_1, v_2)$, where $v_1$ is the
% transaction-time instant when the item was inserted in the database,
% and $v_2$ is the transaction-time instant when the item was deleted:
\thmskip
\begin{definition}
\label{def:lifespan}
The \emph{life span} of a data item is a closed-open transaction\hyp{}time
interval $\vrmark{v} = [v_1, v_2)$.
The time instant $v_1$ is called the \emph{creation time} of the data item,
and it specifies the transaction-time instant when the item was inserted
to the database.
Similarly, $v_2$ is called the \emph{deletion time} of the data
item, and it specifies the transaction-time instant when the item was
deleted.
If the data item has not been deleted, $v_2 = \infty$.
\end{definition}
\thmskip
In transaction-time databases, all the data items have an associated life
span, and we call them multiversion data items to distinguish
them from the data items of the logical database.
The formal definition of a multiversion data item is given later (in
\defref{def:mv-item}), but from now on, we will use the term \emph{data item}
to mean a multiversion data item.
When we wish to refer to the data items of a snapshot database (i.e., data
items without associated life spans), we will use the term
\emph{snapshot data item}.
In contrast to snapshot and valid-time databases, any change in a
transaction-time database creates a new state and thus does not
overwrite any previous states.
An example of this is shown in \figref{fig:tx-transaction}.
\begin{figure}[htb]
\centering
\input{images/tx-transaction.tex}
\figcaption{Transaction in a transaction-time database}{The
transaction has executed the schedule $b w[1,\alpha,\alpha'] d[2]
w[5,\epsilon] c$.
The arrow on the top shows the latest committed version~\comver.}
\label{fig:tx-transaction}
\end{figure}
% Terminology: versions and histories
The transaction-time instants are often called \emph{database versions}, as
the contents of the database item set may only change between each
transaction-time instant.
Because this term is used by many authors, we will use the term
\emph{version} from now on to mean transaction-time instants.
Moreover, we will use the term \emph{alive} to mean that a data item is part
of the current data set.
This is defined more formally for different versions below:
\thmskip
\begin{definition}
\label{def:alive-at-version}
A data item is \emph{alive at version}~$v$, if the life span of the data
item covers $v$; that is, if $v_1 \leq v < v_2$, where $[v_1, v_2)$ is the
life span of the data item.
Queries that target a version~$v$ only return items that are alive at
version~$v$.
\end{definition}
\thmskip
If $v < v_1$ or $v \geq v_2$, the data item is not part of database
version~$v$.
We call the most recent committed database version the \emph{current version}
of the database.
This version is denoted by \comver\phantomsection\label{def:maxcomver}.
\thmskip
\begin{definition}
\label{def:alive}
A data item is \emph{alive} if it is alive at version~\comver.
\end{definition}
\thmskip
The data items that are alive at version~$v$ form a \emph{database state}
$s_v$.
A transaction-time database thus consists of a series of consecutive states
\smash{$s_0, s_1, \ldots, s_{\comver}$}.
Data items that belong to database state $s_v$ automatically belong
to the next state $s_{v+1}$, unless they are explicitly deleted at
version~$v+1$.
The state \smash{$s_{\comver}$}, identified by the current database
version~\comver, is called the \emph{current state}, and all the states $s_v
: v < \comver$ preceding the current state are called
\emph{historical states}.
% User-defined time, more terminology
Efficient queries within the transaction-time domain or the
valid-time domain require database index structures that
are specifically designed for such queries.
In contrast, the third time concept, the \emph{user-defined time},
is an uninterpreted attribute used to refer to any
application-specific time value.
The user-defined time has no special meaning in the database
management system.
Database systems are thus categorized by their ability to index
valid time and transaction time.
Valid-time databases can be used to index events that occur at some
time in the modelled reality.
Database systems that can be used to access historical states are
called transaction-time databases.
Database systems that encompass both of these time dimensions are
called \emph{bitemporal databases}.
This taxonomy of the time dimension was first presented by Snodgrass
and Ahn~\cite{snodgrass:1985:taxonomy}.
The terms transaction time, valid time, and user-defined time were
introduced to replace the previous, vague terms \emph{physical} and
\emph{logical time}.
Finally, we will use the term \emph{real time} to mean the time of
the real world (often called user time or wall-clock time).
In practice, valid time is often the same as real time, but it can
also be used to model some other time domain.
For example, consider a valid-time database used in a
science-fiction game---the time in the modelled virtual universe
would be the valid-time domain in this case.
% Example contents of a bitemporal database
\begin{figure}[tb]
\begin{center}
\begin{tabular}{ccc@{\hspace{1.5ex}}cc@{\hspace{1.5ex}}c}
\themph{Name} & \themph{Rank} & \multicolumn{2}{c}{\themph{Valid
time}} & \multicolumn{2}{c}{\themph{Transaction time}}\\
& & From & To & From & To\\
\toprule
Merrie & Associate & \isodate\dbdate{1.1.2001} & $\infty$ &
\dbdate{20.12.2000} & \dbdate{5.5.2003}\\
Merrie & Associate & \dbdate{1.1.2001} & \dbdate{1.5.2003} &
\dbdate{5.5.2003} & $\infty$\\
Merrie & Full & \dbdate{1.5.2003} & $\infty$ & \dbdate{5.5.2003} &
$\infty$\\
\midrule
Tom & Full & \dbdate{10.3.2003} & $\infty$ & \dbdate{7.3.2003} &
\dbdate{13.3.2003}\\
Tom & Associate & \dbdate{10.3.2003} & $\infty$ & \dbdate{13.3.2003}
& $\infty$\\
\midrule
Mike & Assistant & \dbdate{22.3.2004} & $\infty$ & \dbdate{24.3.2004}
& \dbdate{17.5.2005}\\
Mike & Assistant & \dbdate{22.3.2004} & \dbdate{16.5.2005} &
\dbdate{17.5.2005} & $\infty$\\
\bottomrule
\end{tabular}
\caption[Valid time and transaction time in a bitemporal database]%
{Examples of the use of valid time and transaction time in a
bitemporal database.}
\label{fig:time-concept-table}
\end{center}
\end{figure}
% Example
An example of the possible contents of a bitemporal database, shown
in \figref{fig:time-concept-table}, clarifies the difference
between the time concepts.
This example has been adapted from the
taxonomy of time by Snodgrass and Ahn~\cite{snodgrass:1985:taxonomy}.
Reading from the contents of the example database, Merrie was
appointed as an associate professor \printdate{1.1.2001}, but this
fact was entered into the database earlier on, on the 20th
December.
Merrie was promoted to full professorship
\printdate{1.5.2003}, the fact of which was retroactively inserted
into the database on the 5th.
Associate professor Tom's employment begun \printdate{10.3.2003},
which was tentatively recorded to the database on
the 7th of March.
On the 13th, it was however noted that he had been accidentally
recorded as being a full professor, and the fact was then corrected.
Mike left the faculty \printdate{16.5.2005}, which was recorded to
the database on the 17th.
Note that only the bitemporal database model records all these
aspects of the evolution of the employment data.
% Data persistence
Yet another aspect of temporal databases is data persistence.
In contrast to other database types, transaction-time databases
and bitemporal databases offer more in terms of
traceability and accountability, because the entire history of database
states is stored and available for querying.
By the terminology of Driscoll et~al.~\cite{driscoll:1989:persistent},
transaction-time databases and bitemporal databases are called \emph{persistent
databases}, whereas snapshot databases and valid-time databases are called
\emph{ephemeral databases}.
Intuitively, when using a transaction-time database to model
changes in reality, a linear version history is often appropriate.
In this model, a new database version is always based on the current
version of the database:
\thmskip
\begin{definition}
\label{def:partially-persistent}
A \emph{partially persistent} transaction-time database incorporates a linear
history, in which new versions of data items are created based on the most
recent version.
\end{definition}
\thmskip
%Database systems that incorporate a linear version history are called
%partially persistent database systems.
%Fully persistent databases, on the other hand, allow new versions to be
%created based on any earlier version, thus creating branching and diverging
%version histories.
Many engineering applications, such as
%This model is required by many engineering applications, such as
version control systems and engineering design databases,
require that new versions can be based on any earlier version:
\thmskip
\begin{definition}
\label{def:fully-persistent}
A \emph{fully persistent} transaction-time database system allows new versions
to be created based on any earlier version, thus enabling the creation of
branching and diverging histories.
\end{definition}
\thmskip
Note that an implication of this definition is that every fully persistent
database system is necessarily also a partially persistent database system.
% Multiversion, links to fully persistent models
Most of the temporal database systems are designed on the partially
persistent transaction-time model.
These database systems are called \emph{multiversion databases} by many
authors~\cite{becker:1996:mvbt,bercken:1996:multiversion,jouini:2007:multiversion,lomet:1989:tsb,varman:1997:multiversion}.
From now on, we will be discussing multiversion databases based on
this definition.
Fully persistent database systems are discussed by other
authors~\cite{landau:1995:historical,salzberg:1995:branched}.
In fully persistent databases, a central challenge is finding out
whether two different versions are in the same version branch in the
version history.
Salzberg and Lomet~\cite{salzberg:1995:branched} suggest using
sequence numbers to identify data-item versions.
Their index structure assumes a small number of different branches.
Each data item is assigned a version and a branch identifier,
and the branch identifier is used to check whether a data item is an
ancestor of another data item.
This check requires at most \Oh{n} operations, where $n$ is the number
of branches in the system.
Landau et~al.~\cite{landau:1995:historical} use a history tree with arrays
that store change information in each node of the history tree.
Searches along a history branch first locate the queried version, and then
reconstruct the database state by combining the change sets of the
located node and its ancestor nodes.
Landau et~al.\ note that an index structure such as the snapshot index
of Tsotras and Kangelaris~\cite{tsotras:1995:snapshot-index}~(see
\secpageref{def:snapshot-index}) could be used to make the state
reconstruction more efficient.
%% Query Types
%%---------------------------------------------------------------------
\section{Query Types}
\label{sec:mv-data:queries}
% Different query types
An important property of temporal databases is the different types
of queries that can be run on them.
Ideally, the user should be able to define any bounds for the
key and the transaction time of the sought data item in a
multiversion database.
The corresponding ideal model allows the user to specify the key, the
transaction time, and the valid time in a bitemporal database.
All of these specified properties can be either intervals or
points in the corresponding dimension.
Most temporal databases, however, restrict the query types that are
possible for the database.
Tsotras et~al.~\cite{tsotras:1998:notation} define a notation for
specifying the query types that can be run on different kinds
of databases.
We use the notation for temporal databases that was adapted by
Salzberg and Tsotras in their temporal database
comparison~\cite{salzberg:1999:comparison}.
The notation is \qtype{key/valid/transaction}, where \qtype{key} means
the key dimension, \qtype{valid} means the valid-time dimension, and
\qtype{transaction} means the transaction-time dimension.
Each part of the query type may be either \qtype{point}, \qtype{range},
$*$, or $-$.
Here, \qtype{point} means that a single point in the corresponding
dimension is queried, \qtype{range} means that a range of the
appropriate dimension is queried, $*$~means that all values in
the dimension are to be included in the result, and $-$~means
that the dimension does not apply to this database.
Conceptually, $*$~is the same as defining a \qtype{range} that
covers all the values in the corresponding dimension.
% Allowed query types for multiversion database
On an ideal multiversion database, the database user should be able to run any
queries of the form \qtype{$x$/$-$/$x'$}, where $x$~and~$x'$ are both either
\qtype{point}, \qtype{range}, or~$*$.
Most multiversion databases, however, restrict the query type to
\qtype{$x$/$-$/point}~\cite{becker:1996:mvbt,kollios:2002:hashing,lomet:2005:immortaldb,lomet:1989:tsb}.
This means that queries may target data items with different keys, but only
within a single version of the database at a time.
In the terms of the consensus glossary~\cite{bohlen:1998:consensus},
the multiversion databases only allow queries that apply a
\emph{transaction-timeslice operator} with an \emph{instant-type} argument.
In our dissertation, we will assume the query type
\qtype{$x$/$-$/point} as the basis of our multiversion data model.
In this model, the user always supplies a single version when
querying on the database.
For discussion on the \qtype{$x$/$-$/$x'$} query type for the MVBT
index structure~(see \secref{sec:tsbmvbt:mvbt}), we
direct the reader to the article on query processing techniques by
van~den Bercken and Seeger~\cite{bercken:1996:multiversion}.
%% Representing Multiversion Data
%%---------------------------------------------------------------------
\section{Representing Multiversion Data}
\label{sec:mv-data:mv-data}
At this point, we can formally define a multiversion database:
\thmskip
\begin{definition}
\label{def:mv-database}
A \emph{multiversion database} is a transaction-time database that is
partially persistent and enables efficient \qtype{$x$/$-$/point} queries
on data items, where $x$ is either \qtype{point}, \qtype{range}, or $*$.
The transaction-time instants of the data items in a multiversion
database are called \emph{database versions}.
The versions of the multiversion database are ordered based on the
commit-time ordering of the transactions.
\end{definition}
\thmskip
% Mapping from transaction-time to real time
The versions of a multiversion database do not necessarily directly map to
any real-time instants.
In fact, using consecutive integer values for database versions has
the advantage that the next version can always be easily determined.
Users of the system, however, often wish to query past states based
on real-time instants.
There are two intuitive approaches for overcoming this discrepancy.
In the first approach, database versions are increasing integer
values, and a separate mapping between versions and real
time is maintained in the database system.
In the other approach, database versions are timestamps that are based
on real time.
In this approach, given a version~$v_1$ that was used by
transaction~$T_{v_1}$, it is not possible to directly determine the next
version~$v_2$ used by the next transaction~$T_{v_2}$, unless all the
different versions are separately stored.
Salzberg and Lomet~\cite{salzberg:1995:branched} suggest using the
former approach.
They use an array that maps the sequence numbers into real-time
instants.
Converting from database versions to real time is
straightforward, as the array can be directly accessed from the right
offset.
To convert back from real-time instants to database versions,
a binary search can be used, as the real-time instants are also
sorted in increasing order.
We will assume that this approach is used in the algorithms presented
in this dissertation.
% Transaction identifier
The ordering of the database versions must be based on the commit-time
ordering of transactions to ensure consistency of the data items between
different versions.
An ordering that is based on the starting time of transactions, for
example, guarantees consistency only if the transaction commit order is
forced to be the same as the starting order.
Because we do not wish to impose such a requirement, the data items created
by a transaction must receive the commit-time version of the transaction,
which may differ from the starting time of the transaction.
Because the commit-time version is not known at the beginning of the
transaction, active transactions must use a separate identifier to identify
their updates.
Following Lomet et~al.~\cite{lomet:2006:transactiontime}, we assume
that a transaction~$T$ is assigned a \emph{transaction identifier},
denoted by \txid{T}\phantomsection\label{def:txid}, when it is
created.
When the transaction~$T$ commits, a \emph{commit-time version},
denoted by \commitver{T}\phantomsection\label{def:commitver}, is assigned to
the transaction.
The commit-time version is the version used by the database to order the data
items, and users of the database system use it to query previous database
states.
% We will use the term commit-time version when we wish to emphasize the
% distinction between versions and transaction identifiers.
In contrast, the transaction identifiers are internal to the
database system and not seen by the users.
% Lazy timestamping and eager timestamping
The discrepancy between the transaction identifiers and the
commit-time versions is a common challenge in all multiversion
databases.
Often the entries that represent the data items need to be initially
stored with the transaction identifier, and later on revisited to
change the transaction identifier into the commit-time version.
A \emph{lazy
timestamping}\phantomsection\label{def:lazy-ts}
scheme~\cite{lomet:2006:transactiontime} initially uses the transaction
identifier to store the versions and lets the transaction commit.
Later, when the entries are accessed, the transaction identifiers are changed
to the commit-time version of the transaction.
In contrast, in an \emph{eager
timestamping}\phantomsection\label{def:eager-ts}
scheme~\cite{lomet:2006:transactiontime}, the temporary identifiers are
changed as soon as the transaction commits.
% Data-update model (single-action tx)
The data-update model adopted by most earlier proposals for indexing
versioned data, including the model adopted for the multiversion
\Btree~(MVBT) of Becker et~al.~\cite{becker:1996:mvbt} and the multiversion
access structure~(MVAS) of Varman and Verma~\cite{varman:1997:multiversion},
assumes that each update action creates a new version of the database, so
that the versions are unique across all versions of all data items.
This model is not adequate for modern transactional applications, because
data items within a single transaction should be assigned the
same version.
Following Salzberg et~al.~\cite{salzberg:2004:framework} and
Lomet et~al.~\cite{lomet:2006:transactiontime}, we assume a
multi-action-transaction approach in which all data-item versions
created by a transaction~$T$ get the same commit-time version
\commitver{T}, unless the transaction updates a data item more than
once, in which case only the final data-item version gets the
commit-time version and remains in the database.
% % Data items and data-item identification
We will now formalize the notion of data items in a multiversion
database.
Remember from \secref{sec:mv-data:snapshot} that the logical
database consists of tuples of the form $(k, w)$, where $k$ is the key of the
data item, and $w$ the value associated with the item.
Because the data items need an additional version attribute---the life
span~\vrmark{v} (\defref{def:lifespan})---the data item model used in
snapshot databases is not sufficient for multiversion databases.
% % The data items in a multiversion database have an associated life
% % span~\vrmark{v} (\defref{def:lifespan}).
% The data items in a multiversion database are thus tuples
% of the form $(k, \vrmark{v}, w)$, where $\vrmark{v} = [v_1, v_2)$ is
% the life span of the data item.
\thmskip
\begin{definition}
\label{def:mv-item}
A \emph{multiversion data item} in a multiversion database is a tuple of the
form $(k, \vrmark{v}, w)$, where $k$ and $w$ correspond to the key and value
of the snapshot data items of the logical database, and $\vrmark{v}$ is the
life span of the multiversion data item.
The life span $\vrmark{v}$ is either $[v_1, \infty)$, if the
data item has not been deleted; or $[v_1, v_2)$, if the data item has
been deleted.
The version~$v_1$ is the commit-time version of the transaction that inserted
the data item, and version~$v_2$ is the commit-time version of the
transaction that deleted the data item.
\end{definition}
\thmskip
\thmskip
\begin{definition}
\label{def:pending-update}
The updates performed by an active updating transaction~$T$ are called
\emph{pending updates}, and they are represented by tuples of the form
$(k, \txid{T}, \updatemark)$, where $k$ is the key, \txid{T} is the
transaction identifier, and $\updatemark$ tells whether the update is
an insertion or a deletion.
In the case of an insertion, $\updatemark = w$, the value of the data item;
and in the case of a deletion, $\updatemark =
\deletemark$\phantomsection\label{def:deletemark}, a special marker value used
to denote item deletion.
\end{definition}
\thmskip
In the logical database model, the pending updates created by a
transaction~$T$ are incorporated into the database immediately when $T$
commits, either by inserting new data items or by updating the existing
data items.
Depending on the implementation of the multiversion database, the
entries that are used to store the pending updates may exist for some time
after the transaction has committed.
In this situation, the data-item entries in the database may differ from the
data items of the logical database, and the database management system must
be prepared to apply the effects of the pending updates on the existing data
items during the execution of database queries.
With this definition, the logical data items can be identified by the
key-life-span pair $(k, \vrmark{v})$, or $(k, [v_1, v_2))$.
However, we can also uniquely identify a data item using only the
creation time of the data item, that is, the version~$v_1$.
In this dissertation, we will use the pair $(k, v_1)$ to uniquely
identify a data item $(k, [v_1, v_2), w)$.
This is possible, because the life spans of data items with the
same key~$k$ cannot overlap.
We also define all other pairs $(k, v)$, with version~$v$ in
the life span of the data item, $v_1 < v < v_2$, to refer to the
same data item.
With this definition, users of the database system may query a data
item using any version that is covered by the life span of the data
item.
%% Read-Only Transactions
%%---------------------------------------------------------------------
\section{Read-Only Transactions}
\label{sec:mv-data:read-only-tx}
% Defining snapshot version
A fully concurrent multiversion database management system allows any
number of read-only and updating transactions to operate
on the database concurrently.
As explained in \secref{sec:mv-data:queries}, the read-only
transactions must explicitly specify which version they want to read;
however, they are only allowed to read versions that were already
committed when the read-only transaction began.
In our model, database users must supply the queried version
at the beginning of the read-only transaction.
All the queries of the transaction then target the same version of
the database.
We call this version, associated with transaction~$T$, the \emph{snapshot
time} of $T$, and denote it by \snapver{T}\phantomsection\label{def:snapver}.
% Definition of read-only transactions on multiversion databases
% Definition of transactions in general is assumed to be known
A \emph{read-only transaction}~$T$ operating on a multiversion
database may contain the following actions:
\begin{itemize}
\setlength{\itemsep}{0pt}
% Begin read-only
\item \action{begin-read-only}$(\text{version}~v)$: begins a new read-only
transaction; this action records the value $\snapver{T}
\leftarrow v$ for the transaction.
At this point, it is also checked if $v \leq \comver$.
If $v > \comver$, the transaction is either aborted or blocked until
$v \leq \comver$.
% Single-key query
\item \action{query}$(\text{key}~k) \to \returnmark$: retrieves
the data item $(k,[v_1,v_2),w)$ that covers version \snapver{T},
that is, $v_1 \leq \snapver{T} < v_2$.
Returns $\returnmark = w$ if such a data item exists, otherwise
returns $\returnmark = \nullmark$.
% Range query
\item \action{range-query}$(\text{range}~[k_1, k_2)) \to
\returnset$: this action retrieves the set of data items
$(k_i,[v_i^1,v_i^2),w_i)$ with $k_1 \leq k_i < k_2$ and $v_i^1 \leq
\snapver{T} < v_i^2$.
Returns the set \returnset\ of snapshot data items $(k_i, w_i)$
representing the multiversion data items alive at the queried version
\snapver{T}.
% Commit read-only
\item \action{commit-read-only}: commits the transaction~$T$ by removing
it from the system.
\end{itemize}
A read-only transaction that is aborted by the system may directly be removed
from the database system; no undo actions are necessary.
Read-only transactions in general do not involve any logging, because they
perform no updates to the database.
\begin{figure}
\begin{center}
% \begin{enumerate}
% \setlength{\itemsep}{0pt}
% \item $T_1$ inserts entries $(1, w_1)$, $(2, w_2)$, $(3,
% w_3)$; commits with version $v_1$.
% \item $T_2$ inserts entries $(3, w_3')$, $(4, w_4)$; commits
% with version $v_2$.
% \item $T_3$ inserts entries $(1, w_1')$, $(5, w_5)$; commits
% with version $v_3$.
% \end{enumerate}
\input{images/transaction-db-example}
\figcaption{An example history of a multiversion database}%
{The figure shows the contents of a database created by three transactions
with commit-time versions $v_1$, $v_2$, and $v_3$.
The arrow on the top shows the latest committed version $\comver =
v_3$.}
\label{fig:mv-tx-workload}
\end{center}
\end{figure}
An example of a read-only transaction, operating on the example
multiversion database shown in \figref{fig:mv-tx-workload}, is given below:
\begin{enumerate}
\setlength{\itemsep}{0pt}
\item \action{begin-read-only}$(v_2)$
\item \action{query}$(1) \to w_1$
\item \action{query}$(3) \to w_3'$
\item \action{query}$(5) \to \nullmark$
\item \action{range-query}$([0,10)) \to \{ (1,w_1), (2,w_2), (3,w_3'),
(4,w_4) \}$
\item \action{commit-read-only}
\end{enumerate}
%% Updating Transactions
%%---------------------------------------------------------------------
\section{Updating Transactions}
\label{sec:mv-data:updating-tx}
All updating transactions operating on multiversion databases must always
operate on live data items.
Standard multiversion concurrency-control algorithms are used to
maintain data consistency, so that concurrent updating transactions
do not update the same data items.
As discussed in \secref{sec:mv-data:mv-data}, data items created by and
existing data items deleted by an updating transaction~$T$ must eventually
receive the commit-time version \commitver{T} as the version associated with
the data item.
For a write action, the new data item must have \commitver{T} as its
creation time, and for a delete action, the deleted data
item must have \commitver{T} as its deletion time (see
Definitions~\ref{def:lifespan} and~\ref{def:mv-item}).
Because the commit-time version of a transaction~$T$ is not known during the
execution of~$T$, the updates of $T$ are represented by pending updates (see
\defref{def:pending-update}) that use the transaction identifier~\txid{T} to
separate the updates of different transaction.
When the transaction~$T$ commits, the database management system
performs a \emph{release-version} action that incorporates the
pending updates into the database, using the commit-time version
\commitver{T}.
The committed version \commitver{T} is visible to other transactions after
the release-version operation has completed.
An \emph{updating transaction}~$T$ operating on a multiversion
database may contain the following actions:
\begin{itemize}
\setlength{\itemsep}{0pt}
% Begin updating transaction
\item \action{begin-update}: begins a new updating transaction; this
action records the snapshot version~$\snapver{T} \leftarrow
\comver$, and creates the transaction identifier~$\txid{T}
\leftarrow$ \emph{new identifier}.
% Single-key query
\item \action{query}$($key~$k) \to \returnmark$: if the
transaction~$T$ has already performed an update action on the key~$k$, so
that a pending update $(k, \txid{T}, \updatemark)$ exists, this action
returns either $\returnmark = \updatemark$ if the pending update was a
write action, or $\returnmark = \nullmark$ if the pending update was a
deletion (i.e., $\updatemark = \deletemark$).
If such a pending update does not exist, the action retrieves
the data item $(k,[v_1,v_2),w)$ such that $v_1 \leq \snapver{T}
< v_2$.
If such an item is found, the action returns $\returnmark = w$;
otherwise the action returns $\returnmark = \nullmark$.
% Range-query
\item \action{range-query}$($range~$[k_1, k_2)) \to
\returnset$: this action retrieves all the pending updates $(k_i,
\txid{T},\updatemark_i)$ and all data items
$(k_i,[v_i^1,v_i^2),w_i)$ such that $k_1 \leq k_i < k_2$ and $v_i^1 \leq
\snapver{T} < v_i^2$.
For each key~$k_i$, if a pending update for that key is found,
the return set \returnset\ contains the data item $(k_i, \updatemark_i$),
if $\updatemark_i \neq \deletemark$.
For each key $k_i$, such that there is no pending update for that
key but there is a data item with key $k_i$, the return
set \returnset\ contains the snapshot data item $(k_i, w_i)$.
A pending update that is a deletion thus prevents the deleted key from
appearing in the result set.
% Write action
\item \action{write}$($key~$k,$ data~$w)$: a forward-rolling action
that either inserts a pending update $(k, \txid{T}, w)$ into
the database, if no earlier pending update of the form $(k, \txid{T},
\updatemark)$ exists; or replaces the existing pending update $(k,
\txid{T}, \updatemark)$ with $(k, \txid{T}, w)$.
This action writes a redo-undo log record that contains sufficient
information for redoing or undoing the write action.
% Delete action
\item \action{delete}$($key~$k)$: this action logically deletes an existing
data item.
If the multiversion database contains a
pending update of the form $(k,\txid{T},\updatemark)$, $\updatemark
\neq \deletemark$, this action replaces it with
$(k,\txid{T},\deletemark)$.
If the database does not contain any pending update
$(k,\txid{T},\updatemark)$, then this action is legal if the
multiversion database contains a live data item $(k, [v_1,\infty),
w)$.
In this case, the action inserts a pending update $(k,
\txid{T}, \deletemark)$.
The action finishes by writing a redo-undo log record that contains
sufficient information for redoing or undoing the deletion.
% Set savepoint
\item \action{set-savepoint} $\to p$: creates a savepoint
by generating a savepoint identifier~$p$ and by returning it to the
transaction.
% Rollback to savepoint
\item \action{rollback-to-savepoint}$($savepoint~$p)$: rolls the transaction
back to a preset savepoint~$p$.
This action is followed by the \action{undo-write} and
\action{undo-delete} actions for the \action{write} and
\action{delete} actions done after setting savepoint~$p$, executed
in the reverse order.
% Commit
\item \action{commit-update}: commits an active updating transaction.
This action generates a commit-time version and assigns it to the
transaction by setting $\commitver{T} \leftarrow$ \emph{new commit-time
version};
updates the current-version counter $\comver \leftarrow
\commitver{T}$; and