The Checkpoint Mechanism

This note documents the design for the checkpoint data structures and algorithms.

1. Background

The CapROS checkpoint design is derived from the EROS checkpoint design. The EROS design differs from the KeyKOS, but many of the ideas of that design still pertain. The KeyKOS design is described in The Checkpoint Mechanism in KeyKOS. The EROS design is described in The Design Evolution of the EROS Single-Level Store. The differences between CapROS and the other systems are summarized below. You can skip this section if you're not interested in the history.

2. Definitions

Demarcation event
A point in time, as of which the state of all the objects in the system will be preserved. When the system restarts, it is initialized to the state at a demarcation event.
Generation
The set of objects that have been changed in the interval from one demarcation event to the next.
Stabilization event
After a demarcation event is declared, stabilization occurs when all the state of that generation (objects, processes, directories, generation header, and checkpoint header) has been saved to the log. Writing the checkpoint header is the last act before stabilization.
Restart generation
The most recent stabilized generation. This generation will be selected for a restart.
Working generation
The generation after the restart generation. It is not yet stabilized.
Active checkpoint
From the time of a demarcation event until the next stabilization event has occurred, we say that a checkpoint is active. During this time the system is working to stabilize the working generation.
Next generation
While a checkpoint is active, any objects that are changed belong to the next generation (the one after the working generation). The next generation exists only in RAM, not on disk.
Unmigrated generation
A generation that has not been fully migrated. It may be partially migrated.
Retired generation
A generation that is identified as migrated in the most recent checkpoint header. (Additional generations may have been migrated since that header was written; they are migrated but not retired.) The system keeps a directory of objects in the retired generation(s), so that it can fetch objects from there, taking advantage of the disk locality of the log. On a restart, retired generations must not be used because they may have been overwritten with unstabilized data.
Log
The area of disk reserved for saving generations. The log is divided into the header area and the main log.
Header Area
The first two frames of the log, used to hold checkpoint headers.
Main log
The log, excluding the header area. Generation headers and other generation data are written to the main log.
Main log cursor
Generation data are written sequentially in a circle. The main log cursor is the point in the main log where data will next be written.
Limit percent
The maximum amount of the main log which one generation may use.

3. On-Disk Log Area

The CapROS log area is made up of ranges of disk page frames that are sequentially numbered beginning at zero. These are referred to as log ranges. Just as objects in object ranges are referred to with object identifiers (oids), objects in the log are referred to with log identifiers (lids). A lid is constructed by taking the 56 bit frame number and concatenating an 8-bit object index. (8 is equal to log2(EROS_OBJECTS_PER_FRAME).)

Log ranges may reside on multiple disks, but the implementation may require that sequential numbering be preserved. This requirement presents some difficulties for log area rearrangement which may, in due course, be solved with suitable user-level applications.

A generation that has been completely written to the disk is said to be stabilized. A stabilized generation has an associated on-disk directory that describes where all of the objects in that generation reside. The last step in stabilizing a generation is writing the associated checkpoint header.

3.1 The Checkpoint Header

Frames zero and one of the log are the header area. They hold the two most recent checkpoint headers. There are two headers so that if there is a disk error writing a header, there will still be another valid header that can be used for restart.

The checkpoint header contains an identifying header, containing the generation number of the restart generation. (This structure is called CkptRoot in the code.) The header with the highest generation number is the most recent header. Only the most recent valid header is used for restart.

The checkpoint header also contains the number of unmigrated generations, and an array of LIDs of the generation headers of all the unmigrated generations. There is a limit, currently 20, on the number of unmigrated generations.

The last item in the generation header is a field used to validate the correctness of the previous data. This field is included to detect frames which were only partially written due to hardware failure. We needed this field on the S/370 because certain write failures (e.g. channel failure) caused the disk controller to fill out the block with the last byte transferred from the channel, and then write a correct hardware checking block.

3.2 The Generation Header

The root of a generation directory is the generation header. The generation header frame begins with a header structure describing the generation.

The header structure is:

The object directory needs to be kept entirely in memory even past when a checkpoint is stabilized. It makes sense to write the entire object directory into sequential log locations at the end of the checkpoint.

The process directory must be saved as of the time of the demarcation event. It makes sense to write it into contigous log locations immediately after the pages and log pots that were written before the demarcation event.

Contiguous log locations allow easy scheduling of writes and reads during checkpoint and restart, as all the locations of the directories are known initially.

Implementor note: The old 32 bit lid was referred to as a LogLoc in the code.

Each of the fields is explained below.

versionNumber

The version number of the generation header.

sequenceNumber

The generation sequence number allows CapROS to determine which generation is most recent.

firstLogLoc, lastLogLoc

The firstLogLoc and lastLogLoc fields allow the restart code to determine what part of the main log contains this generation. Note that firstLogLoc will be higher than lastLogLoc when this generation wraps from the end of the main log to the beginning.

The restart code checks each lid when rebuilding the in-core object and process directories. If the most recent version of all the unmigrated objects, the restart generation process directory frames, and the object directory frames for all the unmigrated generations are not all mounted, the system will not restart until they are mounted.

firstProcessDirFrame, nProcessDirFrames

nProcessDirFrames is the number of process directory frames. If the entire process directory is in the checkpoint header, it will be zero. The firstProcessDirFrame is the LID of the first frame of the process directory.

firstObjectDirFrame, nObjectDirFrames

nObjectDirFrames is the number of object directory frames. If the entire object directory is in the checkpoint header, it will be zero. The firstObjectDirFrame is the LID of the first frame of the object directory.

nProcessDescriptors

Part of the process directory may be held in the generation header frame. nProcessDescriptors is the number of process descriptors in that list. They immediately follow the header.

nObjectDescriptors

Similarly, part of the object directory may be held in the generation header frame. nObjectDescriptors is the number of object descriptors in the header. They immediately follow any process descriptors that may be in that header.

3.3 The Process Directory

A process directory frame consists of a single word describing the number of following entries, and then some number of process descriptors. With 8 byte OIDs and 4 byte ObCounts, a 4K process directory frame can describe 314 processes.

An entry in the process directory corresponds to a process that was running at the time of the demarcation event. The process entries from the restart generation are reloaded as part of the system startup procedure.

The oid field identifies the root node of the process. actHazard identifies some situations requiring special handling:

  • When a process that invoked the Sleep capability is awakened due to expiration of the time, but the process is not in memory at that time, the value of actHaz_WakeOK in the actHazard field indicates that we need to give it the return code RC_OK.
  • Any processes that were sleeping on the Sleep capability at the time of checkpoint will be awakened on restart with the return code RC_capros_key_Restart, because we do not store the wakeup time in the process directory. The value of actHaz_WakeRestart in the actHazard field indicates this situation.
  • On restart, we will restart any persistent processes for which there was a Resume capability in a non-persistent node. (See details.) The process directory contains entries for these processes too. We store in callCount the call count from the Resume capability, because it might be stale, and we are not in a position to check that when we take a checkpoint. The value of actHaz_WakeResume in the actHazard field indicates this situation.

    Note that on restart CapROS loads the process list from the most recent generation header only.

    3.4 The Object Directory

    The object directory frames hold ObjectDescriptors. Each object descriptor holds an object identifier, the object's allocation and call counts, the type of the object, and the lid at which the object can be found. With 8 byte OIDs, 4 byte ObCounts, 8 byte LID, and 1 byte types, each 4kB frame can hold a description of 163 objects (assuming unaligned/packed storage and allowing for a count of entries at the start of the frame).

    For nodes, the callCount field contains the call count. For other objects, the callCount field is unused.

    The allocCountUsed and callCountUsed bits are not stored on disk. They are assumed to be one for objects on disk.

    If the lid field of the directory is zero, then the corresponding object is null: either a zero-filled page or a node filled with void keys and zero nodeData. Which one can be determined by the value of the type field.

    If the lid field is nonzero, the corresponding log frame contains either page data or a "log pot." A log pot is simply a page-sized cluster of nodes in the main log. This is why the lid value is concatenated with an object index: the index indicates which entry in the log pot contains the relevant object. (The object index of a LID, like that of an OID, is 8 bits to allow for future types that may be small enough to fit 256 in one frame.)

    3.5 Flash Memory

    CapROS will support solid-state disks that use flash memory. These devices have a finite number of erase-write cycles (up to 1,000,000 for NAND flash). When the device is written, a block of memory (as large as 256KB) is first erased, then written.

    The header area is a "hot spot" on the disk because it is written once on every checkpoint. If we assume (pessimistically) one checkpoint every ten seconds, and (optimistically) 1,000,000 write cycles, the header will wear out in only 115 days. An earlier design had a large header area in an attempt to spread out writes, but the large erase block size makes that approach impractical.

    Flash memory devices intended to replace disks generally have some mechanism for wear leveling. We will rely on that mechanism to handle the header area "hot spot". The migrator will be written to avoid making node pots and tag pots "hot". The main log should be sized to hold several generations to avoid a hot spot there.

    4. Theory of Operation

    In overview, checkpointing operates as follows:

    1. Following the restart generation in the log is the working generation. Objects that are cleaned from RAM are written sequentially to the working generation area.
    2. After a while, a demarcation event occurs. Reasons for declaring a demarcation event are discussed below. The system will preserve the state of all dirty objects as of this time.
    3. Those dirty objects are cleaned into the working area, and the directories are written.
    4. The generation header of the working generation is written into the working area.
    5. A checkpoint header is written to the header area. The successful completion of this write is a stabilization event, and the generation is said to be stabilized.
    6. The working generation becomes the restart generation, and a new working generation comes into existence. The cycle then repeats.

    The diagram above shows the layout of the main log. The main log cursor (the next location to be written) is at the end of the bar. Note that the log is circular, so the left end and the right end are the same location. The retired generations remain in the main log until their storage is reused. There may be free space if the directory entries that described that space have been freed. The reserved space may be bigger or smaller than the free space.

    Before a persistent object in RAM is dirtied, space is reserved for it in the main log. See below for details of the reservation algorithm.

    If the checkpoint interval (typically 5 minutes) has passed since the last demarcation event, a new demarcation event is declared. This ensures that the system can be restarted with data that is no more than 5 minutes old. See below for other reasons to declare a demarcation event.

    At the demarcation event, dirty objects in RAM are marked "copy on write", the Activity structure is saved as a process directory, the next generation is started, and execution resumes. At this time we say a checkpoint is active. Then, the process directory is written to the log, and then all of the dirty objects in the working generation are written to the log.

    After the last dirty object in the working generation has been written to disk, the object directory is written. Then a new generation header is written; that generation will become the restart generation.

    Then a checkpoint header is written to the header area. Alternate checkpoint headers are written to alternate locations in the header area, so the older header is overwritten, and there is always a good header.

    5. Migration

    Migration is the process of copying objects from the log to their home locations. Migration runs whenever there are any unmigrated stabilized generations, migrating the oldest such generation first. The priority of migration is adjusted to balance the desire to avoid contending with needed disk I/O, with the need to progress fast enough to avoid having to stall processes (as described below).

    Conceptually, during the migration phase, all the objects in a stable generation are copied from the main log to their home locations on the disk. As an important optimization, objects that are not current need not be migrated, as described in The Checkpoint Mechanism in KeyKOS.

    Migration is optimized by a non-persistent process external to the kernel. This process may chose to optimize disk arm motion in the home ranges if the disk is on rotating media; or it may minimize rewrites if the disk is on flash memory. Because it is non-persistent, the external migrator will never be stalled attempting to dirty an object.

    As the log cursor and the reserved space advance around the circular main log, it may happen that all the space in retired generations is reserved. In that case, if an attempt is made to dirty another object and increase the reservation, the dirtying process must be stalled until space is freed up. The priority of migration will be adjusted to reduce the likelihood of this happening.

    6. Log Management

    The main log is a circular buffer of frames, each of which is one page in size. A frame may hold a data page, a log pot of nodes, a process directory frame, an object directory frame, or a generation header.

    6.1 Frame Reservation

    Before any persistent object can be dirtied, space for it must be reserved in the main log. The system keeps track of the number of dirty persistent objects of each type. The object reservation is the most amount of space in the main log it could take to save those objects, taking into account:

    • Multiple nodes can fit in one log pot.
    • There may be a partially assembled log pot that needs to be saved.
    • Space is needed for object directories.
    • Space is needed for process directories. The number of processes is capped by the number of Activity structures allocated in the kernel, so we will reserve space for the maximum number of processes.
    • One frame is needed for the generation header.

    The reservation is important because before dirtying an object we need to make sure we will be able to clean it later. When we clean an object, we can't overwrite any unretired generations, because that data will be needed if the system should restart.

    We do not actually decide where to store the newly dirtied object when space for it is reserved; we decide that when the object is actually written. Indeed, the object might be null (zero) at the time it is written, and not take any space in the log at all, so the reservation is an upper bound on the space needed.

    While a checkpoint is active, there are some dirty objects in the working generation, and other dirty objects in the next generation. We track the reservation for each generation separately. We shall see below how these are used.

    6.2 Frame Allocation

    At some later time, we actually write the dirty object to the disk. At this point we actually allocate the log frame that the object or log pot will go to. At the same time, the object becomes clean, so we reduce the count of dirty objects, which may reduce the reservation. This is how reserved space is consumed.

    An object can be dirtied (reserving space), written out (consuming space), and subsequently redirtied. When this occurs, there are two possible design options:

    • Rewrite the object to its previously assigned location.
    • Assign a new location to the object.

    The first option has several disadvantages:

    • In the case of a node, it requires two I/O operations: reading the log pot, modifying it, and then rewriting it.
    • On rotating media, it may require moving the head away from the current main log cursor.
    • On flash memory, it could create extra wear on that log frame.

    Therefore we adopt the second policy. Even if a frame in the log does not contain an active entry, we do not reuse it. As a result, we can use a simple cursor (the main log cursor) to allocate space in the log.

    A great many of the objects written to the log will prove to be null objects. For our purposes, a null object is any of the following:

    • A zero-filled data page.
    • A node filled with void capabilities and zero nodeData.

    The reason these occur with such frequency is that returning storage to the space bank causes that storage to be zeroed. The reason to do this is that objects returned to the space bank are generally dirty, and handling them this way obviates the need to rewrite now-dead data back to the log. By storing them as null objects, the write bandwidth requirements can be reduced significantly.

    The advantage to handling null objects specially is that they do not occupy any object storage in the log. The directory entry for a null object suffices to indicate that the object is null.

    6.3 Restart Sequence

    When CapROS is booted, it normally looks for a checkpoint on disk. (See Booting for other boot scenarios.) CapROS disk volumes (partitions) are identified with an IPLSysId that is unique for each system. This allows one CapROS system to format disks for another CapROS system without confusing them.

    The restart sequence finds the configured disks, using a combination of configuration data and probing. It mounts all divisions of all volumes with the correct IPLSysId. Then it reads the checkpoint headers at LIDs 0 and 1 and determines which one is the most recent valid checkpoint header (the one with the highest generation number). This header identifies the restart generation and all the unmigrated generations.

    If a drive failure occurred while writing a header (detected by bad CRC while reading the corresponding frame or by the checksum at the end of the header failing to verify), that header is discarded.

    Then all the unmigrated generation headers are read. The number of unmigrated generations is given in the checkpoint header. Those generation headers are used to reload the object directory. Unmigrated generations now need to be migrated (perhaps for the second time). All the older generations are both migrated and retired and are available for allocation and reservation. They must not be migrated or used, because they may have been overwritten with newer, unstabilized data.

    The process directory is reloaded from the restart generation. The main log cursor is initialized to the frame following the last frame of the restart generation.

    For simplicity, we consider that no part of an unmigrated generation is available for allocation, even though some frames might be migrated or might not contain current data.

    To avoid deadlock, we must make sure that the disk handler and migrator can always make progress. The design of the disk handler and perhaps the migrator are such that they allocate some storage dynamically. When the system is booted, we must wait until these components have signaled that they have finished allocating dynamic storage, before allowing any persistent objects to be dirtied.

    6.4 Algorithms

    Here are some algorithms that might not be obvious.

    6.4.1 Space Reservation

    Just before a persistent object is dirtied, the storage reservation algorithm is executed. The algorithm goes like this:

      If migrator and disk handler are not initialized:
          Process stalls waiting for those to be initialized.
      If a checkpoint is not active:
        Calculate tentative working reservation assuming object is dirtied.
        If size of working area + tentative reservation > size of main log * Limit Percent
           OR tentative reservation > size of retired generations:
          Declare a demarcation event (and a checkpoint is now active).
      If a checkpoint is active:
        Calculate tentative reservation for next generation assuming object is dirtied.
        If size of working area + working reservation + tentative next reservation
             > size of migrated generations(retired or not)
           OR the number of unmigrated generations (including the working generation)
              >= the maximum number of generations permitted in the directory
          Process cannot dirty now. Process stalls waiting for migration to progress.
        Else dirty the object and adjust next reservation.
      Else dirty the object and adjust working reservation.
      	      

    6.4.2 Object Write

    When we write an object to the disk, we do the following:

      If object is null:
          update in-core directory entry accordingly
      Else
          write object to disk and update in-core directory accordingly
      Mark object clean and adjust the corresponding reservation downwards.
      	      

    6.4.3 Directory Write

    During checkpoint stabilization, we must write the process directory and the object directory. The space allocated to each in the header is variable. Since the process directory is written before the dirty objects, we treat it first.

    Let S be the amount of space in the generation header frame after the header.

    First, if the total number of descriptors in the process directory fits in S, then just put them directly in the header frame and reduce S by the space used.

    Otherwise, put the all the descriptors that will fit in the header frame and set S to zero.

    After all the objects have been written, put as many object descriptors as will fit in the remaining space in the header frame, and write the rest into object directory frames at the end of the checkpoint.

    7. The In-Core Directory

    The in-core directory is the data structure that records the location of objects in the log. The in-core directory is simply called the directory when context makes it clear. It is used for three purposes:

    • When an object must be paged in from disk, the directory tells whether the current version is in the log and if so where.
    • When a generation is migrated, the directory tells the migrator what objects are in that generation and where they are.
    • When a page is journaled, the directory tells the pager the location of the page that would be current if a restart occurred (the restart version), so that version can be updated.

    The migrator only needs to migrate the current version of an object, so the directory only needs to keep information about the current version of objects and (for pages) the restart version.

    Note that the migrator will have to be able to find entries for the generation being migrated, so a compressed generation number will need to be saved for each directory entry or they can be linked in a list headed in the core generation structure.

    A number of data structures are possible for the directory. Here is one. Each directory entry is a node in a red-black tree with the oid as the key. Each entry has information about the object: the allocation and call counts, the alloc and call count used bits, and the type. Each entry also has the LID and generation number of the current version. To find entries by generation, the entry is linked on a doubly-linked chain for its generation. If the object is a page, the entry may also have the LID and generation number of the restart version. (We do not need to link into the chain for the generation of the restart version.)

    When a checkpoint is stabilized, the current version becomes also the restart version; the old restart version is no longer the restart version. Rather than update affected directory entries at this time, we just recognize this situation from the generation numbers, and update the entry whenever we happen to visit it.

    The following is another, older design:

      struct CoreDirectory {
        GenNum migratorSequenceNumber;
        CoreDirent *oidTree;
      
      };
      	      
    migratorSequenceNumber

    The last completely migrated generation sequence number.

    oidTree

    The top of the red-black tree that makes up the in-core directory.

    7.1 Directory Management

    CapROS allocates a fixed, tunable amount of space for directory entries at boot time. The restart code loads the directory with information for the restart generation and all the other unmigrated generations. (It must not load directory information from migrated generations, because those generations may have been freed and overwritten after the restart checkpoint was stabilized.)

    A directory entry is allocated when a dirty persistent object is cleaned, that is, written to the log. (If that object was already in the log in the same generation, we will get its old entry back, but we do not count on this in managing directory entry allocation.) Thus the number of dirty objects is an upper bound on the number of directory entries we will need to take a checkpoint.

    Before we dirty an object, we check that the number of free directory entries, plus the number of directory entries used by retired generations (these entries could easily be freed), is greater than or equal to the soft limit. The soft limit is a tunable parameter that may be the number of objects that can fit in memory, or may be larger. If we aren't going to exceed the soft limit, we go ahead and dirty the object.

    If we would exceed the soft limit, and a checkpoint isn't in progress, we declare a demarcation event. If the number of free directory entries plus the number that could be freed in retired generations is greater than or equal to the number of dirty objects (including the object about to be dirtied), then we go ahead and dirty the object. Otherwise, the dirtying of the object must wait until this comparison can succeed.

    At the time an object is cleaned, if there are no free directory entries, we free the oldest retired generation, freeing all its directory entries if any, until there is a free directory entry. (The above algorithm ensures that we can always get a free entry.)

    8. The Core Generation Structure

    I (wsf) think this structure is overkill, but it is useful documentation of what the coder will possibily need. This whole section is old and should be updated or deleted when the code is written.

    Each generation has an associated in-core structure. This structure maintains book-keeping information concerning the generation and contains the root pointer for the in-core generation directory.

      struct CoreGeneration {
        bool canReclaim;
        CoreDirent *oidTree;
        uint32_t nCoreDirent;
        uint32_t nDirent;
        uint32_t nReservedFrames;
        uint32_t nAllocatedFrames;
      
        uint32_t nDirPage;
        uint32_t nLogPot;
      
        uint32_t nZeroPage;
        uint32_t nZeroNode;
        uint32_t nPage;
        uint32_t nNode;
        uint32_t nProcess;
      
        // Record of storage we have released:
        uint32_t nReleasedNode;
        
      #ifdef PARANOID_CKPT
        uint32_t nAllocDirPage;
        uint32_t nAllocPage;
        uint32_t nAllocLogPot;
      #endif
        
        curLogPot;
        uint32_t   nNodesInLogPot;
      };
      	      

    For each type of object (page, zero page, node, zero node, process), the core generation structure records how many of these objects have been allocated in this generation. An object is allocated the first time it is written to the checkpoint area. Nodes are

    The various fields are:

    canReclaim

    Indicates whether the objects in this checkpoint generation have been migrated, and can therefore be reclaimed.

    oidTree

    Root pointer to a red-black tree of directory entries for each object in this generation.

    nCoreDirent

    Number of entries in the core directory tree for this generation.

    nDirent

    Number of on-disk directory entries that have been reserved for this generation. Should always be equal to nCoreDirent.

    nReservedFrames

    Number of disk frames that have been reserved for this generation.

    nAllocatedFrames

    Number of disk frames that have been allocated for this generation. Should always be less than or equal to nReservedFrames.

    nDirPage, nPage, nLogPot,

    Number of directory pages, pages, and log pots (respectively) that have been reserved in this generation. These should sum to nReservedFrames.

      struct CoreDirent {
        CoreDirent *left;
        CoreDirent *right;
        CoreDirent *parent;
      
        OID         oid;
        ObCount     count;
        LogLoc      logLoc : 24;
        uint8_t        type;
      
      
        enum { red = 1, black = 0 };
        
        uint8_t        color;
      };
      	      

    Appendix A - Another approach to flash memory support

    The design selected for CapROS above tries to avoid "hot spots" on disk, but still has "warm spots" in the checkpoint headers and possibly in the home locations. By having a large number of checkpoint headers, the heat in that part of the flash memory is reduced to a level which should meet the device life requirements.

    The following design is based on the goal of never re-writing a location in the log until all other locations have been written (exactly once). It was rejected because:

    1. It uses cryptography. Using cryptography introduces the need for a good source of random numbers, and builds a dependence on the quality of cryptographic algorithms which is not present in CapROS for any other reason.
    2. It is vulnerable to any kind write error occuring when, after the log has wraped back to the beginning, the first checkpoint header is rewritten. An error rewriting the first checkpoint header will prevent restart from any checkpoint (because you won't be able to find them). This issue could be addressed by replicating the first checkpoint header (in lid=0 and lid=1).

    The design

    The log range page with LogDI (lid) 0 is the first checkpoint header. When the disk is formated, it is initialized to all zeroes. LogID 0 is always a checkpoint header, and acts as the root of all the checkpoint headers on the disk. All the log ranges on the system make up a single logical checkpoint area which is written sequentially in order of increasing LogIDs. When a checkpoint reaches the end of the available log ranges, it wraps around to the beginning, starting with lid=1.

    Each checkpoint header holds pointers to information about the checkpoint, more-or-less following those described in in the chosen design. In addition they hold forward and back pointers to checkpoint headers for previous and subsequent checkpoints as follows:

      HMACValue checksum;
      LID next;
      LID previous;
      LID migratedSeqNum;
      HMACKey nextMac;
      HMACKey previousMac;
      64bitint serialNumber;
        

    The last operation in a checkpoint is writing the checkpoint header which is usually the first block of the checkpoint. (If the checkpoint wraps the log ranges, lid=0 will be the header for that checkpoint, and the first block of the checkpoint will either contain checkpoint data or remain unused.) Note that while we can calculate the lid of the next checkpoint, and generate a HMACKey for it, the header for that checkpoint will probably not contain a checkpoint header. It will probably be the contents of some old page, a log pot of nodes or checkpoint directory data. We use the HMAC to separate a checkpoint header from the other cases.

    The restart strategy is:

      Read lid=0
      If that frame is all zeroes, there are no checkpoints. Stop.
      Do
        Read the next checkpoint header.
        If the HMAC does not check or the serialNumber is lower than
                the previous serialNumber break;
      Enddo
        

    We how have the most recent checkpoint, and can restart from it. We can use the back pointers to find the checkpoint(s) that need migration.

    Additional Thought

    We don't actually need "home ranges" at all. KeyKOS had the concept of "limbo", where, if when it came time to migrate a page or node, its home range wasn't mounted, the kernel simply marked it "dirty" and let it be written into the next checkpoint. This kind of logic could be used for all pages and nodes in the system, eliminating the home locations as hot spots.


    CapROS home
    SourceForge.net Logo Copyright 1998 by Jonathan Shapiro. Copyright 2008, 2009 by Strawberry Development Group. All rights reserved. For terms of redistribution, see the GNU General Public License This material is based upon work supported by the US Defense Advanced Research Projects Agency under Contract No. W31P4Q-07-C-0070. Approved for public release, distribution unlimited.