Disk Formatting and Layout

This note describes how the disk formatting logic works in the CapROS system, and how CapROS volumes are structured. Both disk partitioning and range creation/deletion are covered.

Note that the current implementation does not provide for live creation of new volumes, so the discussion of partition creation should be taken with suitable skepticism; it snapshots what we intend to do when the online disk partitioning/formatting utility is written.

1 Disks and Partitions

Disks can be divided into multiple, independent chunks known as partitions. Each partition constitutes a logical disk volume. CapROS, for the most part, deals in terms of disk volumes. CapROS volumes, in turn, are divided into ranges, described below.

Over the life of a disk, it may prove convenient to add or remove partitions/volumes, and to adjust the content of existing volumes.

1.1 Adding New Partitions

Adding new volumes (paritions) must be done carefully for the sake of consistency, but it isn't especially difficult to do. The principle challenge is to defend against disk failure during format. The creation of new CapROS partitions proceeds in steps:

  1. Create a partition of non-CapROS type. This ensures that if a failure occurs while partition format is in progress the partition will not be seen as a CapROS partition.

  2. Obtain a raw disk range key to the new partition. Use it to write a proper CapROS-formatted volume image to the new partition. At a minimum, this would include a boot block and an empty pair of range tables.

  3. Delete the partition, and create a new partition covering the same range of sectors that is really a CapROS partition (i.e. retype the partition).

  4. Tell CapROS to try and mount the newly created partition.

1.2 Deleting Existing Partitions

There are two cases to consider when deleting CapROS partitions. CapROS treats the case of an unplanned deletion the same as if the disk became unreachable due to removal or failure, which of course means that any data that existed solely on that disk is lost. This can cause critical software components to fail. Another problem is that data in the checkpoint log destined for the removed partition will be unmigrateable.

Planned deletion of a partition involves migrating all the data on it elsewhere. The Object IDentifiers of the data do not change. Most likely this will use mirroring/duplexing as follows:

  1. Create and mount a new partition with ranges that duplicate the ranges in the partition to be deleted. This may be on a different disk.
  2. Dirty all the objects in the partition.
  3. Wait for those dirty objects to be checkpointed and migrated to both partitions.
  4. Dismount the partition to be deleted.

The question of how to actually reduce the amount of data in the system, so disk space can be permanently freed, is beyond the scope of the current design of the system.

2 Ranges

CapROS views each logical volume as further divided into ranges. This section describes the range types, and the formatting of each range type as seen by the kernel. All ranges are an exact multiple of the architecture page size; extra trailing sectors, if any, are unused.

In the kernel code, ranges are called Divisions; should we change this documentation to match?

2.1 Range Types

Ranges come in a variety of types.

No provision is made for allocating spare space for use in the event of disk failure, because both IDE and SCSI provide transparent sparing.

There is no boot range and no kernel range. It is presumed that the disk contains another partition which contains the GRUB boot loader, the kernel, and the preloaded range(s). This partition is formatted with a boot sector and a traditional file system understood by GRUB. We also need to document somewhere how systems will be booted from ROM.

3 Object Ranges

Object ranges are made up of a sequence of clusters. Every cluster begins with a disk frame known as a tag pot, followed by a fixed number of object frames. The size of a cluster is determined by the number of Object Frames a Tag Pot can keep track of. (The last cluster in a range may have as few as zero object frames.) All frames are the size of the architecture's page. So the Object Range looks like:

----------------------------------------------------------------------
| TagPot | ObjFrm |...Object Frames...| ObjFrm | TagPot | ObjFrm | ...
----------------------------------------------------------------------
 \_____________________   _____________________/
                       \ /
                     cluster

Each Object Frame contains one or more objects of a single type (node or page). The type is specified in the tag pot. The type of the frame determines how the Frame is laid out, as follows:

Allocation Count

If the allocation count of an object is N, that means that there are no applicable capabilities containing an allocation count M > N, and there are no rescinded applicable capabilities containing an allocation count M = N. (There may exist rescinded applicable capabilities containing an allocation count M < N.) For nodes, applicable capabilities are all capabilities containing the node's OID. other than resume capabilities. For all other objects, applicable capabilities are all capabilities containing the object's OID.

In the simple case, rescinding capabilities to an object requires incrementing the allocation count of the object.

When the allocation count reaches its maximum possible value, the object is "worn out" and cannot be reused, because we can no longer rescind capabilities to it.

To reduce the "wear" on the allocation count, we keep track of some in-memory capabilities (they are linked to the object) and we can invalidate these capabilities without incrementing the allocation count. A bit called allocCountUsed indicates whether there may be unrescinded capabilities (other than resume capabilities) that contain the current value of the allocation count. If the bit indicates that there are no such capabilities, rescinding can be done without incrementing the allocation count. With this optimization, it would take an extremely long time for a 32-bit allocation count to wear out.

The allocCountUsed bit is not stored on the disk. When an object is fetched from disk, its allocCountUsed bit is set to one. (If we did store the bit on the disk, then we would have to consider the object dirty whenever the bit is set (or cleared). The bit is set when an in-memory capability is converted to the out-of-memory form, and for technical reasons it's not possible to dirty the target object at that time.)

Call Count

Nodes contain a call count that is similar to the allocation count, but for resume capabilities. If the call count of a node is N, that means that there are no resume capabilities containing the node's OID and a call count M > N, and there are no rescinded resume capabilities containing the node's OID and a call count M = N. (There may exist rescinded resume capabilities containing the node's OID and a call count M < N.) Similarly, there is a callCountUsed bit, and it's not stored on disk either.

Tag Pots

A tag pot has an entry for each frame in its cluster. Each entry has the following fields:

type
The type says whether the frame contains a page or nodes. In the future other types may be implemented (process is high on our list).

A virgin frame should be typed as a page, since it is easier to retype a page to a different type than vice versa.

allocationCount
If the frame holds a page, this is its allocation count. If there were other types with only one object per frame, this field might hold the allocation count for that object too. Otherwise, this field is unused.

To avoid wasting space on alignment, a tag pot is arranged as two arrays:

struct TagPot {
  ObCount allocationCount[];
  uint8_t type[];
};

Using a 32-bit count, a tag pot can describe 819 object frames.

Retyping

It is possible to change the base type of a frame, for example from page to node pot or vice versa. By "base type" we mean the type of the container of an object. For example, forwarders and GPTs are different types, but, as an implementation detail, they both occupy a node frame; they both are of base type node.

The space bank internally allocates a whole frame at a time to a given base type. The space bank does not always know the base type of free frames, and allocating the first object in a frame may require changing its type. (As an optimization, the space bank preferentially allocates free frames that were previously of the same base type, to reduce the amount of retyping needed.)

Retyping a frame must not cause any rescinded capabilities to become unrescinded. (When a frame is retyped, there are no allocated objects in the frame, and therefore no unrescinded capabilities to objects in the frame, so we don't need to worry about unrescinded capabilities becoming rescinded.) This requires that allocation counts and call counts never decrease. Here's how we ensure that:

Retyping a frame involves the following steps. These steps ensure that the retyping operation does not change the state of the last committed checkpoint. The steps are described in the context of a persistent (checkpointed) system, but they apply equally well to a non-persistent (embedded) system.

  1. The space bank invokes a Range capability to request a capability to an object in the frame. The kernel attempts to find the requested object, and discovers that the frame is of the wrong type. The Range operation returns an exception that tells the space bank to retype the frame, and returns the current type of the frame.

    Note that the kernel needs to know the current type of the frame.

  2. When retyping, there are no allocated objects in the frame. Therefore there should be no unrescinded capabilities to objects in the frame. Therefore the allocCountUsed and callCountUsed bits of those objects should be zero.
  3. Knowing the type of the frame, the space bank now knows the number of objects in the frame. It determines the maximum allocation count (and, for nodes, call count) by iterating over all the objects in the frame, invoking a Range capability operation to get the current counts. Note, the allocation and call counts should not be changing, because the space bank has not yet allocated anything from the frame.

    As with obtaining the type, the counts are obtained without requiring the contents of the object. Specifically, the kernel gets the counts (1) if the object is in memory, from the object header; (2) if the object is in the in-core log directory, from the directory entry; or (3) otherwise, from the home range tag pot (for a page) or home range node pot (for nodes).

    There is a separate operation for each object, because each object might be in a different node pot in the log, and we don't want to require all the pots to be in memory at once. (This function could be optimized later to return counts for multiple objects in one operation, as long as that can be done without doing multiple fetches of pots on any one operation.)

    In the case of nodes, the EROS design called for keeping in the tag pot the maximum of the allocation counts of the nodes in that home node pot. Note that this maximum only applies to the nodes whose current version is in the home node pot; some nodes from that pot might have their current version in log pots, and we still need to read those. This design could save reading the home node pot when the frame is retyped, but may require updating and writing the tag pot more frequently. Since retyping should be rare, we choose to not keep node counts in the tag pot.

  4. The space bank then invokes an operation on the Range capability to retype the frame. This operation proceeds in three phases.
    1. Ensure we have enough resources to complete the operation.
    2. Destroy vestiges of the old objects. If the object is in memory, we free it. Also, if the object is in the in-core log directory (for any generation), we free that directory entry.
    3. Create nascent objects of the new type, for all objects in the frame. For persistent objects, these are created by making entries in the in-core directory (for the working generation) indicating that the objects are null. For non-persistent objects, the objects are created as null in memory (non-persistent objects never have directory entries). The allocation counts (and, for nodes, call counts) are specified by the space bank based on its earlier determination of the maximum counts. (It is likely that the space bank will soon allocate at least some of these objects.)
  5. The space bank can then once again request a capability to the desired object. The request will now succeed.
  6. Eventually, one or more of the new objects, if persistent, will be migrated. The migrator notices the change of type in the tag pot in the home range. If the frame is now a node pot, the migrator formats the frame as such before migrating to it. The contents and counts of the nodes formatted in this frame do not actually matter, because all the nodes in it were dirtied and will be migrated to it either now or later.

SourceForge.net Logo Copyright 1998 by Jonathan Shapiro. Copyright 2008 by Strawberry Development Group. All rights reserved. For terms of redistribution, see the GNU General Public License This material is based upon work supported by the US Defense Advanced Research Projects Agency under Contract No. W31P4Q-07-C-0070. Approved for public release, distribution unlimited.