IA-32 EmulationThis note describes some early thoughts on how to do complete IA-32 emulation to support foreign operating systems running as user-level applications. The approach requires kernel support, but allows the majority of the emulation to occur in user mode. Most of the mechanisms needed were already contemplated at one point or another. The major new introduction is support for segment tables and the impact of this on the assumptions of the underlying EROS kernel implementation. This note does not describe a general-purpose solution. Emulating an IA-32 machine on an arbitrary host clearly requires dynamic compilation. The objective here is to emulate and IA-32 machine on a host that directly supports the IA-32 user-mode instruction set. Some of what follows was crystalized by a long weekend session with Kevin Lawton (see: plex86). I had initially hoped to borrow heavily in our implementation from plex86. Borrowing is certainly still possible, but it is now clear that the two pieces most likely to survive a port to EROS -- the interpreter and the JIT compiler -- would require significant modification. While there is much in common between the strategy outlined here and the one taken by plex86, the details are quite different. While I wrote this without reference to Kevin's emulation writeup, his writeup of possible emulation techniques is excellent and strongly recommended. At the moment I am unable to find a working online link to it. If you have one, please let me know. 1. General ApproachThe first thing to say about running x86 code is that the hardware is good at it and software isn't. Emulating the behavior of segmentation and paging with a pure software solution carries considerable overhead: 30 to 40 instructions of JIT-generated code per memory-mode instruction. Of these instructions, most go to simulating the behavior of the page translation and segmentation logic. The good news is that we have an engine ready to hand that already knows how to do this: the IA-32 (a.k.a. x86) chip. This machine has been emulated commercially (VM/386 at IBM, VMWare, Connectix). The bad news is that this chip really doesn't want to be virtualized in any simple way. Section 3 of the paper below considers the problems that arise when taking a virtual machine monitor approach to IA-32 virtualization (running supervisor code directly in user mode). All of these issues reappear when taking a hybrid approach (interpreting supervisor code). It's worth a read as background to this note. John Scott Robin and Cynthia Irvine, ``Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor,'' Proceedings of the 9th USENIX Security Symposium, Denver, CO, August 2000, pp. 129--144. At the time of this writing, a copy of this paper could be found on the USENIX site here. 2. Emulating on EROS vs. Emulating on LinuxEmulating an IA-32 machine on EROS is a very different problem than emulating on Linux. Broadly, there are two challenges in using the hardware for user mode code:
In plex86, which emulates on a Linux host, paging and segmentation are handled by providing a kernel subsystem that builds a real page table (and, I am guessing, a shadow segment table) on the side and ``warping'' between the guest application and the host operating system. Plex86 then implements non-application code using a variety of interpretation techniques ranging from single-instruction interpretation to (eventually) JIT compilation. That is, plex86 implements a microkernel within a kernel. Unfortunately, the amount of code that plex86 places in the kernel to support this is considerable (including a JIT compiler), and would probably preclude assurance evaluation for EROS if we were to do that. I don't think we will need to, because EROS some advantages over Linux where emulation is concerned:
If we have to, we can consider moving this code into the kerrnel after it is debugged. To make a long story short, it is possible (with considerable effort) to run emulated application (ring 3) code in an EROS process using the native hardware under a modest set of assumptions. This emulation is not exact, but it can be made good enough to fool most of the operating systems out there -- most notably Windows, Linux, and EROS. The key kernel requirements to support such execution are:
The balance of this note describes what problems we need to deal with, what the ring-3 visible semantics of segmentation actually are, how we can use the EROS paging logic to simulate the native paging logic, and (in abstract) how we will execute guest kernel code. It also discusses some miscellaneous virtualization issues surrounding visible system registers. The actual implementation of the guest code executive is left for another design note, as it is a topic unto itself. 3. Privileged, Sensitive, and Revealing Instructions (Ring 3)In a hybrid design, we begin with the assumption that all code runs in ring 3. That is, all instructions are executed with non-supervisor access modes. As we will interpret supervisor instructions, we will consider those problems separately. Privileged instructions are those that modify the security state of the hardware. The IA-32 does not permit ring-3 applications to perform privileged instructions directly. We can therefore discard this concern, though we will need to pay attention below to correct handling of simulated privilege level transitions. Sensitive instructions are those that reveal the security state of the machine. For example, the IA-32 EFLAGS register contains the ``current privilege level.'' Every instruction that sources this field is therefore sensitive. These include PUSHF, POPF, and a hoard of others. Fortunately, the answers are the same in ring 3 pretty much regardless of operating system. While these instructions do reveal the state of the emulated machine, they reveal state that never changes in a fashion that is visible in ring 3. Revealing instructions are those that might disclose the fact of emulation to a kernel that is looking for it. There are quite a number of these, and the center primarily around segmentation and critical system tables. I believe that with careful management of a shadow segmentation scheme, these can be plugged well enough to fool the majority of operating systems out there. Section 4 discusses the revealing instructions and various strategies for minimizing revelation while preserving system-wide protection. 4. Segment Semantics Visible in Ring 3In the following discussion, our goal is to deal with the revealing instructions. To determine what is needed to preserve the desired illusion, we first need to enumerate what a ring 3 application can learn and which of these things are important. For a more detailed explanation of issues, see section 3.1 of the Robin and Irvine paper. A key issue in the following discussion: which things allow detection of emulation (which we can tolerate) vs. cause emulation to break. 4.1 Location of System TablesRing-3 can use the SGDT, SLDT, SIDT instructions to learn the virtual memory location of the global descriptor table, the local descriptor table, and the interrupt dispatch table, respectively. They also reveal the size of these tables. This is essentially useless information, and I don't know any application that has a reason to use these instructions from ring 3 unless it is checking to see if emulation is going on. I do not see emulation detection as a big issue unless it breaks something. Actually, it's a lousy strategy for detection, because the virtual addresses of all tables are likely to change from kernel version to kernel version as a result of recompilation. The only way I can see that these values can be problematic is if the guest application later passes the discovered location back to the guest OS and a comparison of locations is made. This is a problem because we will almost certainly need to implement a shadow LDT/GDT, so the reported location will not match the location expected by the guest OS. Regrettably, these instructions directly reveal the content of protected system registers. There appears to be no straightforward way to prevent this revelation. Fortunately, applications don't actually execute these instructions. If this becomes an impediment, we can probably arrange to place the shadow GDT, LDT, and IDT at the same virtual address where the guest OS placed the original. I propose we defer this until it proves to be a probleem. 4.2 Exposure of Segment Table ContentFour instructions partially expose the values of a segment table entry. None of these instructions has security implications per se (i.e. it's safe to run them), but each reveals something to ring 3 code about the content of the segment table that an emulator might want to change:
As discussed in Robin and Irvine, these instructions present various problems for execution of supervisor code. However, that isn't the problem we are trying to solve, and for ring 3 code things are not so bad: 4.2.1 VERR, VERWIn ring 3, the VERR, VERW instructions reveal information only about segments that are accessable to ring 3 code anyway. That is, they are not sensitive when invoked from ring 3. A correct emulation must necessarily ensure that the verify instructions generate the same results as they do on the real machine. It is very inconvenient that these instructions do not trap when applied to invalid entries. The fact that they do not is a design failing in the IA-32 family that could be easily and compatibly corrected. Because they do not trap, it is necessary for the permissions fields and limits of the corresponding shadow segment table entries to match the entries in the original segment table. Fortunately, these instructions do not reveal the distinction between descriptor entries that are not present (beyond the descriptor table limit) and those that are not readable. VERR, VERW therefore cannot be used to detect the additional entries in the shadow descriptor table used for the EROS kernel descriptors. 4.2.2 LSLThere is a general issue with TSS, call gate, and task gate segments that needs to be addressed below. Here we consider only the implications of the LSL instruction in the unlikely case that a TSS segment is created with DPL=3 (no current operating systems do so). The LSL instruction, when executed from ring 3, reveals the length of accessable segments plus TSS segments whose DPL value is 3. It is not our responsibility to stop the guest OS from revealing stupidity to the guest application. We need only be concerned about revealing information about the host OS TSS segments (if any). The EROS kernel uses a singleton master TSS with a DPL of 0. Even if it used a DPL of 3, revealing the length of a statically created kernel structure does not create either a significant disclosure or a channel of communication. In effect, this behavior reveals a non-sensitive constant to the guest application. There is one potential revelation concerning the TSS limit: the guest OS may make use of the permissions bitmask and the differences between the size of the guest OS TSS and the size of the EROS OS TSS might reveal the fact of emulation hosting if the DPL of the guest OS TSS is set to 3 (i.e. if the guest OS author was a complete idiot). Revealing the fact of emulation may be a foregone conclusion in any case, but we do not need to reveal it here. Alternatively, note that the LSS instruction does not reveal the linear base address of the task segment. Therefore, the EROS kernel could resolve the problem by maintaining a dummy TSS region and using false TSS entries in the shadow descriptor table that point to this dummy TSS and reflect appropriate sizes. This is the preferred resolution for reasons discussed below. Finally, note that all of this sillyness is required only to support idiot operating systems that set the DPL value to 3. We'll do it. Someday. A long time from now. 4.2.2 LARThe LAR instruction raises many of the same issues as the LSS instruction. As with LSS it reveals information about code/data segments accessable from ring 3, but this is not sensitive. Like LSS, it reveals potentially sensitive information about TSS segments to ring 3 code. It also reveals information about call gates and task gates. As before, these are an issue only when the segment entry DPL value is DPL=3. The statements about TSS segments made under the discussion of LSL apply equally well to the LAR instruction. LAR reveals that a TSS segment exists and what access rights exist to it, but does not reveal anything about the nature of the process that will be invoked. 4.2.3 STRThe STR instruction reveals the identity of the descriptor table entry from which the current task was loaded. This instruction is not used by applications in most systems. The primary requirement to simulate this instruction's behavior correctly for ring 3 code is to ensure that any TSS entry in the shadow descriptor tables appears at the same location as the corresponding entry in the original descriptor table. This can be done without actually implementing multiple TSS segments in the operating system. 4.3 TSS, Task Gates, and Call GatesFor performance reasons, current IA-32 operating systems generally use a single, supervisor-only TSS and do not use task gates or call gates. In a nutshell, it's faster to simulate this behavior in software than to let this sorry excuse for a processor do the work. In such systems, no segment of these types will exist with DPL=3. Since we are only doing native execution of ring 3 code, the virtualization issues associated with simulating the behavior of these misfeatures disappears. 4.3.1 Call GatesCall gates are nastily complicated, but not really that bad to manage. The ``solution'' is for the EROS kernel to provide a set of call-gate entry point in the kernel that accepts zero arguments (and therefore construct a uniform stack frame). Each call gate is directed to a unique kernel entry point that records the identity of the descriptor table selector used in the code. This selector is passed to the keeper of the guest application, which is the program performing the supervisor-mode emulation. Given access to the selector invoked, the emulator can use the original (non-shadow) descriptor table to work out what should be done. If it is absolutely essential to do so, the EROS kernel could also arrange to record the argument words and encapsulate these upward into the keeper invocation. This would penalize the normal capability invocation path, and I am therefore somewhat reluctant to do it. Efficient emulation is important, and this decision should therefore be dictated by performance measurement. 4.3.2 TSS, Task GatesFortunately, transfers via a jump or call to a TSS or task gate segment do not make provision for passing arguments or specifying an entry point. Further, while the privilege level necessary to access the task is revealed by various instructions, the privilege level at which the destination task actually executes thankfully is not. This means that the ``honey pot'' solution works: create a dedicated singleton TSS whose sole purpose is to be the destination of all emulated TSS and task gate transfers that immediately traps. The honey pot TSS is configured to proceed executing EROS kernel code. It immediately unwinds the task linkages (in order to become available for next time), switches back to the expected kernel TSS using the LTR instruction, marks the guest application as having trapped to the emulator, and resumes it, causing a fault into the keeper. 5. Shadow PagingBecause we want to make emulated programs persistent, and also because we want to minimize kernel impact, the EROS IA-32 emulator needs to use shadow paging techniques. These techniques are discussed extensively in the Karger paper. The best hardware support for this is the ``fault on first reference'' support in the Alpha; we will recreate a similar mechanism in software here. The guest OS is maintaining a set of tables that it believes are the real mapping tables. It informs the emulator about what mapping tables to use via the MOV %CR3 instruction. The emulator provides a simulated physical address space that is implemented as an EROS address space. Initially, the guest OS runs from this space directly. Once protected mode is entered, the emulator switches the guest into a new, empty EROS address space. The guest OS remembers the relationship between every guest address space root pointer and its associated emulator address space (we will refine this below, but stick with this for now). As the protected-mode guest executes, it page faults in the new EROS space. As each page fault is incurred, the emulator proceeds as follows:
This mechanism is regrettably complicated, and some tricks will be needed to avoid unnecessary invalidations (the majority of page table modifications are authority upgrades, which can usually be processed lazily. Still, the EROS mapping mechanism works plenty fast enough for EROS, and these tricks should be closer in performance to the purely in-memory form. 6. Use of Segmentation in the EmulatorAs previously mentioned, the hardest parts of guest OS simulation are paging and segmentation. The shadow paging mechanism described above can be used to provide the desired paging simulation. Here we describe how to use the new segmentation mechanisms to support guest emulation. Since the guest OS is interpreted, we need to be concerned with four segments in any given instruction:
While the latter two need to enforce permissions as well, all of these checks can be performed using ring-3 segments, significantly reducing the amount of code that the interpreter must generate. 8. Kernel Support in EROSSupporting IA-32 emulation above requires several pieces of kernel support. 8.1 Relocatable KernelTo fully support IA-32 emulation, the EROS kernel needs to be able to get out of the way. The problem is that the guest OS may require the ability to map things into the memory region that the EROS kernel thinks it owns. When this happens, EROS needs to move. In the current (3/16/2002) implementation, the virtual map as seen by the kernel looks like: 0G 3G 3.25G 4G +-------------------+--------+--------+ | (large) user | small | kernel | | space | spaces | space | +-------------------+--------+--------+ Conceptually, the kernel is link-edited to start at adress 0. In practice, the kernel is link-edited to start at 0x101000. The primary kernel page directory starts at 1 Mbyte (0x100000) and the kernel loads above that. The relocation of the kernel to 3.25 Gbytes is accomplished by altering the segment base address of the kernel segments. The kernel runs in a wrapping 4G segment, and therefore sees user addresses in the current address space starting at 1Gbyte. Within the kernel data structures, the kernel's current base address is known (sometimes implicitly) in several places:
8.1.1 Step 1: Rotate Small Spaces to the EndOur first change will be to rotate small spaces to the end of the map and shrink the total amount of space allocated to the kernel. The kernel will now see user addresses starting at 0.5 Gbytes: 0G 3.5G 3.75G 4G +-------------------+--------+--------+ | (large) user | kernel | small | | space | space | spaces | +-------------------+--------+--------+ While small space segments will continue to need to be relocated whenever the kernel is relocated, their mappings are managed as kernel mappings. From this point forward we will treat the kernel region as a single unit in our discussions. 8.1.2 Step 2: Placement-Neutral Kernel MapWe will plan to be able to relocate the kernel to any of 8 positions in the virtual map. I will refer to these as kernel mapping zones (KMZs). The general idea is that the zone currently owned by the kernel is protected by setting the supervisor bit in all corresponding mapping table entries. When the application faults in this region, the kernel will respond by stealing some other window from the application and rebuilding application mapping directories accordingly. Before the various segment registers and machine registers can be reloaded, the kernel must switch to a placement-neutral mapping. That is, a mapping in which both old and new locations are valid for kernel references. The EROS kernel already maintains a singleton kernel mapping table. The only change required is to duplicate the kernel mappings 8 times. I am tempted to refer to this as the demilitarized mapping, but let's leave well enough alone. The first step in switching zones is to change mapping tables to the KMZ-neutral table. In this table, all zones are valid. 8.1.2 Step 3: Zone ChangeOnce running from the neutral mapping table, the kernel rewrites the appropriate segment table entries in the GDT, disables interrupts (!) reloads the GDTR, IDTR, and (if needed) LDTR. It also rewrites the user base address offset pointer to an appropriate new offset. The kernel now reloads CS, DS, ES, SS from the global descriptor table (the kernel does not use other segment registers). The kernel is now running from the new window. Interrupts are now re-enabled, and execution now resumes with the current process. 8.1.2 Step 3: Tagged Page DirectoriesEROS page directories are already tagged according to whether they represent read-only or read-write directories. We will add to each directory a single-byte ``zone'' field describing the zone constraint under which it was constructed. Before executing any process, we need to check if it's current zone (which will be recorded in the context structure) matches the current kernel zone. If not, we will reset its current zone and force it to attempt to run out of the universal kernel page table. Running out of the kernel table (which has no valid user-mode mappings) is how processes normally bootstrap their mappings. Having arranged that the process will now attempt to relocate a valid page directory, we will modify the ``find page directory'' logic. The directory frame locator will locate the appropriate directory page just as it does now. If it finds a directory page whose zone field is non-current, it makes the following modifications to the directory:
Alternatively, we can keep multiple versions of directories and let the ager kill them over time. My feeling is that it will be more efficient to kill the minimal number of mappings and let the EROS kernel rebuild the holes -- the kernel is already very good at this. We will also need to update the mapping invalidation (depend) logic so that it will not stomp on supervisor mappings. 8.1.3 Trailing ThoughtsHaving written all of this up, I find that this really isn't as difficult as I expected. Perhaps we should reconsider the decision to leave it out and just make this a part of the normal kernel specification for IA-32. The kernel will move occasionally, but it will tend to stabilize in one location for long periods of time. More importantly, the behavior of small spaces will not be unduly disturbed. 8.2 Optional Small SpacesSince emulated IA-32 environments potentially demand full access to the address space, it is not always possible to allow small address spaces within an emulated address space. Protection of small spaces relies on segmentation, which may conflict with guest usage. Small spaces therefore need to be optional. To support this, we need to add a mode bit in the context cache indicating whether a given context can support small spaces. This mode bit is set or cleared according to the descriptor table cache management logic -- it is not part of the persistent per-process state. Processes that do not permit small spaces always take the slower (full switch) context switch path, and require page directories to be tagged as having/not-having small space support. As a practical matter, there is no need for an additional tag. If the process is not using the standard GDT, it is unlikely to support small spaces anyway. I believe we should simplify things by simply refusing to put small spaces in non-native machines. 8.3 Local Descriptor TablesThe EROS kernel presently provides no support for manipulation of descriptor tables, and does not expose a local descriptor table for use by user-mode code. At present, all processes load their descriptors using well-known selectors in the singleton global descriptor table. To support emulation, we regrettably need to support the local descriptor table. The first step is to reserve a capability slot in the IA-32 process root specification to contain the local descriptor table annex capability. As a descriptor table can be up to 64 Kilobytes in length, the capability in this slot should be a node capability to a node containing page capabilities. If any of these constraints fails, the process will execute with a null local descriptor table. I am still thinking about where the descriptor table size should be recorded. The EROS kernel will keep a cache of local descriptor tables that is loaded from the per-process table if a per-process local descriptor table is in use. It will treat the local descriptor mechanism as a functional unit. This is conceptually similar to the way that the floating point unit is currently handled. Whenever an attempt is made to schedule a process that is using a local descriptor table, the EROS kernel will check to see if (a) there exists a local descriptor cache entry, and (b) it is up to date. If necessary, a descriptor cache entry will be allocated. If it is not up to date, entries will be copied from the per-process table. Descriptor copy can be defined to generate faults or to downgrade in place. I believe I prefer the downgrade in place design. The descriptor copy activity will downgrade all copied descriptors as follows:
There are two feasible designs for all this:
The two strategies are compatible, and could be implemented simultaneously. As the pages containing the per-process descriptor table might be aged out, we need to be able to deal with descriptor cache invalidation anyway. 8.4 Global Descriptor TableProviding emulation support in the global descriptor table (GDT) is both necessary and tricky. Most modern operating systems run applications in flat mode, but place the historically required descriptor entries into the global descriptor table. While every operating system has entries that it uses for the kernel, there is no universal convention concerning what these entries are. While the GDT is normally thought of as ``one table, one machine,'' this is incorrect. It is actually ``one table, one operating system.'' In principle even this could change, though I don't know of any operating systems that reload this table once it has been loaded. Of course, EROS is no exception. The EROS kernel firmly believes that it owns the GDT, and relies on being able to use entries in it. These locations are in turn recorded in the interrupt descriptor table (IDT), the task switch segment (TSS), and sometimes the ``fast syscall'' segment register. We can probably compile the kernel to choose values that are unlikely to get stepped on, but we need to be prepared to deal with the possibility that this might occur. 8.4.1 What ``Stepped On'' MeansThe EROS kernel runs in ring 0, and all of the emulated code we want to run natively runs in ring 3. For purposes of finding a GDT slot for EROS kernel segments, this means that a GDT slot desired by some hosted operating system collides with an EROS slot only if it has DPL=3 and occupies the same slot index. The actual in-kernel descriptor table is only a shadow table, and the EROS kernel is free to reuse (dynamically) any slot with DPL<3 provided that it does not get caught doing so. Unfortunately, the EROS kernel cannot reuse a guest-controlled GDT to execute native EROS applications. Doing so would expose segment values to native applications that they should not see. This means that when switching applications, the kernel must compare the inbound and outbound descriptor table cache indexes and conditionally reload the descriptor table. The best solution, if possible, would be to place the EROS desriptors above the guest descriptors in all tables, thereby preserving a common segment selector value for kernel segments across all tables. This will not avoid the need to switch tables, but it will minimize the frequency that other system-critical tables (such as the interrupt dispatch table) need to be rewritten to reflect new selectors. 8.4.2 Switching Kernel SelectorsWhen circumstances force us to switch kernel segment selectors, as when a user application demands a particular GDT slot, the kernel must move out of the way delicately:
Note that if the EROS kernel cannot identify a common set of selector locations that work across all possible GDTs, it must potentially perform this switch every time it switches into and out of an emulated process. One begins to anticipate that the context switch path is becoming both complicated and expensive. 8.5 Interrupt DescriptorsThe interrupt descriptor table contains entries that are used in most modern operating systems to handle system calls. These descriptors must appear at the locations expected by the applications. In some cases this implies a need to reprogram the offboard interrupt controller hardware to move the actual hardware interrupts out of the way. This is a complete mess, and it is frought with peril in an SMP implementation as different processors come to disagree about the state of the interrupt machine. There is one pragmatic saving grace in the interrupt descriptor table, which is that the table is universally kept small by all operating systems. It contains entries for interrupts, exceptions, and one or two system call entry points. As a result, it is possible in practice for the emulator to store the real interrupt entry points in locations that do not conflict with the system call entry points. This largely recovers the SMP issue, though moving the descriptors is more than a little delicate. 8.5 ConclusionsHaving enumerated some of the issues in an EROS-integrated design, I conclude that this design is infeasibly complex. While I believe that managing the descriptor cache in the fashion described is feasible, my guess is that doing what amounts to global register allocation across slots in these tables is far far too complicated to be got right. A different approach needs to be considered. 9. The M-Kernel ApproachThe M-Kernel design is an alternative intended to reduce all of the preceding to a manageable amount of complexity. It is conceptually very similar to the technique proposed for plex86. In the M-kernel design, we introduce a first-class notion of hybrid virtual machine into the kernel. A hybrid virtual machine provides a full user-mode environment, including all necessary segment entries at the expected selector offsets in the usual tables. The job of the M-kernel is to switch between these machines and encapsulate faults. 9.1 Content of an M-MachineAn M-Machine context consists of:
That is, each M-machine contains all of the necessary state to run ring-3 code completely transparently, to capture interrupts from the hardware (but not to handle them), and to load a memory map (but not to manage that map). Note that by giving each M-machine it's own GDT and IDT, it ceases to be necessary to have a common kernel segment selector across the machines. 9.2 Functions of the M-kernelThe M-kernel serves as a machine monitor between M-machines. We assume that M-machine zero (M0) is the ``controlling'' machine, and that for all other machines the behavior on fault, trap, or interrupt is to switch to M0, causing M0 to ``handle'' the fault in whatever fashion is appropriate. In effect, M0 runs that ``machine monitor'' and the M-kernel is a pico-kernel that handles the machine switch portion of the task. The code running in M0 can be a complete operating system, but it is assumed that this code is aware of the existence of the M-kernel. The M-kernel provides two functions: fault handling and machine dispatch. 9.2.1 Fault handlingWhen an M-machine faults, traps, or interrupts, the M-kernel proceeds as follows:
It is the responsibility of the M0 machine to decide what to do about the fault. The above description is about as minimal 9.2.2 Machine DispatchMachine dispatch is essentially the inverse of fault handling. In M0, the machine dispatch entry point is an entry point into the supervisor-mode machine monitor. In all other machines it is an entry point that returns control to the user-mode application then resident in the relevant M-machine. This is exactly the code used by the eros kernel here. 9.3 FeasibilityThe M-kernel can be built using only one page of state. This is desirable because it maximizes the likelihood that a shared page can be found accessable in all currently active M-machine address spaces. If the address of this single page can be shared across all M-machines, then the TSS can likewise be shared, which considerably simplifies the M-machine implementation. I shall call this common page address Mva. There are only two commonly used kernel mapping locations within the x86 virtual map. This leaves a large portion of the virtual address space for user-accessable virtual mappings, and the intersection of user-accessable mappings across all spaces is unlikely to be empty. The M-kernel page address can be stolen from any single page address within this intersection, and can be relocated as needed. While the resulting choice of Mva is likely to bounce around quite a bit at first, it will soon stabilize in a virtual page that is not used by any emulated system for long periods of time. The single page requirement is unquestionably feasible from a code size perspective. For calibration purposes, all of these functions currently exist within the EROS trap handler assembly code. That code also holds the capability invocation path (the M-kernel doesn't need this) and a boatload of trap debugging support. The whole fits in 2400 bytes, leaving plenty of room for a TSS structure (104 bytes). In fact, we can likely shrink that code by 800 bytes or so. In fact, the resulting design is remarkably similar to the current EROS trap handler design. At the risk of inventing new terms The M-kernel may be viewed as the context switching ``femtokernel'' that lives at the core of the EROS kernel, and also (for that matter) at the core of the L4 nucleus. To implement this, most of what will be needed is to remove code from the existing EROS unified trap/interrupt handler. 9.4 Fal-Tor-Pan (The Refusion)So now we have an M-kernel and all of the mechanism to perform an Mswitch (a context switch between machines). This mechanism is very similar to a conventional context switch. How shall we integrate it back into a conventional operating system? One approach, which would be best for a non-persistent design, would be to run only a virtual machine monitor in M0 and run all operating systems in other machines. I have not looked, but this is probably what the VMware GSX product does. While the generic M-kernel could be used in this fashion, and somebody will probably do so someday, it isn't the best approach for EROS. For EROS, we would like to run the EROS kernel in M0 and preserve thereby all of the persistence management that EROS provides. Ideally, we would like in doing this to avoid any need to recopy the context structure that the M-machine has so carefully saved for us. Eliminating that copy is one of the basic wins in the current EROS implementation. That is, we would like to re-fuse the Mswitch logic described above with the context switch (Cswitch) logic of the M0 operating system. This is possible. 9.4.1 Using the Context Cache In-PlaceToday, EROS maintains an array that serves as a ``context cache.'' This array is simply an array of augmented save areas. If the revised EROS were to run only a small number of M-machine processes, we could play games with the M0 TSS stack pointer to use the existing context cache directly whenever running in M0. We would then pay a save area copyin/copyout penalty for other machines M-contexts. This solution involves minimal change to the existing kernel, and is therefore potentially a good one. I am concerned that it creates a somewhat irregular data structure management discipline (M0 behaves differently from M>0, and may therefore be prone to maintenance failure. 9.4.2 Redesign Context Cache as M-CacheIf we can get the M-kernel code small enough (which we probably can), a cleaner solution would be to restructure the existing context cache as an array of M-machine pages. Each M-machine page now contains a copy of the M-kernel code and a TSS at the front, followed by as many context cache entries (save areas) as will fit. Every application address space maps it's containing M-machine page with supervisor permissions at virtual address Mva. We augment every M-machine (conceptually) with a logical machine ID, and skip switching the GDT, IDT, LDT if the M-machine ID being initiated is the same as the current one. This bypass can be done in a single branch instruction, and results in a clean unification between M-machines and the contexts of the native operating system. We will need an M-cache for non-EROS processes anyway. In effect, what I am proposing is that every EROS process logically runs in its own hybrid virtual machine, and optimizing the usual case in which that virtual machine is in fact M0 and the kernel is the M0 supervisor. The amortized cost of this restructuring will probably work out to be a marginal 300 bytes per context cache entry. If we conclude this is excessive, we can revert to the two cache design strategy. I just hate to carry the extra complexity if we can avoid it. 9.4 Managing the Various Hardware TablesIn the proposed M-kernel 10. Responsibilities of M0At this point we have the core context switcher of a hybrid VMM, but not the supporting code to manage the various tables that the x86 requires. In the following discussion I talk about each of these tables and it's associated management requirements. 10.1 Global and Local Descriptor TablesThe root of all evil on the x86 is the Global Descriptor Table. In order for the selector values selected by different emulated operating systems to look right, it is necessary that each M-machine have a separate global descriptor table. This table need not -- indeed must not -- be shared across M-machines, but certain of its entries must satisfy constraints. It is the responsibility of M0 to ensure that at all times only legal entries appear in the GDT associated with Mx. This can be accomplished through page fault handlers or through an explicit M0 kernel interface that permits manipulation of simulated GDTs. The global descriptor table for each machine Mx must be accessable to the M0 machine so the M0 machine can implement descriptor table updates on behalf of the supervisor machine simulator. This table is mapped into the M0 kernel virtual address space. It must also be mapped into the Mx address space in supervisor mode. The Mx address for this mapping can be specified by the machine simulator. The M0 kernel will map the table into both the Mx space and somewhere in the M0 kernel virtual space. Reloading the GDT on the way in to an M-machine (we shouldn't ever need to save it) costs approximately 6 cycles. The local descriptor table is managed in essentially the same way and under the same constraints. While the processor has two different master registers for these two tables, the two tables have identical security behavior for our purposes. 10.1.1 GDT Safety ConstraintsIt is the responsibility of M0 to ensure that whenever a machine Mx executes, it's associated GDT meets the following constraints:
10.1.2 GDTR FidelityFor the moment, I propose to try to get away without it, but if necessary the Mx descriptor table can be mapped in such a way that:
This would eliminate any possibility that the guest application will learn of emulation by examining the LTR value. 10.2 Interrupt Dispatch TableThe interrupt descriptor table (IDT) needs to be kept on a per-machine basis, but thankfully doesn't need to be updated in a complex way to deal with per-machine interrupt handling policy. In fact, the only reason that it needs to be per-machine is that the code segment selector stored in each interrupt descriptor is a selector into the GDT, and the GDT is specific to each M-machine (what a nuisance). Aside from dealing with the GDT selector problem, the content of the interrupt dispatch table should not depend on which M-machine we are running. Here is why I think we will get away with this:
As with most of the supporting data structures, the IA-32 keeps the virtual address of this table stored in the IDTR register. The table itself can be switched around underneath the hardware so long as interrupts are disabled at the time and the virtual address and length remain unchanged. If we find ourselves required to faithfully simulate the user-mode SIDT instruction (which reveals the selector and linear address of the table), we can handle that similar to the handling of SGDT (see the GDTR fildelity section). 10.3 Task StructureA key supporting structure in the x86 firmament is the task structure, which is named by the Task Structure Register TSR. The LTSR instruction loads an entry from the global descriptor table, but once loaded the TSR register value only changes as the result of a task switch. We will address that momentarily. Note that the hardware recalls the location of the task structure, but does not cache information from the structure. The location of the structure is a linear (virtual) address, and as a result the actual structure can safely be swapped out from under the hardware without reloading LTR as long as two conditions are met:
The easiest way to ensure this is to place the task structure (a) on the same page as the M-kernel code, or (b) on a similarly universally mapped page. 10.3 Mapping TablesM0 is of course responsible for all page resource allocation associated with emulation. Beyond this, M0 is responsible for keeping track of the Mva address and ensuring that (a) it does not collide with any M0 kernel address, and (b) it does not collide with any user mode accessable virtual address. The latter is the main subject of discussion here. If we choose to use a separate TSS page, identical constraints apply to that page. To identify a suitable Mva, the M0 kernel first picks a suitable virtual page address within the M0 user address region. It then probes every known address space tree, clobbering the associated PTE in each space with a supervisor-only PTE naming the M-kernel page. It also records the selected Mva in order to cope with later page faults. When a user-mode page fault occurs that references the Mva address, the M0 kernel must choose a new Mva location before servicing the page fault. Whenever a new Mva address is chosen, the TSR and IDTR registers must be reloaded, the ring 0 stack pointer in every M-kernel page must be adjusted, and the GDT code and data segment bases corresponding to the M-kernel code page for each M-machine must be revised. Because each M-machine uses a different GDT and a different selector for it's kernel code, this update may occur at a distinct location in each M-machine. The selector identity should appear at a well-known location within the associated M-machine's descriptive page to simplify updates. Much of this can be done lazily in the EROS implementation by the following scheme:
10.4 Summarizing the M-Machine StateUnder the preceding discussion, the per-machine M-kernel page must hold:
Separate from the M-page, the M0 kernel must maintain a descriptor table cache for GDT and LDT descriptor tables, and arrange for entries to be mapped in the supervisor region of the target M-machine's address space. Copyright 2002 by Jonathan Shapiro. All rights reserved. For terms of redistribution, see the GNU General Public License |