Wednesday, July 17, 2013

A simple linux driver for vmlaunch

 The idea behind the driver is to demonstrate a real example of how to initialize the Virtual Machine Control Structure(VMCS) and to use Intel VT instructions to launch a virtual machine. The driver launches a guest (virtual machine) with vmlaunch, executes one instruction(that causes a vmexit) and then returns to the host. For the vmlaunch instruction to execute successfully, a lot of cpu state (host and guest state) needs to be initialized all of which is done by this driver. The driver also takes a simple approach in setting up the guest state by making it mirror the host state. This makes the design much simpler - for instance the guest does not need its own CR3, it shares it with the host. Inline assembly is used generously throughout the driver.

The driver source code (64bit)  is located here:

The sequence leading to the launch of a virtual machine is as follows:

1. Check to make sure the cpu supportsVMX.
2. Check to see if the bios has enabled vmxon in the FEATURE_CONTROL_MSR (msr 0x3A).
3. vmxon.
4. vmptrld
5. Initialize guest vmcs
6. vmlaunch
7. Guest code executes, causes a vmexit
8. Back to the host.

Below is some discussion of the code - I have provided some code snippets for clarity.

The starting point is the function vmxon_init( ) :

1. First execute cpuid (leaf 1). Bit 5 of value returned in ecx indicates support for vmx. If cpuid indicates support for vmx then the code continues with normal execution. If vmx support is not indicated, the code exits.

asm volatile("cpuid\n\t"

2. Read the feature_control_msr and look for lock bit (bit0) and vmxon bit(bit2). Both bits must be on - If not exit.

asm volatile("rdmsr\n"

3. Call allocate_vmxon_region(). This allocates a 4k region for vmxon.

static void allocate_vmxon_region(void) {
   vmxon_region = kmalloc(MYPAGE_SIZE,GFP_KERNEL);

4. Set up the revision id in the vmxon region. The call to vmxon_setup_revid() sets up the revision-id. This revision id is then copied into the vmxon region. A rdmsr of VMX_BASIC_MSR(msr 0x480) returns the revision id.

memcpy(vmxon_region, &vmx_rev_id, 4); //copy revision id to vmxon region

static void vmxon_setup_revid(void){
   asm volatile ("mov %0, %%rcx\n"
       : "m"(vmx_msr_addr)
       : "memory");
   asm volatile("rdmsr\n");
   asm volatile ("mov %%rax, %0\n"

5. Turn on the VMXE bit in CR4 (bit13). This enables the virtual machine extensions. Execution of vmxon will #UD without cr4.vmxe on. The function turn_on_vmxe() accomplishes this.

static void turn_on_vmxe(void) {
   asm volatile("movq %cr4, %rax\n"
           "bts $13, %rax\n"
           "movq %rax, %cr4\n"

6. Now execute vmxon by calling the function do_vmxon(). If vmxon fails for any reason, the CF or ZF in rflags will be set - The code checks for this case and restores the flags for debug (using pushfq and popfq below).

static void do_vmxon(void) {
   asm volatile (MY_VMX_VMXON_RAX
                 : : "a"(&vmxon_phy_region), "m"(vmxon_phy_region)
                 : "memory", "cc");
   asm volatile("jbe vmxon_fail\n");
     vmxon_success = 1;
     asm volatile("jmp vmxon_finish\n"
     asm volatile ("popq %0\n"
     vmxon_success = 0;
      asm volatile("vmxon_finish:\n");

If vmxon executes successfully, the cpu is now in vmx root operation. INIT# and A20M# are blocked, CR4.VMXE cannot be cleared  and CR0.PG/PE are fixed to 1 in vmx root operation. note: cr4.vmxe can be cleared after the execution of vmxoff which takes the machine out of vmx root operation.

7. Next allocate the vmcs region for the guest and other data structures that are used in vmx non-root operation (iobitmaps, msrbitmaps etc). This is done by allocate_vmcs_region().

static void allocate_vmcs_region(void) {
   vmcs_guest_region  =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   io_bitmap_a_region =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   io_bitmap_b_region =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   msr_bitmap_region  =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);
   virtual_apic_page  =  kmalloc(MYPAGE_SIZE,GFP_KERNEL);

   //Initialize data structures
   memset(vmcs_guest_region, 0, MYPAGE_SIZE);
   memset(io_bitmap_a_region, 0, MYPAGE_SIZE);
   memset(io_bitmap_b_region, 0, MYPAGE_SIZE);
   memset(msr_bitmap_region, 0, MYPAGE_SIZE);
   memset(virtual_apic_page, 0, MYPAGE_SIZE);


8. Populate the guest vmcs region with the same revision id as the one used for vmxon region.

memcpy(vmcs_guest_region, &vmx_rev_id, 4); //copy revision id to vmcs region

9. Execute vmptrld. Checks rflags(ZF and CF) to make sure vmptrld executes successfully.

static void do_vmptrld(void) {
    asm volatile (MY_VMX_VMPTRLD_RAX
                : : "a"(&vmcs_phy_region), "m"(vmcs_phy_region)
                        : "cc", "memory");
     asm volatile("jbe vmptrld_fail\n");
     vmptrld_success = 1;
     asm volatile("jmp vmptrld_finish\n"
     asm volatile ("popq %0\n"
     vmptrld_success = 0;
     asm volatile("vmptrld_finish:\n");

10. Now its time to initialize the guest vmcs. This is accomplished by initialize_guest_vmcs(). It is advisable to keep the initialization of  the guest vmcs  consistent with the field encodings in Appendix B, Vol 3c. This will avoid any important fields from skiiping initialization and will save a great deal of headache trying to debug a vmlaunch fail due to invalid guest state.

static void initialize_guest_vmcs(void){

  1. The latest Intel manuals have newer fields defined - This code initializes the fields that are supported by the earliest processors to support VT. So for example VPID will not be in the initialization section as all processors do not support VPID.

2. All fields are expanded and written individually rather than iterating through a loop [for ease of debug]. Intel's vmentry checks are detailed and any issue with the initialization here that causes a vmentry fail will be less painful to debug with this code.

initialize_16bit_host_guest_state( ):
This function takes care of initializing the 16 bit guest and host states.

A sample initialization of the host and guest ES selector is given below:

  field = VMX_HOST_ES_SEL;
  field1 = VMX_GUEST_ES_SEL;
  asm ("movw %%es, %%ax\n"

A sample initialization of the host and guest TR selector is given below:

   field = VMX_HOST_TR_SEL;
   field1 = VMX_GUEST_TR_SEL;
   asm("str %%ax\n" : "=a"(value));

initialize_64bit_control( ):   
This function takes care of initializing the 64 bit controls.

A sample initialization of the IO bitmaps is given below:

   field = VMX_IO_BITMAP_A_FULL;
   io_bitmap_a_phy_region = __pa(io_bitmap_a_region);
   value = io_bitmap_a_phy_region;

   field = VMX_IO_BITMAP_B_FULL;
   io_bitmap_b_phy_region = __pa(io_bitmap_b_region);
   value = io_bitmap_b_phy_region;

 initialize_64bit_host_guest_state( ):
This function takes care of initializing the 64 bit host/guest state.

 value = 0xffffffffffffffffull;
 value = 0;

initialize_32bit_control( ):
32 bit controls are initialized here:

   value = 0x1f ;

Ideally this code should read the pin_based_ctl msr (msr 0x481) to find the allowed 0's and allowed 1's and then initialize this field. The earliest versions of cpu that supported vmx supported external interrupt exiting(bit0) and nmi exiting(bit 3) . The other bits in this field that are set to 1 are the must be 1 bits.  Hence the author sets the value to be 0x1f.

initialize_naturalwidth_control( ):

The CR0 and CR4 guest host mask are initialized below:

   field = VMX_CR0_MASK;
   value = 0;
   field = VMX_CR4_MASK;
   value = 0;

initialize_32bit_host_guest_state( ):

It initializes  32 bit guest/host state and a few of the natural width fields. Here are a few examples:

Initializing the AR bytes is a 2-step process - First find the access rights of the segment using the lar instruction. Then arrange the format of the access rights to match the guest access rights format described in the chapter Virtual Machine Control Structures, vol 3c(Chapter 24, Table 24.2 in the June 2013 manual).

  asm ("movw %%cs, %%ax\n"
         : "=a"(sel_value));
   asm("lar %%eax,%%eax\n" :"=a"(usable_ar) :"a"(sel_value));
   usable_ar = usable_ar>>8;
   usable_ar &= 0xf0ff; //clear bits 11:8

   field = VMX_GUEST_CS_ATTR;
   value = do_vmread(field);

This code also initializes the GDT base(a natural width field) along with its limit(32 bit field) for convenience. The same process is repeated for IDTR and TR.

  asm("sgdt %0\n" : :"m"(gdtb));
   value = gdtb&0x0ffff;
   gdtb = gdtb>>16; //base

     gdtb |= 0xffff000000000000ull;
   field = VMX_HOST_GDTR_BASE;

initialize_naturalwidth_host_guest_state( ):

Initializes the natural width guest and host states.

As is clear the host and guest cr0,cr4,cr3 are identical.

   field =  VMX_HOST_CR0;
   field1 = VMX_GUEST_CR0;
   asm ("movq %%cr0, %%rax\n"

   field =  VMX_HOST_CR3;
   field1 = VMX_GUEST_CR3;
   asm ("movq %%cr3, %%rax\n"

   field =  VMX_HOST_CR4;
   field1 = VMX_GUEST_CR4;
   asm ("movq %%cr4, %%rax\n"

11. The last piece of initialization is the guest and host rip:

    After a vmexit the cpu transfers control to the host at the label 'vmexit_handler'.
   //host rip
   asm ("movq $0x6c16, %rdx");
   asm ("movq $vmexit_handler, %rax");
   asm ("vmwrite %rax, %rdx");

   After a vmentry the cpu transfers control to the guest at the label 'guest_entry_point'.
   //guest rip
   asm ("movq $0x681e, %rdx");
   asm ("movq $guest_entry_point, %rax");
   asm ("vmwrite %rax, %rdx");

12. Finally the vmlaunch:

   asm volatile (MY_VMX_VMLAUNCH);
   asm volatile("jbe vmexit_handler\n");
   asm volatile("nop\n"); //will never get here

   asm volatile("guest_entry_point:");  ---> after vmlaunch the code gets here:
   asm volatile(MY_VMX_VMCALL); ---> vmcall causes a vmexit with exit reason=0x12
   asm volatile("ud2\n"); //will never get here
   asm volatile("vmexit_handler:\n");   ---> after vmexit code starts executing here:

13. If the launch completes successfully, then the guest executes vmcall and vmexits.

14. After the vmexit, the vmexit_handler takes over. Reads the exit_reason and prints a message.

   asm volatile("vmexit_handler:\n");
   field_1 = VMX_EXIT_REASON;
   value_1 = do_vmread(field_1);
   asm volatile("sti\n");

15. When the driver module is removed (rmmod .ko) , vmxon_exit( ) function is called. It does the following: (a) do vmxoff (b) turn off CR4.VMXE (c) deallocate vmcs region and other data structures. (d) deallocate the vmxon region.

static void vmxon_exit(void) {
   if(vmxon_success==1) {
     vmxon_success = 0;

The dealloc just frees all the allocated memory regions:

static void deallocate_vmxon_region(void) {