Friday, January 29, 2016

Initialization of Kernel Code

I was wondering what happens when you initialize your module....type in insmod module_name.ko.

Most of the code paths in the kernel run as function pointers. You will be tracing some source code and then suddenly there seems to be no direct function call being made, but suddenly the execution returns to the same place. This is where magic happens when function pointers are used.

So, If you are saying what happens when I insert my module, How does the kernel keep track of my custom functions (open, close, read, write, ioctl etc etc).

So the fact is that we as mere mortals are not doing anything extraordinary when we write module operations functions. We are simply attaching our functions read, write.... to a template of module operations.

This kind of framework is essential for the kernel to manage each module operation successfully.

#ifndef MODULE
 **
  * module_init() - driver initialization entry point
 * @x: function to be run at kernel boot time or module insertion
  *
  * module_init() will either be called during do_initcalls() (if
  * builtin) or at module insertion time (if a module).  There can only
  * be one per module.
  */
 #define module_init(x)  __initcall(x);

As you can see above this module_init() will be called wither during do_initcalls()....what does he mean, so essentially module_init() is going to register our init function in an array of function pointers. the do_initcalls() is going to call one by one depending on the order and dependencies, function from this array of function pointers (please excuse my english).

 #define __initcall(fn) device_initcall(fn)

#define device_initcall(fn)             __define_initcall(fn, 6)

/* initcalls are now grouped by functionality into separate
 * subsections. Ordering inside the subsections is determined
 * by link order.
 * For backwards compatibility, initcall() puts the call in
 * the device init subsection.
 *
 * The `id' arg to __define_initcall() is needed so that multiple initcalls
 * can point at the same handler without causing duplicate-symbol build errors.
 */
 #define __define_initcall(fn, id) \
         static initcall_t __initcall_##fn##id __used \
         __attribute__((__section__(".initcall" #id ".init"))) = fn; \
         LTO_REFERENCE_INITCALL(__initcall_##fn##id)

This bold is one statement.

static initcall_t __initcall_##fn##id __used __attribute__((__section__(".initcall" #id ".init"))) = fn;

typedef int (*initcall_t)(void);

as you can see the initcall_t is a function pointer.

Who calls Main.. and can we do more before calling main

When you write your application in C, your code isn't the only thing that gets programmed. Before your application can perform its first action, the C Runtime Environment startup code must configure the device to run code produced by a C compiler.


There are several things the C Runtime Environment startup code must do before your application's code can run.
  • Allocate space for a software stack and initialize the stack pointer
    On 8-bit devices that have a hardware based return address stack, the software stack is mostly used for parameter passing to and from functions. On 16- and 32-bit devices the software stack also stores the return address for each function call and interrupt.
  • Allocate space for a heap (if used)
    A heap is a block of RAM that has been set aside as a sort of scratchpad for your application. C has the ability to dynamically create variables at runtime. This is done in the heap.
  • Copy values from Flash into variables declared with initial values
    Variables declared with initial values (e.g. int x=10;) must have those initial values loaded into memory before the program can use them. The initial values are stored in flash program memory (so they will be available after the device is power cycled) and are copied into each RAM location allocated to an initialized variable for its storage.
  • Clear uninitialized RAM
    Any RAM (file register) not allocated to a specific purpose (variable storage, stack, heap, etc.) is cleared so that it will be in a known state.
  • Disable all interrupts
  • Call main(), where your application code start.

The runtime environment setup code is automatically linked into your application. It usually comes from a file with a name like crt0.s (assembly source) or crt0.o (object code).

The runtime startup code can be modified if necessary. In fact, the source file provides hooks for "user initialization" where you can run code that must execute before the main application begins, such as initializing some external hardware immediately after power is applied. Details on runtime startup code modification will be covered in the compiler specific classes.

crt0.o is an object file with code that is prepended to object code supplied by the user to make an executable. It initializes variables and the stack, and starts the user's program, among other things.

The simplest C runtime code would be
.text    // Select .text section
 b main  // Branch to main() in C source

c runtime

  • crt1.o
    This object file defines the _start symbol. The manner in which this code handles program bootstrap is highly dependent on the particularC library implementation. Some systems use crt0.o while others may even specify crt2.o or higher. Ultimately, whatever gcc has encoded should correspond to the C library in use.
  • crti.o and crtn.o
    crti.o defines the _init and _fini function prologs for the .init and .fini sections, respectively. crtn.o defines the corresponding function epilogs. When the static linker eventually merges all .init and .fini sections of its input object files, the DT_INITand DT_FINI tags in the dynamic section of its output object file will correspond to the addresses of the complete _init and _finisymbols, respectively.
    During run-time, _start sets up some way that the _init and _fini symbols will get invoked e.g. via the __libc_csu_init and__libc_csu_fini symbols, respectively, of the C library.
  • crtbegin.o and crtend.o
    The details of the symbols and sections defined in these files vary among architectures. With the Ubuntu 12.04 AMD64 toolchain, these include legacy code that GCC used to find the constructors and destructors i.e. __do_global_dtors_aux and __do_global_ctors_aux.
SOme more general Information
Some definitions:
PIC - position independent code (-fPIC)
PIE - position independent executable (-fPIE -pie)
crt - C runtime

crt0.o crt1.o etc...
  Some systems use crt0.o, while some use crt1.o (and a few even use crt2.o
  or higher).  Most likely due to a transitionary phase that some targets
  went through.  The specific number is otherwise entirely arbitrary -- look
  at the internal gcc port code to figure out what your target expects.  All
  that matters is that whatever gcc has encoded, your C library better use
  the same name.

  This object is expected to contain the _start symbol which takes care of
  bootstrapping the initial execution of the program.  What exactly that
  entails is highly libc dependent and as such, the object is provided by
  the C library and cannot be mixed with other ones.

  On uClibc/glibc systems, this object initializes very early ABI requirements
  (like the stack or frame pointer), setting up the argc/argv/env values, and
  then passing pointers to the init/fini/main funcs to the internal libc main
  which in turn does more general bootstrapping before finally calling the real
  main function.


glibc ports call this file 'start.S' while uClibc ports call this crt0.S or
  crt1.S (depending on what their gcc expects).
crti.o Defines the function prologs for the .init and .fini sections (with the _init
  and _fini symbols respectively).  This way they can be called directly.  These
  symbols also trigger the linker to generate DT_INIT/DT_FINI dynamic ELF tags.

  These are to support the old style constructor/destructor system where all
  .init/.fini sections get concatenated at link time.  Not to be confused with
  newer prioritized constructor/destructor .init_array/.fini_array sections and
  DT_INIT_ARRAY/DT_FINI_ARRAY ELF tags.

  glibc ports used to call this 'initfini.c', but now use 'crti.S'.  uClibc
  also uses 'crti.S'.

crtn.o
  Defines the function epilogs for the .init/.fini sections.  See crti.o.

  glibc ports used to call this 'initfini.c', but now use 'crtn.S'.  uClibc
  also uses 'crtn.S'.
For statically linked applications2, the load process only requires the kernel to make the binary available in its fixed load address before initializing the Program Counter (PC) for the process with the address of the _start symbol. On the other hand, for dynamically linked applications, the kernel first transfers control to the dynamic linker. In turn, the dynamic linker loads the required shared object dependencies and performs anyimmediate relocations (by default, lazy relocations for function references are performed later on when the symbols are actually referenced). It then methodically runs the initialization code for the loaded shared objects before handing control over to the executable's _start.
Entering the executable's _start concludes the application's load process and control proceeds to the executable's C run-time code before eventually reaching main.

How to enter into Kernel Mode

The only way an user space application can explicitly initiate a switch to kernel mode during normal operation is by making an system call such as openreadwrite etc.
Whenever a user application calls these system call APIs with appropriate parameters, a software interrupt/exception(SWI) is triggered.

  • Make a system call, i.e. explicitly request service from the kernel
  • trap into the kernel because of either:
    • an error (segmentation violation, invalid instruction, etc.) - this is fatal,
    • or a page fault - accessing mapped, but not resident memory page.
A kernel code snippet is run on request of a user process. This code runs in ring 0 (with current privilege level -CPL- 0), which is the highest level of privilege in x86 architecture. All user processes run in ring 3 (CPL 3). So, to implement system call mechanism, what we need is 1) a way to call ring 0 code from ring 3 and 2) some kernel code to service the request.

It was found out that this software interrupt method was much slower on Pentium IV processors. To solve this issue, Linus implemented an alternative system call mechanism to take advantage of SYSENTER/SYSEXIT instructions provided by all Pentium II+ processors. Before going further with this new way of doing it, let's make ourselves more familiar with these instructions.

The SYSENTER instruction is part of the "Fast System Call" facility introduced on the Pentium® II processor. The SYSENTER instruction is optimized to provide the maximum performance for transitions to protection ring 0 (CPL = 0). The SYSENTER instruction sets the following registers according to values specified by the operating system in certain model-specific registers.
  • CS register set to the value of (SYSENTER_CS_MSR)
  • EIP register set to the value of (SYSENTER_EIP_MSR)
  • SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)
  • ESP register set to the value of (SYSENTER_ESP_MSR)
Looks like processor is trying to help us. Let's look at SYSEXIT also very quickly:
The SYSEXIT instruction is part of the "Fast System Call" facility introduced on the Pentium® II processor. The SYSEXIT instruction is optimized to provide the maximum performance for transitions to protection ring 3 (CPL = 3) from protection ring 0 (CPL = 0). The SYSEXIT instruction sets the following registers according to values specified by the operating system in certain model-specific or general purpose registers.
  • CS register set to the sum of (16 plus the value in SYSENTER_CS_MSR)
  • EIP register set to the value contained in the EDX register
  • SS register set to the sum of (24 plus the value in SYSENTER_CS_MSR)
  • ESP register set to the value contained in the ECX register