Monday, November 30, 2015

What happens when you sleep

Once again a joy ride of a roller coaster that is the Linux kernel.

What happens when you sleep in the kernel. You here is the process context code that has either voluntarily relinquished the CPU or because it was doing a blocking call.

/**
 75  * mutex_lock - acquire the mutex
 76  * @lock: the mutex to be acquired
 77  *
 78  * Lock the mutex exclusively for this task. If the mutex is not
 79  * available right now, it will sleep until it can get it.
 80  *
 81  * The mutex must later on be released by the same task that
 82  * acquired it. Recursive locking is not allowed. The task
 83  * may not exit without first unlocking the mutex. Also, kernel
 84  * memory where the mutex resides must not be freed with
 85  * the mutex still locked. The mutex must first be initialized
 86  * (or statically defined) before it can be locked. memset()-ing
 87  * the mutex to 0 is not allowed.
 88  *
 89  * ( The CONFIG_DEBUG_MUTEXES .config option turns on debugging
 90  *   checks that will enforce the restrictions and will also do
 91  *   deadlock debugging. )
 92  *
 93  * This function is similar to (but not equivalent to) down().
 94  */
 95 void __sched mutex_lock(struct mutex *lock)
 96 {
 97       might_sleep();  //---> Just for Debugging.
 98       /*
 99        * The locking fastpath is the 1->0 transition from
100        * 'unlocked' into 'locked' state.
101        */
102      __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
103       mutex_set_owner(lock);
104 }
105 
106 EXPORT_SYMBOL(mutex_lock);
Ingo Molnar's function header sums it up. 



Sorting method in Linux Kernel

I came across this source code lying in the kernel SRC, about a NON recursive sorting algorithm which does a pretty decent job at worst and average case scenarios.
The source file is /lib/sort.c
Prototype:void sort(void *base, size_t num, size_t size, int (*cmp_func)(const void *, const void *),          void (*swap_func)(void *, void *, int size))
This function does a heapsort on the given array. You may provide a
* swap_func function optimized to your element type.
* Sorting time is O(n log n) both on average and worst-case. While  qsort is about 20% faster on average, it suffers from exploitable O(n*n) worst-case behavior and extra memory requirements that make it less suitable for kernel use.
Heap sort is simple to implement, performs an O(n·lg(n)) in-place sort, but is not stable.
So our 1st task is to understand the HEAP DS.
A heap is a kind of binary tree having this property: every parent value is greater than (or equal to) its children's.
 The root always holds the max value
If the Parent value is less than its children, then the value is sifted down repeatedly until it proper heap is formed.
Suppose we have the following, arbitrarily assigned, array:assigned array
The right half side (the blue marked side) is already arranged as the bottom row of a heap, because its items have no children: consider, for instance the first 'blue' item, having value 24. Its index is 7, hence the heap properties for such item (h7>= h15, h7>= h16) are automatically satisfied, because there are no items with index 15 or 16 in the array. So the 'blue side' is our partial heap and we can proceed with heap construction by repeatedly sifting down the items of the 'red side'.
3 - sifting down the first 'red' (left side) item of the array
The sift function would be implemented like this:
// The sift function:
// 'sifts down' the a[l] item until heap properties are restored
// ARGS:
//  a[] (INOUT) the array  (where a[l+1], a[l+2] .. a[size-1] is a partial heap)
//  size (IN) number of items of a
//  l (IN) index of the item inserted into the heap

void sift(int a[], int size, int l)
{
  int i,j, x;
  i = l;
  j = 2*i+1;
  x = a[i];
  while (j < size)
  {
    if ( j <size - 1)
      if ( a[j] < a[j+1])
        ++j;
    if (x >= a[j])
      break;
    a[i] = a[j];
    i = j;
    j = 2*i + 1; // sift
  }
  a[i] = x;
}
The creation of heap than becomes like this:
/ makes the heap using the R.W.Floyd's algorithm
// ARGS:
//  a[] (INOUT) the array wherein the heap is created
//  size (IN) the size of the array
 
void make_heap(int a[], int size)
{
  int l = size / 2;
  while (l)
  {
    --l;
    sift(a, size, l);
  }
} 
After all the operations are finished, We will get a heap array with
First element in array as Maximum value element in the unsorted 
array.
The 1st Fig: Unsorted Array. Fig2: Heaped array---> (29 is the max value).
so that the maximum value (29) is now the first item of the array (or, in other words, is on the top of the heap).

We can remove such maximum value (storing it in a safe place) and rearrange the remaining items in order to regain a heap (to obtain another 'maximum', that is the successive value of the sorted sequence).

It turns out the safe place for storing the maximum value is the last item of 
the array. As matter of fact, we swap the first and last item.
Now we have:

The maximum value in its own safe place (grayed, on the right side of the array)
A partial heap to fix (the blue items)
An item to sift down (the read one)
So We implement this heap_sort for arranging in descending order is:
void heapsort(int a[], int size)
{
  int l = 0, r = size;
  make_heap(a, size);
 
  while ( r > 1)
  {
    int tmp = a[0];
    --r;
    a[0] = a[r];
    a[r] = tmp;
    sift(a, r,0);
  }
}
NOTE: All credit for the diagrams and Source code for Heap sort goes to this guy:http://www.codeproject.com/Tips/732196/Heap-Data-Structure-and-Heap-Sort 
Now coming back to our kernel code for sort.c
void sort(void *base, size_t num, size_t size,
            int (*cmp_func)(const void *, const void *),
            void (*swap_func)(void *, void *, int size))
  {
          /* pre-scale counters for performance */
          int i = (num/2 - 1) * size, n = num * size, c, r;
  
          if (!swap_func) {
                  if (size == 4 && alignment_ok(base, 4))
                          swap_func = u32_swap;
                  else if (size == 8 && alignment_ok(base, 8))
                          swap_func = u64_swap;
                  else
                          swap_func = generic_swap;
          }
  
          /* heapify */
          for ( ; i >= 0; i -= size) {
                  for (r = i; r * 2 + size < n; r  = c) {
                          c = r * 2 + size;
                          if (c < n - size &&
                                          cmp_func(base + c, base + c + size) < 0)
                                  c += size;
                          if (cmp_func(base + r, base + c) >= 0)
                                  break;
                          swap_func(base + r, base + c, size);
                  }
          }
  
          /* sort */
          for (i = n - size; i > 0; i -= size) {
                  swap_func(base, base + i, size);
                  for (r = 0; r * 2 + size < i; r = c) {
                          c = r * 2 + size;
                          if (c < i - size &&
                                          cmp_func(base + c, base + c + size) < 0)
                                  c += size;
                          if (cmp_func(base + r, base + c) >= 0)
                                  break;
                          swap_func(base + r, base + c, size);
                  }
         }
 }

Friday, October 23, 2015

The enigma of Spinlocks...

The SMP and locking has always been a tricky affair to handle in latest embedded systems. Nothing can be understood unless one delves into source code of the implementations.

So take for instance:

spin_lock_irqsave(lock, flags)

spin_lock_irqsave is basically used to save the interrupt state before taking the spin lock, this is because spin lock disables the interrupt, when the lock is taken in interrupt context, and re-enables it when while unlocking. The interrupt state is saved so that it should reinstate the interrupts again.

Thats just english ;)... What it actually does is a do..while(0) loop.

#define spin_lock_irqsave(lock, flags)                          \
do 
{                                                            \
         raw_spin_lock_irqsave(spinlock_check(lock), flags);     \
} while (0)

/*
* Map the spin_lock functions to the raw variants for PREEMPT_RT=n
 */
 static inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
 {
         return &lock->rlock;
 }

#define raw_spin_lock_irqsave(lock, flags)                      \
do  \
{                                            \
                typecheck(unsigned long, flags);        \
                flags = _raw_spin_lock_irqsave(lock);   \
 } while (0)

The _raw_spin_lock_irqsave has 2 Variants: 1st A UP one and 2nd a SMP.

Defined as a preprocessor macro in:
              include/linux/spinlock_api_up.h, line 69
              include/linux/spinlock_api_smp.h, line 61

Defined as a function in:
kernel/locking/spinlock.c, line 147

We check the Uni-processor Variant First:
As it is written: 
/*
 * In the UP-nondebug case there's no real locking going on, so the
   * only thing we have to do is to keep the preempt counts and irq
   * flags straight, to suppress compiler warnings of unused lock
   * variables, and to add the proper checker annotations:  
*/

#define _raw_spin_lock_irqsave(lock, flags)     __LOCK_IRQSAVE(lock, flags)

#define __LOCK_IRQSAVE(lock, flags)
do { local_irq_save(flags); __LOCK(lock); } while (0)

#define local_irq_save(flags) ((flags) = 0)
OR
#define local_irq_save(flags)
do {raw_local_irq_save(flags);} while (0)

#define raw_local_irq_save(flags)                       \
         do {                                            \
                 typecheck(unsigned long, flags);        \
                 flags = arch_local_irq_save();          \
         } while (0)

From here it goes into architecture specific code.

#define arch_local_irq_save arch_local_irq_save
 static inline unsigned long arch_local_irq_save(void)
  {
          unsigned long flags, temp;
          asm volatile(
                  "       mrs     %0, cpsr        @ arch_local_irq_save\n"
                  "       orr     %1, %0, #128\n"
                  "       msr     cpsr_c, %1"
                  : "=r" (flags), "=r" (temp)
                  :
                  : "memory", "cc");
          return flags;
  }

What do these Assembly instructions mean:
1st:    mrs    flags, cpsr        @ arch_local_irq_save\n"
                    - MRS{cond} Rd,  cpsr ---> So basically we are copying the contents of cpsr into flags

So what does this CPSR do:

The Current Program Status Register is a 32-bit wide register used in the ARM architecture to record various pieces of information regarding the state of the program being executed by the processor and the state of the processor. This information is recorded by setting or clearing specific bits in the register.

ARM CPSR format

For our spinlock discussion, lets just talk about the I and F bits which determine whether interrupts (such as requests for input/output) are enabled or disabled.

So when in our assembly code:

orr     %1, %0, #128\n" ---> We do Logical OR of flags and #128 -> 0x00000080 (last 2 nibbles 10000000).

So if the interrupts are enabled the ORing at the 7th bit will disable the interrupts.

And then we copy the value of the ORing into temp. Finally we copy temp into CPSR to disable the interrupts.

Thats all the spin_lock_irqsave(flags) does. Disable the interrupts if they are already enabled, also save the CPSR into flags.

So when we do spin_unlock_irqrestore(flags). It is going to do the opposite.

But Now all this code digging is going for a toss when I say that the do while which I showed is actually not going to execute because its while(0). BUMP !!!. Essentially the point of spinlocks to do nothing in a uni-processor environment.

Now, lets talk about the SMP spin_lock case.
#ifdef CONFIG_INLINE_SPIN_LOCK_IRQSAVE
#define _raw_spin_lock_irqsave(lock) __raw_spin_lock_irqsave(lock)

#endif

/*
* If lockdep is enabled then we use the non-preemption spin-ops
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
 #if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)

 static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
 {
         unsigned long flags;

         local_irq_save(flags);
         preempt_disable();
         spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
         /*
          * On lockdep we dont want the hand-coded irq-enable of
          * do_raw_spin_lock_flags() code, because lockdep assumes
          * that interrupts are not re-enabled during lock-acquire:
          */
 #ifdef CONFIG_LOCKDEP
         LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 #else
         do_raw_spin_lock_flags(lock, &flags);
 #endif
         return flags;
 }

#define local_irq_save(flags)                                   \
        do {                                                    \
                raw_local_irq_save(flags);                      \
        } while (0)

#define raw_local_irq_save(flags)                       \
         do {                                            \
                 typecheck(unsigned long, flags);        \
                 flags = arch_local_irq_save();          \
         } while (0)

Then again the architecture specific stuff.

 #define arch_local_irq_save arch_local_irq_save
 static inline unsigned long arch_local_irq_save(void)
  {
          unsigned long flags, temp;

          asm volatile(
                  "       mrs     %0, cpsr        @ arch_local_irq_save\n"
                  "       orr     %1, %0, #128\n"
                  "       msr     cpsr_c, %1"
                  : "=r" (flags), "=r" (temp)
                  :
                  : "memory", "cc");
          return flags;
  }

 #define preempt_disable() \
 do { \
         preempt_count_inc(); \
         barrier(); \
 } while (0)

#define preempt_count_inc() preempt_count_add(1)

#define preempt_count_add(val)  __preempt_count_add(val)

static __always_inline void __preempt_count_add(int val)
 {
         *preempt_count_ptr() += val;
 }

static inline void barrier(void)
 {
         asm volatile("" : : : "memory");
 }

I hope you enjoyed the ride into the kernel source for understanding spinlocks.

LUCI LUA MVC

Recently I had the opportunity to work on OpenWRT for a Network Security Product. The hardest part for getting the product finished and delivered was customizing the UI.

The default UI of OpenWRT uses LUCI Model-View-Controller Framework. In a traditionally know language like php, Java, javascript, working on MVC is always a pleasure for UI developers. I am by choice a Kernel programmer, and even I would, for once in a while case, touch upon and write some UI components. However, working on LUA based LUCI was a night mare, not only for a non professional UI guy like me, but even for veterans in UI in my team.

The primary reasons why this was the case with LUCI-LUA.
1. No documentation. ( I mean even the kernel has better documentation than LUCI-LUA)
2. The code is not written in a generally accepted and used language (Its written in Lua Language).
3. No one else has written a memoir about there tryst with LUCI-LUA.

So I decided that since recently I just finished adding a dozen custom changes to it, I should share them with readers who might stumble upon it.

The first feature which is not supported by LUCI-LUA is that it is mostly hard-coded with 1 admin user "root". Inherently it does not have multi-user support. But there are hacks lying around to make LUCI-LUA pseudo multi-user.


The reason it is pseudo multi-user is because, even though a new user "Admin" logs into the UI, the underlying user in the linux system is still "root", because the http server which serves LUCI pages is itself running as root.

Suppose you need to add a custom page for a new feature. As an example we add a new Tab under System tab which is to add External Admins configured in a RADIUS Server.

In that case take an existing Template ( like we all do while developing drivers....look into an existing similar code). Add your entry into the System.lua file and then a blank page with necessary CSS and theme automatically pops up with your Menu entry.

Then add your own lua script which does handling of all the logic and UI handling.

Code Snippets:

FIle: system.lua
entry({"admin", "system", "externaladmin"}, cbi("admin_system/externaladmin"), _("External Admin"), 97)

File: externaladmin.lua.

adminform = SimpleForm("userform1", translate("External Admin"))
adminform.reset  = false
username = adminform:field(Value, "uname",translate("Username"))
serverip = adminform:field(Value, "serverip",translate("Server IP"))
key = adminform:field(Value, "key",translate("Key"), translate("Table1: External Admins maintained
in External Radius Server. Table2: Radius Server with Server IPs"))

function adminform.handle(self, state, data)

if state == FORM_VALID then

local file = io.open("/tmp/myuser.log","w+")
local usernametxt = data.uname
local serveriptxt = data.serverip
local keytxt = data.key
local serverfile

if serveriptxt and keytxt and #serveriptxt > 0 and #keytxt > 0 then
command = "sh /root/radius_adt.sh "..serveriptxt.." "..keytxt
os.execute(command)
luci.http.redirect(luci.dispatcher.build_url("admin/system/externaladmin"))
end

if usernametxt and #usernametxt > 0 then
command = "sh /root/sysauth_adt.sh add external "..usernametxt.." NULL"
--os.execute(command)
output = luci.util.exec(command)
if output ~= "" then
adminform.errmessage=output
else
luci.http.redirect(luci.dispatcher.build_url("admin/system/externaladmin"))
end
end

file:close()

-- luci.http.redirect(luci.dispatcher.build_url("admin/system/externaladmin"))

end

return true
end

--[[ This section adds table to display External Admins ]]--

tbluser = adminform:section(Table, userdata)

tblcoluser = tbluser:option(DummyValue, "coluser", translate("User Name"))
tblcoldel = tbluser:option(Button, "deluser", translate("Delete"))

tblcoldel.render = function(self, section, scope)
if userdata[section].enabled then
self.title = translate("Delete")
self.inputstyle = "delete"
end
Button.render(self, section, scope)
end

tblcoldel.write = function(self, section)
local userdel = userdata[section].coluser
command = "sh /root/sysauth_adt.sh remove external "..userdel.." NULL"
--command = "deluser "..userdel
os.execute(command)
luci.http.redirect(luci.dispatcher.build_url("admin/system/externaladmin"))
end

Rest of the template remains same. Much of my LUCI LUA experience has been constantly looking at existing code and putting debug prints to understand the flow.

If you want to add a Start up script UI element which allows you to DO a ENABLE/START/STOP/RESTART of your script, then just write a init script.

#!/bin/sh /etc/rc.common
# script to Run DHCP snoop module using previous state
# Copyright (C) 2015 Nevis Networks
# Author: Gadre Nayan Anand Version: 1.0
# This is a init script to enable, start, stop, restart DHCP snoop Feature.
# For More details on commands involved read User Manual.

START=58
STOP=58

FILE1=/etc/config/dhcp_sis
FILE2=/etc/config/trusted_interfaces

start()
{
        local p_m
        local u_b
        while IFS="=" read -r key value;
        do
                case "$key" in
                        "preventive_mode") p_m=$value ;;
                        "use_bridge") u_b=$value;;
                esac
        done < "$FILE1"
        echo preventive_mode = $p_m use_bridge = $u_b

        echo "starting DHCP SNOOP IP SPOOF"
        /usr/sbin/insmod  /root/dhcp_snoop_IP_spoof.ko preventive_mode=$p_m use_bridge=$u_b

while read -r key;
do
echo "a-$key" > /sys/kernel/dhcp/trusted_interfaces
done < "$FILE2"
}

stop()
{
        echo "stopping DHCP SNOOP IP SPOOF"
        /usr/sbin/rmmod dhcp_snoop_IP_spoof
}



And LUCI framework  is really pathetically rigid, especially with the GUI perspective. CSS and other aspects are very hard to change....atleast this is what I felt.

Sunday, October 11, 2015

Pulling up oneself by one's own bootstraps!!!

This has been explained by many on their respective blog posts around the world...so eh...why not another one!!

I went through some Wikipedia entries as well to understand how the world wide community explains this stuff...and i got some pretty new buzz words to write here...the Hen and the egg problem for example...explains initramfs quite straight forward manner.

So lets start and push your power button.

This probably causes the x86 microprocessor to reset, and hence loads it's program counter (PC)
with a fixed address near the top of the one-megabyte address space assignable in real mode.

Did I write real mode...Yes!! to all those undergraduates who have read that there are two or modes supported by the x86 processors Real, protected, etc....the only code that actually uses Real Mode 16 bit addressing scheme (No paging) etc...is the Boot-loader. Yes!! the boot-loader is the place in today's era that uses 16 bit addressable memory and instructions ( o_O again!!)...this is kept for backward compatibility.

A full power-on self-test (POST) is run (although you can stop POST from running evertime you boot -cold boot by pressing the 3 finger salute ctrl+alt+del,This saves the time otherwise used to detect and test all memory. ).  The POST checks, identifies, and initializes system devices such as the CPU, RAM, interrupt and DMA controllers and other parts of the chipset, video display card, keyboard, hard disk drive, optical disc drive and other basic hardware.

On a more detailed Note:

The BIOS does several things. This is its usual sequence:
Check the CMOS Setup for custom settings
Load the interrupt handlers and device drivers
Initialize registers and power management
Perform the power-on self-test (POST)
Display system settings
Determine which devices are bootable

Initiate the bootstrap sequence

Interrupt handlers are small pieces of software that act as translators between the hardware components and the operating system. For example, when you press a key on your keyboard, the signal is sent to the keyboard interrupt handler, which tells the CPU what it is and passes it on to the operating system. The device drivers are other pieces of software that identify the base hardware components such as keyboard, mouse, hard drive and floppy drive. Since the BIOS is constantly intercepting signals to and from the hardware, it is usually copied, or shadowed, into RAM to run faster.

Now the BIOS is just going to look at a bootlable device in the list of drives attached, be it Floppy or HD or CD or USB. Then jumps to the first partition at that drive and loads just 1 sector (512 bytes) at again a predetermined location. The 1st sector contains MBR (Master Boot Record ).

The MBR contains assembly code ( Not C code) since at that so very nascent stage of machine, the C run time environment it self is not set. Now the job of the MBR is to load the Bootloader (GRUB in Linux), bootmgr ( In windows).

The boot loader then loads the Kernel and initramfs. After the kernel is loaded, it unpacks the initramfs (initial RAM filesystem), which becomes the initial root filesystem. The kernel then executes /init as the first process. The early userspace starts.

The purpose of the initramfs is to bootstrap the system to the point where it can access the root filesystem (see FHS for details). This means that any modules that are required for devices like IDE, SCSI, SATA, USB/FW (if booting from an external drive) must be loadable from the initramfs if not built into the kernel; once the proper modules are loaded (either explicitly via a program or script, or implicitly via udev), the boot process continues. For this reason, the initramfs only needs to contain the modules necessary to access the root filesystem; it does not need to contain every module one would ever want to use. The majority of modules will be loaded later on by udev, during the init process.

At the final stage of early userspace, the real root is mounted, and then replaces the initial root filesystem. /sbin/init is executed, replacing the /init process. Arch uses systemd as the default init.

After the Display manager comes up then the Login screen pops up and you can then LOGIN!!!