Introduction to Memory Forensics: Linux Process Data Structures




Memory analysis has become critical in digital forensics because it provides insight into the state of a system that should not be represented by traditional media analysis. By analyzing a memory image, we can obtain details of volatile data, such as running processes, loaded modules, network connections, open files, and so on. Fundamentally, memory analysis is concerned with interpreting the seemingly unstructured raw memory data collected from a live system as meaningful and actionable information. Gathering and analyzing the information contained in a physical memory dump requires the following information.


  • Specific knowledge of the kernel version.
  • Struct members and memory layout: a complete understanding of the data structures and algorithms used by the original operating system (OS) must be understood. This contains information about memory offsets for struct members and their types. A memory framework must have the same layout information to interpret the contents of the memory. For example, in Windows systems, the EPROCESS struct is used to track per process information. Similarly, the Linux kernels process descriptor structure, task struct, contains all the information that links a process to its open files, memory mappings, signal handlers, network activity, and more.
  • Global variables: almost every OS kernel contains many important constants required for analysis. For example, in Linux kernels, the init_task is a constant pointer to the beginning of the process linked list and is required for listing processes by following the list. In Windows systems, PsActiveProcessHead is the head of the active process list.
  •  Function addresses: functions can be used to uncover the semantics of kernel objects that are important during memory analysis.


Linux Process Data Structures

Every Linux process is represented by a task_struct structure in kernel memory. This structure holds all the information necessary to link a process with its opened file descriptors, memory maps, authentication credentials, and more.


task_struct

struct task_struct {
     volatile long state;     /* -1 unrunnable, 0 runnable, >0 stopped */
     void *stack;
     atomic_t usage;
     unsigned int flags;     /* per process flags, defined below */
     unsigned int ptrace;

     int lock_depth;          /* BKL lock depth */

#ifdef CONFIG_SMP
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
     int oncpu;
#endif
#endif

     int prio, static_prio, normal_prio;
     unsigned int rt_priority;
     const struct sched_class *sched_class;
     struct sched_entity se;
     struct sched_rt_entity rt;

#ifdef CONFIG_PREEMPT_NOTIFIERS
     /* list of struct preempt_notifier: */
     struct hlist_head preempt_notifiers;
#endif

     /*
      * fpu_counter contains the number of consecutive context switches
      * that the FPU is used. If this is over a threshold, the lazy fpu
      * saving becomes unlazy to save the trap. This is an unsigned char
      * so that after 256 times the counter wraps and the behavior turns
      * lazy again; this to deal with bursty apps that only use FPU for
      * a short time
      */
     unsigned char fpu_counter;
#ifdef CONFIG_BLK_DEV_IO_TRACE
     unsigned int btrace_seq;
#endif

     unsigned int policy;
     cpumask_t cpus_allowed;

#ifdef CONFIG_PREEMPT_RCU
     int rcu_read_lock_nesting;
     char rcu_read_unlock_special;
     struct list_head rcu_node_entry;
#endif /* #ifdef CONFIG_PREEMPT_RCU */
#ifdef CONFIG_TREE_PREEMPT_RCU
     struct rcu_node *rcu_blocked_node;
#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
#ifdef CONFIG_RCU_BOOST
     struct rt_mutex *rcu_boost_mutex;
#endif /* #ifdef CONFIG_RCU_BOOST */

#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
     struct sched_info sched_info;
#endif

     struct list_head tasks;
#ifdef CONFIG_SMP
     struct plist_node pushable_tasks;
#endif

     struct mm_struct *mm, *active_mm;
#ifdef CONFIG_COMPAT_BRK
     unsigned brk_randomized:1;
#endif
#if defined(SPLIT_RSS_COUNTING)
     struct task_rss_stat     rss_stat;
#endif
/* task state */
     int exit_state;
     int exit_code, exit_signal;
     int pdeath_signal;  /*  The signal sent when the parent dies  */
     /* ??? */
     unsigned int personality;
     unsigned did_exec:1;
     unsigned in_execve:1;     /* Tell the LSMs that the process is doing an
                     * execve */
     unsigned in_iowait:1;


     /* Revert to default priority/policy when forking */
     unsigned sched_reset_on_fork:1;

     pid_t pid;
     pid_t tgid;

#ifdef CONFIG_CC_STACKPROTECTOR
     /* Canary value for the -fstack-protector gcc feature */
     unsigned long stack_canary;
#endif

     /*
      * pointers to (original) parent process, youngest child, younger sibling,
      * older sibling, respectively.  (p->father can be replaced with
      * p->real_parent->pid)
      */
     struct task_struct *real_parent; /* real parent process */
     struct task_struct *parent; /* recipient of SIGCHLD, wait4() reports */
     /*
      * children/sibling forms the list of my natural children
      */
     struct list_head children;     /* list of my children */
     struct list_head sibling;     /* linkage in my parent's children list */
     struct task_struct *group_leader;     /* threadgroup leader */

     /*
      * ptraced is the list of tasks this task is using ptrace on.
      * This includes both natural children and PTRACE_ATTACH targets.
      * p->ptrace_entry is p's link on the p->parent->ptraced list.
      */
     struct list_head ptraced;
     struct list_head ptrace_entry;

     /* PID/PID hash table linkage. */
     struct pid_link pids[PIDTYPE_MAX];
     struct list_head thread_group;

     struct completion *vfork_done;          /* for vfork() */
     int __user *set_child_tid;          /* CLONE_CHILD_SETTID */
     int __user *clear_child_tid;          /* CLONE_CHILD_CLEARTID */

     cputime_t utime, stime, utimescaled, stimescaled;
     cputime_t gtime;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
     cputime_t prev_utime, prev_stime;
#endif
     unsigned long nvcsw, nivcsw; /* context switch counts */
     struct timespec start_time;           /* monotonic time */
     struct timespec real_start_time;     /* boot based time */
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
     unsigned long min_flt, maj_flt;

     struct task_cputime cputime_expires;
     struct list_head cpu_timers[3];

/* process credentials */
     const struct cred __rcu *real_cred; /* objective and real subjective task
                          * credentials (COW) */
     const struct cred __rcu *cred;     /* effective (overridable) subjective task
                          * credentials (COW) */
     struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */

     char comm[TASK_COMM_LEN]; /* executable name excluding path
                         - access with [gs]et_task_comm (which lock
                           it with task_lock())
                         - initialized normally by setup_new_exec */
/* file system info */
     int link_count, total_link_count;
#ifdef CONFIG_SYSVIPC
/* ipc stuff */
     struct sysv_sem sysvsem;
#endif
#ifdef CONFIG_DETECT_HUNG_TASK
/* hung task detection */
     unsigned long last_switch_count;
#endif
/* CPU-specific state of this task */
     struct thread_struct thread;
/* filesystem information */
     struct fs_struct *fs;
/* open file information */
     struct files_struct *files;
/* namespaces */
     struct nsproxy *nsproxy;
/* signal handlers */
     struct signal_struct *signal;
     struct sighand_struct *sighand;

     sigset_t blocked, real_blocked;
     sigset_t saved_sigmask;     /* restored if set_restore_sigmask() was used */
     struct sigpending pending;

     unsigned long sas_ss_sp;
     size_t sas_ss_size;
     int (*notifier)(void *priv);
     void *notifier_data;
     sigset_t *notifier_mask;
     struct audit_context *audit_context;
#ifdef CONFIG_AUDITSYSCALL
     uid_t loginuid;
     unsigned int sessionid;
#endif
     seccomp_t seccomp;

/* Thread group tracking */
        u32 parent_exec_id;
        u32 self_exec_id;
/* Protection of (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed,
 * mempolicy */
     spinlock_t alloc_lock;

#ifdef CONFIG_GENERIC_HARDIRQS
     /* IRQ handler threads */
     struct irqaction *irqaction;
#endif

     /* Protection of the PI data structures: */
     raw_spinlock_t pi_lock;

#ifdef CONFIG_RT_MUTEXES
     /* PI waiters blocked on a rt_mutex held by this task */
     struct plist_head pi_waiters;
     /* Deadlock detection and priority inheritance handling */
     struct rt_mutex_waiter *pi_blocked_on;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
     /* mutex deadlock detection */
     struct mutex_waiter *blocked_on;
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
     unsigned int irq_events;
     unsigned long hardirq_enable_ip;
     unsigned long hardirq_disable_ip;
     unsigned int hardirq_enable_event;
     unsigned int hardirq_disable_event;
     int hardirqs_enabled;
     int hardirq_context;
     unsigned long softirq_disable_ip;
     unsigned long softirq_enable_ip;
     unsigned int softirq_disable_event;
     unsigned int softirq_enable_event;
     int softirqs_enabled;
     int softirq_context;
#endif
#ifdef CONFIG_LOCKDEP
# define MAX_LOCK_DEPTH 48UL
     u64 curr_chain_key;
     int lockdep_depth;
     unsigned int lockdep_recursion;
     struct held_lock held_locks[MAX_LOCK_DEPTH];
     gfp_t lockdep_reclaim_gfp;
#endif

/* journalling filesystem info */
     void *journal_info;

/* stacked block device info */
     struct bio_list *bio_list;

#ifdef CONFIG_BLOCK
/* stack plugging */
     struct blk_plug *plug;
#endif

/* VM state */
     struct reclaim_state *reclaim_state;

     struct backing_dev_info *backing_dev_info;

     struct io_context *io_context;

     unsigned long ptrace_message;
     siginfo_t *last_siginfo; /* For ptrace use.  */
     struct task_io_accounting ioac;
#if defined(CONFIG_TASK_XACCT)
     u64 acct_rss_mem1;     /* accumulated rss usage */
     u64 acct_vm_mem1;     /* accumulated virtual memory usage */
     cputime_t acct_timexpd;     /* stime + utime since last update */
#endif
#ifdef CONFIG_CPUSETS
     nodemask_t mems_allowed;     /* Protected by alloc_lock */
     int mems_allowed_change_disable;
     int cpuset_mem_spread_rotor;
     int cpuset_slab_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS
     /* Control Group info protected by css_set_lock */
     struct css_set __rcu *cgroups;
     /* cg_list protected by css_set_lock and tsk->alloc_lock */
     struct list_head cg_list;
#endif
#ifdef CONFIG_FUTEX
     struct robust_list_head __user *robust_list;
#ifdef CONFIG_COMPAT
     struct compat_robust_list_head __user *compat_robust_list;
#endif
     struct list_head pi_state_list;
     struct futex_pi_state *pi_state_cache;
#endif
#ifdef CONFIG_PERF_EVENTS
     struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
     struct mutex perf_event_mutex;
     struct list_head perf_event_list;
#endif
#ifdef CONFIG_NUMA
     struct mempolicy *mempolicy;     /* Protected by alloc_lock */
     short il_next;
     short pref_node_fork;
#endif
     atomic_t fs_excl;     /* holding fs exclusive resources */
     struct rcu_head rcu;

     /*
      * cache last used pipe for splice
      */
     struct pipe_inode_info *splice_pipe;
#ifdef     CONFIG_TASK_DELAY_ACCT
     struct task_delay_info *delays;
#endif
#ifdef CONFIG_FAULT_INJECTION
     int make_it_fail;
#endif
     struct prop_local_single dirties;
#ifdef CONFIG_LATENCYTOP
     int latency_record_count;
     struct latency_record latency_record[LT_SAVECOUNT];
#endif
     /*
      * time slack values; these are used to round up poll() and
      * select() etc timeout values. These are in nanoseconds.
      */
     unsigned long timer_slack_ns;
     unsigned long default_timer_slack_ns;

     struct list_head     *scm_work_list;
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
     /* Index of current stored address in ret_stack */
     int curr_ret_stack;
     /* Stack of return addresses for return function tracing */
     struct ftrace_ret_stack     *ret_stack;
     /* time stamp for last schedule */
     unsigned long long ftrace_timestamp;
     /*
      * Number of functions that haven't been traced
      * because of depth overrun.
      */
     atomic_t trace_overrun;
     /* Pause for the tracing */
     atomic_t tracing_graph_pause;
#endif
#ifdef CONFIG_TRACING
     /* state flags for use by tracers */
     unsigned long trace;
     /* bitmask of trace recursion */
     unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
     struct memcg_batch_info {
          int do_batch;     /* incremented when batch uncharge started */
          struct mem_cgroup *memcg; /* target memcg of uncharge */
          unsigned long nr_pages;     /* uncharged usage */
          unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
     } memcg_batch;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
     atomic_t ptrace_bp_refcnt;
#endif
};


task_struct structs are allocated by the slab allocator for efficient object reuse. Until the 2.6 kernel the task_struct was stored at the end of the kernel stack of a process. This made it possible to access the task struct using the stack pointer. Now that the slab allocator manages the task_struct there is a thread_info struct which performs the same function, either stored at the bottom of the stack for stacks that grow down, or the top of the stack for stacks that grow up.


The Process Descriptor

Each process is given a unique process identifier, or PID. A PID is a numerical value of type pid_t (usually an int). The default maximum value is 32,768 for backwards compatibility with older Linux versions, but it can be set up to 4,194,304 by editing the value in /proc/sys/kernel/pid_max. The maximum value is important because it limits the number of processes that can exist at once.


Process State

The state variable is a set of bits that indicate the state of the task. The state field is declared as follows:


volatile long state; /* −1 unrunnable, 0 runnable, >0 stopped */

#define TASK_RUNNING			0x00000000
#define TASK_INTERRUPTIBLE		0x00000001
#define TASK_UNINTERRUPTIBLE		0x00000002
#define __TASK_STOPPED			0x00000004
#define __TASK_TRACED			0x00000008
/* Used in tsk->exit_state: */
#define EXIT_DEAD			0x00000010
#define EXIT_ZOMBIE			0x00000020
#define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_PARKED			0x00000040
#define TASK_DEAD			0x00000080
#define TASK_WAKEKILL			0x00000100
#define TASK_WAKING			0x00000200
#define TASK_NOLOAD			0x00000400
#define TASK_NEW			0x00000800
#define TASK_RTLOCK_WAIT		0x00001000
#define TASK_FREEZABLE			0x00002000
#define __TASK_FREEZABLE_UNSAFE	       (0x00004000 * IS_ENABLED(CONFIG_LOCKDEP))
#define TASK_FROZEN			0x00008000
#define TASK_STATE_MAX			0x00010000

#define TASK_ANY			(TASK_STATE_MAX-1)


TASK_RUNNING

This means the task is "supposed to be" on the run queue. The reason it may not yet be on the runqueue is that marking a task as TASK_RUNNING and placing it on the runqueue is not atomic. You need to hold the runqueue_lock read−write spinlock for read in order to look at the runqueue. If you do so, you will then see that every task on the runqueue is in TASK_RUNNING state. However, the converse is not true for the reason explained above. Similarly, drivers can mark themselves (or rather the process context they run in) as TASK_INTERRUPTIBLE (or TASK_UNINTERRUPTIBLE) and then call schedule(), which will then remove it from the runqueue (unless there is a pending signal, in which case it is left on the runqueue).


TASK_INTERRUPTIBLE

The process is sleeping (that is, it is blocked), waiting for some condition to exist. When this condition exists, the kernel sets the process's state to TASK_RUNNING. The process also awakes prematurely and becomes runnable if it receives a signal.


TASK_UNINTERRUPTIBLE

This state is identical to TASK_INTERRUPTIBLE except that it does not wake up and become runnable if it receives a signal. This is used in situations where the process must wait without interruption or when the event is expected to occur quite quickly. Because the task does not respond to signals in this state, TASK_UNINTERRUPTIBLE is less often used than TASK_INTERRUPTIBLE.


TASK_ZOMBIE

The task has terminated, but its parent has not yet issued a wait4() system call. The task's process descriptor must remain in case the parent wants to access it. If the parent calls wait4(), the process descriptor is deallocated.


TASK_STOPPED

Process execution has stopped; the task is not running nor is it eligible to run. This occurs if the task receives the SIGSTOPSIGTSTPSIGTTIN, or SIGTTOU signal or if it receives any signal while it is being debugged.


The kernel can change a process’s state with the set_task_state() function:


set_task_state(task, state);


The kernel can change the current process state with the set_current_state() function:


set_current_state(state)


Task flags contain information about the process states which are not mutually exclusive. A complete list of these flags is available in include/linux/sched.h.


unsigned long flags; /* per process flags, defined below */
/*
 * Per process flags
 */
int errno; 
int debugreg[8]; 
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
 /* Not implemented yet, only for 486*/
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super−user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */


PF_STARTING and PF_EXITING indicate that the process is just being initiated or terminated. There are more flags (defined in include/linux/sched.h), but these are only used for process accounting. The errno variable holds the error code for the last faulty system call. On return from the system call, this is copied into the global variable errno. The debugreg variable contains the 80x86's debugging registers. These are at present used only by the system call ptrace


Processes List

Before we look at how tasks/processes (we will use the two words as synonyms) are stored by the kernel, we need to understand how the kernel implements circular linked lists. The implementation that follows is a standard that is used across all the kernel sources. The linked list is declared in linux/list.h and the data structure is simple:


struct list_head {
         struct list_head *next, *prev;
 };


The figure below shows how the simple struct list_head is used to maintain a list of data structures.




This file also defines several ready-made macros and functions which you can use to manipulate linked lists. This standardizes the linked list implementation to prevent people "reinventing the wheel" and introducing new bugs. 


Let us now see how the Linux kernel uses circular doubly-linked lists to store the records of processes.


The Task List

Searching for struct list_head inside the definition of struct task_struct gives us:


struct list_head tasks;


This line shows us that the kernel is using a circular linked list to store the tasks. This means we can use the standard kernel linked list macros and functions to traverse through the complete task list.


init is the "mother of all processes" on a Linux system. The init process is started by the kernel as the last step in the boot process. Thus, it is represented at the beginning of the list, although strictly speaking there is no head since this is a circular list. The init task's process descriptor is statically allocated:


extern struct task_struct init_task;


The following shows the linked list representation of processes in memory:



Several other macros and functions are available to help us traverse this list:


for_each_process() is a macro which iterates over the entire task list, although it’s an expensive O(n) operation. It is defined as follows in linux/sched.h:

 

#define for_each_process(p) \
        for (p = &init_task ; (p = next_task(p)) != &init_task ; )


next_task() is a macro defined in linux/sched.h which returns the next task in the list:


#define next_task(p)    list_entry((p)->tasks.next, struct task_struct, tasks)


list_entry() itself is a macro defined in linux/list.h:


/*
 * list_entry - get the struct for this entry
 * @ptr:        the &struct list_head pointer.
 * @type:       the type of the struct this is embedded in.
 * @member:     the name of the list_struct within the struct.
 */
#define list_entry(ptr, type, member) \
        container_of(ptr, type, member)


When a task is created it’s added to the task list, you can access the previous and next item on the task_list with list_entry():


list_entry(task->tasks.next, struct task_struct, tasks);
list_entry(task->tasks.prev, struct task_struct, tasks);


The macro container_of() is defined as follows:


#define container_of(ptr, type, member) ({                      \
        const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
        (type *)( (char *)__mptr - offsetof(type,member) );})


Thus if we can traverse through the entire task list we can have all the processes running on the system. This can be done with the macro for_each_process(task) , where task is a pointer of struct task_struct type. 


Most kernel code deals with the task_struct directly, so it’s important to be able quickly access a process. The currently executing process can be accessed using the current() macro. The current() macro is architecture specific, because different architectures store the thread_info structure at different locations.


Some architectures store a pointer to the current thread_info structure in a register, so the current() macro on those architectures just returns the value from the register. Register-constrained architectures like x86 store the process on the stack. Accessing it on an x86 is done by masking out the 13 least significant bits of the stack pointer, assuming the stack size is 8KB:


movl $-8192, %eax
andl %esp,%eax


current() then deferences the task member of thread_info to return a pointer to the current task_struct:


current_thread_info()->task;


Using the current() macro and init_task we can write a kernel module to trace from the current process back to init.


  #include < linux/kernel.h >
  #include < linux/sched.h >
  #include < linux/module.h >
 
  int init_module(void)
  {
  struct task_struct *task;
  
   for(task=current;task!=&init_task;task=task->parent)
   //current is a macro which points to the current task / process
   {
   printk("%s [%d]\n",task->comm , task->pid);
   }
  
   return 0;
   }
   
   void cleanup_module(void)
   {
   printk(KERN_INFO "Cleaning up 1.\n");


Process Family Tree

All processes are descendants of the init process. The init process is started by the kernel as the last step in the boot process. Every process has one parent, and every process has zero or more children. Processes that are children of the same parent are called sibling processes. Each task_struct has a parent field with a reference to the parent, and a children field which is a linked list that contains the children.


Process Creation

You create a task using the fork() and the exec() family calls. fork() clones the current task, updating the PID and some other values. exec() loads a new executable into the address space and begins executing it.


Linux follows copy-on-write where copying of the address space only happens if the data is rewritten. Linux marks pages so that when a page is written to, it is copied, and each process receives a unique page containing the data. This makes process creation fast because the kernel only has to change a few values, rather than copying over a potentially large address space.


fork() is implemented using the clone() system call. clone() accepts flags that tell it what resources should be shared between the parent and child processes. Clone then calls do_fork(). do_fork() calls copy_process() which then does most of the work


copy_process() does the following:

  • Calls dup_task_struct() to create a new stack, thread_info structure, and task_struct for the new process.
  • Ensures child process does not exceed the limit on the number of processes for the current user.
  • Clears or resets task_struct fields that are unique to a process. Most values in task_struct are unchanged.
  • Sets the new process’s task_struct state field to TASK_UNINTERRUPTIBLE to ensure it does not run yet.
  • Calls copy_flags() to update the flags member of the new task_struct.
  • Assigns a new pid with alloc_pid().
  • Depending on flags passed to clone(), copy_process() either duplicates or shares resources like open file descriptors.
  • Returns a pointer to the new task_struct.
  • In do_fork() the child process is then run before the parent process. In the common case the child calls exec() immediately, so running the child first eliminates any overhead from copy-on-write that might occur if the parent ran first.




The below figure shows how do_fork() forks a child process.


The Process Address Space

The process address space is the virtual memory addressable by a process. 


Each process is given a flat 32 or 64-bit address space. Normally the address space is unique to each process, although it can be shared between processes (e.g. with threads). A process does not have permission to access all memory. The area of legal addresses are called memory areas. A process can dynamically add and remove memory areas via the kernel.


Memory areas have permissions associated with them, such as readable, writeable, and executable. If a process accesses memory that it does not have permission to access, the kernel kills the process with a segmentation fault. Memory areas contain:

  • A memory map of the executable file’s code (the text section).
  • A memory map of the executable file’s initial global variables (the data section).
  • A memory map of the zero page, containing uninitialized variables (the bss section).
  • A memory map of the zero-page used for the process’s user space stack.
  • A text, data, and bss section for each shared library.
  • Any memory mapped files.
  • Any shared memory segments.
  • Any anonymous memory mappings, like this associated with malloc()

  • Memory areas do not overlap.


    The Memory Descriptor

    A process’s address space is represented by a memory descriptor. The memory descriptor is represented with the mm_struct struct and defined in <linux/sched.h>


    struct mm_struct {
      struct vm_area_struct *mmap; /* list of memory areas */
      struct rb_root mm_rb; /* red-black tree of VMAs */
      struct vm_area_struct *mmap_cache; /* last used memory area */
      unsigned long free_area_cache; /* 1st address space hole */
      pgd_t *pgd; /* page global directory */
      atomic_t mm_users; /* address space users */
      atomic_t mm_count; /* primary usage counter */
      int map_count; /* number of memory areas */
      struct rw_semaphore mmap_sem; /* memory area semaphore */
      spinlock_t page_table_lock; /* page table lock */
      struct list_head mmlist; /* list of all mm_structs */
      unsigned long start_code; /* start address of code */
      unsigned long end_code; /* final address of code */
      unsigned long start_data; /* start address of data */
      unsigned long end_data; /* final address of data */
      unsigned long start_brk; /* start address of heap */
      unsigned long brk; /* final address of heap */
      unsigned long start_stack; /* start address of stack */
      unsigned long arg_start; /* start of arguments */
      unsigned long arg_end; /* end of arguments */
      unsigned long env_start; /* start of environment */
      unsigned long env_end; /* end of environment */
      unsigned long rss; /* pages allocated */
      unsigned long total_vm; /* total number of pages */
      unsigned long locked_vm; /* number of locked pages */
      unsigned long saved_auxv[AT_VECTOR_SIZE]; /* saved auxv */
      cpumask_t cpu_vm_mask; /* lazy TLB switch mask */
      mm_context_t context; /* arch-specific data */
      unsigned long flags; /* status flags */
      int core_waiters; /* thread core dump waiters */
      struct core_state *core_state; /* core dump support */
      spinlock_t ioctx_lock; /* AIO I/O list lock */
      struct hlist_head ioctx_list; /* AIO I/O list */
    };
    



    All mm_struct objects are linked together in a linked list. The initial list element is the init_mm memory descriptor, which describes the init process’s address space


    Processes can share their address spaces with their children by passing the CLONE_VM flag to clone. If CLONE_VM is set, then allocate_mm() is not called, and the process’s mm field points to its parents memory descriptor. You can see this in copy_mm():


    
    if (clone_flags & CLONE_VM) {
     /*
      * current is the parent process and
      * tsk is the child process during a fork()
      */
      atomic_inc(&current->mm->mm_users);
      tsk->mm = current->mm;
    


    When a process associated with an address space exits, exit_mm() is called. exit_mm() calls mmput(), which decrements the mm_struct mm_users count. When mm_users reaches 0, mmdrop() is called to decrement the mm_count counter. When mm_count is decremented to 0, and the free_mm() macro is invoked, to return the mm_struct to the mm_cachep.


    Kernel threads do not have a process address space, so their mm value is NULL. Therefore, when a kernel thread is scheduled, the kernel notices that mm is NULL and keeps the previous process's address space loaded. The kernel then updates the active_mm field of the kernel thread's process descriptor to refer to the previous process's memory descriptor. The kernel thread can then use the previous process's page tables as needed. Because kernel threads do not access user-space memory, they make use of only the information in the address space pertaining to kernel memory, which is the same for all processes.


    Memory Regions (Virtual Memory Areas)

    The kernel represents intervals of linear addresses by means of resources called memory regions (virtual memory areas), which are characterized by an initial linear address, a length, and some access rights. VMAs (Virtual Memory Areas) are represented with the vm_area_struct struct. The struct describes a single memory area that covers a contiguous interval in a given address space. Each memory area has certain properties, like permissions, and a set of associated operations.


    struct vm_area_struct {
      struct mm_struct *vm_mm; /* associated mm_struct */
      unsigned long vm_start; /* VMA start, inclusive */
      unsigned long vm_end; /* VMA end , exclusive */
      struct vm_area_struct *vm_next; /* list of VMA’s */
      pgprot_t vm_page_prot; /* access permissions */
      unsigned long vm_flags; /* flags */
      struct rb_node vm_rb; /* VMA’s node in the tree */
      union { /* links to address_space->i_mmap or i_mmap_nonlinear */
      struct {
        struct list_head list;
        void *parent;
          struct vm_area_struct *head;
        } vm_set;
        struct prio_tree_node prio_tree_node;
      } shared;
      struct list_head anon_vma_node; /* anon_vma entry */
      struct anon_vma *anon_vma; /* anonymous VMA object */
      struct vm_operations_struct *vm_ops; /* associated ops */
      unsigned long vm_pgoff; /* offset within file */
      struct file *vm_file; /* mapped file, if any */
      void *vm_private_data; /* private data */
    };


    vm_start is the lowest memory address of the area, vm_end is the first byte after the highest memory address of the area. vm_end - vm_start is the total bytes of the memory area.


    The vm_flags field contains bit flags. Unlike the permissions associated with a physical page which the hardware is responsible for, the VMA flags specify behaviour that the kernel is responsible for maintaining. You can see a full list of the flags in the following table:


    FlagEffect on the VMA and Its Pages
    VM_READPages can be read from.
    VM_WRITEPages can be written to.
    VM_EXECPages can be executed.
    VM_SHAREDPages are shared.
    VM_MAYREADThe VM_READ flag can be set.
    VM_MAYWRITEThe VM_WRITE flag can be set.
    VM_MAYEXECThe VM_EXEC flag can be set.
    VM_MAYSHAREThe VM_SHARE flag can be set.
    VM_GROWSDOWNThe area can grow downward.
    VM_GROWSUPThe area can grow upward.
    VM_SHMThe area is used for shared memory.
    VM_DENYWRITEThe area maps an unwritable file.
    VM_EXECUTABLEThe area maps an executable file.
    VM_LOCKEDThe pages in this area are locked.
    VM_IOThe area maps a device’s I/O space.
    VM_SEQ_READThe pages seem to be accessed sequentially.
    VM_RAND_READThe pages seem to be accessed randomly.
    VM_DONTCOPYThis area must not be copied on fork().
    VM_DONTEXPANDThis area cannot grow via mremap().
    VM_RESERVEDThis area must not be swapped out.
    VM_ACCOUNTThis area is an accounted VM object.
    VM_HUGETLBThis area uses hugetlb pages.
    VM_NONLINEARThis area is a nonlinear mapping.


    The vm_ops field points to the object of operations associated with a given memory area. The operations object is different for different types of memory area. The methods object is represented by vm_operations_struct:


    struct vm_operations_struct {
      void (*open) (struct vm_area_struct *);
      void (*close) (struct vm_area_struct *);
      int (*fault) (struct vm_area_struct *, struct vm_fault *);
      int (*page_mkwrite) (struct vm_area_struct *vma, struct vm_fault *vmf);
      int (*access) (struct vm_area_struct *, unsigned long , void *, int, int);
    };
    


    • open is invoked when the memory area is added to an address space.
    • close is invoked when the memory area is removed from an address space.
    • fault is invoked by the page fault handler when a page not present in physical memory is accessed.
    • page_mkwrite is “invoked by the page fault handler when a page that was read-only is made writable”.
    • access “is invoked by access_process_vm when get_user_pages fails”.

    • mmap links together all memory area objects is a singly linked list. each vm_area_struct is linked in via its vm_next_field. The areas are sorted by ascending address.


      Post a Comment

      Previous Post Next Post