Multiprocessing basic concepts

Multiprocessing basic concepts

In this article I'm going to explain basic principles standing behind  programming using processes in python.


Level 1 - computer and CPU

What is CPU

CPU consumes a set of instructions and produces some output

To get familiar with multiprocessing we have to start from computer and CPU definition.
Computer is a device that computes instructions, saves results and interacts with other devices. The core component of computing power in computers is CPU.
CPU stands for central processing unit. Main purpose of the CPU is to execute instructions. Each CPU has its own set of instructions defined by manufacturer. Transferring a sequence of zeroes and ones as input to this device makes CPU to do some work and produce output as other sequence of zeroes and ones. This is what we call programming . On the low level CPU understands only electronic signals such as 1 (voltage passed) and 0 (no voltage).

Executing instructions

CPU fetches operation codes from memory(RAM) and executes one by one.

0:00
/
CPU executing commands

Registers memory

CPU has special cells where you can store temporary data. They are called registers . Register is a special memory cell located directly inside CPU allowing to store and fetch some data very fast. Programs running on a CPU always use registers to deal with some data. For example, register called Command Pointer or CP stores pointer to currently running command. Other register called Accumulator stores result of mathematical computations. There are also other kinds of registers presented in the CPU.

0:00
/
CPU executing commands and storing something inside registers

I/O devices

Usually you can attach various devices to the CPU in order to increase your capabilities:
* keyboard - allows to write data in human convenient way without need to turn on/off voltage individually on each CPU/Computer line;
* monitor - allows to see output;
* RAM(Random Access Memory) - allows to store some data in runtime and access it fast. Doesn't save data after turning off the power;
* hard drive disk – allows to store data that will be preserved after turning off the power;

You can attach different I/O devices to the CPU

When you turn on your computer, the CPU starts to read commands from the first address available, for example RAM(random access memory).

To run a program we have to load CPU specific commands into RAM(random access memory). CPU fetches commands from RAM one by one and executes them. It's also possible to interact with the devices. For example, you can store data inside HDD(Hard Drive Disk) or show some graphics on a monitor. All such devices have their own interfaces(communication rules). Knowing device interfaces is required to write software interacting with these devices.

Level 2 - operating system

Complexity of writing software

Soon enough you will find that writing a program from scratch to execute on the CPU requires you to write a lot of repeatable and hard code dealing with the devices and resources management.  

Programming computer is too complex because of the details you need to know and manage

Hiding complexity under OS

To make development easier you need some kind of a library which can encapsulate the complexity of the computer. It should "operate" our system. It's called Operating System or OS . It's loaded when you start your computer. CPU starts reading something RAM, where BIOS already loaded your OS from HDD.

OS provides services which can be used by developers. For example, you don't need to remember how to interact with a specific HDD. Instead the code interacting with HDD is already written in the OS. Operating systems usually provide simple interface for complex things using libraries. For example, to create a file you can call single function defined in the OS provided library(C libraries) which will do everything instead of you. This kind of operations called system calls . Using system calls you can tell the OS to do specific things like creating a file or printing something to terminal.

OS provides services and incapsulates devices interaction

Different operating systems have different purposes and features. Key OS you might be familiar with:

  • unix based(Linux and MacOS);
  • windows;

Unix based operation systems have some agreement on how to provide specific interfaces. That's why MacOS and Ubuntu have some common parts and are able to execute similar programs. And that's why you might hear about linux "distributions" which have one common OS core(linux, unix based), but different set of preinstalled software(ubuntu, arch, gentoo, etc).  However you can't run unix program inside windows because windows has different API to run programs. Unix program simply has different structure/definition of a program.

What is process

Operating system defines its own language, abstractions and rules which you should know. The first thing all operating systems provide is an abstraction called process .

Process is an abstraction around currently running program. The abstraction is required to handle the complexity of software development(tracking programm state). To handle the processes and hardware complexity in general operating systems were born.
Process
Process as an abstraction

Process data structure is defined and managed by  operating system. For example, linux kernel is written in C and processes are defined as C structs. The most common attribute you will see while using the OS is PID(process id). It allows us to separate different processes and identify them using integers. For example, OS starts init process with PID=1 . This is the parent of all other processes. Running pycharm program also has its own PID. Knowing PID it's possible to close the process and do various operations with it.

You can see that activity monitor in MacOS can show PID attribute(process id)

Process memory layout

In RAM process represented as a set of special segments.
Usually process is defined using following segments:

  • Text segment. Holds actual code(operations);
  • Data segment. Holds various variables which size is defined statically.
  • Context(registers). Holds information about registers state.
  • Heap. Area of memory which allows to dynamically add new memory data(for example using malloc in C or creating a class instance in Java, C#, etc);
  • Stack. Area of memory where we store temporary variables like function arguments and return values.
OS process memory layout

Child processes

Process can create child process to do some work. Let's imagine you're opening a new tab in chrome browser. Under the hood the browser creates a new child process which starts to render the page. Each child process has his parent's PID.

Spawning child processes

Quasi-parallelism

At the same time CPU executes other processes. But how it's possible? Old computers were able to execute only one process at a time, users felt that software was running in parallel. That's called  quasi-parallel . The main idea of this is the OS switches each process very fast. Thus, you feel like everything runs in parallel and nothing blocks viewing of your chrome page.

0:00
/
Quasy-parallel execution

Parallel execution with multiple cores

One CPU core can execute 1 process at a time. Each process has its own memory variables(context) which are stored inside core registers. Nowadays we have CPUs with multiple cores which means we can execute multiple processes at a time. Each core has its own set of registers and is able to run tasks in parallel with other core.

0:00
/
Multiple cores execution

Creating a child process in Unix and Windows

Different operating systems have different systems calls to create a process. Because process management is handled by OS, you have to use libraries provied by OS. Usually they are implemanted in C. Functions from such libraries called systems calls. Some of them can create child process.

Unix Windows
fork() CreateProcess()
clone()

We will take a closer look at using fork system call later. For now keep in mind that creating a child process in different operating systems requires you to call different functions/methods due to differences in API. These system calls have differences in how these processes spawned.

Level 3 - Python

Unix. Fork a process using python

On different operating systems creating a process is done via using system calls to OS core.

On Unix based operation system programmers have access to a system call "fork" which allows to make a copy of currently running process. New process can be picked up by OS and run in parallel. System call usually refers to a function provided by OS. Child process starts execution from the line where we called fork() . It has the same code and variables. For performance reason fork uses copy on write mechanism. It means that data is located inside the same region of memory, but if you change something, the process copies that data into a new region.

We can fork a process using such code in python:

import os


def calculate_fib(goal_index: int) -> int:
	# Function to be run in paralllel process
    print('Calculating fib sequence: ', goal_index)
    pid = os.getpid()
    print('Running in process with ID: ', pid)

    if goal_index <= 2:
        return 1

    previous, current = 1, 1
    for _ in range(2, goal_index):
        previous, current = current, previous + current

    print(f"Process {pid} completed with result: {current}")
    return current

# make system call to create a copy of processs
new_pid = os.fork()
# Inside child process OS sets new_pid to 0
if new_pid == 0:
    calculate_fib(500)
# Inside parent process new_pid equals to actual child pid 
else:
    current_pid = os.getpid()
    child_pid = new_pid
    print(f"current_pid: {current_pid}, child: {child_pid}")

What happens here? Look at the video below.

0:00
/

Calling os.fork creates a separate process with almost exact variables existied in previous process. They share the same memory till we change something. Inside child process our OS returns new_pid as 0, which means we are in the child process. Inside parent process new_pid equals to child's actual pid. Execution of both processes continues. The first process completes execution and goes away. The second process is still executing. Now process 2 doesn't have link to parent process. OS decides to reassign this link to init process with PID 1. Process with PID 1 is a process create by OS when you start  your computer(unix based OS).

What if we want to wait till the second process finishes and then do some work in the first process? To do that you can use os.waitpid



# make system call to create a copy of processs
new_pid = os.fork()
# Inside child process OS sets new_pid to 0
if new_pid == 0:
    calculate_fib(500)
# Inside parent process new_pid equals to actual child pid 
else:
    current_pid = os.getpid()
    child_pid = new_pid
    print(f"current_pid: {current_pid}, child: {child_pid}")
    # Sleep untill child process is finished
    os.waitpid(new_pid,  0)
    # Do other work with results here
0:00
/
Forking and waiting for child process

Forking is used in a big variety of web/application servers. You will see that many servers  have option to use pre-fork model which means they create processes  via fork before processing web requests.

Gunicorn uses pre-fork to spawn workers processing requests
gunicorn/arbiter.py at 053e15c05c6f8a90c1850cc8cc2bad1bd2f3536a · benoitc/gunicorn
gunicorn ‘Green Unicorn’ is a WSGI HTTP Server for UNIX, fast clients and sleepy applications. - gunicorn/arbiter.py at 053e15c05c6f8a90c1850cc8cc2bad1bd2f3536a · benoitc/gunicorn

Prefork is also used in celery as option to run worker.

celery/prefork.py at master · celery/celery
Distributed Task Queue (development branch). Contribute to celery/celery development by creating an account on GitHub.

Windows. Using CreateProcess() system call

Forking a process works only on unix based OS and has limitations. It doesn't work on windows. Python doesn't provide direct interface to call CreateProcess , but it calls it under the hood in C code.

You should understand this concept of forking a process, but usually you will be spawning new processes using Process class. Python hides all complexity of running processes on different platforms. Under the hood your python interpreter will choose a strategy to create a process based on your platform.

Python checks current OS platform.
For windows python uses  WINAPI CreateProcess function to create a process.

For unix it uses fork

If you want to write platform independent code, you should create processes using Process class. Thus, your code will be working on both windows and unix based OS.

import os
from multiprocessing import Process


def calculate_fib(goal_index: int) -> int:
    print('Calculating fib sequence: ', goal_index)
    pid = os.getpid()
    print('Running in process with ID: ', pid)

    if goal_index <= 2:
        return 1

    previous, current = 1, 1
    for _ in range(2, goal_index):
        previous, current = current, previous + current

    print(f"Process {pid} completed with result: {current}")
    return current


if __name__ == '__main__':
	# Create python class instance representing process
    child_process = Process(name='calculate_fib', target=calculate_fib, args=(500,))
    # Start the process. From this moment 
    # you can see a new process running in OS.
    # For Windows use WINAPI CreateProcess;
	# For Unix use fork;
    child_process.start()
    # Main process will wait for execution of the process spawned
    child_process.join()

Python can use several methods to spawn a process called context or start_method:

However the bottomline is that in unix OS python uses fork, on windows CreatProcess system calls.

Python. Different methods to start a process

You can change start method using set_start_method function in multiprocessing module.

import multiprocessing as mp

def foo(q):
    q.put('hello')

if __name__ == '__main__':
    mp.set_start_method('spawn')
    q = mp.Queue()
    p = mp.Process(target=foo, args=(q,))
    p.start()
    print(q.get())
    p.join()
Code example from docs

Daemonic process and join method

You can pass daemon=True argument when starting a process. This will force the child process to be killed by parent when parent completes execution.

Daemonic processes are killed by parent 

Join method to wait for children

Using join method you can wait for child processes to exit. By default for non-daemonic processes python calls .join automatically.

join() method forces parent to wait for child processes

Inter Process Communication(IPC)

To utilize all power of processes your OS provides several tools how two different processes can communicate with each other. This set of tools is called IPC. Python uses IPC to give a good API  for communicating with processes:

  • communication using messages(queues);
  • communication using pipes;
  • communication using shared memory;

Let's take a closer look at this.

Messages(Queues) in python

Queue is a simple FIFO data structure shared between processes. These processes can read from queue and write into the queue. This object is thread and process safe.

When two processes/threads access shared resource you might face a race condition. It means that two processes can corrupt an object by doing work in parallel. In the resulting object you will see that access order was not predictable overwriting some results of previous process/thread. Queue is thread/process safe
Process/Thread safety
IPC using queue

Pipes  in python

Pipe represents a bidirectional channel between two processes. Each part of the pipe can be used to receive or retrieve data.

If duplex is True (the default) then the pipe is bidirectional.  If duplex is False then the pipe is unidirectional: conn1 can only be used for receiving messages and conn2 can only be used for sending messages.

IPC using pipes

Shared Memory  in python

Processes communication can be established using shared memory model. To do that these processes have to agree on what memory segment to share. Usually OS doesn't allow separate processes to enter other process memory.

IPC using shared memory

In terms of memory two processes map some region of their memory to a shared block of memory. This is done using OS.

Shared memory model

Shared memory manager server

Python can create a separate server process which will manage shared memory. This is more flexible way to manage shared memory. This server process can be running on a separate machine. Thus, you can achieve process communication between processes located on different machines.

Using server process to manage shared memory

Race conditions

Race condition is a situation when two concurrent processes are trying to access shared data and the data is get corrupted due to the concurrent nature of the processes.

Let's take a look at code below:

from multiprocessing import Process, Value


def worker_1(a):
    for i in range(10_000):
        a.value += 1


def worker_2(a):
    for i in range(10_000):
        a.value += 1


if __name__ == '__main__':
    a = Value('i', 0, lock=False)

    process_1 = Process(target=worker_1, args=(a,))
    process_2 = Process(target=worker_2, args=(a,))

    process_1.start()
    process_2.start()
    # With shared memory you call .join explicitly
    process_1.join()
    process_2.join()
    print(a)

At the code above we have shared integer a with initial value 0 . Two processes are trying to increment this value 10 000 times each. What will happen? The variable should equal 20 000

As you can see on the gif above, our code sometimes produce incorrect value. But why? Because += 1 is not atomic operation.

Python disassembles += 1 into smaller steps such as: load 1, load a, add 1 to a, set a to result.

def worker_1(a):                              
    a.value += 1
    
    

print(dis.dis(worker_1))
#   6           0 LOAD_FAST                0 (a)
#              2 DUP_TOP
#              4 LOAD_ATTR                0 (value)
#              6 LOAD_CONST               1 (1)
#              8 INPLACE_ADD
#             10 ROT_TWO
#             12 STORE_ATTR               0 (value)
#             14 LOAD_CONST               0 (None)
#             16 RETURN_VALUE

    
    
    
    
    
Bytecode of increment

Now let's imagine that after bytecode LOAD_CONST (1) our process starts sleeping due to OS scheduler. Another process has already incremented this integer. Then our first process wakes up and sums up old values, then overrides result value in shared memory. Thus, we have corrupted data.

0:00
/
Race condition with non atomic increment

How to solve this? Use locks. For Value you can pass lock=True , which will lock this variable and will not allow another process to increment the variable at the same time. However be aware that this increase time required to finish execution.

a = Value('i', 0, lock=True) # lock=True passed by default

Process and thread safe objects

Threads are not covered in this article. However in simple words they can be described as a unit of execution inside process. One process has at least one thread. Each thread can be running in parallel(not in python though). That's another topic for discussion. For now consider threads as smaller units of execution in a process.
Thread

When two concurrently running processes are trying to use shared object it might lead to undefined behaviour. Objects that have some kind of mechanism to prevent undefined behaviour called process/thread safe.

For example, in django each process/thread processing a request  has db connection object. Inspecting transaction source code shows a comment describing that transaction is thread-safe . It means that it can't a reason for race condition.

https://docs.djangoproject.com/en/4.1/_modules/django/db/transaction/

So, in simple words process/thread safety refers to a shared object which can cause race condition.

Pool of processes

Pool is very common design pattern in concurrent world. Parent process creates several processes called workers. These workers are listening for a message inside queue. When one of the workers get the message it starts executing it.

Pool allows easily to distribute work across multiple units.

0:00
/
Process pool
from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))
        
Code example from python docs

This is also called producer-consumer model(pattern) . The model(pattern) itself is being used in a big variety of tools such as:

  • celery;
  • AWS Lambda;
  • AWS SNS;

Sequential and asynchronous models of programming

Sequential model

In regular python programming you think using sequential model. It means that every operation you've written is likely executing sequentially. Let's take an example. Consider that you have a function run_timer which prints something after several second when timer expires.

import time


def run_timer(seconds: int) -> None:
    time.sleep(seconds)
    print(f'Timer triggered after {seconds} seconds!')


def calculate_something():
    print('calculating something')


if __name__ == '__main__':
    run_timer(5)  # <- this operation blocks main process
    calculate_something()

When you run this command you will see that run_timer  function blocks your execution of the script for 5 second.

Sequential execution

Asynchronous model

In asynchronous model run_timer will not block the execution. In nodejs you will face asynchronous model of execution, however in python python interpreter uses sequential model.

asynchronous model can be achieved using a separate process. When you run the code below you will see that running timer don't block main flow.  This is what we call asynchronous model.

import time
from multiprocessing import Process


def run_timer(seconds: int) -> None:
    time.sleep(seconds)
    print(f'Timer triggered after {seconds} seconds!')


def calculate_something():
    print('calculating something')


if __name__ == '__main__':
    run_timer_5_process = Process(target=run_timer, args=(5,))
    run_timer_10_process = Process(target=run_timer, args=(10,))

    print('calling run_timer with 5 seconds')
    run_timer_5_process.start()  # <- this operation doesn't block main process

    print('calling run_timer with 10 seconds')
    run_timer_10_process.start()  # <- this operation doesn't block main process

    print('we are not blocked, timer is ticking concurrently')
    calculate_something()
Async model of execution

After starting timer function our main flow is not blocked. That's why we call it asynchronous model.

You can notice a lot of methods inside multiprocessing module like:

  • pool.apply_async ;
  • queue.get_nowait ;
  • queue.put_nowait ;

They have the same meaning – they are not blocking the main flow of the program(asynchronous).

Further reading and watching