Shared-memory programming with OpenMP

Note

Starting at l7Oa.pdf slide 45

Private variables

int x = 5;
#pragma omp parallel num_threads(thread_count) private(x)
{
    int my_rank = omp_get_thread_num();
    printf("Thread %d > before initialization, x = %d\n", my_rank, x);
 
    x = 2 * my_rank + 2;
    printf("Thread %d > after initialization, x = %d\n", my_rank, x);
}
 
printf("After parallel block, x = %d\n", x);

What is the output of this code?

Solution

Trick question, we cannot tell. The value of x is unspecified:

at the beginning of the parallel (for) block

after the completion of the parallel block

The default Clause

allows the programmer to specify the scope of each variable of a block

default(none)

it is good practice to specify the scope of each variable in a bock explicitly
with this clause, the compiler will require that we specify the scope of each variable we use in the block and that has been declared outside the block

    double sum = 0.0;
#   pragma omp parallel for num_threads(thread_count) \ default(none) reduction(+:sum) private(k, factor) \ shared(n)
    for (k = 0; k < n; k++) {
        if (k % 2 == 0)
            factor = 1.0;
        else
            factor = -1.0;
    
        sum += factor / (2 * k + 1);
    }

Note

End of l7Oa.pdf, moving on to l7Ob.pdf

Bubble sort

for (list_length = n; list_length >= 2; list_length--) {
    for (i = 0; i < list_length - 1; i++) {
        if (a[i] > a[i + 1]) {
            tmp = a[i];
            a[i] = a[i + 1];
            a[i + 1] = tmp;
        }
    }
}

Odd-even transposition sort

basically parallel bubble sort
repeatedly compares and swaps adjacent elements in two phases: even phase and odd phase
- process repeats until no more swaps are needed → list is fully sorted

// Serial odd–even transposition sort skeleton
for (phase = 0; phase < n; phase++) {
    if ((phase % 2) == 0) {
        for (i = 1; i < n; i += 2)
            if (a[i - 1] > a[i]) Swap(&a[i - 1], &a[i]);
    } else {
        for (i = 1; i < n - 1; i += 2)
            if (a[i] > a[i + 1]) Swap(&a[i], &a[i + 1]);
    }
}

Thread pooling

in OpenMP: depends on implementation
- Intel OpenMP creates pool of threads form the beginning
forking threads and building a thread pool
- with parallel directive: tells OpenMP to use the team of threads for parallelizing the for loops

OpenMP odd-even sort

// Odd–even transposition sort (parallel with OpenMP)
for (int phase = 0; phase < n; phase++) {
    int i, tmp;
 
    if ((phase % 2) == 0) {
        #pragma omp parallel for num_threads(thread_count) \
            default(none) shared(a, n) private(i, tmp) schedule(static)
        for (i = 1; i < n; i += 2) {
            if (a[i - 1] > a[i]) {
                tmp = a[i - 1];
                a[i - 1] = a[i];
                a[i] = tmp;
            }
        }
    } else {
        #pragma omp parallel for num_threads(thread_count) \
            default(none) shared(a, n) private(i, tmp) schedule(static)
        for (i = 1; i < n - 1; i += 2) {
            if (a[i] > a[i + 1]) {
                tmp = a[i + 1];
                a[i + 1] = a[i];
                a[i] = tmp;
            }
        }
    }
}

with thread pooling:

#pragma omp parallel num_threads(thread_count) \
    default(none) shared(a, n) private(i, tmp, phase)
{
    for (phase = 0; phase < n; phase++) {
        if ((phase % 2) == 0) {
            #pragma omp for schedule(static)
            for (i = 1; i < n; i += 2) {
                if (a[i - 1] > a[i]) {
                    tmp = a[i - 1];
                    a[i - 1] = a[i];
                    a[i] = tmp;
                }
            }
        } else {
            #pragma omp for schedule(static)
            for (i = 1; i < n - 1; i += 2) {
                if (a[i] > a[i + 1]) {
                    tmp = a[i + 1];
                    a[i + 1] = a[i];
                    a[i] = tmp;
                }
            }
        }
        // implicit barrier here ensures threads sync before next phase
    }
}

Performance

odd-even sort with two parallel for directives and two for directives
times are in seconds

`thread_count`	1	2	3	4
two parallel for directives	0.770	0.453	0.358	0.305
two for directives	0.732	0.376	0.294	0.239

Scheduling loops

in parallel for directive: the exact assignment of loop iterations to threads is system dependent
most OpenMP implementations → the default behaviour
- roughly → a block partitioning
  - n iterations in the serial loop
  - parallel loop the 1st n/thread_count → assigned to thread0

Optimal split?

no! uneven distribution on inexact assignments, meaning thread thread_count-1 will do more work than thread0.

Better (more evenly distributed) assignment? → round robin fashion

yes! cyclic partitioning of iterations

To parallelize this loop:

sum = 0.0;
for (i = 0; i <= n; i++)
    sum += f(i);
    
double f(int i) {
    int j, start = i * (i + 1) / 2, finish = start + i;
    double return_val = 0.0;
 
    for (j = start; j <= finish; j++) {
        return_val += sin(j);
    }
    return return_val;
} /* f */

Assignment of work using cyclic partitioning

Thread	Iterations
0	$0, \frac{n}{t}, \frac{2 n}{t}, \dots$
1	$1, \frac{n}{t} + 1, \frac{2 n}{t} + 1, \dots$
$⋮$	$⋮$
$t - 1$	$t - 1, \frac{n}{t} + t - 1, \frac{2 n}{t} + t - 1, \dots$

Results

f(i) calls the sin function i times
assume the time to execute f(2i) requires approx. twice as much time as the time to execute f(i)

n	thread #	assignment type	run-time	speedup
10000	1		3.67s
10000	2	default	2.76s	1.33s
10000	2	cyclic	1.84s	1.99s

a good assignment of interactions to threads → significant effect on performance

The schedule clause

default schedule

sum = 0.0;
#pragma omp parallel for num_threads(thread_count) \
    reduction(+:sum)
for (i = 0; i <= n; i++)
    sum += f(i);

cyclic schedule

sum = 0.0;
#pragma omp parallel for num_threads(thread_count) \
    reduction(+:sum) schedule(static,1)
for (i = 0; i <= n; i++)
    sum += f(i);

`schedule(type, chunksize)`

type can be:
- static: iterations can be assigned to the threads before the loop is executed
  - schedule(static, 1)
    - thread 0: 0,3,6,9
    - thread 1: 1,4,7,10
    - thread 2: 2,5,8,11
  - schedule(static, 2)
    - thread 0: 0,1,6,7
    - thread 1: 2,3,8,9
    - thread 2: 4,5,10,11
  - schedule(static, 4)
    - thread 0: 0,1,2,3
    - thread 1: 4,5,6,7
    - thread 2: 8,9,10,11
- dynamic/guided: iterations are assigned to the threads while the loop is executing
  - dynamic: when a thread finishes a chunk, it requests another one from the run time system
  - guided: same as above, but as chunks are completed, the size of the new chunks decrease
- auto: compiler and/or the run-time system determines the schedule
- runtime: schedule is determined at run time
  - uses the environment variable OMP_SCHEDULE to determine at run time how to schedule the loop
  - OMP_SCHEDULE can take on any of the values that can be used for a static, dynamic, or guided schedule
chunksize is a positive integer

Connor's Notes

⬅️ Back to portfolio

Explorer

COSC 3P93: Lecture 8 Notes

Shared-memory programming with OpenMP

Private variables

The default Clause

Bubble sort

Odd-even transposition sort

Thread pooling

OpenMP odd-even sort

Performance

Scheduling loops

Results

The schedule clause

`schedule(type, chunksize)`

Graph View

Table of Contents