Concurrency

Table of contents

  1. Overview
  2. future-promise
  3. std::packaged_task
  4. async()
  5. Threads
  6. Concurrency problem
  7. Mutex and locks
  8. Conditional variables
  9. Atomics

1. Overview ↑top

The C++11 standard library provides several mechanisms to support concurrency. The first is std::thread, which together with sync objects (std::mutex, std::lock_guards and std::condition_variables, etc) offers a thread-based approach to achieve concurrency.

However, working at the level of threads and locks can be quite tricky, and thus a higher-level of abstraction, task-based concurrency, is also supported in C++11, in the form of promises and futures. std::promise<T> and std::future<T> work in pairs to separate the act of calling a func from the act of waiting for the call rsts.

Class std::packaged_task<T> makes codes more readable; it is a container for a task and its promise. The template type is the type of the task func, and it automatically creates and manages a std::promise<T> for use.

Things become much simpiler if we use the std::async() func, which hides all the implementation, platform specific details. It takes as input a callable object and returns a future that will contain the return value.

2. future - promise ↑top

(i) overview

The Pulpit Rock
Fig.1 - Info flow of task-based concurrency.

A std::promise<T> obj represents a result in the callee-side of the asynchronous call, and it is the channel for passing the result asynchronously to the caller. When the task completes, it puts its result into a promise object calling promise::set_value.

When the caller finally needs to access the result, it will call the blocking future::get to retrieve it. If the task has already completed, the result will be immediately available, ow, the caller thread will suspend until the result becomes available.

This shared state can be associated to a future object by calling member get_future. After the call, both objects share the same shared state:
- The promise object is the asynchronous provider and is expected to set a value for the shared state at some point.
- The future object is an asynchronous return object that can retrieve the value of the shared state, waiting for it to be ready, if necessary.

#include <vector>
#include <thread>
#include <future>
#include <numeric>
#include <iostream>
 
void accumulate(std::vector<int>::iterator first,
                std::vector<int>::iterator last,
                std::promise<int> accumulate_promise) {
    int sum = std::accumulate(first, last, 0);
    accumulate_promise.set_value(sum);  // Notify future
}
 
int main() {
    std::vector<int> numbers = { 1, 2, 3, 4, 5, 6 };
    //create promise
    std::promise<int> accumulate_promise;
    //engagement with future
    std::future<int> accumulate_future = accumulate_promise.get_future();
    
    std::thread work_thread(accumulate, numbers.begin(), 
                            numbers.end(), std::move(accumulate_promise));
                            
    accumulate_future.wait();           // wait for result
    std::cout << "result=" << accumulate_future.get() << '\n';
    work_thread.join();                 // wait for thread completion
}

(ii) std::promise<T>

The class template std::promise provides a facility to store a value or an exception that is later acquired asyn via a std::future obj created by the std::promise obj. Each promise is associated with a shared state, which contains some state info and a result which may be not yet evaluated, to be a value or an exception.

The promise is the "push" end of promise-future communication channel: the operation that stores a value in the shared state synchronizes-with the successful return from any func that is waiting on the shared state (e.g., std::future::get).

(iii) std::future<T>

The class template std::future provides a mechanism to access the result of asynchronous operations:

Member functions:

3. std::packaged_task ↑top

std::package_task wraps any callable (func, lambda, bind expr, or another func obj) so that it can be invoked asyn. Its return value or exception thrown is stored in a shared state which can be accessed through std::future objs.

// unique function to avoid disambiguating the std::pow overload set
int f(int x, int y) { return std::pow(x,y); }
 
void task_lambda() {
    std::packaged_task<int(int,int)> task([](int a, int b) {
        return std::pow(a, b); });
    std::future<int> result = task.get_future();
 
    task(2, 9);
    std::cout << "task_lambda:\t" << result.get() << '\n';
}
 
void task_bind() {
    std::packaged_task<int()> task(std::bind(f, 2, 11));
    std::future<int> result = task.get_future();
 
    task();
    std::cout << "task_bind:\t" << result.get() << '\n';
}
 
void task_thread() {
    std::packaged_task<int(int,int)> task(f);
    std::future<int> result = task.get_future();
 
    std::thread task_td(std::move(task), 2, 10);
    task_td.join();
 
    std::cout << "task_thread:\t" << result.get() << '\n';
}
 
int main() {
    task_lambda();
    task_bind();
    task_thread();
}

4. async() ↑top

With packaged_task, we still have to manually create the threads to run the task, and decide on which thread the task will run. Things can be much simpler using the high-level std::async() interface.

(i) example

int doSomething (char c) {
    //...
    return c;
}
int func1 () {
    return doSomething('.');
}
int func2 () {
    return doSomething('+');
}
int main() {
    //start func1 asynchronously (now or later or never)
    std::future<int> result1(std::async(func1));
    //call func2 synchronously (here and now)
    int result2 = func2();
    //print result (wait for func1() to finish and add its result to result2
    int result = result1.get() + result2;
    
    cout << result << endl;
}

instead of calling: int result = func1() + func2();, we call:

std::future<int> result1(std::async(func1));
int result2 = func2();
int result = result1.get() + result2;

func1() is tried to start in the background, using std::async, and assign the result to an obj of class std::future.

With the call of get(), one of the three things might happen:

Without calling get(), there is no guarantee that func1() will ever be called. We have to ensure that we ask for the result of a functionality started with async() no earlier than necessary:

std::future<int> result1(std::async(func1));
//might call func2() after func1() ends
int result = func2() + result1.get();

To have the best effect, in general, we should maximize the distance between calling async() and calling get, i.e., call early and return late.

The object passed to async may be any type of callable object: function, member func, func object, or lambda (std::async([]{ ... })).

(ii) launch policies

The exact behavior of async() is complex and highly depends on the lanunch policy, which can be passed as the first optional argument.

async (launch policy, Fn&& fn, Args&& ...args

 void print_ten (char c, int ms) {
  for (int i=0; i<10; ++i) {
    std::this_thread::sleep_for (std::chrono::milliseconds(ms));
    std::cout << c;
  }
}

int main ()
{
  std::cout << "with launch::async:\n";
  std::future<void> foo = std::async (std::launch::async,print_ten,'*',100);
  std::future<void> bar = std::async (std::launch::async,print_ten,'@',200);
  // async "get" (wait for foo and bar to be ready):
  foo.get();
  bar.get();
  std::cout << "\n\n"; 
  
void print_ten (char c, int ms) {
  for (int i=0; i<10; ++i) {
    std::this_thread::sleep_for (std::chrono::milliseconds(ms));
    std::cout << c;
  }
}

int main ()
{
  std::cout << "with launch::async: ";
  std::future<void> foo = std::async (std::launch::async,print_ten,'*',100);
  std::future<void> bar = std::async (std::launch::async,print_ten,'@',200);
  // async "get" (wait for foo and bar to be ready):
  foo.get(); bar.get();
  std::cout << "\n\n";

  std::cout << "with launch::deferred: ";
  foo = std::async (std::launch::deferred,print_ten,'*',100);
  bar = std::async (std::launch::deferred,print_ten,'@',200);
  // deferred "get" (perform the actual calls):
  foo.get(); bar.get();
  std::cout << '\n';

  return 0;
}

possible output:
with launch::async: **@**@**@*@**@*@@@@@

with launch::deferred: **********@@@@@@@@@@

5. Threads ↑top

(i) class std::thread

the class thread represents a single thread of execution. Threads allow multi pieces of code to run asynchronously and simultaneously.

Constructors:

void f1(int n) {

}
void f2(int& n) {
}

int main() {
    int n = 0;
    std::thread t1;     //t1 is not a thread
    std::thread t2(f1, n+1);    //pass by value
    std::thread t3(f2, std::ref(n));    //pass by ref
    std::thread t4(std::move(t3));  //t4 is now running f2()
                                    //t3 is no longer a thread
    t2.join();
    t4.join();
}

Observers:

Operations:

(ii) namespace this_thread

For any thread, including the main thread, <thread> declares namespace std::this_thread, which provides the thread_specific global funcs:

(iii) basic usage

to start a thread, we simply have to declare an obj of class std::thread and pass the desired task as initial argument, and then either wait for its end or detach it:

void doSomething();

std::thread t(doSomething); //start doSomething() in the background
...
t.join();                   //wait for t to finish (block until doSomething() ends)

As for async(), we can pass anything that's a callable object (function, member func, func obj, lambda) together with psbl additional arguments. Unless you really you what you are doing, you should pass all objs necessary to process the passed functionality by value so that the thread uses only local copies.

void doSomething(int num, char c) {
    //any uncaught exception would cause the prog to terminate
    try {
        ...
    }
    //make sure no exception leaves the thread and terminates the program
    catch(const exception& e) {
        cerr << "thread-exception (thread " 
             << this_thread::get_id() << "): " << e.what << endl;
    }
    catch(...) {
        cerr << "thread-exception (thread "
             << this_thread::get_id() << ")" << endl;
    }
}

int main() {
    //creating a thread might throw a std::system_error
    try{
        thread t1(doSomething, 5, '.'); //print 5 dots in separate thread
        cout << "- started fg thread " << t1.get_id() << endl;
        
        //print other chars in other bg threads
        for(int i=0; i<5; ++i) {
            thread t(doSomething, 10 'a'+i); //print 10 chars in separate thread
            cout << "- detach started bg thread " << t.get_id() << endl;
            t.detach();                 //detach thread into the bg
        }
        
        cin.get();  wait for any input (return)
        cout << "- join fg thread " << t1.get_id() << endl;
        t1.join();                      //wait for t1 to finish
    }
    catch (const exception& e) {
        cerr << "exception: " << e.what() << endl;
    }
}

Detached threads can easily become a problem if they use nonlocal resources. Passing variables and objs to a thread by ref is always a risk, and passing by value is strongly recommended.

And, the lifetime problem also applies to global and static objs, because when the program exits, the detached thread might still run, which means that it might access global or static obj that are already destroyed or under construction. Thus, we should ensure that these global/static objs are not destroyed before all detached threads accessing them are finished. Approaches can be:

Because std::cin, std::cout and std::cerr and the other global stream objs are not destroyed during program execution, access to these objs in detached threads should introduce no undefined behavior. However, other problems, such as interleaved chars, might occur.

The only safe way to terminate a detached thread is with one of the "...atthreadexit()" functions, which force the main thread to wait for the detached thread to truly finish.

thread IDs
This ID is a special type std::thread::id, which is guaranteed to be unique for each thread. Threads IDs can be obtained by the thread obj or inside a threas using namespace this_thread.

std::thread t1(doSomething, 5, '.');
std::thread t2(doSomething, 5, '+');
std::thread t3(doSomething, 5, '*');
std::cout << "t3 ID:    " << t3.get_id()    << endl;
std::cout << "main ID:  " << std::this_thread::get_id() << endl;
std::cout << "nothread ID: " << std::thread::id() << endl;

The only operations allowed for thread IDs are comparisons and calling the output operator for a stream. We cannot make any further assumptions, such as "no thread" has ID 0 or the main thread has ID 1.

std::thread::id masterThreadID;
void doSomething(){
    if(std::this_thread::get_id() == masterThreadID) {
        ...
    }
}

std::thread master(doSomething);
masterThreadID = master.get_id();
...
std::thread slave(doSomething);
...

6. Concurrency problem ↑top

Each compiler can optimize code as long as the behavior of the program visible from the outside behaves the same (as-if rule). Hence, both compiler and hardware vendors can reorder the code to speed the program, as long as the observable behavior remains stable. E.g., compilers might unroll loops, reorder statemenets, eliminate dead code, prefetch data, and in modern architecture, a hardware buffer might reorder loads or stores, etc.

(i) problems

To give compilers and hardware enough freedom to optimize code, C++ does NOT in general give a couple of guarantees, which might cost too much in performance. In C++, we might have the following problems:

(ii) Features to solve the problems

To solve the three major problems of concurrent data access, we need the following concepts:

C++ standard library provides different ways (from high-level to low-level) to deal with the concepts:

7. Mutexes and locks ↑top

(i) std::mutex

A mutex, or mutual exclusion, is a synchronization primitive that can be used to protect data from being simultaneously accessed by multi threads. mutex offers exclusive, non-recursive ownership semantics.

#include <iostream>
#include <chrono>
#include <thread>
#include <mutex>
 
int g_num = 0;  // protected by g_num_mutex
std::mutex g_num_mutex;
 
void slow_increment(int id) {
    for (int i = 0; i < 3; ++i) {
        g_num_mutex.lock();
        ++g_num;
        std::cout << id << " => " << g_num << '\n';
        g_num_mutex.unlock();
 
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
}
 
int main() {
    std::thread t1(slow_increment, 0);
    std::thread t2(slow_increment, 1);
    t1.join(); t2.join();
}

This simple lock-unlock approach can, however, become pretty complicated. E.g., we should ensure that an exception, which ends an exclusive access, also unlocks the crspding mutex; ow, a rsc might become locked forever. Also, deadlock scenarios are psbl, with two thread waiting for a lock of the other thread before freeing their own lock.

To deal with exceptions (guarnatee exception safety), we shoud not lock and unlock by ourselves; instead, we should use the RAII principle (Resource Acquisition Is Initialization), whereby the cstr acquires a rsc so that the dstr, which is always called even when an exception causes the end of the lifetime, releases rsc automatically.

(ii) std::lock_guard

std::mutex is usually not accessed directly: std::unique_lock and std::lock_guard are used to manage locking in exception-safe manner. Note that the locks should be limited to the shortest period psbl because they block other code from running in parallel.

std::map<std::string, std::string> g_pages;
std::mutex g_pages_mutex;
 
void save_page(const std::string &url) {
    // simulate a long page fetch
    std::this_thread::sleep_for(std::chrono::seconds(2));
    std::string result = "fake content";
 
    std::lock_guard<std::mutex> guard(g_pages_mutex);
    g_pages[url] = result;
}//lock released here
 
int main() {
    std::thread t1(save_page, "http://foo");
    std::thread t2(save_page, "http://bar");
    t1.join(); t2.join();
 
    // safe to access g_pages without lock now, as the threads are joined
    for (const auto &pair : g_pages) {
        std::cout << pair.first << " => " << pair.second << '\n';
    }
}

recursive locks

Sometimes, the ability to lock recursively is required. Typical examples are active objs or monitors, which contain a mutex and take lock inside every public method to protect data races corrupting the internal state of the obj. E.g., a db interface might look as follows:

class DatabaseAccess {
 private:
    std::mutex dbMutex;
    ... //state of database access
 public:
    void createTable (...) {
        std::lock_guard<std::mutex> lg(dbMutex);
        ...
    }
    void insertData (...) {
        std::lock_guard<std::mutex> lg(dbMutex);
        ...
    }
    ...
};

When we introduce a public member func that might call other public member funcs, this can become complicated:

void createTableAndInsertData (...) {
    std::lock_guard<std::mutex> lg(dbMutex);
    ...
    createTable(...);   //ERROR: deadlock because dbMutex is locked again
}

Calling createTableAndInsertData() will result in a deadlock because after locking dbMutex, the call of createTable() will try to lock dbMutex again, which will block until the lock of dbMutex is available, which will never happen because createTableAndInsertData() will block until createTable() is done.

A recursive mutex is a lockable object, just like mutex, but allows the same thread to acquire multi levels of ownership over the mutex obj. The lock is released when the last crspding unlock() is called.

class DatabaseAccess {
 private:
    std::recursive_mutex dbMutex;
    ... //state of database access
 public:
    void createTable (...) {
        std::lock_guard<std::recursive_mutex> lg(dbMutex);
        ...
    }
    void insertData (...) {
        std::lock_guard<std::recursive_mutex> lg(dbMutex);
        ...
    }
    void createTableAndInsertData (...) {
        std::lock_guard<std::recursive_mutex> lg(dbMutex);
        ...
        createTable(...);   //OK: no deadlock
    }
    ...
};

tried and timed locks

sometimes a program wants to acquire a lock but doesn't want to block (forever) if this is not psbl. For this case, mutexes provide a try_block() member func that tries to acquire a lock.

std::mutex m;

//try to acquire a lock and do other stuff if not psbl
while (m.try_lock() == false) {
    doSomeOtherStuff();
}
std::lock_guard<std::mutex> lb(m, std::adopt_lock);
...

To wait only for a particular amount of time, we can use a timed mutex, try_lock_for() and try_lock_until provided in std::timed_mutex and std::recursive_timed_mutex, respectively.

std::timed_mutex m;

//try for 1sec to acquire a lock
if (m.try_lock_for(std::chrono::seconds(1))) {
    std::lock_guard<std::timed_mutex> lg(m, std::adopt_lock);
    ...
} else {
    couldNotGetTheLock();
}

mutex constants

Mutex constants are used as tag argument for unique_lock to select a specific constructor:

dealing with multiple locks

C++ enables to lock multi mutexes, avoiding deadlock: std::lock() locks all mutexes passed as arguments, blocking until all mutexes are locked or until an exception is thrown. In the latter case, it unlocks mutexes already successfully locked.

std::mutex m1, m2;

int idx = std::try_lock(m1, m2);    //try to lock both mutexes
if(idx < 0) {       //both locks succeeded
    std::lock_guard<std::mutex> lockM1(m1, std::adopt_lock);
    std::lock_guard<std::mutex> lockM1(m2, std::adopt_lock);
    ...
} /*auto unlock all mutexes*/ else {
    //idx has zero-based index of first failed lock
    cerr << "could not lock mutex m" << idx+1 << endl;
}

(iii) class unique_lock

Besides class lock_guard<>, C++ standard library provides class unique_lock<>, which is lot more flexible when dealing with locks for mutexes. It allows deferred locking, time-constrained attempts at locking, recursive locking, transfer of lock ownership, and use with condition variables.

std::mutex Mutex;

std::unique_lock<std::mutex> Foo() {
    std::unique_lock<std::mutex> lock(Mutex);
    return lock;
    // mutex isn't unlocked here!
}

void Bar() {
    auto lock = Foo();
}   // mutex is unlocked when lock goes out of scope

Member functions:

(iv) calling once for multi threads

Sometimes multi threads might not need some functionality that should get processed whenever the first thread needs it. A typical example is lazy initialization: the first time one of the threads needs sth that has to get processed, you process it (but not before, because u want to save the time to process it if it is not needed).

For single-thread environment:

static std::vector<std::string> staticData;

void foo() {
    if (staticData.empty()) {
        staticData = initializeStaticData();
    }
    ...
}

Such code doesn't work in multithreaded context, because of data races in checking. Instead of using mutex, we can use C++ standard library funcs std::once_flag and std::call_once:

std::once_flag oc;              //global flag
...
std::call_once(os, initialize); //init if not inited yet

static std::vector<std::string> staticData;

void foo() {
    static std::once_flag oc;
    std::call_once(oc, []{
        staticData = initializeStaticData();
    });
    ...
}

The 1st argument passed to call_once() must be the crspding once_flag; further arguments are the usual arguments for callable objects: func, member func, func obj, or lambda, plus optional arguments for the func called. Thus, lazy initialization of an obj used in multi-threads might as follow:

class X {
 private:
    mutable std::once_flag initDataFlag;
    void initData() const;
 public:
    data getData() const {
        std::call_once(initDataFlag, &X::initData, this);
        ...
    }
};

8. Condition variables ↑top

Sometimes, tasks performed by different threads have to wait for each other. Thus, we have to synchronize concurrent operations for other reasons than to access the same data.

Condition variables can be used to synchronize logical dependencies in data flow bt threads. A condition variable is a variable by which a thread can wake up one or multi other waiting threads.

(i) steps

In principle, a condition variable works as follows:

Thus, the thread providing or preparing sth simply calls notify_one() or notify_all() for the cond var, which for one or all the waiting threads is the moment to wake up.

Cond var in general might have so-called spurious wakeups, i.e., a wait on a cond var may return even if the cond var has not been noified. Thus, a wakeup does not necessarily mean that the required cond now holds. Rather, after a wakeup we still need some code to verify that the cond in fact has arrived.

bool readyFlag;         //a flag signaling the cond is indeed satisfied
std::mutex readyMutex;  //a mutex
std::condition_variable readyCondVar;   //a cond var

//locks the mutex, updates the cond, unlocks the mutex and notifies the cond var
void thread1() {
    //do sth thread2 needs as preparation
    std::cout << "<return>" << std::endl;
    std::cin.get();
    
    //signal that thread1 has prepared a cond
    {
        std::lock_guard<std::mutex> lg(readyMutex);
        readyFlag = true;
    }//release lock
    readyCondVar.notify_one();
}

void thread2() {
    //wait until thread1 is ready (readyFlag is true)
    {
        std::unque_lock<std::mutex> ul(readyMutex);
        readyCondVar.wait(ul, []{ return readyFlag; });
    }//release lock
    
    //do whatever shall happen after thread1 has prepared things
    std::cout << "done" << std::endl'
}

int main() {
    auto f1 = std::async(std::launch::async, thread1);
    auto f2 = std::async(std::launch::async, thread2);

}

The waiting thread locks the mutex with a unique_lock, waits for the notification while checking the condition and releases the lock:

{
    std::unque_lock<std::mutex> ul(readyMutex);
    readyCondVar.wait(ul, []{ return readyFlag; });
}//release lock

Here, a wait() member for cond vars is used as follow: pass the lock ul for the mutex readyMutex as 1st argument and a lambda as callable object double checking the cond as second argument. The effect is that wait() internally calls a loop until the passed callable returns true. Thus, the code has the same effect as the following code:

{
    std::unque_lock<std::mutex> ul(readyMutex);
    while (!readyFlag) {
        readyCondVar.wait(ul);
    }
}//release lock

(ii) example of a multi-thread queue

three threads push values into a quque that two other threads read and process:

std::queue<int> queue;
std::mutex queueMutex;
std::condition_variable queueCondVar;

void provider (int val) {
    //push different vals
    for (int i=0; i<6; ++i) {
        {
            std::lock_guard<std::mutex> lg(queueMutex);
            queue.push(val+i);
        }//release lock
        queueCondVar.notify_one();
        
        std::this_thread::sleep_for(
                    std::chrono::milliseconds(val));
    }
}

void consumer (int num) {
    //pop vals if available (num identifies the consumer)
    while(true) {
        int val;
        {
            std::unique_lock<std::mutex> ul(queueMutex);
            queueCondVar.wait(ul, []{ return !queue.empty(); });
            val = queue.front();
            queue.pop();
        }//release lock
        std::cout << "consumer " << num << ": " << val << endl;
    }
}

int main() {
    //start three providers for values 100+, 300+, 500+
    auto p1 = std::async(std::launch::async, provider, 100);
    auto p2 = std::async(std::launch::async, provider, 300);
    auto p3 = std::async(std::launch::async, provider, 500);

    //start two consumers printing the vals
    auto c1 = std::async(std::launch::async, consumer, 1);
    auto c2 = std::async(std::launch::async, consumer, 2);
}

9. Atomics ↑top

Once a std::atomic<T> object has been constructed, operations on it behave as if they were inside a mutex-protected critical section, but the operations are generally implemented using special machine instructions that are more efficient than would be the case if a mutex were employed.

std::atomic<int> ai(0);     //initialize ai to 0
ai = 10;                    //atomically set ai to 10
std::cout << ai;            //atomically read ai's value
++ai;                       //atomically increment ai to 11
--ai;                       //atomically decrement ai to 10

During execution of these statements, other therads reading ai may see only values of 0, 10 or 11. No other values are psbl.

(i) examples of using atomics

using lock:

#include <mutex>
...
   
bool readyFlag;
std::mutex readyFlagMutex;

void thread1() {
    // do something thread2 needs as preparation
    ...
    std::lock_guard<std::mutex> lg(readyFlagMutex); 
    readyFlag = true;
}

void thread2() {
    // wait until readyFlag is true (thread1 is done) 
    {
        std::unique_lock<std::mutex> ul(readyFlagMutex);
        while (!readyFlag) {
            ul.unlock();
            std::this_thread::yield(); // hint to reschedule to the next thread 
            std::this_thread::sleep_for(std::chrono::milliseconds(100)); 
            ul.lock();
        }
    } // release lock

    // do whatever shall happen after thread1 has prepared things
    ...
}

using atomic:

#include <atomic> // for atomic types ...

std::atomic<bool> readyFlag(false);

void thread1() {
    // do something thread2 needs as preparation ...
    readyFlag.store(true);
}

void thread2() {
    // wait until readyFlag is true (thread1 is done) 
    while (!readyFlag.load()) {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

    // do whatever shall happen after thread1 has prepared things
    ...
}

(ii) operations

long data;
std::atomic<bool> readyFlag(false);

void provider() {
    //after reading a char
    std::cout << "<return>" << std::endl;
    std::cin.get();
    
    //provide some data
    data = 42;
    //add signal readiness
    readyFlag.store(true);
}

void consumer() {
    //wait for readiness and do sth else
    while (!readyFlag.load()) {
        std::cout.put('.').flush();
        std:;this_thread::sleep_for(std::chrono::milliseconds(500));
    }
    
    //and process provided data
    std::cout << "\nvalue : " << data << std::endl;
}

Because the setting of data happens before the provider() stores true in the readyFlag and the processing of data happens after the consumer() has loaded true as value of the readyFlag, the processing of data is guaranteed to happen after the data was provided.

This guarantee is provided because in all atomic operations, we use a default memory order named memory_order_seq_cst (sequential consistent memory order).

std::atomic_flag

std::atomic_flag is a really simple Boolean flag, and operations on this type are required (the only one guaranteed to be lcok-free) to be lock-free; and, unlike std::atomic<bool>, std::atomic_flag does not provide load or store operations.

Once we have a simple lock-free Boolean flag, we can use it to implement a simple lock and thus implement all the the atomic types.

std::atomic_flag is "really simple" on the following aspects:

std::atomic_flag lock = ATOMIC_FLAG_INIT;
 
void f(int n) {
    for (int cnt = 0; cnt < 100; ++cnt) {
        while (lock.test_and_set(std::memory_order_acquire))  // acquire lock
             ; // spin
        std::cout << "Output from thread " << n << '\n';
        lock.clear(std::memory_order_release);               // release lock
    }
}
 
int main() {
    std::vector<std::thread> v;
    for (int n = 0; n < 10; ++n) {
        v.emplace_back(f, n);
    }
    for (auto& t : v) {
        t.join();
    }
}

other atomic types

the remaining atomic types are all accessed through specializations of the std::atomic<> class template and are a bit more full-featured but may not be lock-free. On most popular platforms, it's expected that the atomic variants of all the built-in types (e.g., std::atomic<int> and std::atomic<void*>) are indeed lock-free, but it isn't required.

the standard atomic types are not copyable or copy-assignable, and thus have no copy cstrs or copy asgnment operators. They do, however, support asgnment from and implicit conversion to the crspding built-in types as well as direct load()/store(), exchange(), compare_exchange_weak() and compare_exchange_strong(). They also support compound asgnment operators where +=, -=, *=, |=, etc.

Unlike most assignment operators, the assignment operators for atomic types do not return a reference to their left-hand arguments. They return a copy of the stored value instead.

std::atomic
std::atomic<bool> is the most basic of the atomic integral type.

storing a new value (or not) depending on the current value:
the new operation is called compare/exchange, which compares the value of the atomic variable with a supplied expected value and stores the supplied desired value if they're equal; if the values are not equal, the expected value is updated with the actual value of the atomic variable. The return type is a bool, which is true if the store was performed and false ow.

std::atomic<T*>
The interface of std::atomic<T> is essentially the same with std::atomic<bool>. The new opertionss are the ptr arithmetic ones:

std::atomic<T>
the remaining basic atomic types are essentially all the same: they're all atomic integral types and have the same interface as each other, except that the associated built-in type is different.

as well as the usual set of operations (load(), store(), exchange(), compare_exchange_weak() and compare_exchange_strong()), hte atomic integral types such as std::atomic<int> and std::atomic<unsigned long long> have quite a comprehensive set of operations available: fetch_add(), fetch_sub(), fetch_or, fetch_xor, compound-asgnment forms of these operations (+=, -=, &=, |=, and ^=), and ++/--. Only division, multiplication, and shift operators are missing. Because atomic integral values are typically used either as counters or as bitmasks, this isn't a particularly noticeable loss; additional operations can easily be done using compare_exchange_weak() in a loop, if required.

the low-level interface of atomics

the low-level interface of atomics means using the atomic operations in a way that we have no guaranteed sequential consistency. Thus, compilers and hardware might (partially) reorder access on atomics.

std::memory_order

std::memory_order specifies how regular, non-atomic mem accesses are to be ordered around an atomic operation. Absent any constraint on a multi-core system, when multi threads simultaneously read and write to several variables, one thread can observe the values change in an order different from the order another thread wrote them. Indeed, the apprent order of changes can even differ among multi reader threads. Some similar effects can occur even on uniprocessor systems due to compiler transformations allowed by the mem model.

The default behavior of all atomic operations in the library provides for sequential consistent ordering. The default can hurt performance, but the library's atomic operations can be given an additional std::memory_order argument to specify the exact constraints, beyond atomicity, that the compiler and processor must enforce that operation.

relaxed ordering

 //x and y intially zero
 
 // Thread 1:
 r1 = y.load(memory_order_relaxed); // A
 x.store(r1, memory_order_relaxed); // B
 // Thread 2:
 r2 = x.load(memory_order_relaxed); // C
 y.store(42, memory_order_relaxed); // D

is allowed to produce r1==r2==42 (i.e., D-A-B-C) because, although A is sequenced-before B within thread 1, and C is sequenced-before D within thread 2, nothing prevents D from appearing before A in the modification order of y, and B before C in the modification order of x. The side-effect of D on y could be visible to the load A in Thread 1 while the side-effect of B on x could be visible to the load C in Thread 2.

release-acquire ordering
if an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all mem writes (non-atomic and relaxed atomic) that happend-before the atomic store from the point of view of thread A, become visible side-effect in thread B, that is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to mem.