Question regarding pthread_cond_wait/pthread_cond

Discussion:

Question regarding pthread_cond_wait/pthread_cond_signal latencies

Pedro Gonnet

2011-05-20 10:08:36 UTC

Hi guys,

I'm currently working on a shared-memory parallel Molecular Dynamics
simulation library (http://mdcore.sourceforge.net/) geared towards
multi-core systems.

The library uses pthreads (plus some OpenMP for some simple loops) and
uses pthread_cond_wait and pthread_cond_signal to coordinate a group of
worker threads.

I've been profiling the library on different machines and kernels and
have noticed that in many cases there are significant (several ms,
measured with Intel's Vtune-thing) lags between calls to
pthread_cond_signal and the waiting thread actually getting back to
work.

I've tried the Ubuntu -rt and -preempt kernels, and the whole simulation
runs twice as slowly, despite following the advice given here:

https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application

My question is the following: which kernel (or set of configuration
options) will minimize these latencies? And if linux-rt is the answer,
in what ways do I have to be careful when porting the simulation for
this kernel?

Cheers and thanks,
Pedro

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Pedro Gonnet

2011-05-20 11:15:10 UTC

Permalink

Armin Steinhoff

2011-05-22 14:53:17 UTC

Permalink

At what priority are the worker threads running ?

In order to schedule these threads by the real-time part of the CFS
scheduler they should run at lest at a priority of 20 (rt_sched_class)

Regards

--Armin

Post by Pedro Gonnet
I've tried the Ubuntu -rt and -preempt kernels, and the whole simulation
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
My question is the following: which kernel (or set of configuration
options) will minimize these latencies? And if linux-rt is the answer,
in what ways do I have to be careful when porting the simulation for
this kernel?
Cheers and thanks,
Pedro
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Armin Steinhoff

2011-05-22 15:18:33 UTC

Permalink

Post by Armin Steinhoff

At what priority are the worker threads running ?
In order to schedule these threads by the real-time part of the CFS
scheduler they should run at lest at a priority of 20 (rt_sched_class)

... and you have to use SCHED_RR or SCHED_FIFO for the scheduling
method

--Armin

Post by Armin Steinhoff
Regards
--Armin

Post by Pedro Gonnet
I've tried the Ubuntu -rt and -preempt kernels, and the whole simulation
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
My question is the following: which kernel (or set of configuration
options) will minimize these latencies? And if linux-rt is the answer,
in what ways do I have to be careful when porting the simulation for
this kernel?
Cheers and thanks,
Pedro
--
To unsubscribe from this list: send the line "unsubscribe
linux-rt-users" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe
linux-rt-users" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mark Hounschell

2011-05-21 20:48:20 UTC

Permalink

Post by Pedro Gonnet
Hi guys,
I'm currently working on a shared-memory parallel Molecular Dynamics
simulation library (http://mdcore.sourceforge.net/) geared towards
multi-core systems.
The library uses pthreads (plus some OpenMP for some simple loops) and
uses pthread_cond_wait and pthread_cond_signal to coordinate a group of
worker threads.
I've been profiling the library on different machines and kernels and
have noticed that in many cases there are significant (several ms,
measured with Intel's Vtune-thing) lags between calls to
pthread_cond_signal and the waiting thread actually getting back to
work.
I've tried the Ubuntu -rt and -preempt kernels, and the whole simulation
https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
My question is the following: which kernel (or set of configuration
options) will minimize these latencies? And if linux-rt is the answer,
in what ways do I have to be careful when porting the simulation for
this kernel?

Are you saying several ms latency from pthread_cond_signal to waking up a
thread in pthread_cond_wait?

Mark

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Pedro Gonnet

2011-05-21 20:51:42 UTC

Permalink

Post by Mark Hounschell

Are you saying several ms latency from pthread_cond_signal to waking up a
thread in pthread_cond_wait?

Yes. Or at least, this is what Vtune says. It could also be a fluke in
Vtune, but I would still be interested in knowing what kernel or what
kernel options can make these operations as fast as possible.

Cheers, Pedro

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Rolando Martins

2011-05-21 21:40:27 UTC

Permalink

Hi guys,
I am building a real-time middleware and have also struggled with
similar behavior, at this point I reduced to the bare minimum the use
of condition variables.
(using 33-rt)

Rolando

Post by Pedro Gonnet
Hi guys,
I'm currently working on a shared-memory parallel Molecular Dynami=

Post by Pedro Gonnet
simulation library (http://mdcore.sourceforge.net/) geared towards
multi-core systems.
The library uses pthreads (plus some OpenMP for some simple loops)=

and

Post by Pedro Gonnet
uses pthread_cond_wait and pthread_cond_signal to coordinate a gro=

up of

Post by Pedro Gonnet
worker threads.
I've been profiling the library on different machines and kernels =

and

Post by Pedro Gonnet
have noticed that in many cases there are significant (several ms,
measured with Intel's Vtune-thing) lags between calls to
pthread_cond_signal and the waiting thread actually getting back t=

Post by Pedro Gonnet
work.
I've tried the Ubuntu -rt and -preempt kernels, and the whole simu=

lation

Post by Pedro Gonnet
=A0 =A0 =A0 =A0 =A0https://rt.wiki.kernel.org/index.php/HOWTO:_Bui=

ld_an_RT-application

Post by Pedro Gonnet
My question is the following: which kernel (or set of configuratio=

Post by Pedro Gonnet
options) will minimize these latencies? And if linux-rt is the ans=

wer,

Post by Pedro Gonnet
in what ways do I have to be careful when porting the simulation f=

Post by Pedro Gonnet
this kernel?

Are you saying several ms latency from pthread_cond_signal to waking=

up a

thread in pthread_cond_wait?

Yes. Or at least, this is what Vtune says. It could also be a fluke i=

Vtune, but I would still be interested in knowing what kernel or what
kernel options can make these operations as fast as possible.
Cheers, Pedro
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-us=

ers" in

More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-user=
s" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Peter W. Morreale

2011-05-22 00:44:19 UTC

Permalink

Post by Pedro Gonnet

Post by Mark Hounschell

Are you saying several ms latency from pthread_cond_signal to waking up a
thread in pthread_cond_wait?

Do you use any pthread* primitives involving scheduling?

How do you start your process? How many threads? What else is on the
machine?

IOW, can you describe your environment more completely.

Best
-PWM

Post by Pedro Gonnet
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Pedro Gonnet

2011-05-22 11:34:22 UTC

Permalink

Post by Peter W. Morreale
Do you use any pthread* primitives involving scheduling?

I'm not quite sure what you mean by scheduling functions... I only use
the basic pthread_mutex_* and pthread_cond_* functions.

Post by Peter W. Morreale
How do you start your process? How many threads? What else is on the
machine?

The main thread starts several threads with pthread_create. I have a
barrier which uses pthread_mutex's and pthead_cond's to synchronize the
threads. This is where the delays happen.

I observed these latencies both on my own laptop (loads of stuff running
in the background) and on multi-core servers on which I was alone.

I should probably note that I also use OpenMP for some simple
parallelization as well. Eg. after releasing the threads and waiting for
them all to return to the barrier, some things are computed with OpenMP
(OMP_WAIT_POLICY=PASSIVE).

The kernels on which I have seen this are the Ubuntu -generic kernels
2.6.31--2.3.35. I have also tried running the simulations on a Ubuntu
2.6.31-11-rt kernel. This, however, caused the whole simulation to run
twice as slow, even when only using one single thread (on a 6-core
machine).

Please do let me know if you need any more specific information!

Cheers, Pedro

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Peter W. Morreale

2011-05-22 13:51:28 UTC

Permalink

Post by Pedro Gonnet

Post by Peter W. Morreale
Do you use any pthread* primitives involving scheduling?

I'm not quite sure what you mean by scheduling functions... I only use
the basic pthread_mutex_* and pthread_cond_* functions.

So you are not defining scheduling parameters with calls like
pthread_attr_setsched_policy() and pthread_attr_setschedparam().

All this means is that your threads are inheriting their scheduling
attributes from the main thread.

You might use the above calls if you had differing priorities between
your threads and wanted to ensure various scheduling polices.

Post by Pedro Gonnet

Post by Peter W. Morreale
How do you start your process? How many threads? What else is on the
machine?

The main thread starts several threads with pthread_create. I have a
barrier which uses pthread_mutex's and pthead_cond's to synchronize the
threads. This is where the delays happen.

Try starting your application like this:

% chrt -f 20 <name-of-app>

This starts your application in the SCHED_FIFO class with a priority of
20 and all your threads will inherit this class and priority.

You can choose any priority you like, however if you are dependent upon
external (to your app) daemons and/or kernel tasks (Think networking,
for example) and choose a priority higher than them, you will hang
potentially hang your system. The default priority is (IIRC) 50, so
choosing any value lower than that will be safe.

Note that choosing a priority of 25 over 20 makes no difference unless
there are other threads you are competing with. Doesn't sound like it
from your description, so whether you choose a priority of 1 or 49 will
not make a difference for your app. Just get it in SCHED_FIFO.

Currently you are running in SCHED_OTHER, which has a timeslice
associated with it. This means your tasks will give up the CPU
periodically.

Post by Pedro Gonnet
I observed these latencies both on my own laptop (loads of stuff running
in the background) and on multi-core servers on which I was alone.
I should probably note that I also use OpenMP for some simple
parallelization as well. Eg. after releasing the threads and waiting for
them all to return to the barrier, some things are computed with OpenMP
(OMP_WAIT_POLICY=PASSIVE).

Hummm not completely familiar with OpenMP. Are there OpenMP daemons
that your threads will contact for data exchange? If so, then ensure
you modify their startup scripts to start the daemons in SCHED_FIFO at a
similar priority as well, just like above. If not, no worries, OpenMP
probably has no effect.

The next steps would be to partition the CPUs of your multi-core machine
into sets of CPUs. The idea here is to move (almost) all system tasks
to a root set of CPUs, and have a set of CPUs dedicated for your
threads.

This is easier than it sounds if you use the cset utility. I'm unclear
whether it is available via Ubuntu distribution channels, but you can
get a copy of this python script from the RT wiki:

https://rt.wiki.kernel.org/index.php/Cpuset_management_utility

Read through that page. To create a set of shielded CPUs, and migrate
existing tasks to the root set, do something like this:

% cset shield --cpu 1-3 --kthread on

(assuming a 4-way box)

The above creates two CPU sets, CPU-0, and CPUs1-3. In addition the
cset command will migrate virtually all currently running tasks to CPU0.
The caveat is that tasks that have a CPU affinity already set are not
migrated by cset. Likely none of those will hurt your performance too
much...

To start your application (in SCHED_FIFO as above) within the shielded
set:

% cset shield --exec chrt -f 20 <name-of-your-app>

Now all of the threads within your app will only run on CPUs 1-3, and
(virtually) all system tasks will run on CPU0.

Bear in mind that the shielding created by cset if not persistent, if
you reboot, you have to re-create the shielding again.

This is only the tip of what you can do to tune the system for your
application. The basic idea here is to start thinking about the system
as a whole, and tune the system as well as your app for best
performance. Think in terms of:

1 - running your application in the RT sched class - SCHED_FIFO

2 - Partition your multi-core machine to get dedicated CPUs for your
app.

I'd be surprised if you do not see a significant improvement in
latencies.

Even if your box has only two cores, you may see an improvement in using
cset. Whether or not you will comes down to that age-old computing
adage:

"Try it."

Best,
-PWM

Post by Pedro Gonnet
The kernels on which I have seen this are the Ubuntu -generic kernels
2.6.31--2.3.35. I have also tried running the simulations on a Ubuntu
2.6.31-11-rt kernel. This, however, caused the whole simulation to run
twice as slow, even when only using one single thread (on a 6-core
machine).
Please do let me know if you need any more specific information!
Cheers, Pedro

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Robert Schwebel

2011-05-22 18:37:23 UTC

Permalink

Pedro,

Post by Pedro Gonnet
The library uses pthreads (plus some OpenMP for some simple loops) and
uses pthread_cond_wait and pthread_cond_signal to coordinate a group
of worker threads.

Are you aware of this:

http://sourceware.org/bugzilla/show_bug.cgi?id=11588

rsc

--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html