Skip to content

Setting limits in sub-process #176

Open
@orenbenkiki

Description

@orenbenkiki

Calling threadpool_limits in a sub-process fails (hangs) on some of my servers but fails in ones with a specific OS version:

$ hostnamectl
   Static hostname: n86.my.domain
         Icon name: computer-server
           Chassis: server
        Machine ID: 196b497eccff4526a8e34834c95e3de5
           Boot ID: b8318d26cd394a85b706beb2d7324f73
  Operating System: AlmaLinux 8.9 (Midnight Oncilla)
       CPE OS Name: cpe:/o:almalinux:almalinux:8::baseos
            Kernel: Linux 4.18.0-513.18.2.el8_9.x86_64
      Architecture: x86-64

The code is:

import os
import sys
from threadpoolctl import threadpool_limits
from multiprocessing import get_context

def eprintln(text):
    print(text, file=sys.stderr, flush=True)

DID_THREADCTL_FOR_PID = None

def invocation(index: int) -> int:
    global DID_THREADCTL_FOR_PID
    if os.getpid() != DID_THREADCTL_FOR_PID:
        DID_THREADCTL_FOR_PID = os.getpid()
        eprintln(f"PID: {os.getpid()} invocation index: {index} Do threadpool_limits...")
        threadpool_limits(limits=1)
        eprintln(f"PID: {os.getpid()} invocation index: {index} Did threadpool_limits.")
    else:
        eprintln(f"PID: {os.getpid()} invocation index: {index} Old threadpool_limits.")
    return index

invocations = 4
processes = 2
threadpool_limits(limits=processes)

results = [None] * invocations
eprintln(f"PID: {os.getpid()} Do imap...")
with get_context("fork").Pool(2) as pool:
    for index in pool.imap_unordered(invocation, range(invocations)):
        results[index] = index
        eprintln(f"PID: {os.getpid()} - Did imap index: {index}")

eprintln(f"PID: {os.getpid()} Did imap results: {results}")
assert results == list(range(len(results)))

When I run it on the above OS, in Python 3.12.3, threadpoolctl version 3.4.0, I get:

$ python3 bug.py 
PID: 1576849 Do imap...
PID: 1576852 invocation index: 0 Do threadpool_limits...
PID: 1576853 invocation index: 1 Do threadpool_limits...
PID: 1576853 invocation index: 1 Did threadpool_limits.
PID: 1576853 invocation index: 2 Old threadpool_limits.
PID: 1576853 invocation index: 3 Old threadpool_limits.
PID: 1576849 - Did imap index: 1
PID: 1576849 - Did imap index: 2
PID: 1576849 - Did imap index: 3

And the process hangs. Poking around it seems that libc.dl_iterate_phdr does not return (each match_library_callback call does return). I am using a Python 3.12.3 that was compiled from source on this OS, followed by pip installation of numpy, pandas, scipy etc.

This same thing works fine in older versions of the OS. E.g., in:

$ hostnamectl
   Static hostname: n97.my.domain
         Icon name: computer-server
           Chassis: server
        Machine ID: 5e543d50691943628e8e20441f502406
           Boot ID: 0d876250c0ec4a149e8bdb12c99c20eb
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-1160.15.2.el7.x86_64
      Architecture: x86-64

With Python version 3.12.2, again with threadpoolctl version 3.4.0, I get the expected output:

$ python3 bug.py 
PID: 32872 Do imap...
PID: 32874 invocation index: 0 Do threadpool_limits...
PID: 32875 invocation index: 1 Do threadpool_limits...
PID: 32874 invocation index: 0 Did threadpool_limits.
PID: 32875 invocation index: 1 Did threadpool_limits.
PID: 32874 invocation index: 2 Old threadpool_limits.
PID: 32872 - Did imap index: 0
PID: 32875 invocation index: 3 Old threadpool_limits.
PID: 32872 - Did imap index: 1
PID: 32872 - Did imap index: 2
PID: 32872 - Did imap index: 3
PID: 32872 Did imap results: [0, 1, 2, 3]

Any ideas on what I can do to fix this?

Activity

ogrisel

ogrisel commented on Mar 11, 2025

@ogrisel
Contributor

I cannot reproduce on macOS.

Have you tried with other start methods (e.g. "forkserver" or "spawn" instead of "fork")?

I would also be curious to see if you can reproduce with loky.get_reusable_executor() instead of a multiprocessing Pool instance.

ogrisel

ogrisel commented on Mar 11, 2025

@ogrisel
Contributor

BTW, calling system calls after a fork is not POSIX-compliant, so it's expected that this can deadlock. I would therefore recommend not using the "fork" start method and use one of the alternatives suggested above.

ogrisel

ogrisel commented on Mar 11, 2025

@ogrisel
Contributor

If you want to try to debug the root cause, you might want to enable faulthandler in your workers:

import faulthandler

...

def invocation(index: int) -> int:
    faulthandler.dump_traceback_later(10, exit=True)  # 10s should be more than enough
    global DID_THREADCTL_FOR_PID
    if os.getpid() != DID_THREADCTL_FOR_PID:
        DID_THREADCTL_FOR_PID = os.getpid()
        eprintln(f"PID: {os.getpid()} invocation index: {index} Do threadpool_limits...")
        threadpool_limits(limits=1)
        eprintln(f"PID: {os.getpid()} invocation index: {index} Did threadpool_limits.")
    else:
        eprintln(f"PID: {os.getpid()} invocation index: {index} Old threadpool_limits.")
    faulthandler.cancel_dump_traceback_later()
    return index

...
ogrisel

ogrisel commented on Mar 11, 2025

@ogrisel
Contributor

But it's very likely that the deadlock happens in the threadpool management code of one of your native libraries, in which case gdb or similar will be required to dig out where the deadlock happens.

orenbenkiki

orenbenkiki commented on Mar 11, 2025

@orenbenkiki
Author

What eventually solved this for me was the realization by one of our team that the call to threadpool_limits isn't thread-safe. It took "a certain kind of mind" to even consider that a function with "thread" in its name isn't thread safe :-) Wrapping it in a global mutex solved the problem. The internal race condition seems to be a hit-or-miss thing depending on the specifics of the OS, versions of libraries, and whether Mercury is in retrograde, but once we added the global mutex wrapper we haven't seen any more crashes.

A fix would be to incorporate such a global mutex at the very start of the function - any reason not to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @orenbenkiki@ogrisel

        Issue actions

          Setting limits in sub-process · Issue #176 · joblib/threadpoolctl