Configure Celery for reliable delivery

  • Backend development
  • Data engineering
·Wed Aug 24 2022

Celery is probably the most used task runner in the Python ecosystem. Yet, I've found today that configuring it to support a basic scenario for such a critical piece in a software architecture surprisingly difficult and poorly documented.So, what's the problem? Let's say we have a task taking several minutes to run. Imagine now that we have a server failure, completely independent of our source code: a server restart, a power failure, etc. Thus, the task was stopped in the middle of execution and didn't finished. Fine! Things like that happen often with computers 😏 All we want now is Celery to execute the task again after the server restart!Well, actually, by default: it won't 🥹 This is actually an intended default of Celery. By the way, this is one of the main painpoint raised by the author of Dramatiq, a competing library.Fortunately, there is a way to do it! And this blog post will save you hours of research 🙃

TL;DR

Here is the configuration you need to apply for this to work:
app = Celery(
    "tasks",
    broker="YOUR_BROKER",
    backend="YOUR_BACKEND",
)
app.conf.update(
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    worker_state_db="./celery-state.db",
)
Don't go now! I highly recommend you to read the rationale about it 👇

Rationale

So what those options are doing? Basically, we need to tell Celery:
  • To acknowledge the task after their completion, not right before (task_acks_late).
  • To reject the task if the worker is lost or shutdown (task_reject_on_worker_lost).
  • By default, task rejections are persisted only in memory and thus lost if the server is stopped. So we need to setup a state file (worker_state_db).

The visibility timeout quirk

Still, if you test to manually kill the worker while a long task is running before restarting it, you'll see that the task looks like stuck and never re-executed.Actually, this is normal behavior. Why? Because tasks are not re-delivered until they reach the visibility timeout. And by default, this timeout is 1 hour. It means that your pending task will be executed again in 1 hour.If you set this value to 30 seconds, you'll see that your task will get executed again within 30 seconds.It might be tempting to decrease this visibility timeout and keep a very low value but bear in mind that it'll apply for every tasks, not only the crashed ones. If you have a task that takes more than 1 hour to execute, it'll be automatically rescheduled on another worker. That's why it's important to keep it quite high, especially if we have long tasks to run.

References

  • https://docs.celeryq.dev/en/stable/faq.html#faq-acks-late-vs-retry
  • https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-acks-late
  • https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-reject-on-worker-lost
  • https://docs.celeryq.dev/en/stable/userguide/workers.html#persistent-revokes
  • https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html#id1
François Voron Logo
Full-stack web developer and data scientist, I've a proven track record working in SaaS industry, with a special focus on Python backends and REST API.

Legal

Made with ❤️ from 🇫🇷 - François Voron - SIRET 52101474600023