Understanding the Resque internals - Resque::DirtyExit unveiled

Share On Facebook
Share On Twitter
Share On Google Plus
Share On Linkedin
Share On Reddit
Contact us

If your Resque jobs fail because of a mysterious Resque::DirtyExit a quick tour of the Resque internals will help you fix the issue and be back up and running in a matter of minutes.

The Problem

TL;DR: Reduce the number of workers. Make smaller jobs.

Recently I had to import some data from a MySQL database into a Redis store for further processing. The MySQL database contains time-series data and the natural approach to run the import was to split the time window into smaller chunks and each chunk is handled by one Resque job.

Each job processes 1 hour of data and there are 12 workers picking them. After 1 day I opened the Resque console and I saw that more than 50% of my jobs failed:

and this is the error:

A puzzling piece was that my logger was not able to catch the error even if the entire job was wrapped in a begin/rescue block, like this:

The Resque internals

To understand what was going on I decided to open the Resque code. In the Resque::ChildProcess class you can see the following:

When the job is picked up a new process is created with fork. The child process goes ahead and executes the job. The parent simply waits for the child process to finish.

This is where the DirtyExit error comes into play. First of all let's have a look in detail at $?.signaled?. From the Ruby core documentation we have:

The child process in charge of processing the job was killed by someone who is not Resque itself. Maybe the Linux Kernel? Maybe a bad sysadmin? I don't know.

The Solution

At this point it was clear that:
* Resque was simply noticing that my Job processes were being killed and was kind enough to write it for me on Redis (the Resque backend).
* My job processes were being killed with SIGKILL. With such signal the receiving process cannot perform any clean-up before quitting, not even the error logging of my initial example.

Pretty sad eh?

At this point my only suspect left was resources, and maybe the operating system killing my workers because of that. I decided to go ahead and validate this assumption by making smaller things (citing S. Metz).

Here is what I did:
* Reduce the import window from 1 hour to 15 minutes.
* Reduce the workers from 12 to 4.

The result is surprisingly good. I went from a 50% failure rate, to zero:

Conclusion

Your jobs may be failing because of (i) software errors or (ii) infrastructure problems. We usually focus on the former, but never forget about checking that your hardware is doing fine.

Also, opening the Resque gem helped me find the information I needed pretty quickly. I'll do it more often, it is also a chance to learn something new. The quick solution of googling for the error gave me no extra insights in this case.

Looking forward to hearing from you in the comments and if you like the blog please consider adding it to your feed or follow me on Twitter.


Also published on Medium.

2 thoughts on “Understanding the Resque internals - Resque::DirtyExit unveiled

  1. Usually processes getting SIGKILL is an indication of the linux kernel OOM killer reaping your process because you ran out of memory.

    You should check the output of dmesg.

    1. Thank you Micheal, you were right, I can see a bunch of
      "Out of memory: Kill process 23774 (ruby) score 129 or sacrifice child"

Leave a Reply

Your email address will not be published. Required fields are marked *

Answer the question * Time limit is exhausted. Please reload CAPTCHA.