Understanding the Resque internals – Resque::DirtyExit unveiled

If your Resque jobs fail because of a mysterious Resque::DirtyExit a quick tour of the Resque internals will help you fix the issue and be back up and running in a matter of minutes.

The Problem

TL;DR: Reduce the number of workers. Make smaller jobs.

Recently I had to import some data from a MySQL database into a Redis store for further processing. The MySQL database contains time-series data and the natural approach to run the import was to split the time window into smaller chunks and each chunk is handled by one Resque job.

Each job processes 1 hour of data and there are 12 workers picking them. After 1 day I opened the Resque console and I saw that more than 50% of my jobs failed:

[1] pry(main)> Resque.info
=> {:pending=>292, :processed=>619, :queues=>1, :workers=>12, :working=>12, :failed=>344, :servers=>["redis://127.0.0.1:6379/0"], :environment=>"staging8"}

1 2	[1] pry(main)> Resque.info => {:pending=>292, :processed=>619, :queues=>1, :workers=>12, :working=>12, :failed=>344, :servers=>["redis://127.0.0.1:6379/0"], :environment=>"staging8"}

and this is the error:

[3] pry(main)> Resque::Failure.all(1,1)
=> {"failed_at"=>"2015/03/28 10:24:18 UTC",
 "payload"=>{"class"=>"Jobs::ImportTimeRange", "args"=>[{"time_range"=>"2015-02-22T01:00:00+00:00..2015-02-22T02:00:00+00:00"}]},
 "exception"=>"Resque::DirtyExit",
 "error"=>"pid 18535 SIGKILL (signal 9)",
 "backtrace"=>[],
 "worker"=>"staging8:12851:*",
 "queue"=>"my_import"}

[3] pry(main)> Resque::Failure.all(1,1)

=> {"failed_at"=>"2015/03/28 10:24:18 UTC",

"payload"=>{"class"=>"Jobs::ImportTimeRange", "args"=>[{"time_range"=>"2015-02-22T01:00:00+00:00..2015-02-22T02:00:00+00:00"}]},

"exception"=>"Resque::DirtyExit",

"error"=>"pid 18535 SIGKILL (signal 9)",

"backtrace"=>[],

"worker"=>"staging8:12851:*",

"queue"=>"my_import"}

A puzzling piece was that my logger was not able to catch the error even if the entire job was wrapped in a begin/rescue block, like this:

module Jobs
  class ImportTimeRange
    @queue = :my_import
    
    class << self       
      def perform(opts)
        @time_range = opts.fetch('time_range')
        logger.info("[Starting] ImportTimeRange #{opts.to_s}")           
        # Do the job
        logger.info("[Completed] ImportTimeRange #{opts.to_s}")         
        
        rescue Exception => e
          logger.error("[Failed] ImportTimeRange #{opts.to_s}: #{e.to_s}")
          raise e
      end
      
      private
      
      def logger
        @logger ||= # give me a logger
      end
    end
  end
end

module Jobs

class ImportTimeRange

@queue = :my_import

class << self

def perform(opts)

@time_range = opts.fetch('time_range')

logger.info("[Starting] ImportTimeRange #{opts.to_s}")

# Do the job

logger.info("[Completed] ImportTimeRange #{opts.to_s}")

rescue Exception => e

logger.error("[Failed] ImportTimeRange #{opts.to_s}: #{e.to_s}")

raise e

end

private

def logger

@logger ||= # give me a logger

end

The Resque internals

To understand what was going on I decided to open the Resque code. In the Resque::ChildProcess class you can see the following:

# @param job [Resque::Job]
# @return [void]
# @yieldparam (see Resque::Worker#perform)
# @yieldreturn (see Resque::Worker#perform)
def fork_and_perform(job, &block)
  @pid = fork(job) do
    child_job(job, &block)
  end

  if @pid
    wait
    job.fail(DirtyExit.new($?.to_s)) if $?.signaled?
  else
    worker.perform(job, &block)
  end
end

# @param job [Resque::Job]

# @return [void]

# @yieldparam (see Resque::Worker#perform)

# @yieldreturn (see Resque::Worker#perform)

def fork_and_perform(job, &block)

@pid = fork(job) do

child_job(job, &block)

end

if @pid

wait

job.fail(DirtyExit.new($?.to_s)) if $?.signaled?

else

worker.perform(job, &block)

end

When the job is picked up a new process is created with fork. The child process goes ahead and executes the job. The parent simply waits for the child process to finish.

This is where the DirtyExit error comes into play. First of all let’s have a look in detail at $?.signaled?. From the Ruby core documentation we have:

# Status of last executed child process.
# This variable is thread local and read-only.
$? = Process::Status.new #value is unknown, used for indexing.

# Status of last executed child process.

# This variable is thread local and read-only.

$? = Process::Status.new #value is unknown, used for indexing.

# stat.signaled?   -> true or false
#  
# Returns +true+ if _stat_ terminated because of
# an uncaught signal.
def signaled?()
  #This is a stub, used for indexing
end

# stat.signaled? -> true or false

# Returns +true+ if _stat_ terminated because of

# an uncaught signal.

def signaled?()

#This is a stub, used for indexing

end

The child process in charge of processing the job was killed by someone who is not Resque itself. Maybe the Linux Kernel? Maybe a bad sysadmin? I don’t know.

The Solution

At this point it was clear that:
* Resque was simply noticing that my Job processes were being killed and was kind enough to write it for me on Redis (the Resque backend).
* My job processes were being killed with SIGKILL. With such signal the receiving process cannot perform any clean-up before quitting, not even the error logging of my initial example.

Pretty sad eh?

At this point my only suspect left was resources, and maybe the operating system killing my workers because of that. I decided to go ahead and validate this assumption by making smaller things (citing S. Metz).

Here is what I did:
* Reduce the import window from 1 hour to 15 minutes.
* Reduce the workers from 12 to 4.

The result is surprisingly good. I went from a 50% failure rate, to zero:

[2] pry(main)> Resque.info
=> {:pending=>3352, :processed=>196, :queues=>1, :workers=>4, :working=>4, :failed=>0, :servers=>["redis://127.0.0.1:6379/0"], :environment=>"staging8"}

1 2	[2] pry(main)> Resque.info => {:pending=>3352, :processed=>196, :queues=>1, :workers=>4, :working=>4, :failed=>0, :servers=>["redis://127.0.0.1:6379/0"], :environment=>"staging8"}

Conclusion

Your jobs may be failing because of (i) software errors or (ii) infrastructure problems. We usually focus on the former, but never forget about checking that your hardware is doing fine.

Also, opening the Resque gem helped me find the information I needed pretty quickly. I’ll do it more often, it is also a chance to learn something new. The quick solution of googling for the error gave me no extra insights in this case.

Looking forward to hearing from you in the comments and if you like the blog please consider adding it to your feed or follow me on Twitter.

2 thoughts on “Understanding the Resque internals – Resque::DirtyExit unveiled”

Michael J. Cohen says:

March 31, 2015 at 4:28 pm

Usually processes getting SIGKILL is an indication of the linux kernel OOM killer reaping your process because you ran out of memory.

You should check the output of dmesg.

1. mottalrd says:
  
  March 31, 2015 at 5:25 pm
  
  Thank you Micheal, you were right, I can see a bunch of
  “Out of memory: Kill process 23774 (ruby) score 129 or sacrifice child”

The Problem

The Resque internals

The Solution

Conclusion

2 thoughts on “Understanding the Resque internals – Resque::DirtyExit unveiled”

Leave a Reply to mottalrd Cancel reply