Slowdown and memory leak in ConcurrentLinkedQueue

by Dreis · 9. October 2015

Yesterday, I stumpled upon a rather interesting memory leak in Java’s ConcurrentLinkedQueue, I’d like to share with you.

What is this all about

The ConcurrentLinkedQueue leaks memory and slows down when removing the last element of a non-empty queue.

Reproducing the problem

So to reproduce the leak, we must constantly add and try to remove the last item of an initially filled queue. The following code is showing the problem quite well as it prints the duration it takes for every 10.000 iterations of adding and removing an object to the queue.

import java.util.concurrent.ConcurrentLinkedQueue;
 
public class TestLeak
{
    public static void main(String[] args)
    {
        ConcurrentLinkedQueue<Object> queue = new ConcurrentLinkedQueue();
        queue.add(new Object()); //Required for the leak to appear.
        Object object = new Object();
 
        int loops = 0;
        long last=System.currentTimeMillis();
        while(true)
        {
            if(loops%10000==0)
            {
                long now = System.currentTimeMillis();
                long duration = now - last;
                last=now;
                System.err.printf("loops=%d duration=%d%n",loops, duration);
            }
            queue.add(object);
            queue.remove(object);
            ++loops;
        }
    }
}

When running the code snippet it shows that after only 60.000 iterations it already takes one second to add and remove the object 10.000 times.

Total iterations	Duration (in ms)
0	0
10000	125
20000	289
30000	478
40000	666
50000	849
60000	1056

Okay, that’s the slowdown. Up next is the memory leak we’re going to prove. To show you this, I did a 30 minutes flight recording via the Oracle Java Mission Control. There is actually a good tutorial by Oracle how to do that, so I won’t go into detail on this one. When doing this flight recording, you can enable some heap (graphic below) statistics that will show you how much space the Old Generation is taking inside the heap at the start and the end of the recording. The old generation is used to store long surviving objects. So the old generation is also the place, where leaks manifest most of the time when the (expensive and stop-the-world) Garbage-Collector isn’t able to clean things up in this space.

When looking at the first and last garbage collection inside the old generation of my flight recording, I got the following results, which clearly indicate a memory-leak:

	Start	End
Heap-Usage after GC	1,54 MB	28,60 MB

Explanation of the problem

To explain the problem I want to start of with a little reminder on the basics of the ConcurrentLinkedQueue.

Taken from the official docs

An unbounded thread-safe queue based on linked nodes. This queue orders elements FIFO (first-in-first-out). The head of the queue is that element that has been on the queue the longest time. The tail of the queue is that element that has been on the queue the shortest time. New elements are inserted at the tail of the queue, and the queue retrieval operations obtain elements at the head of the queue…

What it essentially is

The queue is a list of nodes. Each node holds its object plus a reference to the next node, that can be accessed via .next(). If it’s the last node, this will be null though. The queue itself holds a reference to the first node that can be accessed via first().

As usual with the official docs, the real meat is a bit hidden between the lines. If you want to remove an object from the queue the links between the nodes need to be adjusted, of course. So when removing a node, it needs to start from the head to the tail for this kind of operation as you see below in the official code.

public boolean remove(Object o) {
    if (o == null) return false;
    Node pred = null;
    for (Node p = first(); p != null; p = succ(p)) {
        E item = p.item;
        if (item != null
            && o.equals(item)
            && p.casItem(item, null)) { // p.casItem() will try to null the item
            Node next = succ(p); //Get next-pointer of current node
            if (pred != null && next != null)
                pred.casNext(p, next); // Try to delink the next-pointer of the previous node
            return true;
        }
        pred = p;
    }
    return false;
}

I commented the code a bit to hopefully make it more clear what actually happens. The remove code nulls the item value via p.casItem(item, null) and then should attempt to delink the predecessor’s next-pointer to bring things back to order. However, if the current node next-pointer is null (which it is for the last item), then the predecessor node’s next-pointer is not nulled. Thus the last node instance is never really removed and the queue grows until eventually there isn’t any memory left. As more and more nodes are appended, each further add+remove operation will have to navigate the linked list until the end, which results in a higher CPU consumption and a slowdown of performance.A fixed version is already committed, but I don’t know in which JVM versions this will be available:

public boolean remove(Object o) {
   if (o != null) {
      Node next, pred = null;
      for (Node p = first(); p != null; pred = p, p = next) {
         boolean removed = false;
         E item = p.item;
 
         if (item != null) {
            if (!o.equals(item)) {
               next = succ(p);
               continue;
            }
            removed = casItem(p, item, null);
         }
 
         next = succ(p);
         if (pred != null && next != null) // unlink
            casNext(pred, p, next);
         if (removed)
            return true;
      }
   }
   return false;
}

Closing thoughts

Long story, short. The reason I’m telling you this is the fact that some of us are using Jetty as their webserver. Jetty is using the ConcurrentLinkedQueue in its QueuedThreadPool that could potentially lead to this leak, but more importantly to the slowdown. In general this thread-pool tries to reuse its threads, but it can happen that a thread is dying and needs to be respawned. So depending on the pools thread death-rate (say this 3 times in a row), you will be more or less affected by the bug above. Their workaround for now is to use their ConcurrentHashSet class. To give you some insights on what this potentially means in terms of performance, here are some numbers for 100 mio. of iterations:

	ConcurrentHashSet	ConcurrentLinkedQueue
Time taken per 100 mio. iterations	5 seconds	Aborted after 30 minutes without completion

Note: A quick load test in our app showed that we’re not affected by this bug as an upgrade to Jetty http://mvnrepository.com/artifact/org.eclipse.jetty/jetty-server/9.3.4.RC1 didn’t had any benefit. Somehow expected because we would have seen this earlier. But to be honest I hoped a bit that we were affected .

Relevant links

https://bugs.openjdk.java.net/browse/JDK-8137184

https://bugs.eclipse.org/bugs/show_bug.cgi?id=477817

https://github.com/eclipse/jetty.project/commit/3c4eb5c4f6432ce508aad00a82c1cf62d3ae035d

Comments2
Pingbacks0

Chupacabras says:

29. May 2016 at 01:30

We monitored system for approx one month and the memory footprint is consistent. Thanks the8472 for guiding me in right direction
RMAU says:

5. June 2016 at 07:17

Our sysadmin has been doing drain just before restarting, but it is not periodic. The only periodic crontab command is a weekly repair of each node, done in a rolling fashion. We looked for correlation with this memory leak problem and found none. Is there something else that would cause this drain-like behavior?

Slowdown and memory leak in ConcurrentLinkedQueue

2 Responses

Random topics

Recent Posts

Blogs I follow

Slowdown and memory leak in ConcurrentLinkedQueue

What is this all about

Reproducing the problem

Explanation of the problem

Taken from the official docs

What it essentially is

Closing thoughts

Relevant links

2 Responses

Random topics

Recent Posts

Tags

Blogs I follow