Some explanations of me (I copy/paste an email I sent to a coworker recently):
To try to find where it comes from, I need to explain how the notifications feature works.
1. The browser needs to know how many unread notifications there are for the current user. If there is more than 20 notifications, it does not need to know the exact number. Instead, it will display "20+".
2. An AJAX query is sent for this purpose.
3. A SQL query is generated, that take care of all enabled filters (don't show the user's event, for example) and watched pages. To avoid using too much database resource, we limit this request to the first 40 results.
4. For each result, some checks are performed:
4.1. First of all, if the event concerns a document that the user is not allowed to view, then the event is discarded. There is no way to improve the SQL query to take care of the rights so this check cannot be avoided.
4.2. Post-filters are executed. Like the right check, these filters allows to check what cannot be expressed with an SQL query.
4.3. The event is then compared to all events that we already have accepted, in order to group similar notifications inside a "fold" one that we call a CompositeEvent. The idea is to avoid having multiple notifications in the UI that concern the same document, with the same kind of event, but with different dates (like when you click "save & continue" on a document a multiple time during a work session).
5. After the results have been checked and grouped into CompositeEvents, we count how many of them we have accepted. If we have less than 20 composite events, we go back to step 3 until we have at least 20 CompositeEvents, or until there is no more event in the database.
As you can see, the steps 3-4-5 can be executed a lot of times, in bad conditions. It is currently implemented as a recursive algorithm, which could theoretically lead to a stack overflow (see: https://jira.xwiki.org/browse/XWIKI-15927).
When I look at your stacktraces, I can see that there is a lot of repeating:
So this is exactly what is going on. It means the SQL queries return a lot of events, but almost all of them are filtered by post-filters or are so similar that they are grouped in a few CompositeEvent.
Some scenarios I can see (in descending order or probability):
A. There is a lot of events in documents that the user is not allowed to see. Adding a filter for the user profile on the restricted space could solve the issue.
B. There is a bug in a post-filter and we need to identify which one and why.
C. There is a lot of "personal messages" (using the Message Sender Gadget) that are filtered only by post-filters (I don't remember why it cannot be expressed with SQL but I had a good reason).
D. The same event is stored multiple times in the database, so it continuously fill the same CompositeEvent.
E. There is a bug in the recursion so the database always return the same results (but it would mean we have an infinite loop, so it would crash).
I recommend to look for any of these scenarios.
As a mitigation of this problem, we could also add a timeout into the algorithm that look for events. But it means the user can miss some notifications (we can't be sure that there is not an interesting event after 100 queries....).
I hope it helps, and that can find a solution all together.