Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-4723

Could not restart only one cluster instance in cluster mode because of stuck JVM

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.3
    • Fix Version/s: 2.3 M1, 2.2.4
    • Component/s: Observation
    • Labels:
      None
    • Environment:
      Tomcat 5.5.23
    • keywords:
      cluster, start, TCP
    • Difficulty:
      Medium
    • Similar issues:

      Description

      Hi,

      We've got a problem while trying to restart only one instance of XWIKI in cluster mode.

      The problem seems to come from cache synchronisation mecanism (we ues TCP to communicate between both cluster instance).
      If we stop and start an instance, the JVM stucks while trying to establish the TCP connection and the starting process is stuck.

      We could easily reproduce the problem. Here are the logs :
      At the beginning, both instance are correctly starting :

      • On TLSDLAS1 :
        [root@tlsdlas1 conf]# lsof -i | grep 3746
        java 3746 iwiki 10u IPv4 188182389 TCP *:26030 (LISTEN)
        java 3746 iwiki 12u IPv4 188182390 TCP *:26031 (LISTEN)
        java 3746 iwiki 17u IPv4 188182615 TCP *:26038 (LISTEN)
        java 3746 iwiki 21u IPv4 188182435 TCP *:7800 (LISTEN)
        java 3746 iwiki 22u IPv4 188182436 UDP *:7500
        java 3746 iwiki 23u IPv4 188182629 TCP localhost.localdomain:26039 (LISTEN)
        java 3746 iwiki 24u IPv4 188182438 TCP tlsdlas1.france.airfrance.fr:7800->tlsdlas2.france.airfrance.fr:52116 (ESTABLISHED)
      • On TLSDLAS2 :
        [root@tlsdlas2 ~]# lsof -i | grep 12531
        java 12531 iwiki 10u IPv4 253276341 TCP *:26030 (LISTEN)
        java 12531 iwiki 12u IPv4 253276342 TCP *:26031 (LISTEN)
        java 12531 iwiki 17u IPv4 253276538 TCP *:26038 (LISTEN)
        java 12531 iwiki 21u IPv4 253276386 TCP *:7800 (LISTEN)
        java 12531 iwiki 22u IPv4 253276387 UDP *:7500
        java 12531 iwiki 23u IPv4 253276390 TCP tlsdlas2.france.airfrance.fr:52116->tlsdlas1.france.airfrance.fr:7800 (ESTABLISHED)
        java 12531 iwiki 24u IPv4 253276649 TCP localhost.localdomain:26039 (LISTEN)
      • We stop TLSDLAS2 :
        [root@tlsdlas2 ~]# lsof -i | grep 12531
        [root@tlsdlas2 ~]# lsof -i | grep 52116
        [root@tlsdlas2 ~]# lsof -i | grep 7800

      The JVM stops correctly ant the ports are released

      • We start again the JVM on TLSDLAS2 :
        [root@tlsdlas2 ~]# lsof -i | grep 16874
        java 16874 iwiki 10u IPv4 253285025 TCP *:26030 (LISTEN)
        java 16874 iwiki 12u IPv4 253285026 TCP *:26031 (LISTEN)
        java 16874 iwiki 21u IPv4 253285106 TCP *:7800 (LISTEN)
        java 16874 iwiki 22u IPv4 253285107 UDP *:7500
        java 16874 iwiki 23u IPv4 253285113 TCP tlsdlas2.france.airfrance.fr:54098->tlsdlas1.france.airfrance.fr:7800 (ESTABLISHED)
      • The JVM stucks while openning the communication with the other instance, here are both concerned threads :
        "Connection.Sender [10.70.88.60:54098 - 10.70.88.26:7800],event,10.70.88.60:7800" prio=1 tid=0x0000002aef078060 nid=0x4202 waiting on condition [0x0000000041564000..0x0000000041564e30]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:118)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1767)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:359)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection$Sender.run(TCPConnectionMap.java:594)
        at java.lang.Thread.run(Thread.java:595)

      "Connection.Receiver [10.70.88.60:54098 - 10.70.88.26:7800],event,10.70.88.60:7800" prio=1 tid=0x0000002aed81e630 nid=0x4201 runnable [0x0000000041463000..0x0000000041463eb0]
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.read(SocketInputStream.java:129)
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      at java.io.BufferedInputStream.read(BufferedInputStream.java:235)

      • locked <0x0000002ad6fb8e78> (a java.io.BufferedInputStream)
        at java.io.DataInputStream.readInt(DataInputStream.java:353)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection$ConnectionPeerReceiver.run(TCPConnectionMap.java:548)
        at java.lang.Thread.run(Thread.java:595)

      In this case, if we now stopping the first instance, the JVM will released correctly and the second instance will ending starting...

      Thanks,

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tmortagne Thomas Mortagne
                Reporter:
                jurevert Julien Revert
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Date of First Response: