Problems bringing up two Cascade servers after cold database backup

bmaupin's Avatar

bmaupin

10 Jun, 2010 03:49 PM

We are running a cluster of 2 Cascade 6.4.0.1 servers on Red Hat Enterprise Linux 5.5 with a separate database server running on RHEL 5.5 with Oracle 10.2.0. We run a level 1 hot backup every day of the week, except for Sunday when we shut down the app servers, run a cold backup, a level 0 hot backup, and then bring the app servers back up.

We have this scripted, and for some reason every Sunday, only one of the app servers comes back up, and it's never the same one. I looked at the logs, and it looks like the app servers try to acquire a database lock a few seconds into the startup process. It looks like whichever server gets the lock is the one that successfully comes up. It then gives up the lock almost exactly 5 minutes later. The other server will try to get the lock, and appear to wait 5 minutes before giving up. Here's what it looks like in the logs:

server 1 (the one that successfully comes up)
cascade is shut down for backup:
2010-06-06 02:03:06,072 INFO [SessionFactoryImpl] closing

cascade is brought back up after backup:

2010-06-06 02:19:34,119 INFO  [[/]] Initializing Spring root WebApplicationContext
...
2010-06-06 02:19:39,106 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)
2010-06-06 02:19:39,566 INFO  [liquibase] Lock Database
2010-06-06 02:19:39,627 INFO  [liquibase] Successfully acquired change log lock
2010-06-06 02:19:41,145 INFO  [liquibase] Reading from DATABASECHANGELOG
2010-06-06 02:19:41,527 INFO  [liquibase] Release Database Lock
2010-06-06 02:24:39,850 INFO  [liquibase] Successfully released change log lock

(more detailed log attached as cascade.log.2010-06-06-server1)

server 2 (the one that doesn't come up)
cascade is shut down for backup:

2010-06-06 02:03:11,695 INFO [SessionFactoryImpl] closing

cascade is brought back up after backup:

2010-06-06 02:19:34,304 INFO  [[/]] Initializing Spring root WebApplicationContext
...
2010-06-06 02:19:39,102 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)
2010-06-06 02:24:39,847 INFO  [DefaultListableBeanFactory] Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@497062...
2010-06-06 02:24:39,849 ERROR [ContextLoader] Context initialization failed...nested exception is liquibase.exception.LockException: Could not acquire change log lock.  Currently locked by fe80:0:0:0:21d:9ff:fe64:efa4%2 (fe80:0:0:0:21d:9ff:fe64:efa4%2) since 6/6/10 2:19 AM
...
2010-06-06 02:24:39,880 INFO  [[/]] Closing Spring root WebApplicationContext
2010-06-06 08:16:25,781 INFO  [[/]] Initializing Spring root WebApplicationContext

(more detailed log attached as cascade.log.2010-06-06-server2)

What's strange to me is that server 1 releases the database lock within a few seconds, but doesn't release the change log lock for 5 minutes. However, when the cascade service is restarted during times other than after database backups, it releases both locks within a few seconds.

grep "\[liquibase\]" cascade.log.* | less
cascade.log.2010-01-12:2010-01-12 07:52:52,260 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)
cascade.log.2010-01-12:2010-01-12 07:52:52,368 INFO  [liquibase] Lock Database
cascade.log.2010-01-12:2010-01-12 07:52:52,378 INFO  [liquibase] Successfully acquired change log lock
cascade.log.2010-01-12:2010-01-12 07:52:54,003 INFO  [liquibase] Reading from DATABASECHANGELOG
cascade.log.2010-01-12:2010-01-12 07:52:54,284 INFO  [liquibase] Release Database Lock
cascade.log.2010-01-12:2010-01-12 07:52:54,290 INFO  [liquibase] Successfully released change log lock
cascade.log.2010-02-15:2010-02-15 09:12:14,543 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)
cascade.log.2010-02-15:2010-02-15 09:12:14,678 INFO  [liquibase] Lock Database
cascade.log.2010-02-15:2010-02-15 09:12:14,685 INFO  [liquibase] Successfully acquired change log lock
cascade.log.2010-02-15:2010-02-15 09:12:15,939 INFO  [liquibase] Reading from DATABASECHANGELOG
cascade.log.2010-02-15:2010-02-15 09:12:16,234 INFO  [liquibase] Release Database Lock
cascade.log.2010-02-15:2010-02-15 09:12:16,236 INFO  [liquibase] Successfully released change log lock

(you can see more examples in cascade-[liquibase].log. Notice how change log lock is released normally after a few seconds, but on Sundays at 2am for the past month or so (when the database backups occur) the change log lock is only released after 5 minutes)

I'm also wondering if the notification I see every time before it locks is an issue:

cascade.log.2010-06-06:2010-06-06 02:19:39,106 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: 
org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)

Thanks!

  1. 1 Posted by bmaupin on 10 Jun, 2010 03:54 PM

    bmaupin's Avatar

    I should note that after first looking at the issue, I suspected that the problem might be related to the fact that the app servers are brought back up less than a second from each other, and so they try to acquire the database and change log locks at nearly the same time. I've since modified the script that brings them back up to wait 30 seconds in between bringing up the next server, but I'll have to wait until our next backup on Sunday to see if it actually worked.

    Even if it does, I'm still wondering why it takes 5 minutes to release the change log lock after a database backup, but only a couple of seconds otherwise...

  2. 2 Posted by Bradley Wagner on 10 Jun, 2010 10:15 PM

    Bradley Wagner's Avatar

    We're still looking into this problem. Please let us know what you find when starting the instances some time apart. I'd actually bump that to about 2m if possible to give the first instance plenty of time to start.

    With respect to the lock release, are you able to confirm that it takes 5m to relinquish the lock even when one server is started by itself? I'm guessing this is related to the contention you're seeing though I've not been able to reproduce this locally.

  3. 3 Posted by bmaupin on 11 Jun, 2010 02:15 PM

    bmaupin's Avatar

    When one server is started by itself, it only seems to take a few seconds to release the lock:

    2010-06-11 09:09:16,346 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)
    2010-06-11 09:09:16,462 INFO  [liquibase] Lock Database
    2010-06-11 09:09:16,475 INFO  [liquibase] Successfully acquired change log lock
    2010-06-11 09:09:18,036 INFO  [liquibase] Reading from DATABASECHANGELOG
    2010-06-11 09:09:18,330 INFO  [liquibase] Release Database Lock
    2010-06-11 09:09:18,333 INFO  [liquibase] Successfully released change log lock
    

    The same seems to even be true when we're restarting both of the servers by hand and a database backup hasn't occurred. But even then we'll start them both within 10 seconds of each other and everything's fine. My guess is either the script is starting them too close together or there's something different going on after we've done a cold database backup. I'm going to try to shut them down and bring them back up using a script after business hours and see if I can narrow it down.

  4. 4 Posted by Bradley Wagner on 11 Jun, 2010 04:48 PM

    Bradley Wagner's Avatar

    Thanks for the update. We've identified a resource contention issue with our framework that performs database updates when two servers are started at the same time.

    Starting them up a few minutes apart should solve it in the mean time while we try to fix the underlying issue.

  5. 5 Posted by bmaupin on 11 Jun, 2010 11:09 PM

    bmaupin's Avatar

    I went ahead and ran a script that simply brought the app servers down, waited a minute or so, and then brought them both back up at once. Just like before, it took 5 minutes for one server to release the change log lock, and the other timed out and never came back up.

    I ran that same script, waiting 30 seconds in between bringing up the first and second servers, and they both came up just fine. They both released the change log and database locks within seconds.

    So I can confirm that this issue has nothing to do with our database backup, and is simply caused by bringing the servers up at the same time, which is what I'd suspected.

  6. 6 Posted by Bradley Wagner on 14 Jun, 2010 12:53 PM

    Bradley Wagner's Avatar

    Thanks for the update. I'm going to go ahead and resolve this for now as we look into the issue causing this.

  7. Bradley Wagner closed this discussion on 14 Jun, 2010 12:53 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac