Cascade CMS: Discussion

Problems bringing up two Cascade servers after cold database backup

2010-06-10T15:54:17Z

I should note that after first looking at the issue, I suspected that the problem might be related to the fact that the app servers are brought back up less than a second from each other, and so they try to acquire the database and change log locks at nearly the same time. I've since modified the script that brings them back up to wait 30 seconds in between bringing up the next server, but I'll have to wait until our next backup on Sunday to see if it actually worked.

Even if it does, I'm still wondering why it takes 5 minutes to release the change log lock after a database backup, but only a couple of seconds otherwise...

Problems bringing up two Cascade servers after cold database backup

2010-06-10T22:15:25Z

We're still looking into this problem. Please let us know what you find when starting the instances some time apart. I'd actually bump that to about 2m if possible to give the first instance plenty of time to start.

With respect to the lock release, are you able to confirm that it takes 5m to relinquish the lock even when one server is started by itself? I'm guessing this is related to the contention you're seeing though I've not been able to reproduce this locally.

Problems bringing up two Cascade servers after cold database backup

2010-06-11T14:15:25Z

When one server is started by itself, it only seems to take a few seconds to release the lock:

2010-06-11 09:09:16,346 INFO  [liquibase] Could not set remarks reporting on OracleDatabase: org.apache.tomcat.dbcp.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.setRemarksReporting(boolean)
2010-06-11 09:09:16,462 INFO  [liquibase] Lock Database
2010-06-11 09:09:16,475 INFO  [liquibase] Successfully acquired change log lock
2010-06-11 09:09:18,036 INFO  [liquibase] Reading from DATABASECHANGELOG
2010-06-11 09:09:18,330 INFO  [liquibase] Release Database Lock
2010-06-11 09:09:18,333 INFO  [liquibase] Successfully released change log lock

The same seems to even be true when we're restarting both of the servers by hand and a database backup hasn't occurred. But even then we'll start them both within 10 seconds of each other and everything's fine. My guess is either the script is starting them too close together or there's something different going on after we've done a cold database backup. I'm going to try to shut them down and bring them back up using a script after business hours and see if I can narrow it down.

Problems bringing up two Cascade servers after cold database backup

2010-06-11T16:48:21Z

Thanks for the update. We've identified a resource contention issue with our framework that performs database updates when two servers are started at the same time.

Starting them up a few minutes apart should solve it in the mean time while we try to fix the underlying issue.

Problems bringing up two Cascade servers after cold database backup

2010-06-11T23:09:35Z

I went ahead and ran a script that simply brought the app servers down, waited a minute or so, and then brought them both back up at once. Just like before, it took 5 minutes for one server to release the change log lock, and the other timed out and never came back up.

I ran that same script, waiting 30 seconds in between bringing up the first and second servers, and they both came up just fine. They both released the change log and database locks within seconds.

So I can confirm that this issue has nothing to do with our database backup, and is simply caused by bringing the servers up at the same time, which is what I'd suspected.

Problems bringing up two Cascade servers after cold database backup

2010-06-14T12:53:45Z

Thanks for the update. I'm going to go ahead and resolve this for now as we look into the issue causing this.