1 person found this helpful
Hi Michael Oryl,
Looking at your error indicates datastore is corrupted Or missing the binaries. I suspect the instance from where you took the backup the datastore is configured outside the sling home. So when you took the backup the datastore is missing. So you need to restore the backup of datastore also.
In order to confirm my suspect could you please restore the backup in another envirnoment and start the instance as stand alone and run datastore consistence check and let me know if it reports any error.
Sham, our system uses local TarPersistence datastores.
Either way, data corruption appears to have been the problem. The entire system died 14 hours after my initial post when the Tar Optimization process started at 2 a.m.
Millions of the following pair of lines were repeated in the error.log for our server's publish instance. It quickly filled up the disk and took down all processes with it. Why it felt the need to add millions of lines to the error log and not just give up on this one file is beyond me, though. In less than a minute and a half, each of the following lines was repeated nearly 500k times:
01.12.2012 02:00:01.407 *ERROR* [Tar PM Optimization] com.day.crx.persistence.tar.file.TarFile Failed to read, file /opt/cms/5.5.x/crx-publish-1/./repository/version/data_00197.tar id:197 length:38000128 append:38130176 1988663922 pos: 38042112 length: 51501.12.2012 02:00:01.407 *WARN* [Tar PM Optimization] com.day.crx.persistence.tar.file.TarFile Error reading entry /opt/cms/5.5.x/crx-publish-1/./repository/version/data_00197.tar id:197 length:38000128 append:38130176 1988663922
I have a backup from Friday morning that I will restore and run the consistency check against. I'll let you know what I find.
Turns out that I cannot run a consistency check. I can't get the instance up at all in order to run that test.
I would suggest to get in touch with Daycare to resolve this issue. Can you try this and see if it works (Assuming that you dont have any folder called crx.XXX or else procedure will be different) ?
"Start your instane as stand alone instance"
1) Stop your instance
2) Remove or rename /crx-quickstart/repository/cluster*
3) Remove or rename /crx-quickstart/repository/tarJournal
4) Remove or rename /crx-quickstart/repository/revision.log
Rename your logs folder to get fresh logs.
restart your instance.
For more detail about cluster and these files please read http://crxcluster.wemblog.com/
The author instance still won't start. It never gets past this point in the error.log:
04.12.2012 04:54:34.217 *INFO* [FelixStartLevel] org.apache.sling.servlets.post Service [org.apache.sling.servlets.post.impl.SlingPostServlet,170] ServiceEvent REGISTERED
04.12.2012 04:54:34.227 *INFO* [FelixDispatchQueue] org.apache.sling.bundleresource.impl Service [Provider of bundle based resources,171] ServiceEvent REGISTERED
04.12.2012 04:54:34.228 *INFO* [FelixDispatchQueue] org.apache.sling.servlets.post BundleEvent STARTED
04.12.2012 04:54:34.232 *INFO* [FelixDispatchQueue] org.apache.sling.servlets.resolver BundleEvent RESOLVED
04.12.2012 04:54:34.240 *INFO* [FelixStartLevel] org.apache.sling.servlets.resolver Service [org.apache.sling.servlets.resolver.internal.defaults.DefaultErrorHandlerServlet,172] ServiceEvent REGISTERED
04.12.2012 04:54:34.241 *INFO* [FelixDispatchQueue] org.apache.sling.servlets.resolver BundleEvent STARTED
04.12.2012 04:54:34.244 *INFO* [FelixDispatchQueue] com.day.commons.osgi.wrapper.svnkit BundleEvent RESOLVED
04.12.2012 04:54:34.244 *INFO* [FelixDispatchQueue] com.day.commons.osgi.wrapper.svnkit BundleEvent STARTED
04.12.2012 04:54:34.248 *INFO* [FelixDispatchQueue] org.apache.felix.framework BundleEvent STARTED
04.12.2012 04:54:34.248 *INFO* [FelixDispatchQueue] org.apache.felix.framework FrameworkEvent STARTED
04.12.2012 05:05:36.212 *ERROR* [127.0.0.1  GET / HTTP/1.0] org.apache.sling.engine.impl.SlingHttpContext handleSecurity: AuthenticationSupport service missing. Cannot authenticate request.
The following is what I see from the command line where I start the instance:
04.12.2012 04:54:26.684 *INFO * [main] Starting Apache Sling in /opt/cms/5.5.x/crx-author_1/.
04.12.2012 04:54:26.693 *INFO * [Apache Sling Control Listener@pennmwebdr/188.8.131.52:0] Apache Sling Control Server started at pennmwebdr/184.108.40.206:60547
04.12.2012 04:54:26.697 *INFO * [main] Checking launcher JAR in folder /opt/cms/5.5.x/crx-author_1/./launchpad
04.12.2012 04:54:26.713 *INFO * [main] Existing launcher is up to date, using it: 2.3.1.R1242720-B006 (org.apache.sling.launchpad.base.jar)
04.12.2012 04:54:26.716 *INFO * [main] Loading launcher class org.apache.sling.launchpad.base.app.MainDelegate from org.apache.sling.launchpad.base.jar
04.12.2012 04:54:26.736 *INFO* [main] Setting sling.launchpad=/opt/cms/5.5.x/crx-author_1/./launchpad
04.12.2012 04:54:26.736 *INFO* [main] Setting sling.home=.
04.12.2012 04:54:26.737 *INFO* [main] Setting sling.run.modes=author
04.12.2012 04:54:26.737 *INFO* [main] Setting org.osgi.service.http.port=4502
04.12.2012 04:54:26.737 *INFO * [main] Starting launcher ...
04.12.2012 04:54:26.740 *INFO* [main] HTTP server port: 4502
04.12.2012 04:54:27.524 *INFO* [OsgiInstallerImpl] org.apache.sling.installer.core.impl.OsgiInstallerImpl Apache Sling OSGi Installer Service started.
04.12.2012 04:54:34.245 *INFO * [main] Startup completed
You can see that it never really finishes startup. It's been sitting at this point for at least 15 minutes, so I don't really see it going any further.
I do not see any error in error log. Are you able to access atleast CRX or felix console ?
Sorry, Yogesh. I had to scrap the clustering attempt. It was taking too much time and we had other projects to work on.
For what it's worth, I tried to add a new slave to the same cluster from a different machine/network and had the same results, even though it was a virgin quickstart install and I had used the GUI to add the instance to the existing single-node cluster.
Great. It make sense. I would suggest to send your logs for analysis to daycare. In case there is an obvious error that can cause this issue again in future.