8 Replies Latest reply on Dec 6, 2012 7:36 AM by Yogesh Upadhyay

    "Unable to store record" error on newly cloned cluster member

    MichaelOryl

      Hi.  We're running CQ5 5.5 SP1 and I'm trying to setup an author instance on a machine at a disaster recovery site. I want the DR author instance to be clustered with our existing non-clustered author node on our staging server.  If I get this to work, I'll redo it with the production author server, which will leave us with a hot backup server in the case that the primary data center goes dark.

       

      I cloned the "master" author instance on our staging server using the CQ backup facility.

       

      I restored that backup onto the DR server and proceeded to follow the directions found in the "Manual Slave Cloning" section of http://dev.day.com/docs/en/crx/current/administering/cluster.html

       

      I then restarted the newly cloned slave isntance and saw that that it had automatically been added to the repository/cluster.properties file found on the staging master server.  This implied that the two servers were now linked.  You can also see this in the new slave's error.log file:

       

      30.11.2012 08:06:56.801 *INFO* [FelixStartLevel] com.day.crx.core.cluster.ClusterController Trying to connect to a master, as the file clustered.txt exists.

      30.11.2012 08:06:56.987 *INFO* [FelixStartLevel] com.day.crx.core.cluster.ClusterController Node author-dr started as: slave, connected to address: /XXX.XXX.XXX.237:8088

       

      The problem is, the cloned slave author instance will not come up properly.  I see the following error in the logs/error.log file:

       

      30.11.2012 07:54:24.876 *ERROR* [FelixStartLevel] com.day.crx.core.data.ClusterDataStore Unable to download data with identifier: Unable to store record.

      30.11.2012 07:56:08.669 *ERROR* [FelixStartLevel] com.day.crx.core.data.ClusterDataStore Unable to download data with identifier: Unable to store record.

       

      They seem to continue as long as the instance is not shut down.  While this is happening, attempting to connect to the CQ admin at http://dr-hostname:4502/ results in a simple "Not Found" error in the browser, which is what you see when you attempt to connect to the admin when the instance is first starting up.

       

      This implies that perhaps the instance is never starting due to this problem.

       

      Can anybody suggest what the issue might be?  I had no success with creating a new quickstart instance and using the GUI to join the existing author in a cluster.  The "Loading..." message showed for a long time before disappearing with nothing appearing to have worked and no error message.

        • 1. Re: "Unable to store record" error on newly cloned cluster member
          Sham HC Level 7

          Hi Michael Oryl,

           

              Looking at your error indicates datastore is corrupted Or missing the binaries.   I suspect the instance from where you took the backup the datastore is configured  outside the sling home.  So when you took the backup the datastore is missing.  So you need to restore the backup of datastore also.

           

              In order to confirm my suspect could you please restore the backup in another envirnoment and start the instance as stand alone and run datastore consistence check and let me know if it reports any error.

           

          Thanks,

          Sham

          1 person found this helpful
          • 2. Re: "Unable to store record" error on newly cloned cluster member
            MichaelOryl Level 1

            Sham, our system uses local TarPersistence datastores.

             

            Either way, data corruption appears to have been the problem.  The entire system died 14 hours after my initial post when the Tar Optimization process started at 2 a.m.

             

            Millions of the following pair of lines were repeated in the error.log for our server's publish instance.  It quickly filled up the disk and took down all processes with it.  Why it felt the need to add millions of lines to the error log and not just give up on this one file is beyond me, though.  In less than a minute and a half, each of the following lines was repeated nearly 500k times:

             

            01.12.2012 02:00:01.407 *ERROR* [Tar PM Optimization] com.day.crx.persistence.tar.file.TarFile Failed to read, file /opt/cms/5.5.x/crx-publish-1/./repository/version/data_00197.tar id:197 length:38000128 append:38130176 1988663922 pos: 38042112 length: 515


            01.12.2012 02:00:01.407 *WARN* [Tar PM Optimization] com.day.crx.persistence.tar.file.TarFile Error reading entry /opt/cms/5.5.x/crx-publish-1/./repository/version/data_00197.tar id:197 length:38000128 append:38130176 1988663922

            I have a backup from Friday morning that I will restore and run the consistency check against.  I'll let you know what I find.

            • 3. Re: "Unable to store record" error on newly cloned cluster member
              MichaelOryl Level 1

              Turns out that I cannot run a consistency check.  I can't get the instance up at all in order to run that test.

              • 4. Re: "Unable to store record" error on newly cloned cluster member
                Yogesh Upadhyay Level 4

                Michael,

                 

                I would suggest to get in touch with Daycare to resolve this issue. Can you try this and see if it works (Assuming that you dont have any folder called crx.XXX or else procedure will be different) ?

                 

                "Start your instane as stand alone instance"

                 

                1) Stop your instance

                2) Remove or rename /crx-quickstart/repository/cluster*

                3) Remove or rename /crx-quickstart/repository/tarJournal

                4) Remove or rename /crx-quickstart/repository/revision.log

                 

                Rename your logs folder to get fresh logs.

                 

                restart your instance.

                 

                For more detail about cluster and these files please read http://crxcluster.wemblog.com/

                 

                Yogesh

                www.wemblog.com

                • 5. Re: "Unable to store record" error on newly cloned cluster member
                  MichaelOryl Level 1

                  Hi Yogesh,

                   

                  The author instance still won't start.  It never gets past this point in the error.log:

                   

                  04.12.2012 04:54:34.217 *INFO* [FelixStartLevel] org.apache.sling.servlets.post Service [org.apache.sling.servlets.post.impl.SlingPostServlet,170] ServiceEvent REGISTERED

                  04.12.2012 04:54:34.227 *INFO* [FelixDispatchQueue] org.apache.sling.bundleresource.impl Service [Provider of bundle based resources,171] ServiceEvent REGISTERED

                  04.12.2012 04:54:34.228 *INFO* [FelixDispatchQueue] org.apache.sling.servlets.post BundleEvent STARTED

                  04.12.2012 04:54:34.232 *INFO* [FelixDispatchQueue] org.apache.sling.servlets.resolver BundleEvent RESOLVED

                  04.12.2012 04:54:34.240 *INFO* [FelixStartLevel] org.apache.sling.servlets.resolver Service [org.apache.sling.servlets.resolver.internal.defaults.DefaultErrorHandlerServlet,172] ServiceEvent REGISTERED

                  04.12.2012 04:54:34.241 *INFO* [FelixDispatchQueue] org.apache.sling.servlets.resolver BundleEvent STARTED

                  04.12.2012 04:54:34.244 *INFO* [FelixDispatchQueue] com.day.commons.osgi.wrapper.svnkit BundleEvent RESOLVED

                  04.12.2012 04:54:34.244 *INFO* [FelixDispatchQueue] com.day.commons.osgi.wrapper.svnkit BundleEvent STARTED

                  04.12.2012 04:54:34.248 *INFO* [FelixDispatchQueue] org.apache.felix.framework BundleEvent STARTED

                  04.12.2012 04:54:34.248 *INFO* [FelixDispatchQueue] org.apache.felix.framework FrameworkEvent STARTED

                  04.12.2012 05:05:36.212 *ERROR* [127.0.0.1 [1354626336202] GET / HTTP/1.0] org.apache.sling.engine.impl.SlingHttpContext handleSecurity: AuthenticationSupport service missing. Cannot authenticate request.

                   

                  The following is what I see from the command line where I start the instance:

                   

                  04.12.2012 04:54:26.684 *INFO * [main] Starting Apache Sling in /opt/cms/5.5.x/crx-author_1/.

                  04.12.2012 04:54:26.693 *INFO * [Apache Sling Control Listener@pennmwebdr/207.211.75.234:0] Apache Sling Control Server started at pennmwebdr/207.211.75.234:60547

                  04.12.2012 04:54:26.697 *INFO * [main] Checking launcher JAR in folder /opt/cms/5.5.x/crx-author_1/./launchpad

                  04.12.2012 04:54:26.713 *INFO * [main] Existing launcher is up to date, using it: 2.3.1.R1242720-B006 (org.apache.sling.launchpad.base.jar)

                  04.12.2012 04:54:26.716 *INFO * [main] Loading launcher class org.apache.sling.launchpad.base.app.MainDelegate from org.apache.sling.launchpad.base.jar

                  04.12.2012 04:54:26.736 *INFO* [main] Setting sling.launchpad=/opt/cms/5.5.x/crx-author_1/./launchpad

                  04.12.2012 04:54:26.736 *INFO* [main] Setting sling.home=.

                  04.12.2012 04:54:26.737 *INFO* [main] Setting sling.run.modes=author

                  04.12.2012 04:54:26.737 *INFO* [main] Setting org.osgi.service.http.port=4502

                  04.12.2012 04:54:26.737 *INFO * [main] Starting launcher ...

                  04.12.2012 04:54:26.740 *INFO* [main] HTTP server port: 4502

                  04.12.2012 04:54:27.524 *INFO* [OsgiInstallerImpl] org.apache.sling.installer.core.impl.OsgiInstallerImpl Apache Sling OSGi Installer Service started.

                  04.12.2012 04:54:34.245 *INFO * [main] Startup completed

                   

                  You can see that it never really finishes startup.  It's been sitting at this point for at least 15 minutes, so I don't really see it going any further.

                  • 6. Re: "Unable to store record" error on newly cloned cluster member
                    Yogesh Upadhyay Level 4

                    I do not see any error in error log. Are you able to access atleast CRX or felix console ?

                     

                    Yogesh

                    • 7. Re: "Unable to store record" error on newly cloned cluster member
                      MichaelOryl Level 1

                      Sorry, Yogesh.  I had to scrap the clustering attempt.  It was taking too much time and we had other projects to work on.

                       

                      For what it's worth, I tried to add a new slave to the same cluster from a different machine/network and had the same results, even though it was a virgin quickstart install and I had used the GUI to add the instance to the existing single-node cluster.

                      • 8. Re: "Unable to store record" error on newly cloned cluster member
                        Yogesh Upadhyay Level 4

                        Great. It make sense. I would suggest to send your logs for analysis to daycare. In case there is an obvious error that can cause this issue again in future.

                         

                        Yogesh