CPD bloat

Report · Sep 11, 2007

We have a large X5 webhelp project (5000+ topics) with a problem: the cpd file bloats up to an unmanageable size just within one day's time. Starting with a clean cpd of about 3-4 MB, rebuilt from scratch in the morning (which takes quite a while!), the cpd quickly grows with each save until by the end of the day it is too big to work properly. It easily grows to 25 MB, and yesterday the writer reported that it went to over 40 MB!

The writer is doing a lot of structural work like adding/deleting/renaming/moving files and folders, so this is probably a worst-case scenario for impacting the cpd. But it's work that needs to be done, so we're stuck for it.

Aside from the obvious solution of breaking the project down into subprojects, is there any other way to deal with this problem? I've just tried to "compact and repair" the cpd file in Access, but haven't had a chance to test and see if that works any differently than simply deleting the cpd file and letting it rebuild. Also, we are moving to RH6; is that likely to help this problem at all?

Thanks for any suggestions...

G

Report · Sep 11, 2007

Hi,

You say,
The writer is doing a lot of structural work like adding/deleting/renaming/moving files and folders,

Probably it goes without saying, but I'm a little compulsive about asking:

All this is done via RH project operations, not in Windows Explorer.

Right?

Harvey

Report · Sep 11, 2007

Yes, that is correct.

A bit more data. I've just been playing around in a copy of the project and have seen it in action. I've added some topics to the project, and with each one, I've done a Save All and then checked the cpd file size. The size increment in the cpd is pretty consistent and depends on the template used to create the topic. At the low end, if I create a topic using "none" (the default template), the cpd grows by about 150 KB. At the larger end, if I create a topic using one of our templates with more content, the cpd grows by 500 KB. I can now see how it gets so big so fast! Making it stop...not so clear.

G

Report · Sep 11, 2007

You obviously appreciate the CPD is an Access database. Personally I think compacting is safer as deletion requires a rebuild with the attendant risk of loss of information. Compacting should avoid that. Neither can be guaranteed so save a copy somewhere safe.

You also appreciate why it is growing but I must admit that whilst I have no yardstick, that does seem to be high.

I use merged webhelp and tend to regard 4/5000 files as being the limit. There is actually no real limit as much depends on content but that's a reasonable guideline.

I don't think there is much else you can do, except as you say break it down into smaller projects. Presumably the massive restructure is a passing thing so perhaps easier to just live with it?

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Sep 11, 2007

Thanks for the input, Peter. While the structural rework will end, there's enough activity in the project that our problems won't end there. I think I'll have to look for ways to decrease the size of the project and cpd, whether it be through merging or by some other means. I might be able to thin out some types of links (ones that the end user doesn't see anyway), and maybe some cleaning up of that ilk would help the cpd size.

Thank you for your quick response and insight in this and so many other topics in the forum. I've learned a lot, and I really, really appreciate the amount of time and care you spend here!

Cheers

G

Report · Sep 11, 2007

It's my pleasure when people appreciate it. Just gets annoying when others take it for granted.

If you do go the merged help route. There's some stuff in my merged webhelp topic about splitting a project.

Still finding that amount of bloat hard to figure.

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Sep 11, 2007

I'll be sure to check your site for tips on blowing apart projects.

Yes, it does seem like a lot of bloat, doesn't it? I need to look into it some more. In particular, I intend to compare this project with other analagous projects. That should be interesting.

G

Report · Sep 11, 2007

Hmm. Not the phrase I would have used but when I once complimented someone on a bodge that got a car going, they said it was not a bodge but "site engineering" so what do I know?

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Sep 11, 2007

Hi all

Just a possible bit of insight here. The actual size of the bloat may vary, depending on your hard drive. This has to do with the size of the clusters that comprise the actual storage space used. For example, one may have a file that is only a few bytes in size. But in order to save that file on the hard drive, a complete cluster is required. Granted, I'm no maven of cluster knowledge, but lets say a cluster is 32,000 bytes in size. That small file that is maybe only 3 bytes long would report as consuming 32,000 bytes, as that is the smallest cluster size. So if you increased the file to perhaps 32,001 bytes, it would then report as consuming 64,000 bytes.

I'm thinking that maybe this may account for it?

Just thinking out loud... Rick

Report · Sep 11, 2007

Cool thought. Wouldn't whether the disk is formatted FAT or NTFS be an issue in that case?

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Sep 12, 2007

I agree. Rick's examples show how FAT / NTFS is a major concern, as Peter says.

More random thoughts:

Have you checked your disk fragmentation lately and de-fragged if it was recommended?

Do you have a lot of file names that are extremely long? Do you make heavy use of the Properties dialog -- To Do list, Comments, etc.? If you use build tags heavily, do they have short names?

I'm not an expert on MS Access, but the data tables are fairly simple. Most are cross-references in two columns. Some are short lists.

The longest individual record is in the Topics table, which has columns for file name (twice) and title, build tags applied (topic or text level), plus flags and entries in the Properties dialog, such as priority, comments, to do list checkoffs, and so on. Multiplied by 5,000-plus records the table could need a lot of disk memory. And I don't know how efficient Access is when a table has 75,000 to 100,000 cells (even a "null" is a piece of data).

Compacting the database in MS Access should reduce any bloat that RH may have created in its own housekeeping.

But first priority should be to follow Peter's thought.

Harvey

Report · Sep 13, 2007

Thanks for the additional thoughts and suggestions. I'm still chipping away at this. Some additional factoids, in no particular order:
1. We're configured as NTFS, with default cluster size. (According to MS, that would be 4K.)
2. Going to RH6 didn't seem to affect the problem, although the writer reported that rebuild the cpd took twice as long (somewhere in the neighbourhood of 4 hours!).
3. Preliminary comparison of this project with a much smaller sibling project shows a striking difference in cpd growth pattern. When I add a topic to the smaller project, I see a cpd size increment that is *one tenth* that of the problem project! This is not precisely an "apples to apples" comparison, but it is definitely pertinent.
4. We do use build tags rather heavily, with names running from 4 characters to as much as 30. Most are about 10-15 characters. We also have long filenames, along with several layers of folder structure. Not much in the way of comments or to-do lists. There are lots of topic-to-topic links.
5. The writer tells me it's "been awhile" since she defragged, so that was a great suggestion. In fact, I think I'll do so myself!

As you can see, I have more checking to do on this. I appreciate the extra paths you've given me to explore.

G

Report · Sep 13, 2007

Have you tried to "Jetcomp" the CPD? Jetcomp is a free tool from Microsoft to repair Access databases, which also cleans up a lot of mess and compacts it.

It works very well on database files which have grown a lot while working on them. The Access engine is not exactly known for generating clean, slim files.

After I tried my luck on a RSC database, thus reducing the file from 24 MB to 2.5 MB, I regularly treat all my MDBs and CPDs.

To use it on a CPD, you'll have to rename the CPD into a MDB.

Hope this helps.

---Dirk Bock

Report · Sep 14, 2007

Nice one Dirk. I have just created Snippet 78 on my site so that information does not get lost. There's also a link to the download page.

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Sep 14, 2007

Cool! Thanks very much for the pointer, Dirk. I'll definitely try that out.

When I use Access to compact this cpd file, it seems like RH just goes ahead and rebuilds it anyway when I open the project.

Cheers

G

Report · Sep 17, 2007

G --

I wasn't aware of RH rebuilding the cpd file on opening the project. Yes, it should open the file and revise it while you're working in the project, but rebuilding it from scratch (and, I'm guessing from your comment, inflating it to the pre-compacted size) is something new for me.

Peter, Dirk, Rick -- Any insights here?

G --
You confirm that you have long filenames and build tags. Are you replicating that in your test project?

H

Report · Sep 17, 2007

quote:

Originally posted by: HKabaker
I wasn't aware of RH rebuilding the cpd file on opening the project. Yes, it should open the file and revise it while you're working in the project, but rebuilding it from scratch (and, I'm guessing from your comment, inflating it to the pre-compacted size) is something new for me.

No, RH dosn't rebuild the CPD as long as it doesn't have to, i.e., there ise a fitting CPD available. In theory, you could decide to always delete the CPD after you close the project so as to force RH to rebuild it. The CPD will be less convoluted than usual, but it takes quite some time to do so.

---Dirk Bock

Report · Sep 17, 2007

Another thought --

When compacting and repairing the .cpd file in Access, did you see an option to upgrade it for compatibility with the latest release of MS Access on your computer? (See my new question in this forum about Access 2007.)

H

Report · Sep 17, 2007

quote:

Originally posted by: HKabaker
When compacting and repairing the .cpd file in Access, did you see an option to upgrade it for compatibility with the latest release of MS Access on your computer? (See my new question in this forum about Access 2007.)

As Peter said: don't upgrade. RH uses, at least up to version X5, the old Jet engine and will thus not be able to open a converted CPD.

---Dirk Bock

Report · Sep 19, 2007

====================
4. We do use build tags rather heavily, with names running from 4 characters to as much as 30. Most are about 10-15 characters. We also have long filenames, along with several layers of folder structure. Not much in the way of comments or to-do lists. There are lots of topic-to-topic links.
=====================

No time to search right now, but isn't there a limit to the amount of characters allowed in a conditional statement? Compound "build tags...with characters...as much as 30" with "long filenames, along with several layers of folder structure," and you're probably straining the limits over and over.

Good luck,
Leon

Report · Sep 19, 2007

Thx, Colum!

Report · Sep 17, 2007

Don't upgrade the CPD. Keep it 97 compatible. It might not do any harm but why risk it?

Help others by clicking Correct Answer if the question is answered. Found the answer elsewhere? Share it here. "Upvote" is for useful posts.

Report · Sep 19, 2007

The limit is 255 characters.

Report · Sep 19, 2007

The database table for hyperlinks is a cross-reference by topic ID numbers, so its columns contain only a few digits. Same for links to graphics, pictures, image maps and so on.

The Topic List table has a lot of bloat potential. The more I get into it, the more questions I have.

1. Each topic record has a column for Build Tags and another for Context Build tags. Each contains a string of build tag names, separated by commas, presumably the tags that you have applied to the topic and/or text in it. And, presumably, the tags still apply to the topic and/or text.

So a string of very long tag names uses more memory than a string of short tag names. This can explain only a small amount of bloat, however, because a string of text doesn't use much space.

2. When you un-apply a build tag, presumably the Meta tag is deleted from the source code, and RH edits the database table. But RH slips up on housekeeping here.

I'm not certain why. But I found a lot of spurious entries for build tags in several topics.

Perhaps they were tags that had been applied and unapplied. But the table isn't updated. Call this a Type A problem.

Some of these topics were imported from earlier projects, where the tags were applied. In the new project however, the build tag didn't exist when I imported the topic. Maybe RH made note of it somewhere. It didn't automatically create a tag in the new project, and there's no way to see it unless you happen to see the old Meta tag in the source file. Call this a Type B problem.

So a long string of spurious entries of long tag names will use some undetermined amount of storage space.

3. A further complication: The Topic List table contained some spurious entries for topics where the tags were absent from the project as well as source code. In other words, there is no old build tag code in the topic, and the project doesn't have this specific build tag, but the topic record in the table has multiple entries in the build tag columns. This is a special case of Type A.

4. OK. Let's throw away the .cpd and let RH rebuild it. That should clean it up, right?

No. Go figure.

In the new .cpd file some records were cleaned up, probably Type A detritus But some were not, maybe because the old Meta tags were still in the source, the Type B case.

I'm not certain because I haven't tracked down each and every difference between the old .cpd file and the new one, against the topic source code.

5. At first I thought the folder structure wasn't a problem, because each topic folder has an ID, and each topic record refers to the folder code, one or more digits. (By the way, in rebuilding the .cpd file, RH condenses the ID numbers and assigned new ones for topics, project folders and build tags, for example.)

However, there is a column headed TopicStringID that shows the file/path string needed, I think, for "breadcrumbs." It reflects the multilevel folder structure. So whether you use it or not, the breadcrumbs feature adds something to the .cpd file size.

Probably there are other factors in running a 5 MB .cpd file up to 20+ MB.

But this leads me back to my favorite rant:

Be stingy with file names. Yes, we've come a long way from the FORTRAN and DOS contraints of 4, 6 or 8-character labels. But that desn't negate the value of economy in naming conventions.

Same goes for any other set of labels, like build tags.

Keep the project file structure shallow to avoid long strings in hyperlinks. Remember, the output TOC can have as many levels as you think necessary to organize the material, while the project directories can be different. Yes, it's hard to keep all topics in the top level folder, but a two-or three-deep structure should be enough for managing the project, even though you may have a five-level TOC in the output.

Did I mention you should use short folder names, too?

I welcome comments and rebuttal.

Harvey

Report · Sep 20, 2007

You guys are awesome. I'm going to have to reread and digest your comments some more, but here's some additional info:
- I always leave the cpd in Access 97 - didn't even consider converting it.
- The build expression is 99 characters total.
- Project folder structure is three levels: root, first subfolder level, and one more sublevel after that.

Ruminating...
G

Adobe Community

CPD bloat