The database table for hyperlinks is a cross-reference by
topic ID numbers, so its columns contain only a few digits. Same
for links to graphics, pictures, image maps and so on.
The Topic List table has a lot of bloat potential. The more I
get into it, the more questions I have.
1. Each topic record has a column for Build Tags and another
for Context Build tags. Each contains a string of build tag names,
separated by commas, presumably the tags that you have applied to
the topic and/or text in it. And, presumably, the tags still apply
to the topic and/or text.
So a string of very long tag names uses more memory than a
string of short tag names. This can explain only a small amount of
bloat, however, because a string of text doesn't use much space.
2. When you un-apply a build tag, presumably the Meta tag is
deleted from the source code, and RH edits the database table. But
RH slips up on housekeeping here.
I'm not certain why. But I found a lot of spurious entries
for build tags in several topics.
Perhaps they were tags that had been applied and unapplied.
But the table isn't updated. Call this a Type A problem.
Some of these topics were imported from earlier projects,
where the tags were applied. In the new project however, the build
tag didn't exist when I imported the topic. Maybe RH made note of
it somewhere. It didn't automatically create a tag in the new
project, and there's no way to see it unless you happen to see the
old Meta tag in the source file. Call this a Type B problem.
So a long string of spurious entries of long tag names will
use some undetermined amount of storage space.
3. A further complication: The Topic List table contained
some spurious entries for topics where the tags were absent from
the project as well as source code. In other words, there is no old
build tag code in the topic, and the project doesn't have this
specific build tag, but the topic record in the table has multiple
entries in the build tag columns. This is a special case of Type A.
4. OK. Let's throw away the .cpd and let RH rebuild it. That
should clean it up, right?
No. Go figure.
In the new .cpd file some records were cleaned up, probably
Type A detritus But some were not, maybe because the old Meta tags
were still in the source, the Type B case.
I'm not certain because I haven't tracked down each and every
difference between the old .cpd file and the new one, against the
topic source code.
5. At first I thought the folder structure wasn't a problem,
because each topic folder has an ID, and each topic record refers
to the folder code, one or more digits. (By the way, in rebuilding
the .cpd file, RH condenses the ID numbers and assigned new ones
for topics, project folders and build tags, for example.)
However, there is a column headed TopicStringID that shows
the file/path string needed, I think, for "breadcrumbs." It
reflects the multilevel folder structure. So whether you use it or
not, the breadcrumbs feature adds something to the .cpd file size.
Probably there are other factors in running a 5 MB .cpd file
up to 20+ MB.
But this leads me back to my favorite rant:
Be stingy with file names. Yes, we've come a long way from
the FORTRAN and DOS contraints of 4, 6 or 8-character labels. But
that desn't negate the value of economy in naming conventions.
Same goes for any other set of labels, like build tags.
Keep the project file structure shallow to avoid long strings
in hyperlinks. Remember, the output TOC can have as many levels as
you think necessary to organize the material, while the project
directories can be different. Yes, it's hard to keep all topics in
the top level folder, but a two-or three-deep structure should be
enough for managing the project, even though you may have a
five-level TOC in the output.
Did I mention you should use short folder names, too?
I welcome comments and rebuttal.
Harvey