Link Management Algorithms

A listing of operations that need to be carried by a system that wants to ensure reference integrity — so links remain valid as files and folders are moved or renamed. 

About Link Management

The DreamWeaver HTML program demonstrates proof-of-concept for a system that manages links in a flat-file system. (Thereby demonstrating that a database is not a fundamental requirement for a CMS.) When moving and renaming files and directories, DreamWeaver finds all links to them, and can automatically change those links.

In essence, the data structures consist of a list of file paths, and a list of links for each file (using hash maps to improve performance). When something is moved or renamed, any links the file contains need be adjusted, as well as any links pointing to that file from elsewhere. Link-lists make it possible to rapidly identify the changes that need to be made. Transforms can then be generated and applied to the target files to make the changes.

Enabling Smart Builds

The data structures used for link management can also be used to manage build-dependencies, so a DITA document-production system can do smart builds that regenerate a file only when the sources it depends on have been modified — something that is not currently possible using the DITA tool kit.

Link Management Operations

There are two important aspects to link management:

  • Link adjustments (keep them from breaking)
  • Version-aware behavior (keep the VCS informed)

Link Adjustments

There are several kinds of structure changes to take into account, with somewhat different operations for each:

  • File Rename: Examine lists for all files in the site. When the target file’s name is found in one of the lists, fix the links in the file associated with that list.
  • File Move: Perform file rename operations. In addition, adjust all relative links in the file.
  • Directory Rename: For all files in the directory, and in all subdirectories, use the File Rename code to identify candidate files that might need to change. For those files, adjust all links that include the directory name in the path.
  • Directory Move: Perform directory rename operations. In addition, all files in the directory, and in all subdirectories, adjust relative links. (Somewhat tricky. Relative links within the relocated hierarchy do not need to change. Only links that go outside the hierarchy need to be changed.)

Version-Aware Behavior

Two kinds of “pre-commit” processing are needed:

  • List Management: When files are submitted to the the VCS, the links they contain need to be parsed out and added to the appropriate link lists.
  • Change Reflection: After links have been adjusted in a file, the changes need to be reflected in the Version Control System (VCS). The command used to submit a change depends on which VCS is in use.

Ideally, the system would identify the VCS dynamically and use the appropriate commands to find out which files have been modified by the user, and to submit automatically-generated changes. Alternatively, the type of VCS could be declared in a configuration file for the site map. At worst, the commands themselves could be configured in an installation step — but the tool would then need to be reinstalled to work with a different VCS.)

Problems to Solve

The Collision Problem

The downside of that of a distributed-renaming strategy that it greatly increases the possibility of collisions, where you check out one document and I another, and we each rename our respective documents before checking them in. In that deadly embrace, you could not check in your document, nor I mine.

There are undoubtedly ways to solve that problem, however.

The first step in any approach would probably be to ensure that renames and moves never occur at the same time as content edits. A simple solution might then block name- and location-changes if any of the dependent documents are checked out by others.

A more sophisticated approach might be to queue up the renames and moves, do the required changes as items are checked in, and continually adjust the queue as necessary, removing items that have been fixed after a content has been checked in, and modifying the queue when additional moves and renames are checked in — with the proviso that some move- and rename- operations may not be permitted, if they create conflicts with queued, pending operations.

Link Normalization

One of the trickier aspects of the code is to normalize relative file paths to absolute form, do the conversion, and then replace the original link with the new relative path. So ../../x becomes root/abc/x, which is then modified to root/abc/y, and restored to ../../y.

I coded such a normalization algorithm for the LinkCheck tool, written in Java. That code should ideally be converted to Ruby, and the normalization code reused here.

1 Comment

Trackbacks & Pingbacks

  1. RuDi – A Ruby-based System for DITA document generation | May 5, 2017 (8:18 pm)

    […] links from being automatically adjusted, the changes can be made outside the system.) Learn more: Link Management Algorithms […]

Add your thoughts...