Encoding Source in XML

A Strategic Analysis

By Eric Armstrong
26 Jun '00 -- Added "comment conversions" section under Encoding Issues. Extended the summary.
25 Jun '00 -- Added final note on Comment Styles secttion. Cleaned up the text throughout.
24 Jun '00 -- Created


Storing source code in XML provides many benefits that cannot be achieved any other way, including hierarchical structures, literate programming style, links to explanations, choosable display styles, and the elimination of braces, semicolons, and end-comment marks. The advantages are described, and possible strategies for encoding source in XML are evaluated. A "plain" encoding style using one, or at most two element types is found to be most desirable for editing, despite the value of using multiple elements (one per language component) for compiling and other automated processing. This paper also identifies the problems that must be solved when translating a hierarchical version of source code to and from plain text. It ends with an analysis of the deficiencies in XML which make it impossible to construct an elegant and, more importantly, an easily editable encoding.
(8800 words / 15 pgs)


This document looks at the reasons and methods for encoding program source code in XML. It also describes the major issues that must be solved for the encoding to be successful.

This document contains the following sections:

Motivation for XML-based Encoding

Storing source code using XML data structures has a number of advantages that are difficult to obtain using plain text. Those advantages include:

Hierarchical structures
Today, no one would want to go back to the "flat directory" structures of the past. We are too used to branching directory trees that are today's norm. Those trees carve up the name space nicely, so we don't have to work so hard at naming things. But more importantly, they organize directory contents, and make it possible to get to files quickly.

I recently found, to my surprise, that I had 1700 directories in my directory tree, and who knows how many files in them. The directory tree just never seemed that big, because I had maybe 10 or 15 directories at the top level, each with several directories under it, and so on. Instead of having to do a linear scan to find files, you "drill down" through the directory tree to find them.

The combination of intellectual organization and speed of access conspires to make even very large structures feel "small" and manageable. Hierarchical structures for source code can work that same magic. Instead of looking at page after page of code, online, you might see a single list of method names. Expanding the method name brings up the code for that method, just as expanding a directory entry shows the files under it.

For example:
           public class MyClass {
              + public String myFirstMethod() {
              + public String mySecondMethod() {
Literate programming style
Perhaps the greatest advantage of hierarchical structures in programs is the ability to tuck code away and out of sight, under comments. When viewing a method, then, what you might see is a list of comments, each which describes one of the steps in the algorithm. Viewing only the comments would provide an outline of the algorithm implemented in that method. In other words, you would see the same picture the author had in his head before he started coding. The result would be the ability to create highly literate programs -- programs that read well, in addition to executing efficiently.

The original "literate programming" espoused by Donald Knuth includes the ability to invoke subroutines using their literate names -- in effect, their comments. While that capability is potentially desirable, it would not result from the simple act of storing source in XML. It may make movement in that direction more feasible, however.

Here is an example:
          public class MyClass {  
              + // Variables          
              + public String myFirstMethod() {
                  + // Access the database
                  + // Calculate the result
                  + // Check for validity
                  + // Return the result
              + public String mySecondMethod() {
Links to Explanations
Outliners have been around since the mid-eighties. HTML linking has been around since the early 90's. Yet source code developers use plain text. Not only does that preclude the advantages that stem from hierarchy, it also precludes the advantages that stem from linking.

Linking in source code documents is particularly useful for providing explanations. When looking at source code, the hardest question to answer is why. Why does the code do this? What is the reason for that odd looking bit of code? Why, why, why? Some estimates put the cost of maintenance at over 5 times the cost of original development. And something like 50-90% of a maintainer's time is spent acquiring enough understanding to make an intelligent change. Links to explanations can go a long way towards shortening that time.

Although some comments can be put in the code, others are simply too long. Design documents, for example, contain lengthy explanations of algorithms and the reasons for adopting them. Links can point to those documents. Or, if another author's code was used as the basis for a class or method, a link can serve as both an attribution and a pointer to the "seed" from which the current project developed.

Finally, bug fixes have a way of producing multiple, odd insertions in the code. You insert a variable here, set it's value over there, and test it somewhere else. Then you do something in response to that test. In contrast to the simple explanations that sufficed for the general outline of the algorithm, it may take a paragraph or a page of text to explain the nature of the bug, the complex conditions that cause it to come to life, and how the patch solves the problem. If lengthy explanations are included in the code, the code eventually becomes unreadable -- you can't find it for all the comments. If the explanation isn't included, the result is "mystery code" -- why is that there? More importantly, if the patch touches several different locations, the explanation would need to be repeated multiple times.

The ability to link to external explanations solves all of those problems. In addition, the ability of an explanation to link back to multiple code points delineates a logical "thread" that runs through the code. Those threads are often orthogonal (at right angles) to the class structure and the code's normal flow of control.
Choosable Display Styles
Having source code in XML would make it possible to edit the source code in the format you like best. Number of spaces for indentation, placement of newlines, and other features would be controllable -- given good, style-controllable editors. Today, editors like XMetal give you control over the formatting of elements using HTML-style capabilities. Although such features are not totally ideal for source code editing, an XML encoding makes the development of such editors possible. When they are, coming up to speed on a new program would no longer be impeded by the need to get used to someone else's coding conventions.
Elimination of Braces, Semicolons, and End-Comment Marks
Given that a hierarchical structure fully describes the nesting of program statements, and the fact that the nesting is depicted by indentation in the editor or browser, braces become superfluous. Eliminating them also has the desirable consequence of preventing those hard-to-find errors that say, in effect, "somewhere in your 2,000 line file, you are missing an ending brace". (Python made a good start at that process by using indentation instead of braces to indicate nesting. But in Python, you have to get the spacing just right, because you are still using a plain text editor. One developer had his emacs editor set up in such a way that tabs were invisible, which led to a plethora of mysterious compilation errors! For Python especially, then, encoding source language in XML would seem to make a great deal of sense.) Similarly, since XML elements are well-delimited, the need for semicolons and even end-comment marks would disappear.

For those reasons, it makes sense to consider an XML encoding. The question is, what format to use for the XML data structures?

Overview of Encoding Strategies

The goal is to encode source language documents in XML in such a way that they are:

  1. Editable using standard XML editors
  2. (Eventually) directly compilable and subject to source control
  3. (For now) translatable to/from plain text, so the source can be compiled and exchanged with other developers, typically through a source control repository.

Their are many possible strategies to use for the encoding. Several of them were discussed on the http://extende.sourceforge.net developer's list. [Note: In the analysis that follows, I have tried to acknowledge the contributors to the discussion. If I have overlooked anyone, or misstated anyone's position, please send a note to the developer's list, and I will correct it forthwith!]

Those strategies include:

For editing and acceptance by developers, I suspect that is the plain node (outline) style that will work best. However, there is already an excellent definition for a more language-centric format at http://sds.sourceforge.net. Their code structure format (CSF) will be useful for many automated tools. It may be the developers will be won over to this format, and will gravitate towards it, once really good XML editors make their way to the market. However, as will be argued below, even with the best of editors, the task of editing will be rendered more difficult with such an encoding strategy. Since use of XML for source code depends in large part upon its acceptance by developers, I suspect that a more "natural" approach will have the highest rate of acceptance.

The remainder of this paper discusses the short term vs. the long term outlook for encoding source in XML, after which it provides an analysis of the various encoding options. It then lists the salient issues that have to be taken into account when storing and processing source code in XML structures, and ends with a look at the shortcomings of XML for that purpose. (With those shortcomings rectified, the XML-encoding option begins to approach elegance. But those solutions are not anywhere on the horizon.)

Short Term vs. Long Term Outlook

The short term presents some interesting complexities that ideally will not exist in the long term. In the short term, we are faced with the fact that source code developers have been using plain text editors and a "plain text" encoding of source code for the last 40 years or more, ever since computer languages were first invented. Developers are, therefore, used to that process. But aside from the normal human resistance to trying something new, developers are understandably wary of storing source code in new forms.

Suppose, for example, that you tried some early development environment that used databases and binary formats to store source code. You might have found that other tools you took for granted (like "grep", a search tool you used to find every file that contained a particular variable) no longer worked. Although it was part of your working style when source was in flat files, you suddenly discovered that you could no longer use that tool. You would have felt like a carpenter whose hammer was missing. Powerless.

Or you might have entrusted your source code to that system, and then found you could not easily share it with others. That prevented them from building on your work. Or, worst of all, you might have found that a power failure corrupted the database and wiped out all your code -- not just the file you were working on at the time. (That example is not too far-fetched. I was using a production email program for over a year, when the database got corrupted. It turned out that the manufacturer had not thought to include any recovery or analysis tools. So every email in that system was lost.)

So, developers have been historically adverse to trying new formats, with considerable justification. However, there is a very real possibility that XML will become the ubiquitous data format of the next century. If that occurs, developers will be using XML editors to write documents, send email, and do most other tasks they perform in their working day. If that happens, the resistance to storing source in XML will likely diminish. (See Design Notes for an XML Editor.)

After all, it must have felt very strange to the first programmer who typed a program into plain text. It must have seemed much less safe than plugging wires into a board, or punching holes in cards. One of those programs was real. It was solid. You could hold it in your hand, and know it was safe. Putting it into a computer file would have taken a whole lot of trust. Over time, though, experience with plain text editing proved reliable, and the advantages over punched cards were enormous, so the new encoding medium took root. I expect a similar acceptance curve for source code in XML.

Analysis of Encoding Options

This section considers and discusses each of the encoding options.

Using a "graphical" approach

This approach was proposed by Howard Golden on the eXtenDE developer's list. In reality, of course, this is more of UI (user interface) issue than an encoding issue. Any encoding scheme could be treated graphically, so that the system had more of a UML flavor than a source-language flavor. The possibility of using graphics is an interesting one, though, that deserves to be addressed. This seemed like a reasonable place to do so.

Despite the value of using UML diagrams for understanding and communicating the broad strokes of a system design, there are a few factors that seem to imply it's unsuitability for large projects. Among those factors are:

Graphic Complexity
Graphics work well for visualizing small systems, or a very high-level view of a large system. All of the demos for graphical systems do one or the other. But when systems get large and complex, full graphical treatments break down very quickly. What works best, of course, is the combination of graphics and text -- diagrams of specific subsections, attached to text (in the case of a diagram) or source code (in the case of a program). XML encoding of source, and the potential for linking and image-inclusion that results, therefore provides the most likely prospect for dealing with complex systems effectively.
History of Hardware Development
Hardware development started out as a graphic process. For decades, graphic tools were improved and bettered, so that designers could image their designs and have them translated into silicon. But in the last decade, the trend has been away from graphic designs, and toward languages. The reasons include the inability to visualize 7-layer boards with 3-dimensional interconnections, as well as the ability to easily reuse various routines stored in source-language form.
Lack of Hierarchical Graphics Tools
While multi-dimensional interconnections appear unavoidable in hardware design, the goal of object-oriented development is the production of more modular systems. In principle, then, "visualization difficulty" need not be a limiting factor. In practice, though, there is a definite lack of good, hierarchical graphic tools at the software designer's disposal.

Typically, when we think of a graphic hierarchy, we think of a tree of graphic objects, with lower level objects connected to to parent objects by lines. But in the context of a design, a graphic hierarchy requires nesting. At a high level, you might see 3 or 4 major components connected to each other. "Drilling down" into one of them might then show the subsystems comprising that component.

However, mere "drilling down" is not enough. The outliner equivalent of that is like a directory tree where all you see is the topmost level of directories, and when you "drill down", that view is replaced by the directories it contains. Although the display system is "hierarchical", the loss of surrounding context at each view places too many demands on the viewer, who must keep the interconnections in his or her head in order to relate the current view to other views.

What is needed in such systems is the ability to view an diagram at multiple levels. Just as you can expand or collapse an outline to see one level deep or multiple levels deep (as in a directory tree), you need the ability to display multiple levels of a graphic hierarchy. At the top most level, then, you might see 3 or 4 major components with "thick pipes" between them. But when you expand that view, you would see the objects inside those components, as well. The interconnections would then show the communication paths between those objects. Those smaller communication paths would also be contained in the larger "thick pipes", so that they were organized and labeled at the higher level as well as at the lower level.

For example, imagine a thick pipe labeled "input". That pipe might contain a path from a text object that goes to a text processor, and one from a scrollbar that goes to a percentage-value processor. The major components in this case might be UI, Processing, and Document, with graphical widgets in the UI component, and various processors in the Processing component, and a database system or file system in the Document component.

To be fair, it must be also recognized that there exist UML tools like Together/J that do "round-trip" engineering -- from UML diagrams to source code, and from source code edits back to UML diagrams. There is also a rabid cadre of engineers that use those tools. However, even though the UML tools show the structure of the system, they do not encode all of it's details. Given the present state of the art, it would be simply too difficult to encode every line of every method graphically. And if you did, you probably wouldn't be able to live with the result.

If good, hierarchical graphic tools existed, treating programs graphically might be conceivable. However, it is also likely that the complexity of real life programs would make them too difficult to follow in a graphic layout. Still, there is an important potential for graphics. If the design patterns could be selected from a graphics palette, and then instantiated in one's code, it would simplify development tremendously. And if the code linked back to the explanation of those patterns, it would be all the more understandable.

Creating an external structure of pointers

Another alternative recommended by Dennis Hamilton on the eXtenDE developers list was that of leaving the source code intact and making an XML structure that consists of pointers to the source code. That approach lets developers continue to use their plain text editors, while providing some of the benefits of hierarchical structuring and linking.

That approach is being used, in fact, by current tools that convert source code to HTML documents. After translation, program elements become links. So, when a method is invoked, the method name links to that method. The link can then be traversed to see the code on comments for that method -- in particular, to see the required parameters, along with their definitions and datatypes. Similarly, a variable can link to the place where it was defined, along with the comments that explain what it is for.

That such tools are viewed as highly useful, I think, points to the paucity of the plain text systems that developers are currently using. Those tools are valuable, just because they provide the linking capability that is so egregiously missing from plain text. But by the same token, they do not provide the benefits of hierarchical structuring.

Using external pointers in an XML file would provide the same benefits as translating to HTML, with the addition of adding hierarchical viewing capabilities. That mechanism would, as result, represent an incremental advantage over current systems. With a really good XML editor (which are becoming increasingly available), the developer would have the ability to collapse and expand sections, and make it possible to browse a more "literate" version of the code. However, that proposal causes serious difficulties with respect to editing.

The first, most obvious disadvantage is that existing XML editors would be useless. Changing the XML would have no effect on the underlying source, so making changes in a normal XML editor would be pointless. That means a custom editor would be required. However, that editor would be doubly complicated. It would not only have to make changes in the XML structures, it would have to replicate those changes in the text version of the document. So, while such a system would improve one's ability to view source code, it would not constitute the "next generation" integrated editor/browser advance that will make it possible to develop code more efficiently.

Defining a generic language

While defining a generic, "uberlanguage" that could be translated into Lisp, Smalltalk, Python, or Java is clearly not feasible, if not computationally insolvable, the folks at the Software Development Foundation (sds.sourceforge.net) have come with an interesting approach. They have defined a generic DTD (Document Type Definition) for a family of similar procedural languages, including Java, C, and Python.

Around that definition, they are building and/or planning a whole suite of development tools, including compilers, debuggers, syntax checking tools, pretty printers, and documentation generators. However, as valuable as that tool is for automated processing, I suspect that it poses some problems, as well -- mostly with respect to editing.

One problem with that standard (for editing, not for any other purpose) is that it appears to throw away the extra spaces and newlines that add to readability. "Pretty printers" can make the newlines appear in some consistent manner, and it would be possible for a "pretty printing" (style-controllable) editor to do so, as well. However, spaces that were added in order to make variable names and comments on them line up, for example, would disappear.

The desire to dictate style, for example with respect to spacing, lies in direct contrast to the desire to make the style viewer-controlled, as for example with indentation and line breaks. There is a tension between these two requirements that must be taken into account in the final design of the system. Possibilities for resolving the tension include separating the two concerns (line breaks and indentation controlled by user, extra spaces controlled by author), or creating even more intelligent display options. For example: "line up variables and comments on adjacent lines, when doing so will keep the results on a single-displayable line" and "when wrapping an assignment statement onto multiple lines, indent successive lines so that they start to the right of the assignment symbol (an equals sign, in Java)".

But if we assume that display problems are solvable, or at least livable, the editing problems still remain. The major problems stem from the need to continually specify the element type when adding statements to the program. You could select them from a palette, but continually moving the cursor to get them is going to be a drag. Or you could right click and select from a list, but that is still a lot of cursor movements for every single statement in a program. Alternatively, you might have control-key combinations to select elements. But that makes a lot of control-key combinations to memorize. Besides, isn't it easier to type "if" that hit "ctrl+I"?

One interesting solution to this problem is for newly added lines to always default to some generic element, say <node>. That element might then be changed by the editor depending on what the user types. Blank lines, comments, and language elements would be recognized, but mis-typing a language element could be identified immediately. However, here again we are talking about the need for a language-specific editor that parses and understands the text the user types. Generic XML editors would be of no use.

In addition, DTD-directed editors may disallow intermediate invalid states. That makes it more difficult to move things around and insert things in the order you think of them, as opposed to the order the program needs them in. Many a syntax-editor has fallen into disrepute because it did not allow the kinds of invalid states you typically move through when editing a program. You want to find out about them before you finish, but you don't want to break your train of thought to accommodate the editor during the writing process. (A DTD-directed editor that did it's syntax checks at the end might solve those problems, but it's not clear how many do, or will, operate in that manner.)

A more serious problem with using a generic specification for editing is that it may allow one to express statements that either cannot be translated into the current language, or cannot be done so efficiently. Significantly, even the CSF format at sds.sourceforge.net expects to receive source code input in plain text files. It does not appear to be intended as an editing format. So, even though a existing Java program or Python program can be nicely expressed in that format, when you turn it around and go the other way, you may run into problems.

When you go from plain source to CSF, the plain source is already a legal program. So, if the CSF format is a union of Java and Python constructs, the result of translating a Java program into CSF would only contain Java constructs. It would therefore translate back nicely. But if you edited a program using that DTD, you might add Python-based constructs to the program. Those constructs might not translate at all (although one hopes that CSF's developers have made sure that they do), or else they may represent a construct which is easily expressed in Python, but which does not map into Java code nicely. The result could be a program that performs inefficiently, or which is much harder to read in plain text form by someone accustomed to Java idioms.

In summary, CSF appears highly beneficial for automated processing. But the jury is still out on the mechanics of editing. Even if the structural problems can be solved, there is still the matter of usability and programmer acceptance. Over time, it is possible that all of the problems will be solved. But for the next five or six years, I suspect that a "plain encoding" that looks more like a standard outliner will have greater appeal when it comes to editing.

Defining a language-specific vocabulary

Rather than defining a generic language, one might choose elements that have a one-for-one correspondence with structures in a chosen language. For example, the <if> tag would encode Java's if statement, the <catch> tag would encode an exception-handling block, etc. This approach would eliminate the potential for defining programs that are either impossible to translate, or impossible to translate efficiently. However, it would suffer from all of the other problems attendant upon a language-based encoding.

Even if this approach were desirable, however, the existence of the Code Structure Format makes it moot. The few benefits that would be derived from a single-language encoding pale beside the benefits to be derived from using the existing standard. Using CSF makes a lot more sense. In addition to saving the time and work necessary to define the vocabulary you need, using CSF makes it possible to utilize any editors or other tools that are built around that standard.

Using general purpose ("plain") nodes, as in an outliner

The alternative to using a language-specific or even a generic-language encoding is to use one that is language-neutral. If the document contains only <node> elements, for example, the DTD becomes the picture of simplicity -- at least conceptually. Although it will become more complex as the issues discussed in the next section are addressed, it will still be many times simpler than a language-oriented DTD.

In effect, such an encoding uses XML to replicate the outliner utilities that enjoyed a brief spurt of popularity in the mid-eighties. But XML adds the capability for links and attributes that the structured encoding needs to interact well with utilities that are driven by plain text. For example, compilers and programs currently produce errors that give line numbers. Eventually, it would be nice to see them converted so they provide XML pointers that could be clicked to go directly to the source. But in the meantime, it will be necessary for the editors to provide "go to line" functions that can be used in place of links. Those line numbers will need to be stored as attributes in the XML structure, or else calculated on the fly in a way that accounts for multiple-line wrapping when the plain text version is generated from the XML.

Using such a "plain" encoding makes it possible to use standard XML editors on the source. That makes it possible for others to read the code (and add comments, for example), without requiring a custom editor to do it. (An editor that understands line numbers will still need be needed to translate the line numbers on compilation and runtime error messages, but that is fairly trivial hack.)

Such an encoding will also feel the most comfortable to current-day hackers. The editor will already be introducing new hierarchical display and manipulation capabilities that will take some getting used to. Plus, syntactic elements like braces and semicolons will have disappeared. At least the programmer will still be able to type "if" and "else" to enter statements!

So a plain encoding seems to be the most desirable. For Python, it really seems like the way to go. For Java, though, one more issue remains: Is a special element type needed for Javadoc comments? (Javadoc comments start with /** instead of /*. They are processed by the Javadoc program to generate API documentation.) That question will be taken up at the end of the next section, which covers encoding issues.

Discussion of Major Encoding Issues

These are the major issues that must taken into account when encoding a source language in XML.

Line Numbers
Current compilers do not operate on XML, so the program must be translated to plain text for compilation. Compilers generate line numbers for the error messages they generate, and encode those numbers in the program for runtime error messages. In the short term then, it must be possible to use those line numbers to get to the statements they represent. A goto line number function will be needed in the editor. The editor must then either calculate those numbers on the fly or store them as attributes in the XML structure.
Using XML will make it possible to define links to other documents, and include sections of those documents inline. A mechanism is needed that includes the copied information when converting to plain text, but which reconstructs the original links when converting back. (This like line numbering, is an issue that hopefully goes away over time, but which must be taken into account at the moment.)

Another issue which needs to be addressed is the production of links for variables, methods, and class names. One of the advantages of HTML-versions of source code is the ability to click a link to go to the place where such elements are defined. It would be interesting to include that functionality in an XML encoding. However, since links to classes and methods can go site-wide, it would require an intelligent, link-caching translation system. That doesn't make sense for a "one at time" cycle of converting a plain text file to XML, editing, and converting it back to XML to store in a plain text. But it might make sense if the XML version was kept around. That, in turn, would require the system to deal with the source control repository, in order to update the XML version with changes (an idea originally suggested by Lee Iversion at the collaborative system design meetings held at SRI in early 2000). That way, only file differences would need to be processed to generate links. Such functionality probably makes a lot of sense in version 2 or 3 of the system. It's probably best to avoid in version 1, however.
Comment-Conversion Issues
Converting comments to and from plain text poses a real challenge. The task of implementing an ideal solution may prove insurmountable, in fact. The issues are:
(1) In the hierarchy, we want code to fall "under" major comments. But the indentation that implies should not be reflected in the plain text version. On input, then, the parser must construct the hierarchy intelligently.
(2) In in plain text, multiple "//" lines in a row may represent either a word-wrapped paragraph or independent lines -- for example: a list of steps that outline an algorithm. On input, line breaks need to be removed from the word-wrapped paragraph, yet preserved if for multiple, independent comment lines (or converted to multiple comments, which is less desirable,but equally difficult. [Note: On output, the right "re-encoding" may need to be specified. But that cluttters up the plain text version and provides no help for initial input.]
(3) In the hierarchy, "/*" can comment out an entire block of code. That block can contain other "/*" comments, as well, because the extent of the block is well defined. The hierarchy does not need to worry about "*/" from an inner comment prematurely terminating the outer one. Once again, the issues arise when translating to and from plain text:
(a) In the original plain text file, code may be commented out with either a series of "//" marks, or with "/*...*/" marks. It would be nice if the parser could produce the nested code, but that may be impossible in practice. (For example, how do you distinguish code that is part of the method from a code-fragment that demonstrates how to use a method? If that distinction is needed, the parser could determine whether the code is inside or outside of a method, which probably solves most of the problems -- but not all. )
(b) When converting back to plain text, should the translator use "//" or "/*" comments? The "//" comments work all the time, but make it harder for a plain-text author to uncomment the code. (They may have originally used "/*", if the code block contained no "/*" comments.) If converted using "/*" comments, the translator has to distinguish between code under a comment -- which should be output "as is" -- and indented commentary, which should be output with a leading space, an asterisk, and another space (" * ") so that the comment block stands out in the way that it should.
(c) In the hierachy, it is easy to use the outlining features to add multiple comments, rather than putting them all in the same node with line breaks. That leads directly issue (b). When converting back from plain text, though, the program should probably identify cases where it needs to recreate a comment-hierarchy, and cases where it needs to create a single CDATA node, in order to presever the original hierarchy after translation to and back from plain text.
If / Else statements
One of the advantages of hierarchies is that when you drag an element, everything under it goes along for the ride. However, if and else statements usually occur in "parallel" in a program. The else is typically at the same level as the if, in other words, rather than indented under it -- especially in chains of else-if statements. That means it is possible to move the if and leave the else behind.

Theoretically, that seems like not such a good idea. In practice, however, it is usually desirable. You might want to reorder a series of else-if statements, example. Or you might want to invert the logic of the conditional test, so you would drag the if below the else, then convert the if to an else, change the else to an if, and change the conditional expression to its boolean opposite. (Note that in a syntax-directed editor, you have to go through a number of intermediate "invalid" states -- first with the if below the else, then with two else's in a row -- before you arrive at the final valid state. Those are the kind of operations that turn syntax-controlled editors into a headache to use.)
Special characters
Special characters like the angle-brackets (<,>) and ampersands (&) in conditional expressions have to be converted into legal XML. Otherwise they foul up the processor. One option is to convert them to their predefined entity equivalents: &lt; and &amp; -- that makes the resulting code a bit more difficult to edit by hand, however, which may be desirable on occasion. Another option is to use CDATA sections, which solve other problems, as well.
Preserving line breaks
I have a particular format I like for complex conditional expressions. I like to put the if statement on one line, and put the "&&" (and) or "||" (or) symbols at the start of the next line. Unless a style-directed editor is capable of parsing the contents of the file and formatting the conditionals the way I like them (unlikely) the system will need to preserve line breaks. One way to do that in XML is to use the xHTML element, <br/>. However, that requires the editor to understand what to do when it sees that element. An alternative is to use a CDATA section, which every XML parser is required to understand.
Preserving spacing
Then there is the issue of preserving spacing. When you continue an assignment statement on multiple lines or line up variables and comments for readability, those spaces are going to go away unless either (a) you have an awfully smart editor that parses the code and can be told what to do, or (b) the XML encoding uses CDATA sections.

At this point, CDATA sections look like the odds on candidate. They solve at least three special problems, and require no special understanding on the part of the editor. They do fly in the face of "browser-controlled formatting", to some extent, but the kinds of spacing they provide are arguably beneficial, and not the kind of formatting that programmers typically have stylistic fights over.

Granted, CDATA sections are ugly. In XML, they are encoded as <[CDATA[....]]>. The result of using CDATA sections on every node in the program in order to accommodate spacing, line breaks, and special characters will be an XML program that is harder to look at in a plain text editor, much less edit. This limitation is one of the shortcomings of XML for the present purpose. For more on the subject, see the next section, Shortcomings of XML for Source Encoding.
Comment styles (//, ./*, //)
XML makes it very easy to handle the comment styles that are commonly found in most procedural languages. Those styles are comment-to-end-of-line (//) and comment-to-end-comment-mark (/*).

One interesting advantage of XML is the fact that multi-line comments no longer need "//" at the beginning of every line. Because the end of the element is well-defined, a single "//" at the start of a node can comment out the whole element, line breaks and all. The editor/browser may still want to display "//" on each line for readability, and the structure would certainly need to be translated to plain text that way, but it would not be necessary to type in the "//" characters. In addition, long comments would wrap automatically. Editing long comments would therefore be simplified, as well.

Since "//" suffices for a multi-line comment in XML, "/*" can be used to comment out entire structures. Where the "//" comment is delimited by the end of the element's content, the "/*" element would be delimited by the end of the element's sublist. That makes it possible to comment out, or put back in a whole block of text by changing a single character! Changing "//" to "/*" takes it out. Changing it back to "//" puts it back in. Very useful.

Note, though, that the only way to achieve that functionality is to create a distinction between a comment's content (the text of the content) and it's substructure (any elements under it). The inability to define that difference nicely is the second major limitation of XML for our purposes. To accommodate that limitation, every source <node> element must contain a <content> element, as well as any <node> elements that constitute it's substructure. The result, in conjunction with CDATA sections, is even more ugly. (It won't look that bad, but it makes good editors much harder to write.)

Even worse, that mapping, which is forced on us by necessity, produces a very bad editing experience. A normal, everyday XML editor is going to display the structure exactly as coded: a <node> element with nothing in it, under which the <content> and structure elements occur. This is not a code-editing environment that developers are going to fall in love with, if only because so much vertical whitespace is wasted with the extra <node> elements.

There are several possible solutions. One, of course, is a custom editor that knows enough to suppress the <node> element and display the <content> text where the <node> element appears, and save any changes in that <content> element. Another, more remote, possibility is that XML editors will understand that this is a genuine limitation of XML, and set up style-controls that make the processing automatic. That will allow editing of the source code with generic editing tools. A third (even more remote) possibility is to fix this defect in XML. We'll discuss the subject more in the next section, XML Shortcomings for Source Encoding. But to anticipate that discussion, we'll see that every XML document runs into the same editing problem. There is, therefore, considerable motivation for a generic, XML-based solution.

The final comment style to consider is that used by Javadoc comments: /**. These comments pose particularly evil problems for XML encoding. (As you may have guessed, the solutions are ugly.) The big problem posed by Javadoc comments is that they can incorporate HTML tags -- and HTML is not nicely structured, like XML.

Now, it would be nice if HTML structures like lists could be displayed using the same hierarchical structures that the XML editor provides for the source code. But HTML tends to make that sort of thing difficult. It's hard to process, because <li> could be terminated by </ul> as well as </li>. And it can be hard to discern the intended hierarchy, since <h1> can be used as a style tag within a paragraph! As a result, attempting to render the HTML as fully hierarchical structures, while desirable, is darn hard to do. You wind up needing a full HTML parser as well as an XML parser, and the HTML parser is much the harder of the two, due to HTML's irregularities.

In addition to HTML codes, Javadoc comments include special tags like @param (to define parameter entries) and @returns (to define return values). It also provides the ability to create HTML links using @see and @link tags. Except for @link tags, each of the special tags tends to occur on its own line, at the start of the line. Attempting to create an XML substructure for Javadoc comments would have to take into account those tags, as well.

For a first cut, then, it makes sense to encode all Javadoc comments in CDATA sections, as part of the <content> element for the node. That makes them harder to edit, but it significantly enhances the prospect for getting a successful first version out the door. In the long term, several options are possible. Maybe Javadoc comments get replaced with links to other documents. Or maybe the compiler starts enforcing well-formed xHTML, instead of HTML, which allows for full XML processing. Or maybe someone bites the bullet and produces an editor that accurately handles the HTML and special tags in Javadoc comments. (A <doccmt> tag might need to be added to the encoding scheme, for that purpose.) Maybe the editor even converts the @see and @link tags into real links that can be traversed from the XML version.

Personally, though, I'm betting that the existence of a really good editing and document-integration environments will cause API comments to be placed more and more in external locations. After all, large blocks of comment-text get in the way of seeing the code. They really should be external. In the ideal editing environment, they would show up in a separate window that could be sized and minimized as desired. As the developer traverses either file, the other window would be automatically synchronized. That is probably the fundamentally right way to keep API documentation adjacent to code. (But given the dearth of high-quality editing environments, it is no wonder that Java chose to make API comments part of the code.)

The problem with including javadoc comments in the code is that, since they have their own structure, any given <node> element may have two hierarchies under it -- one for the javadoc comment, and one for the code that occurs under that node. The <content> element provides a way to encapsulate the second element. Without it, a node that consists of javadoc comments could not have code under it -- which could diminish the "literate programming" aspects of the system. There are several alternatives however. One is adding a special <doccmt> element, in addition to the normal <node> element. Howerver, as previously argued, it is desirable to avoid having multiple element types, if possible. Another alternative is using XML "includes" to point to javadoc comments contained in external documents. The comments would still appear "inline" when viewed by the developer, but would not constitute a parallel structure under a <node>. That alternative could obviate the need for a <content> element. It would also make it possible for writers to edit those comments, with no chance of modifying code inadvertently. Another alternative is stylistic: A developer would be expected to create a short summary line in a normal comment, under which the javadoc comment exists in one node, as well as the code (in a parallel node). That way, the program still reads like a series of short comments when collapsed, allowing the user to expand comments and/or code as they see fit. (In fact, that stylistic approach makes sense even when the javadoc comments are in external documents.)

Shortcomings of XML for Source Encoding

As described in the previous section, under the headings of "Preserving spacing" and "Comment styles", XML has two major shortcomings that make the process of source encoding more difficult:

The second problem in particular affects every XML document -- not just source code documents. This section examines those limitations in a bit more detail.

CDATA Syntax

As we saw earlier, the need to handle special characters, as well as to preserve spaces and line breaks, implies the need for a continuous series of CDATA sections throughout the document. Virtually every element will have a CDATA section, so the section-delimiting tags <[CDATA[...]]> will appear over and over again in the XML structure. The extra syntax will turn the XML structures something you don't really want to edit by hand, unless you really need to.

One possible solution would allow the DTD or schema specification to declare "this element always contains unparsed character data (CDATA)". The parser would then proceed to ignore any any all special characters, and pass on any line breaks, until it saw the exact sequence of characters necessary to terminate that element.

The problem arises, of course, that you may want to discuss "</node>" inside of a node element, without terminating that node. Possible solutions to that problem include escaping the / character, as <//node> or <&slsh;node>. However, since those instances would be exceedingly rare, the problem would not arise very frequently.

A bigger problem concerns the interaction of the "automatic CDATA" mechanism and the solution to the problem described in the next section. First, let's look at that problem...

Distinguishing Content and Structure elements

As we saw in the section on comment styles, encoding source code in XML requires both <content> and structure elements in every node. The reason: There is no other way to make sure that no text occurs in what should otherwise contain only structure tags.

XML's "mixed content model" allows text and tags to be mixed. That's swell for a paragraph. It means that bold and italic tags can be mixed in with the text. The inverse is also true: It means that text can occur between tags in the file. So the structure can seen as: <b>...</b> ...some text here... <i>...</i> -- that is, as a structure containing two elements that have text between them.

While that arrangement makes perfect sense in a paragraph, it doesn't make any sense in a list. So this:
<li>...</li>...some text here...<li>...</li>
would mean what? That text is obviously not part of any list item, so it makes no sense.

In XML, you can't allow any text in an element without allowing it everywhere in that element. If you allow text to occur at the beginning of an element, you have to allow it between any of the elements in that structure. Again for tags like <b>...</b>, that makes sense. But for tags like <li>...</li>, it doesn't. The difference between those two kinds of tags is the difference between content and structure. (In XHTML and DocBook, content tags are defined as inline tags. However, that distinction means nothing to an XML parser.)

Even without considering content tags at all, we still saw the need for distinguishing content from structure when we considered the "//" comment. The content of the element is the text that comes after it. The structure of that element consists of the <node> elements under it, which contain programming language statements and subordinate comments.

However, in XML, if we were to define a single <node> that could contain both text and other <node> elements, then the "mixed data" specification would allow text to be freely intermixed between the subnodes. And that would not be a legal program! The only way to get around that problem is to introduce a <content> element under <node>. As we have seen though, that makes editing more difficult. A straight-forward display of the XML includes all the textless <node> elements, while a more intelligent display complicates the editor and requires additional style controls to identify the elements that should "disappear" when the data is displayed.

The fact is, this problem affects every document, not just source code. Consider the document you are reading, for example. A heading consists of a text, followed by substructure elements like <p> or subheadings. Text certainly does not occur between the substructure elements -- only before those elements. (A heading may also contain various "inline" tags like <i> or links. So, to be complete, the concept of content needs to include those tags as well as text.)

DocBook is the SGML standard for defining books, articles, magazines, journals, and most any other kind of document you can name. The SGML (and XML) version of DocBook faced the same problem, which they solved in the same way. Each <head1> and <head2> tag, for example, contains a <text> tag that contains the content of that heading.

But adding an extra tag to solve the dilemma is not, in my opinion, the ideal solution. With more XML editors coming online every day, there is a real chance to turn structured XML data into the ubiquitous data/text format -- something that replaces plain text the same way that plain text replaced punched cards and plug boards. But one of the things that makes plain text so ubiquitous is that it is so easy to view and edit, with any number of tools designed for that purpose.

For XML to achieve the same level of ubiquity, editors and viewing tools have to be as readily available as their plain text cousins. If XML had the capability to declare, as part of the DTD or schema, that particular tags were inline (content) tags, and a way for a validating parser to verify that any given element contains content (text plus inline tags) followed by structure (other tags), then XML might just achieve that ubiquity. The ability tio distinguishing content from structure would make it possible for any XML editor to intelligently display, edit, and validate the data.

Other alternatives include adding an attribute to each element definition, and adding that attribute to each and every data element in the file -- but that is an awful lot of extra work for something that could easily be specified in the schema. Another alternative is to make all content an "attribute" of an element. The current XML specification does not allow that, however, because attributes may not contain subelements. The next version of XML apparently will allow that, though. The XML data that results of course, may be the ugliest thing yet -- but at least the problem will be solvable in a way that gives any unspecified editor a chance of doing the right thing. (At the moment, most do not handle attributes nicely. But if a standard attribute like "content" or "text" were defined, perhaps they could do better. That standard would allow them to display the content in the main window, instead of in a separate form, as they typically do for attributes.)

Interaction of CDATA and Content/Structure Solutions

An interesting problem occurs when we try to solve both the CDATA and Content/Structure problems at the same time. The CDATA solution implies that when <node> is seen, it is only terminated by </node>. But the XML document contains multiple nested <nodes>. Meanwhile, one implication of the Content/Structure dilemma implies that the text of a section terminates when the first <node> (or other structure element) is seen.

Taking those two in combination, therefore, implies that the CDATA part of a <node> would have to be terminated by </node> or any structure element defined in the document schema. (In our case, that's just another <node>. But for a general XML solution, that could be structure elements like <h2> or <ol> in an XHTML document, or <size> and <color> in an order-entry document.

The impact of trying to combine both solutions at one time, then, means that many more tags besides </node> would have to be escaped in order to carry on a discussion about them. And that would lead to many more escapes than the original CDATA solution suggested. It may therefore be unwise to attempt both solutions at the same time. (The need for additional escapes arises regardless of whether the content exists as text under the element, or as an attribute of it.)

Of the two, the most pressing problem is the one that interferes with ubiquitous, intelligent editing of XML documents. That is the need for distinguishing content from structure in some standard way. Given that, the CDATA issue can be lived with for special cases like source code. In most other cases, it's not that big an issue.


The use of XML for encoding source language statements would be highly desirable. Using a "plain" encoding seems to be the most desirable format for editing, with a generic format like CSF coming in a close second -- and possibly (but not necessarily) overtaking it in the long term. Barring improvements in the XML standard itself, the desired structure looks like this:


In DTD parlance, the definition calls for an optional <content> element (or it could be required, but empty) and zero or more <node> elements:


where &inline; is the definition of the inline tags defined in the XHTML DTD. (The inline tags won't be needed as long as javadoc comments are treated as CDATA sections, but could come in handy later on if more interesting structures are allowed.)

(This is the bare outline. For the full dtd, see xmlsource.dtd.)