SVDIG, June 2007 - Tips and Techniques on Migrating to DITA

Ann Adams of Kyocera

5 writers, 2 software products.

1 mature product -- a universal printer driver -- has HTML Help and a User Guide in unstructured Frame. Help is translated into 31 languages, the User Guide is not translated.

The 2nd product -- network printer management -- has just had its first release. A User Guide in unstructured Frame is being used for Help. Translated into 6 languages.

The plan is to single source both products and output HTML for Help and PDFs for User Guides.

We had 3 days of DITA authoring training with Anna van Raaphorst and Dick Johnson of VR Communications. They helped us install the Open Toolkit and walked us through the garage and grocery shopping examples. Anna and our printer driver SME discussed the structure and organization of the existing Help. We then started rewriting that Help in DITA during the class and the author has been working on it since then.

We have signed up with DocZone for hosted content and translation memory management. As I mentioned, their team will be working on analyzing our data and loading it into the system so that we can start working from a database and implementing workflow processes. Wednesday will be their first day on-site, so I hope to have an update for the meeting on what we have accomplished.

Tips from Anna van Raaphorst and Dick Johnson of VR Communications, Inc.

Because of the fragility and incompatibility of tools (that would include authoring tools, CMSs, GMSs, migration tools, etc, and this situation will improve over time) we are recommending that organizations migrating to DITA learn SOMETHING about how the Toolkit works, so they can processing using it if necessary. This doesn't mean that EVERYONE in their group needs to be an expert, but SOMEBODY needs to be.

A related point is that all DITA teams need information architecture, writing, editing, and technical expertise. These skills don't need to be wrapped up in a single person, but a collaborative environment with all skills represented is essential. You can hire a consultant to help you get started, but someone in the group needs to be there to take the baton when it's passed.

There are good points for and against starting a legacy migration by rearchitecting existing content OR converting first and then rearchitecting. I tend to favor starting with a solid architecture and a small number of topics (up to 50) and then adding topics and adjusting the architecture using an iterative approach over time (up to 1-3 months, depending on the project size). Starting with the conversion would be a better option for a project that must be completed in a short time, but you might be left with a project that is difficult or impossible to scale or extend over time.

Notes from Megan Bock and Jenifer Schlotfeldt of IBM

General guidelines for structured-to-structured migration

  • Make sure that you understand element mappings.
  • Remove obsolete content before beginning migration.
  • Develop precise instructions for the prep, conversion, and cleanup work
    • Test and refine the instructions using at least two projects.
    • Identify corrective and preventative work that has a lower cost on the front end than on the back. Some things are easier to fix in the initial language; others in DITA.
    • During the migration, take frequent snapshots (or make frequent forks of working directories) so that you can fall back to an intermediate stage if you discover a problem.
    • Understand the probable impact on translation memory.
  • Talk to your translation vendors early.
  • Develop realistic sizings.
  • Do not rip the entire book apart unless you know that you have time to put it all back together. Work in sections. - "Migrating HTML to DITA, Part 1: Simple steps to move from HTML to DITA"

Brief notes from Scott Prentice of Leximation

You don't have to go all the way to "pure" DITA topic files in one step. If you're currently in unstructured Frame (or any chapter-based model), you can always go to chapter-based DITA files where you have nested topics. This is perfectly valid DITA and can be a good interim target for a conversion if you don't have the time to restructure and rewrite. Most documentation sets will fit into DITA's topic structure (even linear non-topic types of documentation). You won't really gain the full benefits of all that DITA has to offer, but it will allow you to work in structure/XML while keeping your filesystem and other processes intact while you test the remainder of the conversion and wait for another window in your schedule.

If you're migrating from unstructured Frame, take the time to get your conversion tables as close as possible to generating real DITA .. don't rush things.

After running your conversion table and applying the new EDD/template .. keep the files in FM binary format as long as possible. There are all kinds of things you can quickly do in Frame across a whole book of files that are much more difficult to do once you're in XML (unless you're an XSLT wizard).

I highly recommend getting a copy of FrameSLT ( .. this plugin lets you do XSLT-like processing in Frame and is a huge time saver!

DITA Migration Strategy from Eric Armstrong of Sun

This writeup describes a strategy derived from the conversion of heavily-interlinkined 24 HTML documents to 35 DITA topics wrapped by three DITA maps. Some of the details in this writeup are HTML-specific, but the general concepts apply to most any conversion.

Note: This was all pretty set in my mind, until the Silicon Valley DITA Interest Group had a roundtable devoted to the topic. What I learned from the other members heavily influenced my thinking and forced me to revisit some of my operating assumptions (see below, under "Two-Stage Conversion").

General Notes

Get a CMS

I had heard this advice before, but until I did a project in DITA, I didn't fully apreciate the reasons:

  • DITA is like HTML links on steroids. Imagine having to manage all of your HTML links manually. DITA is much worse. You have:

    • Metadata tags in filenames

      • Similar topics have the same name, except for their metadata: example: install_solaris, vs. install_linux
      • You can't use directories for metadata names, because topics have multiple tags (product and platform), and a shared topic can have multiple values (platform=solaris and platform=linux).  So the only alternative is encode enough metadata into the filename to keep topics distinguished from one another.
    • Topic type encoded in directory name (concept/) or in file name (c_xyz, xyz_concept)

      • It helps to keep different topics distinguished from one another
      • However you do that, it affects the file path
    • Topicrefs that are sensitive to location and name

      • If you get it all 100% right at the outset, you're golden. But if you need to make a change, you've got to keep everything sync'd up
    • Conditionalized topicrefs that depend on the metadata, as well as conditionalized content

      • So you need to keep your conditionals sync'd up, too.
    • plus image links, topic links, and external references

  • None of the editors handle things for you, the way DreamWeaver handles links for HTML
  • So you need some kind of CMS to work effectively. Inexpensive choices:

    • An open source CMS--but Ann Adams reported that none of them really seemed ready for prime time, as of today
    • DocZone--a remote-host solution that is pre-integrated with all major localization systems
    • XDocs--a feature-rich CMS with department-level pricing: $5k for a single-server, 4-writer system with unlimited reviewers, $10k for anything more than 4. Although the main repository is a database, each writer has a working set on their filesystem. So they can use any editor and any other tools they're comfortable with.
Four Kinds of Activities
  • There are 4 kinds of activities in the process:

    • Architecture: Analyzing, defining metadata, building maps
    • Authoring: Creating topics
    • Production: Processing maps to produce deliverable content
    • Branding & Styling: Adding boilerplate headers & footers, nav bars, specialized CSS
  • Getting something out the door requires a lot of work in each area. Ideally, you'll have at least one person for each area. If it's a solo effort, there are many hats to wear.
Work Agile
  • To work agile-style, create something tiny and work it through every stage of the process. Then add to it.
  • I tried to do one stage at a time. That wasn't ideal, because making the right decision early on often requires knowing what's going to happen later, as a result of that decision. (For example: link type influences directory structure of open toolkit output.)
  • Going through the full process with minimum data gives you the information you need
  • In my case, working agile would have meant running into the late-stage production issues much sooner, and it would have produced tangible results for management much sooner.
Take Courses, Hire Consultants
  • Courses and consultants are another good way to get the benefit of knowledge that comes from experience.
  • For consultants, hire one that consults--answers questions, gives advice.Do it yourself. Don't let the consultant do it all. In that case, you wind up learning nothing.
  • For courses, The DITA Bootcamp was terrific for learning how to architect and use DITA. But it was far short of what we needed on the production side of the equation. (For that, we needed either a CMS, consultants like Anna van Raaphorst and Dick Johnson, or more time to master the Open Toolkit.)
The Ideal Starter Set
  • The ideal starter set for an agile process would look something like this:

    • Three topics: one concept, one reference, one task
    • Two maps
    • One topic shared between the two maps
    • One metadata tag, with two possible values
    • One case where the value is included, one where it's excluded
    • One common-entity in a snippets file, conref'd one or more places
Two-Stage Conversion
  • In particular, Scott Prentice made a good case for doing simple a format-conversion first, using nested topics. You can then do the refactoring later on, at your leisure, to create an architecture consisting of simple topic-components. (Megan Bock further underscored that point when she said that she would rather re-architect a book in DITA than attempt to convert a colleciton of structured topics written in SGML.)
  • Just to be explicit, the two stages are:
    • Reformat: A straight conversion to nested topics
    • Rearchitect: Unwind into separate topics, refactoring to eliminate redundancy
  • The nice thing about a two-stage conversion is that is more agile in nature. You deliver output sooner in the process. In retrospect, that would have been a good idea for my project. It would have meant showing results much earlier.
  • If you follow that procedure, the formatting details described here won't apply. But the re-architecting principles will still be useful.
Time Estimates
  • Getting documents re-architected and converted to DITA took about 1 hr per printed page.
  • We still have topic cleanup and final production work to do. (Time TBD, but could be another 15-20 minutes/page.)
  • At times, I only had 3 or 4 hours a week to spend on the process, so be sure to factor in the real amount of time in your workweek before you publish a schedule based on these estimates.
Track Your Time
  • It helps to know how long things took. (Your estimates will probably differ from mine.)
Keep Notes
  • Write down the steps you used to do things in a given editor
  • You can share them with others later, and you'll refer t them yourself
  • If you try multiple editors, your notes will be the basis for your evaluation.

Conversion Process (Quick List)

  • Identify Deliverables
  • Identify Metadata
  • Create a Topic List (Worksheet)
  • Create Topic Maps
  • Identify Topics with Conditionals
  • Create Pseudo Topics
  • Convert to DITA Topics
  • Clean Up the Topics
  • Generate Output
  • Add Branding and Styling

Process Outline (Explanations)

Identify Deliverables
  • List the things you plan to deliver. That list is the foundation of your metadata. (If it doesn't result in a deliverable, there shouldn't be a metadata value for it.)
Identify Metadata
  • Metadata may not match existing terminology exactly
  • Example: We had pages named "Install Solaris" & "Install Solaris-64"
  • So I defined metadata to match (solaris and solaris_32). But later I began to wonder: does something tagged "solaris" include solaris_64, or exclude it?
  • It became clear that I needed a hierarchy: solaris, solaris/32, and solaris/64. That would be ideal. But DITA doesn't allow for metadata hierarchies.
  • I solved the problem with a "poor man's hierarchy": solaris, solaris_32, and solaris_64--a simulated hierarchy formed by concatenating the metadata values.
  • There is a processing implication: For the solaris-64 version of a document, I needed to include "solaris", as well as "solaris_64". Similarly for the solaris-32 version.
  • But it's an easy system for the writers. They only need to select one tag. That factor dominated the design decision.
Create a Topic List (Worksheet)
  • This turned out to be a much-needed abstraction. Trying to go directly from documents to topics worked for one document, but when I began working on the second, I quickly found myself wallowing in confusion, wondering what to do next. Making the topic list let me go through the set of documents quickly, creating a guide for later activity.
  • The idea for the worksheet came from the Bootstrap class. Fortunately, a co-worker who also attended that class reminded me of it. (I mean to tell you, I was really lost.)
  • Here are the columns for the worksheet:

    Source document -- topic type -- subject (topic name) -- metadata columns

  • I had one column per metadata tag. For example, columns headed S, S32, and S64 were for solaris, solaris_32, and solaris_64. Then I put an X in the appropriate column.
  • In my conversion, I left out the source document column. That turned out to be a mistake. Later on, I'd be wondering where a particular topic came from, and I had no way to find out. Was it the document I'm looking at now? Or was it some other document I should be looking at to combine this with?
  • Note 1:

    You want subject names here, not file names:

    • If you're putting topics in the file system, the filenames will need to include metadata tags to keep them separate. (You can't use metadata to define directories, because a topic will be used in multiple places and belong to several categories.)
    • So you want a subject identifier here that names the topic, minus metadata valeus. That strategy lets you sort by subject later on, to find candidates for merging.
  • Note 2:

    While doing this high-level analysis, it's a good time to think about topic titles:

    • Many of the titles in our existing docs were horrific. For example, one was "Registering the Plugin"--something you're liable to skip. But the first sentence informed you that wouldn't work unless it was "registered" (and registration was just a matter of creating a symlink.) That got changed to "Enabling the Plugin".
    • The DITA Bootcamp features JoAnn Hackos' seminar on "minimalist" (user-centric, task-oriented) documentation. Worth the price of admission.
Create Topic Maps
  • They let you think about how the metadata will be used in practice.
  • They let you start creating deliverables as soon as possible.
  • The process may adjust your thinking about metadata. (I started out thinking I needed 12 maps, one for each deliverable. Then I tried to do it with one. I finally figured out that we needed three maps: One each for Solaris, Linux, and Windows. The reason: There was almost nothing in common between those maps. There were no topics that appeared in two of them. But the Solaris-64 topic set was a superset of the Solaris-32 topics, and the Java Development Kit (JDK) topic set was a superset of the Runtime Enviroment (JRE) topics, so it made a lot of sense to have one map for those combinations.
Identify Topics with Conditionals
  • Sort the topic list by subject identifier.
  • Topics that have identical idenifiers are candidates for merging. Inspect the source documents to see:

    • If most of the content is the same, use conditional metadata and make one topic
    • If most of the content is different, keep them as separate topics.
  • Process the merge-candidates first, or flag them so you look at the other source docs when you get to them:

    • Pick the best version to use for a starting point
    • Create pseudo topics, as described below.
    • Process each version and add conditional markers to the pseudo topic
  • Then sort the list by source document & proceed with conversion, one source at a time.
  • Note: Sorting the list and then resorting it could conceivably lose information. After the original analysis, the topics will appear in the table in the same order they appear in the source files. That makes it easy to step through the source file, extracting topics. After sorting ,

Create Pseudo Topics
  • Work in the source document format as long as possible, to minimize the confusion factor

    • I started by cutting sections out of HTML documents and pasting them into DITA topics. But I was going crazy working in two different editors, with different tags, different menus, tabs, and buttons, all while trying to keep track of where I was in the process.
  • Process

    • Make a copy of the doc set
    • Cut material out to make a topic, don't copy it (makes it easy to see what you've already done)
    • Delete a file when there's nothing left in it (makes it easy to see which files are left)
    • Add a "Done" column to the topic worksheet. Mark the topics as you create them to keep track of where you are.
  • Flag places where conditional metadata needs to be added:

    • __SOLARIS:__ Here is an example of a tagged sentence.
    • Try to flag complete sentences, not individual words, for translations' sake
    • Put the sentence in a <ph> tag so you can use it anywhere.
  • Flag places where conrefs need to be inserted:

    • __REFERENCE: name-of-material-to-insert___
Convert to DITA Topics
  • Run h2d on the HTML pseudo topics to create DITA topics

    • Converts HTML to DITA
    • You tell it whether to create a concept, task, reference, or generic topic
    • Does a pretty decent job
    • Found in Open Toolkit's demo/ subdirectory
  • Review the conversion, make corrections as needed to pseudo-docs and rerun the conversion
Clean up the Topics
  • Make final corrections to topics
  • Insert conrefs and conditional metadata
Generate Output
  • Choices:

    • Open ToolKit (produces a site that mirrors the source directories)
    • CMS production tools (better, if you have a CMS)
    • Editor production tools (one at a time)
  • Compare output to original versions
  • Make final corrections
Add Branding and Styling
  • Choices:

    • Configure Open Toolkit processing script with pointers to CSS, headers, etc.
    • Use a "pipeline" approach, where output from Open Toolkit is fed into downstream processes