DoctorMyhill:Migration/Jan-Apr 09 Initial Move

From DoctorMyhill
Jump to navigation Jump to search

Migration Overview

Note that this is now a historic document retained only for reference

This website is a migration of the current Doctor Myhill website. In its current version this website is a pre-production test website whose purpose is to evaluate the effectiveness of the Wikipedia engine as a delivery vehicle for Sarah MyHill's content.

The migration process is based on a set of Perl migration scripts which:

  • Use a URL/HTML download to "scrape the contents of the existing site to an intermediate file base local copy
  • Use a set of clean-up macros to sanitise the content in this legacy format and convert it to Wiki compatible content
  • Convert this content into Wikipedia XML upload format for loading into the Wiki engine.

These scripts were developed and initially run against a LAMP VM based on Ubuntu JeOS running under VirtualBox. We have subsequently established a web shared service to host this instance using Webfusion Shared service offering. This service is intended to facilitate multi-user testing by Dr Myhill's team and to prove the capability of running this site as a Wiki and such a shared service. This service is not guaranteed to meet full live load requirements.

There are three methods of making changes to a Wikimedia website:

  1. Script it as part of upload. If you can define a simple rule for a change that you want to make (e.g. here's a spreadsheet of the old and new names, or if you find character X in the text then replace it with character Y. Advantages: the grunt work can be automated and done by the computer as part of the upload from the old website. Disadvantages: We need to be able to define the rule. I need to program this up. You can get the "false positives" effect, that is say 95% of hits will work and work correctly. 5% apply the change incorrectly or change something that you didn't want to change. You also can't do this once you've started to change the wiki pages manually.
  2. Script as a batch process on the Wiki. This is really a hybrid of 1 and 2 where I write a script which goes through the wiki interface to automatically change stuff. Advantages: Really a combination of those of 1 and 2. Disadvantages: again largely a combination of 1 and 2, but this type of process can be done at any time.
  3. Use the standard Wiki functionality. Advantages: it's standard functionality and therefore the Myhill team can do what they want without IT support from someone such as me. Disadvantages: If someone wants to make the same change to 200 pages then its 200 times the work.

The trick to minimising work is to use the right approach at the right time. So which do we use and when? With this migration, we will be running in three different phases of site use, and which approach you use depends on which phase we are in:

  • Prototyping. The "master" database is the live "legacy" database. The wiki copy is ephemeral — that is we are using it purely as a "try it and see" test site so that we can learn about: (a) the features and potential of the wiki software; (b) sort out style templates, and (c) work out any rules that we want to apply (such as bulk remaining or more automated reformatting rules.) In the meantime, we continue to make any necessary changes to the live legacy database. We are still in this phase, so the master repository is sill the "legacy" site. We are planning to cut over the Phase 2 within the next week or so.
  • Preparation. At the end of Protyping, I will apply any new rules and lessons learnt by reimporting the live legacy database into the Wiki. I can apply any upload scripting at this point. At this point I will remain the existing wiki to /proto and create a second "pilot" wiki which will sit alongside the prototype and take the name /wiki. That way we are free to export any specific page changes to the prototype wiki and import them to the pilot. At this point the pilot wiki becomes the master database and we switch working to that. The legacy website is largely frozen and where we do need to make changes, then we need to manually replicate them into the pilot wiki.
  • Wiki Live. The Wiki becomes the live website and the legacy website is brought offline. All work must be done using live batch (2) or standard manual (3) processes.

My current work is to ensure that we can do the best automated conversion with our resource limits.

Requirements

  1. All articles and tests must be exported from the existing site and moved to the new Wiki on a page by page basis.
  2. No text from the articles and test content can be lost in the migration process.
  3. Where practical reasonable assumptions and mapping of content formats will be applied. This conversion is intended to be reasonable rather than optimal, and some per-page post-migration clean-up is anticipated. (Say 10 mins per article).
  4. Articles will be optionally renamed on conversion to fall into a more consistent article naming style.
    • We need to decide whether we have short names for cross-referencing purposes. For example one article is called "Allergy and Elimination Dieting — when the diet fails". Do we prefer the shorter title removing the byeline "Allergy and Elimination Dieting".
  5. All article and test references within the text will be remapped to internal wiki references.

Specific Implementation Points

Issues

Poor Syntax in existing HTML markup

The HTML markup in the existing content has not been created with a syntax checking editor. As a result the standard HTML to Wik converter fails to process the markup in the way anticipated. Example include:

  • Incorrect nesting of tags. A typical one is the use of the <b> and <i> tags:
<b><i>Some Title</b></i>              Wrong
<b><i>Some Title</i></b>              Right
Solution: any adjacent bold / italic tags will be converted to the second format. ? do you mean "heading"? Or does this mean something else?Hania 14:35, 18 March 2009 (UTC)
  • inconsistent Bulleting. Some bullets are introduced with <li> tags, sometimes * or >> or square block characters.
Solution: All special character entries in column 1 will be treated as <li> tags and converted to the same/
  • Incorrect list framing. Sequences of <li> tags should be bracketed by list definition tags (e.g. <ol> or <ul>). These are usually omitted and when they are included extra text is sometimes inserted before or after bullets. This causes the converter a lot of problems and this test is sometimes dropped from the converted output.
Solution: All list tags (<ol> </ol> <ul> </ul> ) will be removed, and new open and close list tags (<ul> </ul> )will be automatically inserted before and after each sequence. Numbered lists will be converted into bullet lists to avoid number break and continuation issues.
  • Use of emphasis tags for (<b>, etc) for titles
Solution: Any text lines of less that 100 characters which are bracketed by emphasis tags covering the whole line length wil be converted to level 3 headings.
  • Use of inappropriate tags such as <title> and <body>
Solution: Luckily the converter ignores them

The way that I have tried to pick up all of these conversion issues is to take the pre conversion and post conversion text, remove all text, markup and punctuation to leave the word sequence of the content. Where these are not identical I have compared the two to make sure no material test has been lost.

As a consequence, I've had to remove a little more formatting than I would have liked (eg. numbered lists become unnumbered), but at least this way I am not losing content because of the converter getting confused by the invalid HTML syntax.