Multi-language support
Introduction
Documents in Daisy can exist in multiple variants. There are two types of variants: branches and languages. Different document variants are technically pretty much like different documents, but they are identified by the same document ID, thus giving a logical grouping.
While on a basic level Daisy has support for storing multiple language variants of one document, it is useful to have some higher-level functionality to work with, and manage, the language variants.
Multi-lingual content
We can distinguish different types of multi-lingualness:
-
one set of documents in which some documents happen to be in language A, some others in language B, ... In Daisy, for this case we wouldn't make use of the language variants, all documents would belong to the same language variant ('default'). Since the variant doesn't identify the content's language, some meta data field that specifies the language could be added.
-
a set of documents translated in multiple languages: for all (or most) documents, there is a translation in one or more other languages. In Daisy, this setup would make use of the language variants functionality.
-
maybe some shared resources, such as images, only exist in one language variant ('default')
-
mixed language usage within a document. Not very common, for example when including a quotation or when writing about another language. The language could be indicated with markup (e.g. to allow for spell-checking)
The second case is the most interesting one, on which the remainder of these notes focus.
Terminology
reference variant: the document variant containing the original written text, from which the other language variants are translated. Authors work on the reference variant, translators translate the reference variant to other languages. The reference variant could also be called the source variant or the master variant.
translated variant: a variant containing a translation of the reference variant.
Desirable features for working with multi-lingual content
Listed below are various ideas for things which could be added to Daisy to improve working with language variants. It is not the intent to necessarily solve all these issues at once, but rather to have a view on what could be possible.
Management related
Image a document written in language EN, which needs translations to languages FR and NL. The initial translations are made, possibly in a number of cycles (causing multiple versions), until the translations are finished.
After a while, or possibly while the initial translations are still being made, the EN variant gets some updates. The translations now need to be brought up to date. The end result we want to achieve is that the live versions of the translated variants match the live version of the reference variant.
The 'management related' items here are about keeping track of what translations are missing or out-of-date.
-
keeping track of what language variant is the reference language for the other ones. This should be stored on a per-document level, the reference language should not be the same for all documents.
-
For the reference variant:
-
keep track of which versions contain changes that invalidate the translated variants. Thus for each version: does it contain only typo fixes or real content changes.
-
alternatively, we could explicitly store per variant which version is the one with which translations should be kept in sync. This has the advantage that the user can more easily control how things work. It also makes searching easier, though for this purpose we could automatically derive this version number from 'the most recent version with content changes'.
-
For the translated variants:
-
keep track of which version of the translated document variant corresponds to which version of the reference language variant.
-
getting a report of out-of-date translations becomes then possible
-
allows to diff what's changed since the previous translation, so that translator can focus on those areas
-
alternatively, keep track for the variant as a whole with what version of the reference language it is in sync. Again, this is easy for searching, but this might also be auto-assigned from the last version which has this information.
-
triggering of translation tasks (probably in the form of workflow tasks): upon each relevant edit, or only at certain times? E.g. product manual: at each release, website: continuously.
Content related functionality
-
support special markup to indicate non-translatable content (pre blocks, ...).
-
This might also be rule-based: all pre-blocks with class 'sourcecode'
Editing related
These items are only relevant when doing translations “in-house”, without use of dedicated tools.
-
split-screen view of the translation and the reference language
-
when saving the source language, ask the type of changes: typo fixes (don't make the translations out-of-date), real content changes (require translations to be updated)
-
when saving a translation, ask to what extend it has been brought up-to-date
-
[probably goes too far to support this directly in Daisy] support for segment-based editing, with automatic translation based on previous version and/or translation memory.
-
add markup for non-translatable content
-
spell check, aware of language markup
-
consult translation memory, terminology database, ... : should be made possible by extension infrastructure
Publishing related
-
auto-fallback between languages: if a document is not available in language A, show it in language B, then language C, ...
-
should this fallback also happen on document includes?
-
determining the optimal language for a visitor and redirecting to appropriate site. Possibly switching the GUI language together with the content language.
Other issues
-
language-independent fields. E.g. fields containing dates or numeric values, or using a selection list with labels for different languages
-
navigation tree translation: one tree definition with labels in multiple languages, or multiple language-variants of the tree?
-
language-specific behavior fulltext index
-
documentation: we need a document describing how to do a multilingual setup: defining the languages, setting up sites (one per language), collections (one for all languages), ...
Existing standards, terminology, technologies
A very short overview of some of the basic technologies used by the the translation industry. Lots of more detailed information can be found on the Web.
Documents to be translated are split into 'segments', these are pieces of text to be translated. Often a segment corresponds to one sentence. Some of the issues involved are finding out where a sentence starts and ends, and handing sentence-crossing markup. A related standard is SRX.
Translators translate the segments. The translated segments are stored in a database, called the translation memory. When a new document needs to be translated, the translation can be automatically prepared by consulting the translation memory for identical or similar segments that were previously translated. This also shows the importance of having a stable segmentation process.
Since the translation memory is an important asset, it is important that it stays in the hand of the customer so that it can be reused as once moves from one translation agency to another. So there needs to be some standard for translation memories, one is TMX.
Other things:
-
XLIFF: interchange format for translation work. Contains segmented content, one or more translations, skeleton into which to insert the translated content to get back the original file.
-
W3C ITS (http://www.w3.org/International/its/): i18n related markup
Typical translation workflow
There are a few possible scenario's depending on:
-
do the translators directly work in Daisy or do they use their own systems. The first case will mostly be used for smaller document-bases where the translations are done without help of external translators. The last case requires exporting the content and providing it to the system of the translator, and importing it again after the translation is finished.
-
do translations happen continuously as documents are created, or do translations happen in batch, e.g. once every few months. The first is useful for normal websites, the second for product manuals that only need to be updated for each release.
Scenario: use of external translation agency + do translations in batch:
-
Content in reference language is written in Daisy
-
Content is exported from Daisy to a set of files
-
if Daisy has knowledge about what translations are still up-to-date, the export can be limited to those files, reducing the translation cost.
-
cheap solution: this could be based simply on the last modification date? Thus if we know when translations were last submitted, simply check for files changed since then. (assuming the previous translation was complete)
-
Files are provided to translation agency (via a webservice, FTP upload, email, ...)
-
workflow within the translation agency: files are segmented, translation memory is used to prepare the translation, human translator finishes the work.
-
Get back translated files, import them into Daisy
In this scenario, there is no need for use of workflow withing Daisy, except maybe for review. There only needs to be a person responsible for triggering the export to the translation agency and doing the import when the results come back.
Scenario: translation work is done in Daisy (continuously or in batch):
-
content is written in reference language
-
translation work can be triggered in multiple ways:
-
author or another person starts workflow task when a translation is desired
-
pull: translator checks page with translation work to be done (missing & out-of-date translations). This can be enough for simply cases where the work does not need to be devided between multiple persons.
-
push: a process regularly queries for missing translations and creates workflow tasks
-
How do we know which translation work needs to be done? There are a number of scenarios:
-
One scenario is that an article gets written just once, a translation workflow is started, and once that's done, everything is finished. If the source variant would change, it is the responsibility of the person's doing those changes to start new translation workflows.
-
Another scenario is that we query the system for missing or out-of-date translations. Translators could either check for translation work, or a process could run and create translation workflows (see triggers above).
A problem is that we need to know if translations are outdated. For this we need to keep special meta data, as described in the 'management related' topics above.



There are no comments.