Home > Resources for Contributors > Representing the Original

Representing the Original

When we encode documents with TEI-XML, we are more concerned with content than appearance. Indeed, one of the benefits of using XML is that it separates content from how that content will ultimately be presented. 

However, for the purposes of this project, we are interested in recording the appearance, or formatting, of the original document itself, to the extent we are able and within certain technical and practical limitations.

We use the elements and attributes on this page to describe how things appear in the original, not how we believe they should be shown in any eventual rendering of the XML document.

We attempt through the "transcription" view on our TEI-Boilerplate site to replicate the features of the original document in our output. We eliminate much of this in the "reading version" view, where our formatting is regularized and driven more by content than by the incidental facts of the original.

You can see a demonstration of how many of the elements on this page play out in our interface by viewing this demonstration document. Switch back and forth between "transcription" and "reading text" view to compare.

A. Structural elements

1. Page break

<pb n="1" facs="../images/ew_a1_342_001.jpg"/>

The value of the attribute n should be the page number. If there is no explicit pagination in the document, you can impose it here, numbering sequentially from 1.

The value of the attribute facs should be filename of the corresponding image, if we have permission to use it in our edition (please ask Dr. McCarl if you are not sure). The URL you see here is a relative path to the "images" folder in our TEI-Boilerplate website. If the image is already online elsewhere and we have permission to use it, this can also be an absolute URL to that location. 

2. Paragraph break


encloses a paragraph. Putting these elements on their own on separate lines can make solving problems later much easier.


   This is the text of a paragraph.


As noted below, the style attribute can be used to indicate that a paragraph is indented in the original:

<p style="indent">

3. Line break


marks the place where a line break occurs in the transcription. When a word is divided across the line, we use attribute break to indicate this:

<lb break="no"/>

Note: To avoid creating an empty space in the published transcription (this is a defect of our current interface), we continue our transcription on the same line in the editor when we use <lb break="no"/>. For instance:

   This document has several<lb/> 
   lines of text. Most lines end<lb/> 
   neatly at the end of a word,<lb/> 
but this line, in con<lb break="no"/>trast, does not.<lb/> </p>

If contrast were hyphenated in the original, you would not need to transcribe the hyphen here.   

4. Column break


marks the beginning of a column on a page that has more than one column. See this example.    

The text will not display in multiple columns in the transcription view, but a token will be inserted into the text indicating each time a new column begins. 

5. Openings and Closings of Letters

Many of the documents we are editing are letters, which generally have openings and closings. These elements are both structural and semantic, but we'll put them in this section. 

We encode these using <opener> and <closer>, respectively.

These elements do not go inside <p> or <head> elements. These do not need to appear in a <div>, but placing each in its own <div> may simplify matters when trying to manipulate them in terms of layout. Also, <opener> is assumed to be the first element in a <div>. Therefore, if the first page of your document is an image of an envelope (this is the case with many of the letters provided by UNF--see also the next section below), you should put that envelope within its own <div></div> on page 1, and then start a new <div> on page 2, which will begin with the opener (you may need to adjust this scheme, depending on the actual facts of your document, of course).

Openers gather together elements including <dateline> (which can contain <date>and <address>, <name>, etc.) and <salute> (which will often include <name).

Closers generally include <salute> and <signed> (which will often include <name>).

Here are examples of how to mark up the opening and closing of a letter:

      <name type="place" subtype="city">
      <date when="1900-01-01">
         Jan. 1, 1900
      My dear friend,

      Your sister in Christ.
      <name type="person">
         Sarah Best

6. Envelopes

Often the images of the corresponding envelopes are included with those of a letter itself. We will consider an envelope a "page" in the corresponding XML document. TEI-XML P5 does not appear to contemplate this situation, so just encode each address block (sender, receiver) as a <p>, with <lb/> at the end of each line.  

If there is a postage stamp, please use <stamp> with a value of "postage_stamp" for type. Embed a <figure> element, with a nested <figDesc> element containing a description of the stamp's visual appearance. See this example:

   <stamp type="postage_stamp">
            Stamp representing the Statue of Liberty.

If you can read any of the text that was impressed onto the envelope when the stamp was canceled, please include another <stamp> element (we'll represent the physical stamp and the cancellation of it as two separate <stamp> elements). Use a value of "postmark" for type, and transcribe the contents, as follows:

   <stamp type="postmark">
<name type="place">Patterson, N.J.</name>
<date when="1934-08-01">Aug. 1, 1934</date>

If your document also has an opener and/or a closer, you'll need to have two <divs> in your document. One will contain the text corresponding to the envelope, and the other will contain the header, the text of the letter, and the closer.


B. Representing layout/formatting of original document

For the purposes of thinking about layout and formatting, we will divide text up into two types: Headings and Body Text.

1. Headings


A <head> element has be contained inside a <div>, as in this example:



<head>Section 1</head>

<p>This is the text of section 1.</p>



We consider a header to be the initial title of a document, or the heading of a sub-section. These are most often present in newspaper articles, monographic works, pamphlets or other such formal or structured writing. Letters generally will not contain headings.

In the transcription view within our interface, we display different levels of headings in a standardized fashion: an initial heading (<body><head>) will be extra-large, a first-level subheading (<body><div><head>) will be large, a second-level subheading (<body><div><div><head>) will be medium, and a third-level subheading (<body><div><div><div><head>) will be small. Each subheading must be within its own nested <div>, which will also contain all of the body text corresponding to that subsection.

To indicate the positioning in the original, we use a value of "center," "left" or "right" for the style attribute on the respective <head> element(s), as in:

<head style="left">
<head style="right">
<head style="justify">

2. Body text

We consider body text to be any text that is not a header. In our documents, such text is generally contained in a paragraph (<p>), an <opener> or a <closer>.

a. Indentation

The first line of a paragraph is assumed to begin at the left margin. To indicate instead that the first line is indented, we use a value of "indent" for the style attribute on the <p> element, as in <p style="indent">.

b. Alignment of text

We assume the text in paragraphs to be left-aligned. To indicate otherwise, we use a value of "center" or "right"  for the style attribute on any of these elements (or on <div>, or other elements), as in:

<p style="center">
<p style="right">

c. Horizontal positioning of headers/closers of letters.

We assume that these begin at the left margin.  When that is not the case, use the style attribute to indicate the approximate positioning, according to the following options: 

<opener style="one-fourth">
<opener style="mid-page">
<closer style="three-fourths">

d. Superscript

<hi style="superscript"></hi> 

encloses text that is raised in the original.

e. Italics and underlining

<hi style="italics">


<hi style="underline"></hi>

enclose text that is italicized or underlined in the original.

f. All caps

If you encounter text in all caps in the original, transcribe it in the case in which it should appear in the reading version (sentence case, or title case, as appropriate), and enclose it within the following:

<hi style="allcaps"></hi>

For example:

<hi style="allcaps">
Indigent Hospital Patients

(This text appears in the original as "INDIGENT HOSPITAL PATIENTS.")

e. Non-contiguous blocks of text

Sometimes you may find that the text in a document jumps around on a page, or even between different pages of a document. This can present a challenge when it comes to transcription, particularly when we're trying to represent the text as it appears on the page as closely as possible.

Here is my recommendation:

Transcribe the text as a continuous flow of text, but each time the text jumps to a new place, use <note type="transcription"> to explain what is happening. Such notes will only appear in the "Transcription" view. For more information, see Doubts, Comments, Annotations and Additions.

Here's an example of what that might look like:


   [...] were going downtown<lb/>

   <note type="transcription">The following text is written sideways into the top margin of the page</note>

to the movies [...] <lb/>

C. Deletions, Insertions and Changes of Hand

1. Struck text

<del type="strikeout"></del>

encloses any material that has been crossed out.

<del type="overwritten"></del> 

encloses any material that has been written over.  

2. Added text

<add type="caret"></add>

encloses any material that has been added above the line, or between lines, with a caret or other mark indicating its point of insertion in the original:

<add type="no_caret"></add>

encloses any material that’s been added, with the point of insertion not indicated (meaning you’ve have to make an educated guess about where it goes).  

3. Marginal notes


can be used to transcribe any material you find hand-written into the margin of a document, using the place attribute to indicate which margin. We use a value of "authorial" in order to differentiate these notes from our own editorial annotations, which will also be done with the <note> element.

<note type="authorial" place="marginLeft"></note>
<note type="authorial" place="marginRight"></note>
<note type="authorial" place="marginTop"></note>
<note type="authorial" place=""marginBottom"></note>

If the note is in the left or right margin, you should locate this element as close as possible to the place to which it corresponds in the top-to-bottom flow of the text. 

4. Changes of hand


can be used to mark a place in the text where we see a change in handwriting (as in, for instance, when someone other than the primary author is writing in the document). 

It can also be used to mark a place where one writer switches mediums, such as from pencil to pen. In this case, the medium attribute can be used to explain the shift. For more information see this topic in the TEI-XML P5 guidelines.

D. Gaps in the text

<gap reason="page missing from document"/> 

can be used to mark a place where there is an unrecoverable gap in the original, using reason attribute to explain the circumstances. This will often have to do with damage to the original or missing pages. If the gap is small, and you believe you can accurately deduce the missing letters or words, see Material you add yourself on the page Doubts, Comments, Annotations and Additions.

E. Special characters

Some characters have to be transcribed in XML with character references. These include: ° (the degree sign, as in 98° Fahrenheit) must be represented in your text as follows:


& (the ampersand, meaning “and”) must be represented as follows: