Subject: Center for Electronic Texts in the Humanities
From: Stephen Ferguson <0629212@PUCC.BITNET>
Date: Mon, 14 Sep 1992 14:24:37 EDT
Message-id: <"IY19z1.0.K54.MOBCn"@sul2>
Sender: Rare Books and Special Collections Forum <EXLIBRIS@RUTVM1.BITNET>
I thought this might interest some Exlibris readers. It's forwarded
from the Text Encoding Initiative bulletin board --
------ --------- ----------- -------------- ------------- --------
A Report on CETH Seminar on Textual Analysis
Princeton University
August 9-21, 1992
Text Encoding Initative
was prominently featured at the first seminar on textual analysis
sponsored by the Center for the Electronic Texts in the Humanities
(CETH). CETH was established in late 1991 by Rutgers and Princeton
Universities to act as a central organization to assist in the
creation, dissemination and use of electronic texts in the humanities.
In addition to creating an inventory of machine-readable texts and
making them available through the Internet, the Center is
committed to offering educational seminars on various aspects of
electronic texts.
The two instructors were Susan Hockey, CETH Director, and Willard
McCarty, Assistant Director of the University of Toronto's
Centre for Computing in the Humanities. Susan chairs the TEI Steering
Committee, and was formerly the Director of the United Kingdom's
Computers in Teaching Initiative (CTI) Centre for Textual Studies,
located at Oxford University. Willard is a member of the TEI Verse
work group, the founding editor of the _Humanist_, and is currently
working in the area of classical studies, in particular on Ovid's
_Metamorphoses_.
An international group of librarians, literary, linguistic and social
science scholars, and computer and information scientists comprised the
class. Librarians and library graduate students from the Association
of Research Libraries and from universities at Arizona State, Columbia,
Indiana, Iowa, Manitoba, Maryland, NYU, Princeton, Rutgers, Texas, and
Wesleyan attended. Literary scholars and students from Spain,
Virginia, New York State and Wooster, Ohio ranged in their specialties
from Old English to _Piers Plowman_ and modern English and Russian
fiction. Linguistic scholars from England and Canada were working in
computational linguistics and discourse analysis. Social scientists
from Israel, Missouri, and Illinois brought backgrounds in the history
of Judaism and Zionism, Sri Lanka, modern Western social theory, and
U.S. urban development; a Princeton art historian, in the Princeton
Cyprus expedition. Computer scientists and a mathematician from
Rutgers and Wisconsin brought a familiarity with higher level
programming techniques and interests in analyzing literature.
The seminar provided historical information on electronic texts,
including their development in the U.S., Europe, and elsewhere.
Existing resources such as ARTFL, the Dante Database, the Thesaurus
Linguae Grecae, and the Oxford English Dictionary Version II were
described. Robert Hollander, Professor of Comparative Literature at
Princeton and Dante Database creator, demonstrated the Dante. Toby
Paff and Hannah Kaufmann of Princeton's Humanities Computer Center
demonstrated ARTFL and the OED2. The need for additional effectively
structured online dictionaries was expressed.
Other electronic texts were made available for individual
perusal (Intelex's _Pastmasters_, Georgetown's _The Phenomenology of
the Mind_). Each of these pioneer projects includes textual analysis
software with which to analyze the text; they are not aimed at the
casual browser, in part due to copyright restrictions.
A number of issues were identified as of continuing concern: the need
for collaboration in the creation of electronic texts, ample space for
their storage, their easy retrieval, and widespread access to
texts. Better user interface, improved presentation of individual
and parallel texts, hypertext (see below), and dynamic, graphic
displays were also deemed desirable.
Susan and Willard reviewed two textual analysis programs, one
public domain and the other proprietary -- TACT and MICRO-OCP. Their
common features include the creation of alphabetical frequency
lists of all words, concordances (all the occurrences of a word or
phrase, in context), and collocations (co-occurrences of words and
phrases). Susan described several studies using stylistic analysis --
Mosteller's and and Wallace's _The Federalist Papers_, Morton's study
of Greek texts and their disputed authorship by St. Paul, Kenny's work
on _The Aristotelian Ethics_, and Burrows on Jane Austen. We also
explored the statistical tests used to summarize the findings in these
studies.
Beyond stylistic analysis, we looked at linguistic and lexical
analysis. Means of using TACT to undertake simple analyses of this
type were described. Linguistic and lexical analysis are important for
studying language and developing printed and electronic dictionaries.
Of even greater significance are their potential for improving
information retrieval. As the rules of language are systematized in a
manner that computers can understand, computers can apply these rules
in interpreting new textual material. The complexity of the task was
revealed in the demonstration of a program to automatically parse
several sentences. It was successful with one sentence, but completely
fell apart when faced with a particularly ambiguous phrase. (By the
end of the workshop, we all were freely talking about the difficulty
of "disambiguating" words.) Much additional development in the area
of automated recognition and analysis of "fuzzy" matches, names,
concept relations and figures of speech such as metaphors was
desired.
Computer assistance in creating critical editions was explored.
Those interested in this topic had the opportunity to try out the
Collate program prepared by the chair of the TEI Text Criticism work
group, Peter Robinson.
Susan presented the TEI to the participants, many of whom were familiar
with its general principles. TEI's advantages were described as its
transportability across different platforms, ease of sharing texts and
their analyses, and superior analytical tools. Some of those present
expressed reservations about the labor intensiveness of marking up
texts, and the desire to analyze a "clean" text free of the
interpretation implicit in any mark-up system. A major constraint is
the lack of existing software with which to ease the mark-up process
and to exploit the mark-up for analysis. Such software is being
developed or is used for selective applications. PAT takes advantage
of the OED's SGML mark-up, while Dynatext, which was demonstrated to
the group, uses SGML to create the links in its hypertext electronic
books. These applications are presently too limited or too expensive
for general use, and much additional effort is needed in this area. In
spite of the reservations expressed, the need for a standard means of
encoding and sharing texts seemed to be accepted.
About half of us had brought texts to analyze using these tools.
Afternoons and late evenings in the dormitory basement were devoted to
this task. Texts treated included the poetry of Canadian Margaret
Avison, "Piers Plowman" (B), Shakespeare's tragedies, "My Dinner With
Andre," classified ads from modern British newspapers, English
translations of French and Egyptian fiction, 15th Century Russian
chronicles, Durkheim's works, the diary of Robert Knox (a 17th century
British sea captain's son imprisoned on Ceylon), andd an early issue of
_The Catholic Worker_, a progressive activist Catholic newspaper. One
student created a program for Latin morphological analysis (and taking
a cue from Julius Caesar, proposed that Latin be adopted as Europe's
common language).
The projects aptly demonstrated the challenges involved in analyzing
electronic texts. In some cases, the difficulty lay in creating or
obtaining access to an electronic text. Several attempts at scanning
were unsuccessful, especially on older books such as the _History of
the British Royal Society_. Where technology was not a problem,
obtaining publishers' approval to convert copyrighted texts such as
_The Book of Mormon_ and _Lolita_ was.
Stylistic analysis required choosing characteristic features of style.
For example, Nabokov's language in _Pale Fire_ was compared to the
poetry of Alexander Pope and Robert Frost which it parodied, raising
questions about the appropriateness of semantic and lexical analysis as
the basis of comparison. An interesting study of professional and
amateur English translations of French authors Theophile Gautier and
Eugene Sue for "stylistic fingerprints" included too small a sample
from which to draw conclusions. A successful outcome occurred when a
student studying Egyptian short stories and their translations found that
TACT and Micro-OCP speeded up the analysis he had begun years before
without these tools.
Conceptual and thematic analysis called for hard decisions about the
relationship between complicated concepts and words and brief phrases
(the basic units of TACT and Micro-OCP). A study of the use of the
words "sin", "redemption," and "atonement" in _The Book of Mormon_
revealed interesting information about the Mormons' connection of these
concepts. Willard's description of his work with Ovid's
_Metamorphoses_ augmented by his explanatory article in the _Tact
Exemplar_ demonstrated how TACT could be used to unveil important
themes. Although the results of these analyses were quite interesting,
the need for additional development of analytical and auxiliary tools
was widely agreed upon.
Don Walker, a member of the TEI Steering Committee and chair of the
Association for Computational Linguistics (ACL), described the
extensive work being done worldwide with electronic texts, especially
in linguistics and lexical analysis. The ACL Data Collection
Initiative is amassing electronic transcriptions of written and spoken
English, a portion of which is available on CD-ROM. The Network of
European Corpora is developing standards to guide the individual
European nations in the creation of language corpora. The Consortium
for Lexical Research and Linguistic Data Consortium have formed to
enhance cooperation among the many projects in progress.
Hypertext, the newest frontier in electronic texts, was discussed
and debated. Its advantages in integrating different sources of
information were acknowledged. Its effect on the behavior of student
and scholar are not yet understood, however. Will it stimulate the
student or scholar to investigate sources other than those included
in the hypertext package, or create the perception that the most
important sources are included in the package?
Elli Mylonas, chair of the TEI Performance Texts work group, gave a
presentation on Pandora, a new text retrieval program she and others
have developed to search the Thesaurus Linguae Grecae, and on Perseus.
Perseus, developed by a consortium of universities and located at
Harvard, is a multi-media educational Macintosh product that includes
Greek/English texts from the classical period, a Greek/English lexicon,
a classical encyclopedia, and a wealth of photographs of artifacts and
sites.
Elli also demonstrated two types of electronic hypertext fiction. The
first type is represented by Voyager Company's books, which tend to
treat text in a traditional manner, although they include analytical
and note-taking tools for those who want to analyze Sara Paretsky and
the like. The second, Story Space fiction, was created explicitly for
the electronic medium and uses the interweaving allowed by hypertext as
part of its literary strategy.
Ann Okerson of the Association for Research Libraries (ARL) described
ARL's efforts in exploring the extent and advantages of electronic
journals, newsletters, and bulletin boards. Scholarly communication
has speeded up with the advent of the computer, and collaboration has
become a greater possibility with the ease of the electronic medium.
The ARL has produced the "Directory of Electronic Journals, Newsletters
and Academic Discussion Lists," and Ann described her greater
appreciation of publishers' efforts after completing this project.
Finally, Andreas Bjorklind of Sweden offered a presentation on
Wide Information Area Servers (WAIS), the new communication system
which allows individuals to search and retrieve electronic databases
across the world.
The seminar offered a great variety of information about electronic
texts and textual analysis, as well as a relaxed setting in which to
study. It was enlightening to learn how many universities and
libraries are already involved in offering and analyzing electronic
texts. Many of the scholars attending were involved in establishing
humanities computing centers or services within their institutions,
libraries or departments. The professional and personal relationships
established, the understanding gained of textual analysis techniques,
and the appreciation of the need for additional hardware and software
for more sophisticated analysis were the highlights of the session.
'Stephen Ferguson Exlibris bulletin b 9/14/92 CETH Seminar in Textual Analys
ú Postmast Princeto 9/14/92
§Mail Delivery Subsy 0629212@pucc.PRINCE 9/14/92 Returned mail: Host unknown