The orthography of machine-readable Neolatin texts: A plaidoyer for minimal intervention

Coping with orthographical diversity: Designing a search machine

Standardization is needed for many tasks. It is the advantage of an electronic text that this need no longer be the duty of the editor. It can be delegated to a middle layer between text and user, where the question of the user is transformed into a query adequate for the text. It can be the task of the editor of an electronic text to empower the reader to make informed choices. We should not disenfranchise the user by preempting decisions about the text.

One obvious use is searching. Google, while meritorious and used by us all, is unsuited for searches in heavily inflected languages such as Latin. Neither does it index long texts completely nor - contrary to a widely held belief - support the joker (*) in search queries.

We need search algorithms which, e.g., transform a search for cael- into one for coel-/cel-/cael-, committ- into conm-/comm-, which correctly understand the ampersand, æ- and œ- ligatures (if retained), i/j/y, u/v, gracia/gratia, cope with hyphenation/word breaks, ... . Some such algorithms were developed for the NLW, more broadly for the Salutati-CDROM, and are implicit in Camena's reglat-tool.

Such a search machine could even be open for texts spread over other sites, which follow wildly differing conventions and are now only retrievable by chance. Similar mechanism could be applied to other task and allow for a flexible and adaptive model of text retrieval.