-
Notifications
You must be signed in to change notification settings - Fork 0
12‐enhance‐titles
In this stage, we begin to move from minimum viable product towards a more advanced research interface. So far our list of article titles provides only the title itself and the publication date, but for a researcher that doesn’t take us very far. If we want to expose more information to make the archive more usable, we should think about the information we want to expose to a user, the order we want to impose on that information, and the way we want to present it. Here is a sample of the titles list we currently have:
And here’s a sample of the list we will create by the end of this stage:
This stage improves the range of information, although it is still far from complete with respect to presentation, and we’ll continue to develop it in subsequent lessons. The goal of building this Articles page over a series of branch stages is to demonstrate the kind of iterative development that any non-trivial project (and therefore any meaningful digital edition) undergoes. In our role as end-users we are accustomed to seeing digital editions as final and authoritative research objects, but there are typically many iterations a project must go through before reaching that stage, most of which a public user will never see.
To enrich the interface for the list of articles in this Stage we will add the following additional information (where “formatted” means “not copied verbatim from the source XML”):
- formatted title field
- article identifier
- publication title field
- formatted date field
- incipit field
We explored this question a bit when we first introduced fields, but in this case we want to use fields for a wider variety of reasons. First, for reasons we’ll explore below, using fields makes the information easier to reuse in other parts of the application. Second, some of the field values will become essential parts of the search feature in Stage 15. We can think of fields as reusable derived information that, insofar as it is reusable, can make the development of future features more streamlined.
Let’s review the implementation of fields before we jump back into editing them. Remember that we write the XQuery for fields in a separate file in the root of our application directory, collection.xconf. Databases, including eXist-db, achieve quick response time because they build persistent indexes that let them know where to retrieve information, so that they can go to that location immediately without having to look for it. This is similar to the way a back-of-the-book index works in a paper publication: instead of having to leaf through all of the pages looking for the topic you care about, you can find the topic quickly in a (sorted) alphabetized list that will direct you to the sorted (sequential) pages that mention the topic. The collection.xconf file in eXist-db lets you tell eXist-db what to index and how, where you decisions are based on what you want to be able to retrieve quickly. eXist-db uses indexes when it executes XQuery scripts, and because the fields you create becomes part of the indexes, they can also be retrieved quickly.
The collection.xconf file is part of your app, but when you install the app into a running instance of eXist-db your pre-install.xq file copies it into a different location, outside your app hierarchy, and builds the indexes as part of the installation process. This means that editing collection.xconf inside the app filespace won’t have any effect on your indexes; if you change collection.xconf you need to rebuild your xar file by typing ant
at the command line and then reinstall your app into eXist-db. The most common way to install an xar file into eXist-db is to launch the eXist-db Package manager from the eXist-db Dashboard, but you can also perform the installation from the command line if you use the xst utility.
Below we provide step-by-step instructions on modifying collection.xconf to create the formatted title field here. We encourage you to develop features like this iteratively, as we do below, and to test the data at the model and view level by making only one change at a time.
We know we want to write some XQuery that will format our titles so that definite and indefinite articles at the beginning of a title won’t interfere when we sort the titles alphabetically in a later Step. There are only three articles (“A”, “An”, and “The”) and we care about them only when they appear at the very beginning of the title. For example, “A Ghost” (just one indefinite article) should appear as “Ghost, A” and “A Ghost, a bear, or a devil” should be transformed to “Ghost, a bear, or a devil, A” (three indefinite articles, but we care only about the one at the very beginning of the title).
We find it helpful to describe what we want to accomplish in human language:
- If the first word in a title matches a definite or indefinite article, then move that word to the end after a comma and capitalize the new first letter.
- If the first word of the title is not an article, do nothing.
Regular expressions, or regex, can be used in some XPath functions that manipulate strings. Regex is a pattern-matching language that will let us describe, in a machine-actionable way, how the processing can recognize when a title begins with a definite or indefinite article. If you are not already comfortable with regex, a good place to start is the Regularexpressions.info quickstart page.
In a tmp folder or using eXide, try to write XQuery that will process titles as described above. Once we’re happy with the code we’ll integrate it into the app, but we find it helpful to explore the details in a scratch XQuery file first. Here is our starting point, which constructs some helpful representative titles and, for now, returns them unchanged.
xquery version "3.1";
let $data as element() :=
<titles>
<title>A ghost, a bear, or a devil</title>
<title>The Ghost</title>
<title>There once was a Ghost</title>
<title>Ghost story</title>
<title>Science: A new ghost</title>
</titles>
for $title in $data//title
return $title
As reflected in the human language summary of the task we want to complete, above, we think of the task as if a title begins with a definite or indefinite article then move the article to the end after a comma and space and capitalize the new first letter or else do nothing. In the return
clause of the FLWOR expression you can play around with this construction to see how it will work.
for $title in $data//title
return
if (matches($title, '(The|An|A)'))
then $title
else "none"
This says “if the title matches one of the strings “The”, “An”, or “A” return the title unchanged; if it doesn’t match, return the string ‘none’”. You don’t actually want to replace non-matching strings with “none” in the final product, but we do that now to make it easier to see what does and does not match our regular expression. And you’ll see below that we don’t actually use an if/then/else
construction in the final version of our code, but we nonetheless find it useful at this early stage.
The XPath matches()
function takes two argument, the place where we’re searching (a title) and a regex that we’re searching for (we sometimes refer to these as the haystack and the needle). The code correctly recognizes that “Ghost story” does not need to be modified, but it incorrectly thinks that both “There once was a Ghost” and “Science: A new ghost” do. Before looking at the answers below, can you spot the reason that our regex is matching things we don’t want it to match?
“Science: A new ghost” contains a capitalized indefinite article, but it’s in the middle of the title, and we care about definite and indefinite articles only at the very beginning of a title. We can fix that by adding a regex anchor to our pattern, so that it will match only at the very beginning of a title:
for $title in $data//title
return
if (matches($title, '^(The|An|A)'))
then $title
else "none"
The issue with “There once was a Ghost” is that although it begins with the letters “The”, it doesn’t begin with the definite article “The”. We can fix that by requiring a space character after the definite or indefinite article:
for $title in $data//title
return
if (matches($title, '^(The|An|A) '))
then $title
else "none"
Now that we’re happy with what we’re matching it’s time to focus on the replacement:
for $title in $data//title
return
if (matches($title, '^(The|An|A) ' ))
then replace($title, '^(The|An|A) (.+)', '$2, $1' )
else "none"
The regex pattern ('^(The|An|A) (.+)'
) has two capture groups. The first, (The|An|A)
, is for the definite or indefinite article, then comes a space that we don’t capture, and then the second group,(.+)
, captures the remainder of the title. The replacement expression ('$2, $1'
) says to output the second capture group, then a literal comma and space character, and then the first capture group. The output of the two titles that begin with definite or indefinite articles now looks like:
ghost, a bear, or a devil, A
Ghost, The
This is significantly closer to our desired output; the only lingering issue is that although the second title correctly begins with an uppercase letter, the first doesn’t. The way to fix that, in human language, is to separate the first letter of the interim result from the rest, capitalize the first letter, and stitch it back together with the unchanged remainder. We can use the XPath substring()
function to perform the separation, the upper-case()
function to perform the capitalization, and the concat()
function to reunite the parts:
for $title in $data//title
return
if (matches($title, '^(The|An|A) ' ))
then replace($title, '^(The|An|A) (.+)', '$2, $1' )
! concat(upper-case(substring(., 1, 1)), substring(., 2))
else "none"
Now that this does what we want, we can add back in the titles that do not start with a definite or indefinite article:
for $title in $data//title
return
if (matches($title, '^(The|An|A) '))
then replace($title, '^(The|An|A) (.+)', '$2, $1')
! concat(upper-case(substring(., 1, 1)), substring(., 2))
else $title
We actually don’t need to use an if/then/else
expression because the replace()
function changes only titles that the regex matches. This means that if we apply it to a title that doesn’t match the pattern because it doesn’t begin with a definite or indefinite article, the function will return the title unchanged. The final version of our function looks like:
for $title in $data//title
return
replace($title, '^(The|An|A) (.+)', '$2, $1')
! concat(upper-case(substring(., 1, 1)), substring(., 2))
Now that we know this XQuery works and returns what we want, we can move it into a function and start to use it. We do that by adding the following code inside functions.xqm:
(: Text functions :)
(:~
: hoax:format-title() moves definite and indefinite article to
: the end of the title for rendering
: @param $title : xs:string any article title
: @return xs:string
:)
declare function hoax:format-title($title as xs:string) as xs:string {
replace($title, '^(The|An|A) (.+)', '$2, $1')
! concat(upper-case(substring(., 1, 1)), substring(., 2))
};
The comments at the top label a new subsection and describe the function signature (name, input parameters, output) and what it does. You can try out this function from the modules/functions.xqm library by importing it into a scratch XQuery file (in eXide, for example) and using it from there:
xquery version "3.1";
import module namespace hoax="http://www.obdurodon.org/hoaxed" at "/db/apps/hoaXed/modules/functions.xqm";
let $data as element() :=
<titles>
<title>A ghost, a bear, or a devil</title>
<title>The Ghost</title>
<title>There once was a Ghost</title>
<title>Ghost story</title>
<title>Science: A new ghost</title>
</titles>
for $title in $data//title
return hoax:format-title($title)
The output should look like:
"Ghost, a bear, or a devil, A"
"Ghost, The"
"There once was a Ghost"
"Ghost story"
"Science: A new ghost"
We could have written the code to format the titles directly inside collection.xconf, but saving it in a library module has two advantages:
- We can use it elsewhere.
- It makes collection.xconf less cluttered, and therefore easier to read.
With the function installed and working inside the library module, we now want to edit the collection.xconf file in the root directory of the repository to use the new function to create the formatted-title field. As we mention above, whenever we edit collection.xconf we need to reinstall the app so that eXist-db, as part of the installation process, will rebuild the indexes using the new instructions. The first step is to add the following code inside the <text qname="tei:TEI">
element in collection.xconf:
<!-- ==================================================== -->
<!-- Article title (field) -->
<!-- -->
<!-- format-title() moves definite and indefinite article -->
<!-- to end, e.g., from "A Ghost" to "Ghost, A" -->
<!-- ==================================================== -->
<field name="formatted-title"
expression="descendant::tei:titleStmt/tei:title
! hoax:format-title(.)"/>
<!-- ==================================================== -->
We add this field on the <TEI>
element level so we get titles for every article. We add comments for developer readability. Self-documenting code with clear comments are not just for others who may read your code; they are gifts you give your future self when you return to a codebase to reuse or repurpose, or to fix something that breaks. In this case the name of the field describes what it does and the expression uses XPath to find the title you want to format and apply your hoax:format-title()
function to it. We use the simple map operator (instead of wrapping the function around the XPath expression) for two reason:
- If a text has more than one title, this version will format each separately. If we were to wrap the function around an expression that returned a sequence of more than one item, it would throw an error because the function signature accepts only a single item.
- We find the simple map operator easier to read than nesting a long XPath expression inside function parentheses.
Now that we’ve added the code to create the field, we need to rebuild, reinstall, and verify that the field has been created in the indexes and contains populated data. We do this with Monex, a developer tool that is available from the eXist-db dashboard. After launching the Monex app, we navigate to Indexes and then find the index we want to view. When we do that, we see:
With our field constructed and ready to use, we can now update our model and view code so that the titles page will use it. Because of the way fields are stored in the eXist-db indexes we need to a use a function called ft:query()
in a predicate to access them. You can read more about querying the index with ft:query()
at the eXist-db Full Text Index resource. You can find the full code with the other features in titles.xql in the branch for this section, but here’s the snippet that is relevant to this specific field.
<m:articles>{
for $article in $articles[ft:query(., (), map{'fields':('formatted-title')})]
let $id as xs:string := $article/@xml:id ! string()
let $title as xs:string := ft:field($article, "formatted-title")
return
<m:article>
<m:id>{$id}</m:id>
<m:title>{$title}</m:title>
</m:article>
}</m:articles>
Now that we’re returning an element called <m:articles>
with <m:article>
children, we also need to update titles-to-html.xql so it can create the appropriate HTML elements and present them to the user. The logic for this is similar to our previous view code: we need to replace elements from our model namespace with HTML elements and incorporate other HTML markup to format the page the way we want. Instead of stringing together the title and date as a single list item, which is what we did earlier, we now create separate list items for each piece of information. Here is simplified view code that renders only the title and identifier:
<html:section>{
for $item in $data//m:article
let $id as xs:string := $item/m:id ! string()
let $title as xs:string := $item/m:title ! string()
return
<html:ul>
<html:li>{$title}</html:li>
<html:li>{$id}</html:li>
</html:ul>
}</html:section>
In the real code for this branch you will also notice that date, publisher, and incipit have their own list items. We’re treating these the same way we’ve treated the titles: we create fields for them, which we then retrieve them ft:query()
and ft:field()
in the model, and we then transform the model to create simple HTML lists. In subsequent stages, we’ll continue to build on this title view to add links to the articles, and eventually this page will become a searching and filtering interface. The interim processing steps that were added in this stage lay the groundwork for our searching and filtering, but they also make available information that users will need for an intermediate result that builds on the earlier “minimum viable product” and brings us closer to our final target.
In this section, we wrote a new function to format titles. To keep our focus on the pipeline we ignored temporarily a development step that we covered in Stage 11, writing tests for functions. This important step should be part of your development journey, so we want to practice writing a test for a new function here as well, even if it’s not the main pedagogical goal of this section.
To some extent, we did some testing in our temporary version of the function, but we need to make this testing repeatable, so that as we add more functional we can run our entire test suite and ensure that we haven’t accidentally introduced a regression, that is, broken code that worked previously. You will recall from section 11 that we use the tests directory and two files, test-suite.xq and test-runner.xq, which were created by the eXistentializer app. To write a test for the hoax:format-title()
function, we will open the test-suite.xq file first.
(: ==========
Tests for fixing definite and indefinite articles
========== :)
declare
%test:arg('input', 'The Big Sleep')
%test:assertEquals('Big Sleep, The')
%test:arg('input', 'An Unusual Life')
%test:assertEquals('Unusual Life, An')
%test:arg('input', 'A Boring Life')
%test:assertEquals('Boring Life, A')
%test:arg('input', 'Andrea and Andrew')
%test:assertEquals('Andrea and Andrew')
%test:arg('input', 'A ghost, a bear, or a devil')
%test:assertEquals('Ghost, a bear, or a devil, A')
function tests:test-format-title($input as xs:string) as xs:string {
hoax:format-title($input)
};
In this test, we create arguments with inputs that test the behaviors we expect from our function, and provide assertions about our expected output. We test for all three articles at the beginning, a leading word that starts with “A” but doesn’t contain articles, and a test case with several articles in a single title. If our corpus had significantly more variation, we might introduce even more test cases.
Whenever you write a new test you should save it and then open and execute text-runner.xq. This will run your entire test suite and report any bugs it discovers.
The goal of this section was to expand the titles list features to include formatted titles, identifiers, dates, publishers, and incipita, and we situated that new work in an iterative development cycle and pipeline. That process included developing and testing a new function, adding a new field to our index that invoked that function, and updating the model and view files to include enhanced information provided by the new function. We walked through the steps for this process only with respect to formatted titles, but the general process is the same for any use of fields, and encourage you now to try developing the other fields by following the steps we completed here. In the next sections we continue to build on these features as we create a reading view and XML view, which are ultimately accessed via links we’ll add to this articles list interface for use as a discovery tool.