-
Notifications
You must be signed in to change notification settings - Fork 0
12‐enhance‐titles
In this stage, we begin to move from minimum viable product towards a more advanced research interface. Our list of article titles provides only the article and date, but for a researcher that doesn’t take us very far. If we want to expose more information to make the archive more usable, we should think about the order of information, how it is presented, and what information we want to expose to a user. Here is a sample of the list we currently have:
And here's a sample of the list we will create by the end of this stage:
This stage is still very intermediate. The goal of building this Articles page over a series of branch stages is to demonstrate the kind of iterative development all digital editions undergo. We are accustomed to seeing these products as final and authoritative research objects. In most cases, project development does reach that stage, but there are many iterations a project must go through before then, most of which a public user will never see.
To advance this Articles interface, we will add the following features:
- formatted title field
- article ID
- publication title field
- formatted date field
- incipit field
We explored this question a bit when we introduced fields a few stages back, but in this case we want to use fields for a wider variety of reasons. First, it makes the information more reusable in other parts of the application. Second, it will become an essential part of the search feature in stage 15. We want to build extensible parts in order to make development of future features more streamlined.
Let's review the implementation of fields before we jump back into editing them. Remember that we write the XQuery for fields in a separate file in the root of our application directory, collection.xconf, which runs these queries and stores the results in memory when we build and install the application in eXist-db. This means that our development cycle is slightly different than it is when we build modules; any change to the fields in collection.xconf requires a rebuild using ant
at the command line and a reinstall in the web interface of eXist-db. Here’s a new command line utility for interacting with eXist-db that might make this process a little simpler: https://github.com/eXist-db/xst
For the sake of clarity and brevity in this tutorial, we will provide step-by-step instructions on creating the formatted title field here and provide a link to the stage with the rest of the fields implemented. We highly encourage you to develop each feature iteratively, and to test the data at the model and view level, by making only one change at a time.
We know we want to write some XQuery that will format our titles so the most relevant information is presented first, so definite articles don’t interfere if we decide to sort alphabetically in the future. There are only three articles (“A”, “An”, and “The”) and we only care about them when they appear at the beginning of the title. For example, “A Ghost” should appear as “Ghost, A” but “A Ghost, a bear, or a devil” shouldn’t be “Ghost, bear, or devil, A, a, a” because we are only interested in alphabetizing the list, rather than the internal parts of the title it contains. We also want this formatted title to be readable for users, and “Ghost, bear, or devil, A, a, a” is not the way a person would write a title.
This helps us give some human language to describe we want to accomplish. If the first word in a title matches the articles, then move that word to the end after a comma and capitalize the new first letter. If the first word is not an article, do nothing.
Regular expressions, or regex, are often used in XPath functions to manipulate strings. Since we are updating the formatting of these titles, we need to use regular expressions. If you are not already familiar with this, you can find guidance on using regex here: https://www.regular-expressions.info/quickstart.html
In a tmp folder or using eXide, try to write XQuery that will accomplish this with the examples we provided above.
Here is our starting point, which constructs some helpful example titles and returns them as-is.
xquery version "3.1";
let $data as element() :=
<titles>
<title>A ghost, a bear, or a devil</title>
<title>The Ghost</title>
<title>Ghost story</title>
<title>Science: A new ghost</title>
</titles>
for $title in $data//title
return
$title
As you might have noticed in our human language summary of the task we want to complete, we used an “if/then/else” statement. In the return
portion of FLWOR, we can play around with this construction to see how it will work.
for $title in $data//title
return
if (matches($title, '(The|An|A)' ))
then $title
else "none"
This yields our titles, but in place of “Ghost story” we see "none". It matched the articles, returned titles for those with an article, and returned "none" for the one without an article. Play around with the data to see how capitalization and position change the outcomes for this expression.
Next, we can try to return only those with an article as the first word of the title, leaving off “Science: A new ghost”.
for $title in $data//title
return
if (matches($title, '^(The|An|A)' ))
then $title
else "none"
Now that we are matching only the items we care to change, we can start to build out the then
structure.
for $title in $data//title
return
if (matches($title, '^(The|An|A)' ))
then replace($title, '^(The|An|A)(.+)', '$2, $1' )
else "none"
Here we use replace
and regex capture groups to isolate the article from the rest of the title and switch the order. The output has a leading space, though. We can fix that with more capture groups.
for $title in $data//title
return
if (matches($title, '^(The|An|A)' ))
then replace($title, '^(The|An|A)( )(.+)', '$3, $1' )
else "none"
The results are significantly closer to our desired output. Our titles don’t yet look consistent, though, as one starts with a lowercase letter. To fix that, we can use the simple map (!
) operator and some string surgery to make our capitalization consistent.
for $title in $data//title
return
if (matches($title, '^(The|An|A)' ))
then replace($title, '^(The|An|A)( )(.+)', '$3, $1' )
! concat(upper-case(substring(., 1, 1)), substring(., 2))
else "none"
In human language, this says “for every title in the data, check if it has an article at the beginning. If it does, then move that article to the back. For every title that has a changed article, find it’s first letter, capitalize it, and concatenate it with the rest of the title. If it doesn’t start with an article, then return "none".”
Now that this does what we want, we can add back in the titles that do not start with an article and clean up the spaces.
for $title in $data//title
return
if (matches($title, '^(The|An|A) '))
then replace($title, '^(The|An|A)( )(.+)', '$3, $1')
! concat(upper-case(substring(., 1, 1)), substring(., 2)) => normalize-space()
else normalize-space($title)
Now that we know this XQuery works and returns what we want, we can move it into a function and start to use it.
In functions.xqm we can add the following function declaration:
(: Text functions :)
(:~
: hoax:format-title() moves definite and indefinite article to
: the end of the title for rendering
: @param $title : xs:string any article title
: @return xs:string
:)
declare function hoax:format-title($title as xs:string) as xs:string {
if (matches($title, '^(The|An|A) '))
then replace($title, '^(The|An|A)( )(.+)', '$3, $1')
! concat(upper-case(substring(., 1, 1)), substring(., 2)) => normalize-space()
else normalize-space($title)
};
The comments at the top provide a new subsection and describe the functions purpose, parameters, and output. We can test invoking this function in our tmp file, just remember that you'll need to add the declare and import the module namespace to that file.
Next, we want to open and edit the collection.xconf file in the root directory of the repository. This file is read and executed when the application is installed, so the app stores the information we put here for faster return later on. That means that we will need to build and install the app again after we make changes to this file to test. Because we are using a function we can troubleshoot without re-building the app, all we need to do is declare the field and invoke the function with a lot less troubleshooting required.
Inside the <text qname="tei:TEI">
element in collection.xconf add the following code:
<!-- ==================================================== -->
<!-- Article title (field) -->
<!-- -->
<!-- format-title() moves definite and indefinite article -->
<!-- to end, e.g., from "A Ghost" to "Ghost, A" -->
<!-- ==================================================== -->
<field
name="formatted-title"
expression="descendant::tei:titleStmt/tei:title
! hoax:format-title(.)"/>
<!-- ==================================================== -->
We add this field on the TEI
element level so we get titles for every article. We add comments for developer readability. Self-documenting comments are not just good form for public work, they are gifts you give your future self when you return to a codebase to reuse or repurpose, or to fix something that breaks.
The name matches the meaning of the expression. We provide the XPath expression first, then use the simple map operator to call the function and pass the titles in as its input. This format is used throughout collection.xconf
for two reasons: if a text has more than one title, this will format each separately, and this is more readable for future you and other developers.
Now that we've added the field, we should rebuild, reinstall, and check Monex to see if our field has populated data. Monex is available from the eXist-db dashboard. After launching the Monex app, we navigate to Indexes, and then find the index we want to view. In this case, you should be able to find yours here, provided you installed the app under db/apps: http://localhost:8080/exist/apps/monex/field.html?collection=/db/apps/hoaXed&field=formatted-title&node-name=tei:TEI
With our fields constructed and ready to use, we can now update our model and view code for the titles page to use it. To do this we use a function called ft:query()
in a predicate, which queries the specific index we built to retrieve those stored values. You can read more about querying the index on eXist-db's Full Text Index resource. You can find the full code with the other features in titles.xql in the branch for this section, but here's the snippet that is relevant to this specific field.
<m:articles>
{for $article in $articles[ft:query(., (), map{'fields':('formatted-title')})]
let $id as xs:string :=
$article/@xml:id ! string()
let $title as xs:string :=
ft:field($article, "formatted-title")
return
<m:article>
<m:id>{$id}</m:id>
<m:title>{$title}</m:title>
</m:article>
}</m:articles>
Now that we’re returning an element called <m:articles>
with <m:article>
children, we also need to update titles-to-html.xql so it can create the appropriate HTML elements and present them to the user. The logic for this is similar to our previous view code, as we merely changed the element names and formatted the text earlier in the pipeline. Instead of stringing together one list item, we have separate list items for each piece of information.
<html:section>
{for $item in $data//m:article
let $id as xs:string := $item/m:id ! string()
let $title as xs:string := $item/m:title ! string()
return
<html:ul>
<html:li>{$title}</html:li>
<html:li>{$id}</html:li>
</html:ul>
}
</html:section>
In the real code for this branch, you will also notice that date, publisher, and incipit have their own list items. We’re treating these the same way we’ve treated the titles: we parse them using XPath or a function that we call in the index and the values are stored, we then call those values using ft:query()
in the model, and then we process the model to create simple HTML lists. In subsequent stages, we’ll continue to build on this title view to add links to the articles, and eventually this page will become a searching and filtering interface. The interim processing steps that were added in this stage lay the groundwork for our searching and filtering, but they also make available information that users will need for a “minimum viable product” right now.
In this section, we wrote a new function to format titles. To keep our focus on the pipeline and avoid confusion, we left out the part we covered in branch 11, writing tests for functions. This important step should be part of your development journey, so we want to practice writing a test for a new function here as well, even if it’s not the main pedagogical goal of this section.
To some extent, we did some testing in our tmp version of the function, but we need to make this testing repeatable (and maybe even automatable!). You will recall from section 11 that we use the tests directory and two files, test-suite.xq and test-runner.xq which were automatically created by the eXistentializer app creation script. To write a test for the hoax:format-title()
function, we will open the test-suite.xq file first.
(: ==========
Tests for fixing definite and indefinite articles
========== :)
declare
%test:arg('input', 'The Big Sleep')
%test:assertEquals('Big Sleep, The')
%test:arg('input', 'An Unusual Life')
%test:assertEquals('Unusual Life, An')
%test:arg('input', 'A Boring Life')
%test:assertEquals('Boring Life, A')
%test:arg('input', 'Andrea and Andrew')
%test:assertEquals('Andrea and Andrew')
%test:arg('input', 'A ghost, a bear, or a devil')
%test:assertEquals('Ghost, a bear, or a devil, A')
function tests:test-format-title($input as xs:string) as xs:string {
hoax:format-title($input)
};
In this test, we create arguments with inputs that test the behaviors we expect from our function, and provide assertions with our expected output. We test for all three articles at the beginning, a leading word that starts with “A” but doesn’t contain articles, and a test case with several articles in a single title. If our corpus had significantly more variation, we may introduce even more test cases.
When you write a new test, remember to save it and then open and execute text-runner.xq.
In this section, we wanted to expand the titles list features to include formatted titles, dates, publisher, and incipit. We covered the iterative development cycle and pipeline required to add the titles and ID. This included developing a new function, adding a new field to our index that invokes that function, and updating the model and view files to include this information. While we did not complete every step of this iterative cycle for the other features listed, we challenge you to try developing those features without looking at the branch code by following the same steps we completed here. They may not be exactly the same, and you should ask yourself if you need a function or a field while you consider the best way to approach the problem. In the next sections, we continue to build on these features as we create a reading view and XML view, which are ultimately accessed via this articles list interface as a discovery tool.