Time for Genre: Temporal Expressions as Features for the Classification of Literary Subgenres

Henny-Krahmer, Ulrike
University of Cologne, Germany
ulrike.henny@uni-koeln.de

Table of contents

1. Introduction

One of the major concerns of digital literary studies when using machine learning methods is the classification of texts by genre. Several different types of features have been used to this end, of which especially most frequent words and topics have been employed successfully (Underwood 2015; Hettinger et al. 2016; Schöch 2017). There are two main goals of using supervised learning for this task: to be able to determine the genre of unseen texts in large collections and to better understand which textual features are decisive for the distinction of genres.

This proposal aims to pursue the second goal with a type of feature set that has so far not been in the focus for genre classification: temporal expressions. A temporal expression is a sequence of tokens in a text that refers to time, for example to express when something happened, how long it lasted, or how often it occurred (Ferro et al. 2005: 5), for instance “on October 5 th, 1850”, “for three hours”, or “every Tuesday”. Temporal tagging is a well researched area (Strötgen and Gertz 2015, 2016) and studies concerned with the annotation of temporal expressions in literary texts do exist (Bögel et al. 2015; Fischer and Strötgen 2015a, 2015b; Gius and Jacke 2015), but these annotations have so far not been used for subgenre classification.

In this contribution subgenres of the novel are analyzed with a corpus of 19 th century Spanish American novels (Henny-Krahmer 2021). Time plays a role for different types of novels on various levels. The narration relates to what is narrated with a certain temporal perspective: the narrated events can be located in a precise historical past, as in historical novels (Fernández Prieto 1996; Spang 1998; Lefere 2013) or they can be temporally close to the narration, as in social novels or novels of customs (Calderón 2005; Janik 2008: 60-63). In other types of novels the temporal location of the events might not have a high priority. Second, different kinds of temporal expressions are relevant for the description of events that are located precisely in time (e.g. dates) vs. others that are more vague (e.g. referring to seasons or times of the day). The hypothesis is that precise temporal expressions are more distinctive for the style of subgenres for which the temporal location of the events is decisive (such as the historical novel) and that fuzzy or unspecified temporal expressions are more frequent in subgenres for which a clear anchoring in time is not constitutive (such as sentimental novels). The two main goals of this proposal are to test how useful temporal features are in general for the classification of novels by subgenre and to test the hypotheses concerning the relevance of the different temporal expressions for different subgenres.

2. Corpus and methods

The corpus used consists of 256 novels from Argentina (99), Cuba (49), and Mexico (108), which were published between 1830 and 1910 and written by 121 different authors. The three most frequent thematic subgenres were analyzed: historical novels (67), sentimental novels (55), and novels of customs (50). The remaining novels were treated as one group of other novels. The subgenre labels were collected from literary historical sources and explicit mentions in subtitles. Six subgenre constellations were analyzed: historical novels vs. other, sentimental novels vs. other, novels of customs vs. other, historical vs. sentimental novels, historical novels vs. novels of customs, and sentimental novels vs. novels of customs.

Temporal tagging was performed with HeidelTime 2.2.1 ( Strötgen and Gertz 2015) and linguistic annotation with TreeTagger 3.2.3 (Schmid 1995) and FreeLing 4.0 (Padró and Stanilovsky 2012). A set of 499 different features was created based on the temporal and linguistic tagging including basic types of temporal expressions (e.g. DATE, TIME, DURATION, or SET expressions, see table 1 below), more elaborate subtypes of the basic types (e.g. fully specified vs. unspecified dates), and the most frequent temporal expressions of different types (e.g. the counts of “hoy”, “octubre”, “las 9 de la noche”, “dos horas”, “cada día”). Also counts of different verb tenses were used as features related to time. In addition, a feature set of the 4,000 MFW was created for comparison because the classification of novels by subgenre works well with this number of MFW (Hettinger et al. 2016, Henny-Krahmer forthcoming). Values relative to text length and proportional to the overall number of expressions in the texts were used for the temporal expression features and tf-idf values for the MFW features.

Type of temporal expressionDescription
DATEa calendar time, e.g. “last week” or “Thursday, March 4 th, 1886”
TIMErefers to the time of a day, e.g. “half past nine”, “3 p.m.”
DURATIONa duration, e.g. “eight years”
SETdescribes a set of times, e.g. “once a month”, “every four hours”

Table 1: Basic types of temporal expressions according to the annotation standard TimeML

The classification was performed with a linear SVM (Pedregosa et al. 2011), using a C parameter value of 100. For each subgenre constellation, undersampling made sure that the two classes that are compared always have the same size and a baseline of 0.5 can be assumed. The novels were selected randomly and the selection process was repeated 10 times. A 10-fold cross-validation was applied. 1

3. Results and discussion

The classification results were evaluated for temporal features alone, for MFW-based features, and for temporal and MFW-based features combined (see the mean accuracies for the six subgenre constellations in table 2).

SubgenresTemporal featuresMFWTemporal + MFW
historical novel vs. other0.700.830.85
sentimental novel vs. other0.640.780.77
novel of customs vs. other0.620.720.73
historical novel vs. sentimental novel0.740.920.91
historical novel vs. novel of customs0.760.860.89
sentimental novel vs. novel of customs0.590.740.74

Table 2: Results for classification by subgenre with different feature sets

For temporal features alone, the best mean accuracies are achieved for historical novels vs. novels of customs (0.76), historical vs. sentimental novels (0.74) and historical vs. other novels (0.70), so the historical novel clearly is the subgenre which is easiest to distinguish from the other ones. That the results for sentimental novels and novels of customs are lower in general than for historical novels might be related to the fact that novels of customs used to combine descriptive parts with sentimental plot elements (Janik 2008: 67-77). All the results are above the baseline of 0.50, but lower than the results with MFW. This is not surprising because temporal expressions are much less frequent in the novels than all types of words and the temporal feature set was considerably smaller (499 features vs. 4,000 MFW).

With temporal features and MFW combined the results improve for the constellations historical novels vs. other (0.85), historical novels vs. novels of customs (0.89), and novels of customs vs. other (0.73). They do not change or are even slightly worse when the sentimental novel is involved. So although temporal features alone are not better than MFW, by including them and combining them with MFW, the results get better for specific subgenres and worse for others. This means that the temporal features add relevant information for the distinction of certain subgenres, in this case historical novels and novels of customs, but not for subgenres of the novel in general.

The results confirm the hypotheses formulated for the relevance of temporal features for historical novels and novels of customs. For sentimental novels, however, the inclusion of the temporal features seems to confuse the classifier, which needs to be further investigated. In figure 1, the 25 most important features for historical novels vs. novels of customs are shown for temporal features + MFW.

Figure 1: Top 25 (average) feature weights for novels of customs vs. historical novels

The plot shows that the top four features are temporal features: fully specified dates (with day, month, and year) are distinctive for historical novels and time expressions referring to times of the day (e.g. morning, afternoon, night) are typical for novels of customs. Also, dates with at least one specified part are among the top features for historical novels. The other top features belong to the MFW feature set.

4. Conclusions

Using temporal features for the classification of novels by subgenre revealed that they alone yield classification results that are above the baseline of 50 % for the different subgenre constellations but that temporal features alone are not as good as MFW features. When both types of features are combined, however, the classification results improve for historical novels and novels of customs, showing that temporal features add relevant information for the distinction of these subgenres, while they worsen the results for sentimental novels. A look into the feature weights confirms that temporal features are useful for the classification of individual types of subgenres, but also that only a few very specific temporal features are selected when temporal features and MFW are combined (e.g. fully specified dates and time expressions referring to times of the day).

More general conclusions that can be drawn for genre classification is that features which are more specific than MFW or topics make the classification harder because they are more sparse. In combination with frequent features, they can improve the results for certain types of genres, but not for others, confirming that different genres are not all defined on the same textual and linguistic levels. As next steps, the quality of the temporal tagging should be evaluated to make sure that there is no bias resulting from the annotation procedure. Besides that, temporal features could be used to classify literary texts of other genres and different linguistic and historical contexts.

Appendix A

Bibliography
  1. Bögel, Thomas / Strötgen, Jannik / Gertz, Michael (2015): “A Hybrid Approach to Extract Temporal Signals from Narratives”, in: Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL’15), Duisburg-Essen, September-October 2015 106-107 <http://web.archive.org/web/20210616084719/https://konvens.org/proceedings/2015/GSCL-201514.pdf> [15.06.2021].
  2. Calderón, Mario (2005): “La novela costumbrista mexicana”, in: Clark de Lara, Belem / Speckman Guerra, Elisa (eds.): La república de las letras. Asomos a la cultura escrita del México decimonónico. Vol. 1: Ambientes, asociaciones y grupos. Movimientos, temas y géneros literarios. México: Universidad Nacional Autónoma de México 315-324.
  3. Fernández Prieto, Celia (1996): “Poética de la novela histórica como género literario”, in: Signa. Revista de la Asociación Española de Semiótica 5: 185-202.
  4. Ferro, Lisa / Gerber, Laurie / Inderjeet, Mani / Sundheim, Beth / Wilson, George (2005): TIDES. 2005 Standard for the Annotation of Temporal Expressions <http://web.archive.org/web/20200716215051/https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-timex2-guidelines-v0.1.pdf> [15.06.2021].
  5. Fischer, Frank / Strötgen, Jannik (2015a): “Un calendario de la literatura española (aplicación para Android e iOS)”, in: LINHD (ed.): Humanidades Digitales Hispánicas (HDH’15), Madrid, October 2015 <http://web.archive.org/web/20210126221259/https://dbs.ifi.uni-heidelberg.de/files/Team/jannik/hdh2015-madrid-slides.pdf> [15.06.2021].
  6. Fischer, Frank / Strötgen, Jannik (2015b): “When Does German Literature Take Place? – On the Analysis of Temporal Expressions in Large Corpora”, in: ADHO (ed.): DH 2015: Annual Conference of the Alliance of Digital Humanities Organizations, Sydney, June-July 2015 <http://web.archive.org/web/20210616084951/https://dbs.ifi.uni-heidelberg.de/files/Team/jannik/publications/fischer-stroetgen_temporal-expressions-in-literary-corpora_dh2015_final_2015-03-01.pdf> [15.06.2021].
  7. Gius, Evelyn / Jacke, Janina (2015): Zur Annotation narratologischer Kategorien der Zeit. Guidelines zur Nutzung des CATMA-Tagsets. Hamburg <http://web.archive.org/web/20201212151209/http://heureclea.de/wp-content/uploads/2016/11/guidelinesV2.pdf> [15.06.2021].
  8. Henny-Krahmer, Ulrike (ed.) (2021): Corpus de novelas hispanoamericanas del siglo XIX (conha19). Version 1.0.1. Github.com <https://github.com/cligs/conha19> [15.06.2021]. DOI: 10.5281/zenodo.4766987.
  9. Henny-Krahmer, Ulrike [Forthcoming]: Genre Analysis and Corpus Design: 19 th Century Spanish American Novels (1830-1910). Ph.D. thesis, University of Würzburg.
  10. Hettinger, Lena / Reger, Isabella / Jannidis, Fotis / Hotho, Andreas (2016): “Classification of Literary Subgenres”, in: DHd2016. Konferenzabstracts, Leipzig, March 2016: 160-164. DOI: 10.5281/zenodo.4645369.
  11. Janik, Dieter (2008): Hispanoamerikanische Literaturen. Von der Unabhängigkeit bis zu den Avantgarden (1810-1930). Tübingen: Narr Francke Attempto.
  12. Lefere, Robin (2013): La novela histórica: (re)definición, caracterización, tipología. Madrid: Visor Libros.
  13. Padró, Lluis / Stanilovsky, Evgeny (2012): “FreeLing 3.0. Towards Wider Multilinguality”, in: ELRA: Proceedings of the Language Resources and Evaluation Conference (LREC 2012) Istanbul, May 2012 <http://web.archive.org/web/20210117151054/http://nlp.lsi.upc.edu/publications/papers/padro12.pdf> [15.06.2021].
  14. Pedregosa, Fabian / Varoquaux, Gaël / Gramfort, Alexandre / Michel, Vincent / Thirion, Bertrand / Grisel, Olivier / Blondel, Mathieu / Prettenhofer, Peter / Weiss, Ron / Dubourg, Vincent / Vanderplas, Jakee / Passos, Alexandre / Cournapeau, David / Brucher, Matthieu / Perrot, Matthieu . / Duchesnay, Édouard . (2011): “Scikit-learn: Machine Learning in Python”, in: Journal of Machine Learning Research 12: 2825-2830 <http://web.archive.org/web/20210616092859/https://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf> [15.6.2021].
  15. Schmid, Helmut (1995): “Probabilistic Part-of-Speech Tagging Using Decision Trees”, in: Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, 1994 <http://web.archive.org/web/20210616092524/https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf> [15.06.2021].
  16. Schöch, Christof (2017): “Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama”, in: Digital Humanities Quarterly 11, 2 <http://web.archive.org/web/20210616092707/http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html> [15.06.2021].
  17. Spang, Kurt (1998): “Apuntes para una definición de la novela histórica”, in: Spang, Kurt / Arellano, Ignacio / Mata, Carlos (eds.): La novela histórica. Teoría y comentarios. Pamplona: EUNSA: 63-125.
  18. Strötgen, Jannik / Gertz, Michael (2015): “A Baseline Temporal Tagger for all Languages”, in: Association for Computational Linguistics (ed.): Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, September 2015: 541-547. DOI: 10.18653/v1/D15-1063.
  19. Strötgen, Jannik / Gertz, Michael (2016): Domain-Sensitive Temporal Tagging. London: S. L.: Morgan & Claypool.
  20. Underwood, Ted (2015): Understanding Genre in a Collection of a Million Volumes. White Paper Report 109365. Urbana-Champaign: University of Illinois. DOI: 10.17613/M6W07V.
Notes
1.
The data and code related to this contribution is available at https://github.com/hennyu/time_for_genre_eadh21(version 1.0).