Notebook for Lemmatization¶
Setup¶
- Check your Python version and install the CLTK library.
!python --version
!which python
pip install cltk
pip install --upgrade jupyter ipywidgets
Processing a Single Chapter¶
Familiarize yourself with the document you want to lemmatize.
Read the document carefully.
Document analysis:
- List the first 80 characters.
- Count the total number of characters (letters, spaces, punctuation).
- Tokenize the document to count the number of words.
with open("texts/C_II_all.txt") as f:
Chapter2_full = f.read()
snippet = Chapter2_full[:80]
chars = len(Chapter2_full)
tokens = len(Chapter2_full.split())
print("80 first worlds:", snippet)
print("Character count:", chars)
print("Approximate token count:", tokens)
Running the Lemmatization Pipeline¶
- Import the NLP module.
- Apply lemmatization to the document.
- Print the first 20 lemmatized words as a quick test.
from cltk import NLP
cltk_nlp = NLP(language="lat")
%time cltk_doc = cltk_nlp.analyze(text=Chapter2_full)
# List of lemmas
print(cltk_doc.lemmata[:20])
Formatting the Document for Data Visualization¶
- Create a
.txtfile with one word per line (for archival purposes). - Create a
.txtfile containing the full chapter with the entirely lemmatized text for visualization (e.g., co-occurrence, topic modeling).
# Assuming cltk_doc.lemmata contains the lemmata data
lemmata = cltk_doc.lemmata
# Specify the file path where you want to save the result
file_path = "C_II_lem_col.txt"
# Open the file in write mode ('w'), this will overwrite any existing file with the same name
with open(file_path, 'w') as file:
# Write each lemma to the file
for lemma in lemmata:
file.write(f"{lemma}\n") # Each lemma on a new line
print(f"Results saved to {file_path}")
Reconstituer le texte pour l'analyse de coocurence
# Assuming cltk_doc.lemmata contains the lemmata data
lemmata = cltk_doc.lemmata
# Specify the file path where you want to save the result
file_path = "C_II_lem.txt"
# Open the file in write mode ('w'), this will overwrite any existing file with the same name
with open(file_path, 'w') as file:
# Join the lemmata with spaces between them and write to the file
file.write(" ".join(lemmata)) # All words in one line, separated by spaces
print(f"Results saved to {file_path}")
Processing a Batch of Documents¶
Work with a large dataset using the
globfunction.- Example path:
"texts/*/*.txt"
- Example path:
Create the necessary folders and subfolders.
Run the NLP pipeline.
import glob
import os
from cltk import NLP
# Initialize CLTK only once
cltk_nlp = NLP(language="lat")
# Input folder pattern: all .txt files inside subfolders e.g. : "texts/*/*.txt" ou un dossier précis "texts/Hyperius/*.txt"
INPUT_PATTERN = "texts/Unbekannt/*.txt"
# Output folder
OUTPUT_DIR = "lemmatized_outputs"
os.makedirs(OUTPUT_DIR, exist_ok=True)
print("Setup complete.")
𐤀 CLTK version '1.4.0'. When using the CLTK in research, please cite: https://aclanthology.org/2021.acl-demo.3/ Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`. ⸖ ``LatinStanzaProcess`` using Stanza model from the Stanford NLP Group: https://stanfordnlp.github.io/stanza/ . Please cite: https://arxiv.org/abs/2003.07082 ⸖ ``LatinEmbeddingsProcess`` using word2vec model by University of Oslo from http://vectors.nlpl.eu/ . Please cite: https://aclanthology.org/W17-0237/ ⸖ ``LatinLexiconProcess`` using Lewis's *An Elementary Latin Dictionary* (1890). ⸎ To suppress these messages, instantiate ``NLP()`` with ``suppress_banner=True``. Setup complete.
Verify Documents to be Lemmatized¶
- Display the list of documents to be processed.
files = glob.glob(INPUT_PATTERN)
print(f"Found {len(files)} text files:")
for f in files:
print(" -", f)
Found 12 text files: - texts/Unbekannt/C_II_v2_cl.txt - texts/Unbekannt/C_II_v8_cl.txt - texts/Unbekannt/C_II_v6-7_cl.txt - texts/Unbekannt/C_II_v1_cl.txt - texts/Unbekannt/C_II_v13-14_cl.txt - texts/Unbekannt/C_II_cl.txt - texts/Unbekannt/C_II_v5-6_cl.txt - texts/Unbekannt/C_II_v11-12_cl.txt - texts/Unbekannt/C_II_v15_cl.txt - texts/Unbekannt/C_II_v9_cl.txt - texts/Unbekannt/C_II_v10_cl.txt - texts/Unbekannt/C_II_v3-4_cl.txt
Lemmatizing a Dataset¶
Create a loop to process multiple documents.
Choose a consistent naming convention for output files:
filename = "FolderName_FileName.txt"
Run the NLP processing line for each document.
Save the results in a designated folder.
#open the loop
for file_path in files:
print("\nProcessing:", file_path)
# -------------------------------------------------------------------
# Extract folder name + filename
# -------------------------------------------------------------------
folder = os.path.basename(os.path.dirname(file_path)) # e.g. "Bullinger"
filename = os.path.basename(file_path) # e.g. "C_II_v5-7_cl.txt"
# Clean base name: remove extension + replace hyphens
base = os.path.splitext(filename)[0].replace("-", "_")
# Output name format: Folder_Filename_lem.txt
output_name = f"{folder}_{base}_lem.txt"
output_path = os.path.join(OUTPUT_DIR, output_name)
# -------------------------------------------------------------------
# Read the file
# -------------------------------------------------------------------
with open(file_path, "r", encoding="utf-8") as f:
text = f.read()
# -------------------------------------------------------------------
# CLTK NLP
# -------------------------------------------------------------------
cltk_doc = cltk_nlp.analyze(text=text)
lemmas = [w.lemma for w in cltk_doc.words]
# -------------------------------------------------------------------
# Save output
# -------------------------------------------------------------------
with open(output_path, "w", encoding="utf-8") as out:
out.write("\n".join(lemmas))
print("Saved →", output_path)
print("\nAll files processed!")
Processing: texts/Unbekannt/C_II_v2_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v2_cl_lem.txt Processing: texts/Unbekannt/C_II_v8_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v8_cl_lem.txt Processing: texts/Unbekannt/C_II_v6-7_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v6_7_cl_lem.txt Processing: texts/Unbekannt/C_II_v1_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v1_cl_lem.txt Processing: texts/Unbekannt/C_II_v13-14_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v13_14_cl_lem.txt Processing: texts/Unbekannt/C_II_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_cl_lem.txt Processing: texts/Unbekannt/C_II_v5-6_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v5_6_cl_lem.txt Processing: texts/Unbekannt/C_II_v11-12_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v11_12_cl_lem.txt Processing: texts/Unbekannt/C_II_v15_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v15_cl_lem.txt Processing: texts/Unbekannt/C_II_v9_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v9_cl_lem.txt Processing: texts/Unbekannt/C_II_v10_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v10_cl_lem.txt Processing: texts/Unbekannt/C_II_v3-4_cl.txt Saved → lemmatized_outputs/Unbekannt_C_II_v3_4_cl_lem.txt All files processed!
List Processed Documents¶
- Keep track of the documents that have been successfully lemmatized.
print("Generated files:")
for f in sorted(os.listdir(OUTPUT_DIR)):
print(" -", f)
Generated files: - Aretius_C_II_cl_lem.txt - Aretius_C_II_v10_cl_lem.txt - Aretius_C_II_v11_cl_lem.txt - Aretius_C_II_v12_cl_lem.txt - Aretius_C_II_v13_cl_lem.txt - Aretius_C_II_v14_cl_lem.txt - Aretius_C_II_v15_cl_lem.txt - Aretius_C_II_v1_cl_lem.txt - Aretius_C_II_v1b_cl_lem.txt - Aretius_C_II_v2_cl_lem.txt - Aretius_C_II_v2b_cl_lem.txt - Aretius_C_II_v3_cl_lem.txt - Aretius_C_II_v6_cl_lem.txt - Aretius_C_II_v6b_cl_lem.txt - Aretius_C_II_v7_cl_lem.txt - Aretius_C_II_v8_cl_lem.txt - Aretius_C_II_v9_cl_lem.txt - Bugenhagen_C_II_cl_lem.txt - Bugenhagen_C_II_v11_cl_lem.txt - Bugenhagen_C_II_v1_cl_lem.txt - Bugenhagen_C_II_v4_cl_lem.txt - Bugenhagen_C_II_v5_cl_lem.txt - Bugenhagen_C_II_v6_cl_lem.txt - Bugenhagen_C_II_v8_cl_lem.txt - Bugenhagen_C_II_v8b_cl_lem.txt - Bullinger_C_II_cl_lem.txt - Bullinger_C_II_v11_15_cl_lem.txt - Bullinger_C_II_v15ep_cl_lem.txt - Bullinger_C_II_v15epb_cl_lem.txt - Bullinger_C_II_v1_2_cl_lem.txt - Bullinger_C_II_v1_cl_lem.txt - Bullinger_C_II_v3_4_cl_lem.txt - Bullinger_C_II_v5_7_cl_lem.txt - Bullinger_C_II_v8_cl_lem.txt - Bullinger_C_II_v9_10_cl_lem.txt - Cajetan_C_II_cl_lem.txt - Calvin_C_II_v11_15_cl_lem.txt - Calvin_C_II_v1_2_cl_lem.txt - Calvin_C_II_v2_4_cl_lem.txt - Calvin_C_II_v5_7_cl_lem.txt - Calvin_C_II_v8_10_cl_lem.txt - Hyperius_C_II_cl_lem.txt - Hyperius_C_II_v11_12_cl_lem.txt - Hyperius_C_II_v13_14_cl_lem.txt - Hyperius_C_II_v15_cl_lem.txt - Hyperius_C_II_v15ep_cl_lem.txt - Hyperius_C_II_v1_2_cl_lem.txt - Hyperius_C_II_v3_4_cl_lem.txt - Hyperius_C_II_v5_6_cl_lem.txt - Hyperius_C_II_v7_cl_lem.txt - Hyperius_C_II_v8_10_cl_lem.txt - Hyperius_C_II_v8_cl_lem.txt - Lambertus_C_II_cl_lem.txt - Lambertus_C_II_v10_cl_lem.txt - Lambertus_C_II_v11_cl_lem.txt - Lambertus_C_II_v12_cl_lem.txt - Lambertus_C_II_v12b_cl_lem.txt - Lambertus_C_II_v13_cl_lem.txt - Lambertus_C_II_v14_cl_lem.txt - Lambertus_C_II_v15_cl_lem.txt - Lambertus_C_II_v1_cl_lem.txt - Lambertus_C_II_v2_cl_lem.txt - Lambertus_C_II_v3_cl_lem.txt - Lambertus_C_II_v4_cl_lem.txt - Lambertus_C_II_v5_cl_lem.txt - Lambertus_C_II_v6_cl_lem.txt - Lambertus_C_II_v7_cl_lem.txt - Lambertus_C_II_v8_cl_lem.txt - Lambertus_C_II_v9_cl_lem.txt - Lefevre_C_II_cl_lem.txt - Pellicanus_C_II_v11_12_cl_lem.txt - Pellicanus_C_II_v13_14_cl_lem.txt - Pellicanus_C_II_v15_cl_lem.txt - Pellicanus_C_II_v1_2_cl_lem.txt - Pellicanus_C_II_v2_cl_lem.txt - Pellicanus_C_II_v3_4_cl_lem.txt - Pellicanus_C_II_v5_6_cl_lem.txt - Pellicanus_C_II_v6_7_cl_lem.txt - Pellicanus_C_II_v8_cl_lem.txt - Pellicanus_C_II_v9_10_cl_lem.txt - Unbekannt_C_II_cl_lem.txt - Unbekannt_C_II_v10_cl_lem.txt - Unbekannt_C_II_v11_12_cl_lem.txt - Unbekannt_C_II_v13_14_cl_lem.txt - Unbekannt_C_II_v15_cl_lem.txt - Unbekannt_C_II_v1_cl_lem.txt - Unbekannt_C_II_v2_cl_lem.txt - Unbekannt_C_II_v3_4_cl_lem.txt - Unbekannt_C_II_v5_6_cl_lem.txt - Unbekannt_C_II_v6_7_cl_lem.txt - Unbekannt_C_II_v8_cl_lem.txt - Unbekannt_C_II_v9_cl_lem.txt
Formatting Documents for Data Visualization¶
- Create a
.txtfile with one word per line (for archives). - Create a
.txtfile containing each verses and his commentary with the fully lemmatized text (for co-occurrence and topic modeling visualization).
# Directory with lemma files
LEMMA_DIR = OUTPUT_DIR
# Directory for continuous text output
CONTINUOUS_DIR = "lemmatized"
os.makedirs(CONTINUOUS_DIR, exist_ok=True)
# Loop over all lemma files
for filename in os.listdir(LEMMA_DIR):
if filename.endswith("_lem.txt"):
input_path = os.path.join(LEMMA_DIR, filename)
output_path = os.path.join(CONTINUOUS_DIR, filename)
# Read lemma file
with open(input_path, "r", encoding="utf-8") as f:
lemmas = f.read().splitlines()
# Join lemmas into continuous text
continuous_text = " ".join(lemmas)
# Save
with open(output_path, "w", encoding="utf-8") as out:
out.write(continuous_text)
print("Converted →", output_path)
print("\nAll lemma converted to continuous text!")
Converted → lemmatized/Aretius_C_II_v14_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v4_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v5_7_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v3_4_cl_lem.txt Converted → lemmatized/Calvin_C_II_v5_7_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v10_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v8_10_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v5_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v3_4_cl_lem.txt Converted → lemmatized/Calvin_C_II_v2_4_cl_lem.txt Converted → lemmatized/Aretius_C_II_v2b_cl_lem.txt Converted → lemmatized/Bullinger_C_II_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v14_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v4_cl_lem.txt Converted → lemmatized/Aretius_C_II_v8_cl_lem.txt Converted → lemmatized/Aretius_C_II_v1_cl_lem.txt Converted → lemmatized/Lefevre_C_II_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v13_cl_lem.txt Converted → lemmatized/Calvin_C_II_v11_15_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v5_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v10_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v15epb_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v2_cl_lem.txt Converted → lemmatized/Aretius_C_II_v6b_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v5_6_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v8_cl_lem.txt Converted → lemmatized/Aretius_C_II_v7_cl_lem.txt Converted → lemmatized/Aretius_C_II_v3_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v1_2_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v1_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v2_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v1_2_cl_lem.txt Converted → lemmatized/Calvin_C_II_v8_10_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v13_14_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v5_6_cl_lem.txt Converted → lemmatized/Aretius_C_II_v1b_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v1_2_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v3_4_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v13_14_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v6_7_cl_lem.txt Converted → lemmatized/Aretius_C_II_v6_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v15_cl_lem.txt Converted → lemmatized/Hyperius_C_II_cl_lem.txt Converted → lemmatized/Cajetan_C_II_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v1_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v2_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v9_10_cl_lem.txt Converted → lemmatized/Aretius_C_II_v9_cl_lem.txt Converted → lemmatized/Aretius_C_II_v10_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v3_4_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v8_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v8_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v8b_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v15_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v8_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v11_cl_lem.txt Converted → lemmatized/Aretius_C_II_v11_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v3_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v6_cl_lem.txt Converted → lemmatized/Aretius_C_II_v15_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v11_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v7_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v8_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v5_6_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v9_10_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v15ep_cl_lem.txt Converted → lemmatized/Aretius_C_II_v12_cl_lem.txt Converted → lemmatized/Calvin_C_II_v1_2_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v6_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v9_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_v1_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v12_cl_lem.txt Converted → lemmatized/Aretius_C_II_v2_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v15_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v7_cl_lem.txt Converted → lemmatized/Bugenhagen_C_II_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v15ep_cl_lem.txt Converted → lemmatized/Aretius_C_II_cl_lem.txt Converted → lemmatized/Aretius_C_II_v13_cl_lem.txt Converted → lemmatized/Hyperius_C_II_v11_12_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v15_cl_lem.txt Converted → lemmatized/Lambertus_C_II_cl_lem.txt Converted → lemmatized/Lambertus_C_II_v12b_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v9_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v1_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v6_7_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v8_cl_lem.txt Converted → lemmatized/Bullinger_C_II_v11_15_cl_lem.txt Converted → lemmatized/Unbekannt_C_II_v11_12_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v11_12_cl_lem.txt Converted → lemmatized/Pellicanus_C_II_v13_14_cl_lem.txt All lemma converted to continuous text!
- Reconstitute the chapter by combining all
.txtfiles that share the same text, in alphanumerical order
import os
import re
from natsort import natsorted # pip install natsort
# Folder containing your .txt files
folder = "lemmatized"
# Get all .txt files in folder
files = [f for f in os.listdir(folder) if f.endswith(".txt")]
# Extract prefix before first underscore
# Example: Aretius_C_II_v1 → prefix = "Aretius"
prefixes = {}
for f in files:
prefix = f.split("_")[0]
prefixes.setdefault(prefix, []).append(f)
# Process each prefix group
for prefix, grouped_files in prefixes.items():
# Sort alphanumerically
sorted_files = natsorted(grouped_files)
# Output filename
outfile = f"{prefix}_all_C_II_cl_lem.txt"
print(f"Creating {outfile} from {len(sorted_files)} files...")
with open(outfile, "w", encoding="utf-8") as out:
for fname in sorted_files:
path = os.path.join(folder, fname)
with open(path, "r", encoding="utf-8") as infile:
out.write(infile.read())
out.write("\n") # optional separator
print("Done.")
Creating Aretius_all_C_II_cl_lem.txt from 17 files... Creating Lambertus_all_C_II_cl_lem.txt from 17 files... Creating Bullinger_all_C_II_cl_lem.txt from 10 files... Creating Hyperius_all_C_II_cl_lem.txt from 11 files... Creating Calvin_all_C_II_cl_lem.txt from 5 files... Creating Unbekannt_all_C_II_cl_lem.txt from 12 files... Creating Bugenhagen_all_C_II_cl_lem.txt from 8 files... Creating Pellicanus_all_C_II_cl_lem.txt from 10 files... Creating Lefevre_all_C_II_cl_lem.txt from 1 files... Creating Cajetan_all_C_II_cl_lem.txt from 1 files... Done.
Documentation¶
- Full documentation is available at CLTK Docs.
Citation¶
When using the CLTK, please cite the following publication:
Johnson, Kyle P., Patrick J. Burns, John Stewart, Todd Cook, Clément Besnier, and William J. B. Mattingly. "The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages." In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 20-29. 2021. DOI: 10.18653/v1/2021.acl-demo.3
BibTeX entry:
@inproceedings{johnson-etal-2021-classical,
title = "The {C}lassical {L}anguage {T}oolkit: {A}n {NLP} Framework for Pre-Modern Languages",
author = "Johnson, Kyle P. and
Burns, Patrick J. and
Stewart, John and
Cook, Todd and
Besnier, Cl{\'e}ment and
Mattingly, William J. B.",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-demo.3",
doi = "10.18653/v1/2021.acl-demo.3",
pages = "20--29",
}