Data Cleaning and Preparation Script for Lemmatization¶

Cleaning TXT Files¶

Steps¶

  1. Reconstruct words split by hyphens in the original document.
  2. Remove punctuation marks except for periods.
  3. Convert uppercase letters to lowercase.
  4. Remove accents.
  5. Remove words in Greek.
  6. Remove numbers.
  7. Correct recurring OCR errors using a .csv dictionary.
  8. Normalize spelling of words from Neo-Latin to Classical Latin (only for recurring words) using a .csv dictionary.
  9. Resolve general abbreviations using a .csv dictionary.
  10. Resolve abbreviations for book names of the Bible using a .csv dictionary.
  11. Remove floating periods.
  12. Add a line break after each period (not required for lemmatization, but improves document readability).

Method¶

  • Raw data cleaning is performed using regex.

  • Word corrections are handled using CSV dictionaries, which can be easily expanded with new entries.

  • Dictionaries are located at /Dictionaries/*.csv:

    • abreviation_solved_new.csv
    • biblicalab_solved_new.csv
    • corpus_standardize.csv
    • old_dict
    • recurrent_ocr_failure_new.csv

Tips¶

  • The notebook allows step-by-step visualization of the data processing.
  • Alternatively, the script cleaning.py can be used for batch processing.

Verify your python setup¶

In [1]:
! python --version
! which python
Python 3.12.3
/home/floriane/Documents/visual_code/.venvocc/bin/python

Cleaning step¶

  1. Reconstruct words split by hyphens in the original document.
In [2]:
import re

def clean_text(text):
    # 1. Fix inline hyphens: "agnitio- nem" → "agnitionem"
    text = re.sub(r'(\w)-\s+(\w)', r'\1\2', text)

    # 2. Remove soft hyphens (invisible OCR hyphens)
    text = text.replace('\u00AD', '')

    # 3. Remove hyphen + newline (in case those also appear)
    text = re.sub(r'-\s*\r?\n\s*', '', text)
    
    return text
  1. Remove punctuation marks except for periods.
In [3]:
import re
import string

def clean_punctuation(text):
    # 1. Replace ':' with a space before removing it
    text = text.replace(":", " ")

    # 2. Build list of punctuation to remove (everything except '.')
    punctuation_to_remove = string.punctuation.replace(".", "")
    
    # Add extra characters you want removed
    punctuation_to_remove += "„»"

    # 3. Remove all unwanted punctuation
    text = re.sub(f"[{re.escape(punctuation_to_remove)}]", "", text)

    return text
  1. Convert uppercase letters to lowercase.
In [4]:
def to_lowercase(text):
    # First convert uppercase V → U
    text = text.replace("V", "U")
    # Then lowercase everything else (U will become u)
    text = text.lower()
    return text
  1. Remove accents.
In [5]:
import unicodedata

def remove_accents(text):
    # Normalize text to NFD form (decomposed)
    nfkd = unicodedata.normalize("NFD", text)
    # Keep only characters that are NOT combining marks (accents)
    return "".join([c for c in nfkd if not unicodedata.combining(c)])
  1. Remove words in Greek.
In [6]:
import re

def remove_greek(text):
    # Remove characters in Greek Unicode ranges
    return re.sub(r'[\u0370-\u03FF\u1F00-\u1FFF]+', '', text)
  1. Remove numbers.
In [7]:
import re

def remove_numbers(text):
    # remove all digits 0-9
    return re.sub(r'\d+', '', text)
  1. Correct recurring OCR errors using a .csv dictionary.
In [8]:
import pandas as pd
import re 
def ocr_failure (text, csv_path="Dictionnaries/recurrent_ocr_failure_new.csv"):
    df = pd.read_csv(csv_path)
    
    for _, row in df.iterrows():
        pattern = row['mistake']
        repl = row['correction']
        text = re.sub(pattern, repl, text)
    
    return text
  1. Normalize spelling of words from Neo-Latin to Classical Latin (only for recurring words) using a .csv dictionary.
In [9]:
import pandas as pd
import re 
def standardize_latin (text, csv_path="Dictionnaries/corpus_standardize.csv"):
    df = pd.read_csv(csv_path)
    
    for _, row in df.iterrows():
        pattern = row['original']
        repl = row['standardize']
        text = re.sub(pattern, repl, text)
    
    return text
  1. Resolve general abbreviations using a .csv dictionary.
In [10]:
import pandas as pd
import re

def solved_abreviation(text, csv_path="Dictionnaries/abreviation_solved_new.csv"):
    # Load CSV into a DataFrame
    df = pd.read_csv(csv_path)
    
    for _, row in df.iterrows():
        pattern = row['abreviation']
        repl = row['resolution']
        text = re.sub(pattern, repl, text)
    
    return text
  1. Resolve abbreviations for book names of the Bible using a .csv dictionary.
In [11]:
import pandas as pd
import re

def solved_biblicalab(text, csv_path="Dictionnaries/biblicalab_solved_new.csv"):
    # Load CSV into a DataFrame
    df = pd.read_csv(csv_path)
    
    for _, row in df.iterrows():
        pattern = row['abreviation2']
        repl = row['resolution2']
        text = re.sub(pattern, repl, text)
    
    return text
  1. Remove floating periods.
In [12]:
def remove_dot_with_spaces(text):
    return text.replace(" . ", " ")
    
# Example
s = "This is a test . With some dots . "
cleaned = remove_dot_with_spaces(s)
print(cleaned)  # Output: "This is a test With some dots "
This is a test With some dots 
In [13]:
def remove_double_spaces(text):
    """
    Replace all consecutive spaces in the text with a single space.
    """
    # Split by whitespace and join back with single spaces
    return ' '.join(text.split())

# Example usage
text = "This  is   an example  with  double spaces."
cleaned_text = remove_double_spaces(text)
print(cleaned_text)
This is an example with double spaces.
  1. Add a line break after each period (not required for lemmatization, but improves document readability).
In [14]:
def add_newline_after_dot(text):
    """
    Add a newline character after each period in the text.
    """
    # Use regex to handle period followed by optional spaces
    import re
    # Replace ". " or "." followed by end of text with ".\n"
    return re.sub(r"\.\s*", ".\n", text)

# Example usage
text = "This is a sentence. This is another sentence. And one more."
result = add_newline_after_dot(text)
print(result)
This is a sentence.
This is another sentence.
And one more.

In [15]:
def clean_all(text):
    text = clean_text(text)
    print("clean_text: ok")

    text = clean_punctuation(text)
    print("clean_punctuation_and_newlines: ok")    
    
    text = to_lowercase(text)
    print("to_lowercase: ok")

    text = remove_accents(text)
    print("remove_accents: ok")

    text = remove_greek(text)
    print("remove_greek: ok")

    text = remove_numbers(text)
    print("remove_numbers: ok")

    text= ocr_failure(text)
    print("ocr_failure_corrected : ok")
    text = standardize_latin(text)
    print("standardize_latin: ok")

    text = solved_abreviation(text)
    print("solved_abreviation: ok")
    
    text = solved_biblicalab(text)
    print("solved biblcal abreviation : ok")
    
    text = remove_dot_with_spaces(text)
    print("remove_dot_with_spaces: ok")

    text = remove_double_spaces(text)
    print("remove_double_spaces:ok")
    
    text = add_newline_after_dot(text)
    print("add_newline_after_dot: ok")

    return text
In [16]:
import glob
import os

paths = glob.glob("test/**/*.txt", recursive=True)

for path in paths:
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()

    cleaned = clean_all(text)

    # create new filename: file.txt → file_cl.txt
    base, ext = os.path.splitext(path)
    new_path = base + "_cl" + ext

    with open(new_path, "w", encoding="utf-8") as f:
        f.write(cleaned)

    print(f"Saved cleaned file: {new_path}")
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v1-2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v11-12_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v5-6_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v6-7_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v3-4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v9-10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Pellicanus/C_II_v13-14_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v9_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v1_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v11-12_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v5-6_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v6-7_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v3-4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_v13-14_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Unbekannt/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v1-2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v15epb_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v1_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v11-15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v3-4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v15ep_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v9-10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_v5-7_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bullinger/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v13_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v14_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v1b_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v12_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v9_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v2b_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v6b_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v1_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v6_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v11_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v3_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Aretius/C_II_v7_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v1-2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v11-12_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v5-6_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v3-4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v8-10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v15ep_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v13-14_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Hyperius/C_II_v7_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Calvin/C_II_v1-2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Calvin/C_II_v2-4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Calvin/C_II_v11-15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Calvin/C_II_v8-10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Calvin/C_II_v5-7_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lefevre/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v5_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v1_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v6_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v11_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_v8b_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Bugenhagen/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Cajetan/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v5_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v13_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v2_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v14_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v12_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v9_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v8_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v4_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v1_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v15_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v6_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v11_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v12b_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v3_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v10_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_cl.txt
clean_text: ok
clean_punctuation_and_newlines: ok
to_lowercase: ok
remove_accents: ok
remove_greek: ok
remove_numbers: ok
ocr_failure_corrected : ok
standardize_latin: ok
solved_abreviation: ok
solved biblcal abreviation : ok
remove_dot_with_spaces: ok
remove_double_spaces:ok
add_newline_after_dot: ok
Saved cleaned file: test/Lambertus/C_II_v7_cl.txt

For an other data processing¶

reconstitue the chapter by merging all the file.txt in ordre.

In [17]:
import re
import glob
import os

def sort_key(filename):
    """
    Return a tuple for sorting:
    1. start number after 'v' (0 if none)
    2. end number after '-' (same as start if no range)
    3. ep_flag: 0 = none, 1 = ep, 2 = epb
    """
    basename = os.path.basename(filename)
    match = re.search(r'_v(\d+)(?:-(\d+))?(epb|ep)?', basename)
    if match:
        start = int(match.group(1))
        end = int(match.group(2)) if match.group(2) else start
        ep_flag = 0
        if match.group(3) == "ep":
            ep_flag = 1
        elif match.group(3) == "epb":
            ep_flag = 2
        return (start, end, ep_flag)
    else:
        return (-1, -1, 0)

# Get all subfolders under 'test'
folders = [f for f in glob.glob("test/*") if os.path.isdir(f)]

for folder in folders:
    # Find all txt files ending with '_cl.txt' in this folder
    files = glob.glob(os.path.join(folder, "*_cl.txt"))
    files_sorted = sorted(files, key=sort_key)
    
    # Output file name is based on the folder
    folder_name = os.path.basename(folder)
    output_file = f"{folder_name}_all_cl.txt"
    
    with open(output_file, "w", encoding="utf-8") as outfile:
        for fname in files_sorted:
            with open(fname, "r", encoding="utf-8") as infile:
                outfile.write(infile.read())
                outfile.write("\n")  # newline between files
    
    print(f"Merged files in folder '{folder_name}' into '{output_file}'")
Merged files in folder 'Pellicanus' into 'Pellicanus_all_cl.txt'
Merged files in folder 'Unbekannt' into 'Unbekannt_all_cl.txt'
Merged files in folder 'Bullinger' into 'Bullinger_all_cl.txt'
Merged files in folder 'Aretius' into 'Aretius_all_cl.txt'
Merged files in folder 'Hyperius' into 'Hyperius_all_cl.txt'
Merged files in folder 'Calvin' into 'Calvin_all_cl.txt'
Merged files in folder 'Lefevre' into 'Lefevre_all_cl.txt'
Merged files in folder 'Bugenhagen' into 'Bugenhagen_all_cl.txt'
Merged files in folder 'Cajetan' into 'Cajetan_all_cl.txt'
Merged files in folder 'Lambertus' into 'Lambertus_all_cl.txt'
In [18]:
"""""""""
import csv

# Input: old CSV
with open('Dictionnaries/recurrent_ocr_failure.csv', newline='', encoding='utf-8') as infile:
    reader = csv.reader(infile)
    rows = [row for row in reader]

# Output: new CSV
with open('new_dict.csv', 'w', newline='', encoding='utf-8') as outfile:
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for row in rows:
        pattern = row[0]
        replacement = row[1] + " "  # Add trailing space
        writer.writerow([pattern, replacement])

print("CSV updated to new format with trailing spaces!")

"""""""""
Out[18]:
'\nimport csv\n\n# Input: old CSV\nwith open(\'Dictionnaries/recurrent_ocr_failure.csv\', newline=\'\', encoding=\'utf-8\') as infile:\n    reader = csv.reader(infile)\n    rows = [row for row in reader]\n\n# Output: new CSV\nwith open(\'new_dict.csv\', \'w\', newline=\'\', encoding=\'utf-8\') as outfile:\n    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)\n    for row in rows:\n        pattern = row[0]\n        replacement = row[1] + " "  # Add trailing space\n        writer.writerow([pattern, replacement])\n\nprint("CSV updated to new format with trailing spaces!")\n\n'