| Title: | Compute Semantic Distance Between Text Constituents |
|---|---|
| Description: | Cleans and formats language transcripts guided by a series of transformation options (e.g., lemmatize words, omit stopwords, split strings across rows). 'SemanticDistance' computes two distinct metrics of cosine semantic distance (experiential and embedding). These values reflect pairwise cosine distance between different elements or chunks of a language sample. 'SemanticDistance' can process monologues (e.g., stories, ordered text), dialogues (e.g., conversation transcripts), word pairs arrayed in columns, and unordered word lists. Users specify options for how they wish to chunk distance calculations. These options include: rolling ngram-to-word distance (window of n-words to each new word), ngram-to-ngram distance (2-word chunk to the next 2-word chunk), pairwise distance between words arrayed in columns, matrix comparisons (i.e., all possible pairwise distances between words in an unordered list), turn-by-turn distance (talker to talker in a dialogue transcript). 'SemanticDistance' includes visualization options for analyzing distances as time series data and simple semantic network dynamics (e.g., clustering, undirected graph network). |
| Authors: | Jamie Reilly [aut, cre] (ORCID: <https://orcid.org/0000-0002-0891-438X>), Emily B. Myers [aut], Hannah R. Mechtenberg [aut], Jonathan E. Peelle [aut] |
| Maintainer: | Jamie Reilly <[email protected]> |
| License: | LGPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-05-29 09:13:38 UTC |
| Source: | https://github.com/reilly-conceptscognitionlab/semanticdistance |
Cleans a transcript where there are two or more talkers. User specifies the dataframe and column name where target text is stored in addition a factor variable corresponding to the identity of the person producing corresponding text. Users also specify cleaning parameters for stopword removal and lemmatization (both defaulting to TRUE). Function splits and unlists text so that the output is in a one-row-per-word format marked by a unique numeric identifier (i.e., 'id_orig'). Function appends a turn_count sequence used for aggregating all the words within each turn. If a speaker generates no complete observations because of stopword removal, the turn counter will not increment until a talker switch AND a complete observation is observed.
clean_dialogue(dat, wordcol, who_talking, omit_stops = TRUE, lemmatize = TRUE)clean_dialogue(dat, wordcol, who_talking, omit_stops = TRUE, lemmatize = TRUE)
dat |
a datataframe with at least one target column of string data |
wordcol |
quoted column name storing the strings that will be cleaned and split |
who_talking |
quoted column name with speaker/talker identities will be factorized |
omit_stops |
T/F user wishes to remove stopwords (default is TRUE) |
lemmatize |
T/F user wishes to lemmatize each string (default is TRUE) |
a dataframe
Cleans and formats text. User specifies the dataframe and column name where target text is stored as arguments to the function. Default option is to lemmatize strings. Function splits and unlists text so that the output is in a one-row-per-word format marked by a unique numeric identifier (i.e., 'id_orig')
clean_monologue_or_list(dat, wordcol, omit_stops = TRUE, lemmatize = TRUE)clean_monologue_or_list(dat, wordcol, omit_stops = TRUE, lemmatize = TRUE)
dat |
a dataframe with at least one target column of string data |
wordcol |
quoted column name storing the strings that will be cleaned and split |
omit_stops |
option for omitting stopwords default is TRUE |
lemmatize |
option for lemmatizing strings default is TRUE |
a dataframe
Cleans a transcript where word pairs are arrayed in two columns.
clean_paired_cols(dat, wordcol1, wordcol2, lemmatize = TRUE)clean_paired_cols(dat, wordcol1, wordcol2, lemmatize = TRUE)
dat |
a dataframe with two columns of words you want pairwise distance for |
wordcol1 |
quoted column name storing the first string for comparison |
wordcol2 |
quoted column name storing the second string for comparison |
lemmatize |
T/F user wishes to lemmatize each string (default is TRUE) |
a dataframe
A sample dyadic conversation transcript where two people are conversing.
Dialogue_TypicalDialogue_Typical
## "Dialogue_Typical" A data frame with 5 rows and 2 columns:
fictional text from a language transcript
Mary or Peter: fictional speaker identities
...
Function takes dataframe cleaned using 'clean_monologue', computes rolling chunk-to-chunk distance between user-specified ngram size (e.g., 2-word chunks)
dist_anchor(dat, anchor_size = 10)dist_anchor(dat, anchor_size = 10)
dat |
a dataframe prepped using 'clean_monologue' fn |
anchor_size |
an integer specifying the number of words in the initial chunk for comparison to new words as the sample unfolds |
a dataframe
Function takes dataframe cleaned using 'clean_dialogue' and computes two metrics of semantic distance turn-to-turn indexing a 'talker' column. Sums all the respective semantic vectors within each tuern, cosine distance to the next turn's composite vector
dist_dialogue(dat, who_talking)dist_dialogue(dat, who_talking)
dat |
a dataframe prepped using 'clean_dialogue' fn with talker data and turncount appended |
who_talking |
factor variable with two levels specifying an ID for the person producing the text in 'word_clean' |
a dataframe
Function takes dataframe cleaned using 'clean_monologue', computes rolling chunk-to-chunk distance between user-specified ngram size (e.g., 2-word chunks)
dist_ngram2ngram(dat, ngram)dist_ngram2ngram(dat, ngram)
dat |
a dataframe prepped using 'clean_monologue' fn |
ngram |
an integer specifying the window size of words for computing distance to a target word |
a dataframe
Function takes dataframe cleaned using 'clean_monologue', computes two metrics of semantic distance for each word relative to the average of the semantic vectors within an n-word window appearing before each word. User specifies the window (ngram) size. The window 'rolls' across the language sample providing distance metrics
dist_ngram2word(dat, ngram)dist_ngram2word(dat, ngram)
dat |
a dataframe prepped using 'clean_monologue' fn |
ngram |
an integer specifying the window size of words for computing distance to a target word will go back skipping NAs until content words equals the ngram window |
a dataframe
Function takes dataframe cleaned using 'clean_2columns', computes two metrics of semantic distance for each word pair arrayed in Col1 vs. Col2
dist_paired_cols(dat)dist_paired_cols(dat)
dat |
a dataframe prepped using clean_2columns' with word pairs arrayed in two columns |
a dataframe
Word embeddings (300 hyperparameter dimensions, 59061 words). Each word is one row.
glowca_25glowca_25
## "glowca_25" A data frame with 59061 observations of 301 variables
word characterized across embeddings
hyperparameter number 1
hyperparameter number 300
...
A monologue discourse sample. Grandfather Passage is a well-known test of reading aloud.
Grandfather_PassageGrandfather_Passage
## "Grandfather_Passage" A data frame with 1 observation of 1 variable:
text from the Grandfather Passage unsplit
...
Load all .rda files from a GitHub data folder into the package environment
load_github_data( repo = "Reilly-ConceptsCognitionLab/SemanticDistance_Data", branch = "main", data_folder = "data", envir = parent.frame() )load_github_data( repo = "Reilly-ConceptsCognitionLab/SemanticDistance_Data", branch = "main", data_folder = "data", envir = parent.frame() )
repo |
GitHub repository (e.g., "username/repo") |
branch |
Branch name (default: "main") |
data_folder |
Remote folder containing .rda files (default: "data/") |
envir |
Environment to load into (default: package namespace) |
nothing, loads data (as rda files) from github repository needed for other package functions
Dataframe with ordered text squashed into a single cell.
Monologue_TypicalMonologue_Typical
## "Monologue_Typical" A data frame with 1 row and 1 column
text from a language transcript
...
Word embeddings (300 dimensions, 59061 words). Each word is one row.
SD15_2025_completeSD15_2025_complete
## "SD15_2025_complete" A data frame with 25,050 observations of 16 variables
word characterized across 15 ratings
z-score of auditory salience from Lancaster Sensorimotor Norms
z-score of gustatory salience from Lancaster Sensorimotor Norms
z-score of haptic salience from Lancaster Sensorimotor Norms
z-score of interoceptive salience from Lancaster Sensorimotor Norms
z-score of visual salience from Lancaster Sensorimotor Norms
z-score of olfactory salience from Lancaster Sensorimotor Norms
z-score of handarm motor salience from Lancaster Sensorimotor Norms
z-score of excitement salience from affectvec
z-score of surprise salience from affectvec
z-score of fear salience from affectvec
z-score of anger salience from affectvec
z-score of disgust salience from affectvec
z-score of sadness salience from affectvec
z-score of happiness salience from affectvec
z-score of contempt salience from affectvec
...
List of stopwords
Temple_stops25Temple_stops25
## "Temple_stops25" A data frame with 829 observations of 4 variables
numeric identifier
stopword target
length in words
universal part-of-speech tag
...
No talker delineated. List of 17 words spanning 4 semantic categories, Good for examining clustering
Unordered_ListUnordered_List
## "Unordered_List" A data frame with 1 rows and 1 columns:
unsplit list of words containing musical instruments, weapongs, fruits, emotions
first target word for computing distance in one column, second word in another column.
Word_PairsWord_Pairs
## "Word_Pairs" A data frame with 27 rows and 2 columns:
text corresponding to the first word in a pair to contrast
text corresponding to the second word in a pair to contrast
...
Takes a vector of words with semantic distance ratings, converts to a square matrix, then to a euclidean distance matrix (all word pairs), then plots the words in either a cluster dendrogram or simple igraph network
wordlist_to_network( dat, wordcol, output = "dendrogram", dist_type = "embedding" )wordlist_to_network( dat, wordcol, output = "dendrogram", dist_type = "embedding" )
dat |
dataframe with text in it (cleaned using clean_monologue_or_list function |
wordcol |
quoted argument identifying column in dataframe with target text |
output |
quoted argument for type of output default is 'dendrogram', alternate is 'network' |
dist_type |
quoted argument semantic norms for running distance matrix on default='embedding', other is 'SD15' |
This function internally calls eval_kmeans_clustersize for cluster evaluation. The dendrogram visualization is based on hierarchical clustering of semantic distances.
a plot of a dendrogram or an igraph network AND a cosine distance matrix