Title: | Bayesian Clustering Using the Table Invitation Prior (TIP) |
---|---|
Description: | Cluster data without specifying the number of clusters using the Table Invitation Prior (TIP) introduced in the paper "Clustering Gene Expression Using the Table Invitation Prior" by Charles W. Harrison, Qing He, and Hsin-Hsiung Huang (2022) <doi:10.3390/genes13112036>. TIP is a Bayesian prior that uses pairwise distance and similarity information to cluster vectors, matrices, or tensors. |
Authors: | Charles W. Harrison [cre, aut, cph], Qing He [aut, cph], Hsin-Hsiung Huang [aut, cph] |
Maintainer: | Charles W. Harrison <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-03-03 03:03:13 UTC |
Source: | https://github.com/charleswharrison/tip |
An S4 class to store the results of the Gibbs sampler.
An object of class bcm.
n
Positive integer: the sample size (i.e., the number of subjects).
burn
Non-negative integer: the number of burn-in iterations in the Gibbs sampler.
samples
Positive integer: the number of sampling iterations in the Gibbs sampler.
posterior_assignments
List of vectors of positive integers: a list of vectors of cluster assignments (i.e., positive integers) for each sampling iteration in the Gibbs sampler.
posterior_similarity_matrix
Matrix: a matrix where the (i,j)th element is the posterior probability that subject i and subject j belong to the same cluster.
posterior_number_of_clusters
Vector of positive integers: each vector element is the number of clusters after posterior sampling for each sampling iteration in the Gibbs sampler.
prior_name
Character: the name of the prior used.
likelihood_name
Character: the name of the likelihood used.
Estimate the number of similar subjects using univariate multiple change point detection (i.e., binary segmentation in the changepoint package).
get_cpt_neighbors(.distance_matrix)
get_cpt_neighbors(.distance_matrix)
.distance_matrix |
Matrix: a symmetric n x n matrix of distance values. |
Vector of positive integers: a vector of positive integers where the (i)th integer corresponds to the number of subjects (observations) that are similar to the (i)th subject.
# Import the tip library library(tip) # Choose an arbitrary random seed set.seed(007) # Generate some data (i.e., 20 subjects described by a 5 x 1 vector) X <- matrix(rnorm(10*10),nrow=20,ncol=5) # Compute the pairwise distances between the subjects distance_matrix <- data.matrix(dist(X)) # For each subject, find the estimate for the number of similar subjects get_cpt_neighbors(.distance_matrix = distance_matrix)
# Import the tip library library(tip) # Choose an arbitrary random seed set.seed(007) # Generate some data (i.e., 20 subjects described by a 5 x 1 vector) X <- matrix(rnorm(10*10),nrow=20,ncol=5) # Compute the pairwise distances between the subjects distance_matrix <- data.matrix(dist(X)) # For each subject, find the estimate for the number of similar subjects get_cpt_neighbors(.distance_matrix = distance_matrix)
A function that produces a ggnet2 network plot to visualize the posterior similarity matrix (i.e., the matrix of posterior probabilities).
ggnet2_network_plot( .matrix_graph, .subject_names = vector(), .subject_class_names = vector(), .class_colors, .class_shapes, .random_seed = 7, .node_size = 6, .add_node_labels = TRUE )
ggnet2_network_plot( .matrix_graph, .subject_names = vector(), .subject_class_names = vector(), .class_colors, .class_shapes, .random_seed = 7, .node_size = 6, .add_node_labels = TRUE )
.matrix_graph |
Matrix: a matrix M where each element Mij corresponds to the posterior probability that the (i)th subject and the (j)th subject are in the same cluster. |
.subject_names |
Vector of characters: an optional vector of subject names that will appear in the graph plot. |
.subject_class_names |
Vector of characters: an optional vector of class names corresponding to each subject (i.e. vertex in the graph) which influences each vertex's color and shape. For example, the subject class names can be the true label (for the purpose of research) or it can be any other label that analyst chooses. |
.class_colors |
Named vector of characters: an optional named vector of colors that
correspond to each unique value in |
.class_shapes |
Named vector of integers: an optional named vector of shapes that correspond
to each unique value in the |
.random_seed |
Numeric: the plot uses the random layout, so set a seed for reproducibility. |
.node_size |
Positive integer: the size of each node (i.e., vertex) in the graph plot. |
.add_node_labels |
Boolean (i.e., TRUE or FALSE): should individual node labels be added to each node (i.e., vertex) in the graph plot? |
ggnet2 network plot: a network plot with respect to the undirected network given by .matrix_graph. This is used to visualize the posterior similarity matrix.
# Import the tip library library(tip) # Choose an arbitrary random seed to generate the data set.seed(4*8*15*16*23*42) # Generate a symmetric posterior probability matrix # Each element is the probability that the two subjects belong # to the same cluster n1 <- 10 posterior_prob_matrix <- matrix(NA, nrow = n1, ncol = n1) for(i in 1:n1){ for(j in i:n1){ if(i != j){ posterior_prob_matrix[i,j] <- runif(n=1,min=0,max=1) posterior_prob_matrix[j,i] <- posterior_prob_matrix[i,j] }else{ posterior_prob_matrix[i,j] <- 1.0 } } } # --- BEGIN GRAPH PLOT 1: NO COLORS OR NODE LABELS --- # Set an arbitrary random seed random_seed <- 815 # Set add_node_labels to FALSE add_node_labels <- FALSE # Set the node size node_size <- 6 # Construct the graph plot ggnet2_network_plot(.matrix_graph = posterior_prob_matrix, .subject_names = vector(), .subject_class_names = vector(), .class_colors = vector(), .class_shapes = vector(), .random_seed = random_seed, .node_size = node_size, .add_node_labels = add_node_labels) # --- END GRAPH PLOT 1: NO COLORS OR NODE LABELS --- # --- BEGIN GRAPH PLOT 2: NO COLORS, BUT ADD NODE LABELS --- # Set an arbitrary random seed random_seed <- 815 # Add node labels to the plot add_node_labels <- TRUE # Set graph nodes (i.e. subject identifier) a larger size node_size <- 6 # Add subject names to the plot subject_names <- paste("S", 1:n1, sep = "") # Construct the graph plot ggnet2_network_plot(.matrix_graph = posterior_prob_matrix, .subject_names = subject_names, .subject_class_names = vector(), .class_colors = vector(), .class_shapes = vector(), .random_seed = random_seed, .node_size = 6, .add_node_labels = TRUE) # --- END GRAPH PLOT 2: NO COLORS, BUT ADD NODE LABELS --- # --- BEGIN GRAPH PLOT 3: ADD COLORS AND NODE LABELS --- # Set an arbitrary random seed random_seed <- 815 # Add node labels to the plot add_node_labels <- TRUE # Set graph nodes (i.e. subject identifier) a larger size node_size <- 10 # Add subject names to the plot subject_names <- paste("S", 1:n1, sep = "") # Create a vector of class labels subject_class_names <- c("Class2","Class2","Class1","Class2","Class1", "Class1","Class2","Class1","Class1","Class2") # Assign a color to each class; this can be a character value or a hex value class_colors <- c("Class1" = "skyblue", "Class2" = "#FF9999") # Assign a pch integer value to each class class_shapes <- c("Class1" = 16, "Class2" = 17) # Generate the plot ggnet2_network_plot(.matrix_graph = posterior_prob_matrix, .subject_names = subject_names, .subject_class_names = subject_class_names, .class_colors = class_colors, .class_shapes = class_shapes, .random_seed = random_seed, .node_size = node_size, .add_node_labels = add_node_labels) # --- END GRAPH PLOT 3: ADD COLORS AND NODE LABELS ---
# Import the tip library library(tip) # Choose an arbitrary random seed to generate the data set.seed(4*8*15*16*23*42) # Generate a symmetric posterior probability matrix # Each element is the probability that the two subjects belong # to the same cluster n1 <- 10 posterior_prob_matrix <- matrix(NA, nrow = n1, ncol = n1) for(i in 1:n1){ for(j in i:n1){ if(i != j){ posterior_prob_matrix[i,j] <- runif(n=1,min=0,max=1) posterior_prob_matrix[j,i] <- posterior_prob_matrix[i,j] }else{ posterior_prob_matrix[i,j] <- 1.0 } } } # --- BEGIN GRAPH PLOT 1: NO COLORS OR NODE LABELS --- # Set an arbitrary random seed random_seed <- 815 # Set add_node_labels to FALSE add_node_labels <- FALSE # Set the node size node_size <- 6 # Construct the graph plot ggnet2_network_plot(.matrix_graph = posterior_prob_matrix, .subject_names = vector(), .subject_class_names = vector(), .class_colors = vector(), .class_shapes = vector(), .random_seed = random_seed, .node_size = node_size, .add_node_labels = add_node_labels) # --- END GRAPH PLOT 1: NO COLORS OR NODE LABELS --- # --- BEGIN GRAPH PLOT 2: NO COLORS, BUT ADD NODE LABELS --- # Set an arbitrary random seed random_seed <- 815 # Add node labels to the plot add_node_labels <- TRUE # Set graph nodes (i.e. subject identifier) a larger size node_size <- 6 # Add subject names to the plot subject_names <- paste("S", 1:n1, sep = "") # Construct the graph plot ggnet2_network_plot(.matrix_graph = posterior_prob_matrix, .subject_names = subject_names, .subject_class_names = vector(), .class_colors = vector(), .class_shapes = vector(), .random_seed = random_seed, .node_size = 6, .add_node_labels = TRUE) # --- END GRAPH PLOT 2: NO COLORS, BUT ADD NODE LABELS --- # --- BEGIN GRAPH PLOT 3: ADD COLORS AND NODE LABELS --- # Set an arbitrary random seed random_seed <- 815 # Add node labels to the plot add_node_labels <- TRUE # Set graph nodes (i.e. subject identifier) a larger size node_size <- 10 # Add subject names to the plot subject_names <- paste("S", 1:n1, sep = "") # Create a vector of class labels subject_class_names <- c("Class2","Class2","Class1","Class2","Class1", "Class1","Class2","Class1","Class1","Class2") # Assign a color to each class; this can be a character value or a hex value class_colors <- c("Class1" = "skyblue", "Class2" = "#FF9999") # Assign a pch integer value to each class class_shapes <- c("Class1" = 16, "Class2" = 17) # Generate the plot ggnet2_network_plot(.matrix_graph = posterior_prob_matrix, .subject_names = subject_names, .subject_class_names = subject_class_names, .class_colors = class_colors, .class_shapes = class_shapes, .random_seed = random_seed, .node_size = node_size, .add_node_labels = add_node_labels) # --- END GRAPH PLOT 3: ADD COLORS AND NODE LABELS ---
A function to that produces a ggplot2 plot of .y versus .x where points are added via geom_point() and the points are connected via geom_line().
ggplot_line_point(.x, .y, .xlab = "", .ylab = "")
ggplot_line_point(.x, .y, .xlab = "", .ylab = "")
.x |
The variable on the horizontal axis. |
.y |
The variable on the vertical axis. |
.xlab |
Character: the label on the horizontal axis. |
.ylab |
Character: the label on the vertical axis. |
ggplot2 geom_line + geom_point plot: a ggplot2 plot of .y versus .x with a label .xlab on the horizontal axis and label .ylab on the vertical axis.
# Import the tip library library(tip) # Create the variable that appears on the horizontal axis x <- 1:10 # Create the variable that appears on the vertical axis y <- rnorm(n = length(x), mean = 3, sd = 1) # Create a label that appears on the horizontal axis xlab <- "x" # Create a label that appears on the vertical axis ylab <- "y" # Create the plot of y versus x with ggplot_line_point(.x = x, .y = y, .xlab = xlab, .ylab = ylab)
# Import the tip library library(tip) # Create the variable that appears on the horizontal axis x <- 1:10 # Create the variable that appears on the vertical axis y <- rnorm(n = length(x), mean = 3, sd = 1) # Create a label that appears on the horizontal axis xlab <- "x" # Create a label that appears on the vertical axis ylab <- "y" # Create the plot of y versus x with ggplot_line_point(.x = x, .y = y, .xlab = xlab, .ylab = ylab)
A function that produces a ggplot bar chart (i.e., geom_bar) that corresponds to the posterior number of clusters. The vertical axis is normalized so that it displays the posterior probability.
ggplot_number_of_clusters_hist(.posterior_number_of_clusters)
ggplot_number_of_clusters_hist(.posterior_number_of_clusters)
.posterior_number_of_clusters |
Vector of positive integers: each integer corresponds to the number of clusters after posterior sampling for a given sampling iteration in the Gibbs sampler. |
ggplot2 geom_bar plot: a plot of the distribution of the posterior number of clusters computed after each sampling iteration in the Gibbs sampler.
# Import the tip library library(tip) # Generate a vector of positive integers # Example: the posterior number of clusters computed after posterior # sampling in each sampling iteration of the Gibbs sampler. num_clusters <- c(1,2,2,2,2,3,3,1,2,3,3,3,1,3) # Generate the plot of the posterior number of clusters versus the # sampling iteration number in the Gibbs sampler. ggplot_number_of_clusters_hist(.posterior_number_of_clusters = num_clusters)
# Import the tip library library(tip) # Generate a vector of positive integers # Example: the posterior number of clusters computed after posterior # sampling in each sampling iteration of the Gibbs sampler. num_clusters <- c(1,2,2,2,2,3,3,1,2,3,3,3,1,3) # Generate the plot of the posterior number of clusters versus the # sampling iteration number in the Gibbs sampler. ggplot_number_of_clusters_hist(.posterior_number_of_clusters = num_clusters)
A function that produces a ggplot2 trace plot (i.e., geom_line) with respect to the posterior number of clusters.
ggplot_number_of_clusters_trace(.posterior_number_of_clusters)
ggplot_number_of_clusters_trace(.posterior_number_of_clusters)
.posterior_number_of_clusters |
Vector of positive integers: each (s)th element denotes the number of clusters after posterior sampling
for each iteration s = 1, 2, ..., |
ggplot2 geom_line plot: a plot of the posterior number of clusters in each Gibbs sampling iteration versus the Gibbs sampling iteration number.
# Import the tip library library(tip) # Generate a vector of positive integers # Example: the posterior number of clusters computed after posterior # sampling in each sampling iteration of the Gibbs sampler. num_clusters <- c(1,2,2,2,2,3,3,1,2,3,3,3,1,3) # Generate the plot of the posterior number of clusters versus the # sampling iteration number in the Gibbs sampler. ggplot_number_of_clusters_trace(.posterior_number_of_clusters = num_clusters)
# Import the tip library library(tip) # Generate a vector of positive integers # Example: the posterior number of clusters computed after posterior # sampling in each sampling iteration of the Gibbs sampler. num_clusters <- c(1,2,2,2,2,3,3,1,2,3,3,3,1,3) # Generate the plot of the posterior number of clusters versus the # sampling iteration number in the Gibbs sampler. ggplot_number_of_clusters_trace(.posterior_number_of_clusters = num_clusters)
A function that iteratively applies the transformation max(0, .graph_matrix - cutoff) until there are <.num_components> graph components where cutoff = cutoff + .step_size. This is used to generate the one-cluster graph and plot.
partition_undirected_graph(.graph_matrix, .num_components, .step_size)
partition_undirected_graph(.graph_matrix, .num_components, .step_size)
.graph_matrix |
Matrix: a symmetric matrix that the analyst wishes to decompose into <.num_components> components. |
.num_components |
Positive integer: the number of components that the analyst wishes to decompose <.graph_matrix> into. |
.step_size |
Positive numeric: the size of the update for the cutoff in the transformation max(0, .graph_matrix - cutoff) where cutoff = cutoff + .step_size. |
List with three elements:
graph_component_members |
Vector. A vector of positive integers: the (i)th element is the graph component assignment for the (i)th subject. |
cutoff |
Numeric. The value max(0, g_i,j - cutoff) so that there are < |
partitioned_graph_matrix |
Matrix. The graph with < |
# Import the tip library library(tip) # Choose an arbitrary random seed to generate the data set.seed(4*8*15*16*23*42) # Generate a symmetric posterior probability matrix # Each element is the probability that the two subjects belong # to the same cluster n1 <- 10 posterior_prob_matrix <- matrix(NA, nrow = n1, ncol = n1) for(i in 1:n1){ for(j in i:n1){ if(i != j){ posterior_prob_matrix[i,j] <- runif(n=1,min=0,max=1) posterior_prob_matrix[j,i] <- posterior_prob_matrix[i,j] }else{ posterior_prob_matrix[i,j] <- 1.0 } } } # Generate a one-cluster graph (i.e., partitioned_graph_matrix) partition_undirected_graph(.graph_matrix = posterior_prob_matrix, .num_components = 1, .step_size = 0.001)
# Import the tip library library(tip) # Choose an arbitrary random seed to generate the data set.seed(4*8*15*16*23*42) # Generate a symmetric posterior probability matrix # Each element is the probability that the two subjects belong # to the same cluster n1 <- 10 posterior_prob_matrix <- matrix(NA, nrow = n1, ncol = n1) for(i in 1:n1){ for(j in i:n1){ if(i != j){ posterior_prob_matrix[i,j] <- runif(n=1,min=0,max=1) posterior_prob_matrix[j,i] <- posterior_prob_matrix[i,j] }else{ posterior_prob_matrix[i,j] <- 1.0 } } } # Generate a one-cluster graph (i.e., partitioned_graph_matrix) partition_undirected_graph(.graph_matrix = posterior_prob_matrix, .num_components = 1, .step_size = 0.001)
Generate plots from a Bayesian Clustering Model (bcm) object
## S4 method for signature 'bcm,missing' plot(x, y, ...)
## S4 method for signature 'bcm,missing' plot(x, y, ...)
x |
bcm object: a Bayesian Clustering Model (bcm) object fit to a dataset |
y |
Not used. |
... |
Not used. |
List: a list of two ggplot2 plots that are constructed using the information contained in an object of class bcm (Bayesian Clustering Model). A bcm object contains the clustering results from a clustering model that uses the TIP prior.
trace_plot_posterior_number_of_clusters |
ggplot2 Plot: a plot of the posterior number of clusters (sampling iterations only) versus the corresponding sampling iteration number from the Gibbs sampler. |
histogram_posterior_number_of_clusters |
ggplot2 Plot: a bar plot of the posterior number of clusters (sampling iterations only) from the Gibbs sampler. |
# Import the tip library library(tip) # Import the iris dataset data(iris) # The first 4 columns are the data whereas # the 5th column refers to the true labels X <- data.matrix(iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]) # Extract the true labels (optional) # True labels are only necessary for constructing network # graphs that incorporate the true labels; this is often # for research. true_labels <- iris[,"Species"] # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: burn >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:dim(iris)[1]) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1)
# Import the tip library library(tip) # Import the iris dataset data(iris) # The first 4 columns are the data whereas # the 5th column refers to the true labels X <- data.matrix(iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]) # Extract the true labels (optional) # True labels are only necessary for constructing network # graphs that incorporate the true labels; this is often # for research. true_labels <- iris[,"Species"] # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: burn >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:dim(iris)[1]) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1)
Bayesian clustering with the Table Invitation Prior (TIP) and optional likelihood functions.
tip( .data = list(), .burn = 1000, .samples = 1000, .similarity_matrix, .init_num_neighbors, .likelihood_model = "CONSTANT", .subject_names = vector(), .num_cores = 1, .step_size = 0.001 )
tip( .data = list(), .burn = 1000, .samples = 1000, .similarity_matrix, .init_num_neighbors, .likelihood_model = "CONSTANT", .subject_names = vector(), .num_cores = 1, .step_size = 0.001 )
.data |
Data frame (vectors comprise a row in a data frame; NIW only) or a list of matrices (MNIW only) that the analyst wishes to cluster. Note: if .likelihood_model = "CONSTANT", then the .data argument has no effect. |
.burn |
Non-negative integer: the number of burn-in iterations in the Gibbs sampler. |
.samples |
Positive integer: the number of sampling iterations in the Gibbs sampler. |
.similarity_matrix |
Matrix: an n x n matrix of similarity values. |
.init_num_neighbors |
Vector of positive integers: each (i)th positive integer corresponds to the estimate of the number of subjects that are similar to the (i)th subject. |
.likelihood_model |
Character: the name of the likelihood model used to compute the posterior probabilities. Options: "NIW" (vectors; .data is a dataframe), "MNIW" (matrices; .data is a list of matrices), or "CONSTANT" (vector, matrices, and tensors; .data is an empty list) |
.subject_names |
Vector of characters: an optional vector of names for the individual subjects. This is useful for the plotting function. |
.num_cores |
Positive integer: the number of cores to use. |
.step_size |
Positive numeric: A parameter used to ensure matrices are invertible. A small number is iteratively added to a matrix diagonal (if necessary) until the matrix is invertible. |
Object of class bcm: bcm denotes a "Bayesian Clustering Model" object that contains the results from a clustering model that uses the TIP prior.
n |
Positive integer: the sample size or number of subjects. |
burn |
Non-negative integer: the number of burn-in iterations in the Gibbs sampler. |
samples |
Positive integer: the number of sampling iterations in the Gibbs sampler. |
posterior_assignments |
List: a list of < |
posterior_similarity_matrix |
Matrix: an |
posterior_number_of_clusters |
Vector of positive integers: a vector where the jth element is the number of clusters after posterior sampling (i.e., the posterior number of clusters). |
prior_name |
Character: the name of the prior used. |
likelihood_name |
Character: the name of the likelihood used. |
##### BEGIN EXAMPLE 1: Clustering the Iris Dataset (vector clustering) ##### ##### Prior Distribution: Table Invitation Prior (TIP) ##### Likelihood Model: Normal Inverse Wishart (NIW) # Import the tip library library(tip) # Import the iris dataset data(iris) # The first 4 columns are the data whereas # the 5th column refers to the true labels X <- data.matrix(iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]) # Extract the true labels (optional) # True labels are only necessary for constructing network # graphs that incorporate the true labels; this is often # for research. true_labels <- iris[,"Species"] # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: samples >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:dim(iris)[1]) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$histogram_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # If the true labels are available, then show the cluster result via a contingency table table(data.frame(true_label = true_labels, cluster_assignment = cluster_assignments)) # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and colors for the plot class_palette_colors <- c("setosa" = "blue", "versicolor" = 'green', "virginica" = "orange") # Associate class labels and shapes for the plot class_palette_shapes <- c("setosa" = 19, "versicolor" = 18, "virginica" = 17) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = NA, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. Also, suppress # the subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 1: Vector Clustering (NIW) ##### ##### BEGIN EXAMPLE 2: Clustering the US Arrests Dataset (vector clustering) ##### ##### Prior Distribution: Table Invitation Prior (TIP) ##### Likelihood Model: Normal Inverse Wishart (NIW) # Import the TIP library library(tip) # Import the US Arrests dataset data(USArrests) # Convert the data to a matrix X <- data.matrix(USArrests) # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMENDATION: *** burn >= 1000 *** burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMENDATION: *** samples >= 1000 *** samples <- 1000 # Extract the state names names_subjects <- rownames(USArrests) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$trace_plot_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # Create a list where each element stores the cluster assignments cluster_assignment_list <- list() for(k in 1:length(unique(cluster_assignments))){ cluster_assignment_list[[k]] <- names_subjects[which(cluster_assignments == k)] } cluster_assignment_list # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # View the state names # names_subjects # Create a vector of true region labels to see if there is a pattern true_region <- c("Southeast", "West", "Southwest", "Southeast", "West", "West", "Northeast", "Northeast", "Southeast", "Southeast", "West", "West", "Midwest", "Midwest", "Midwest", "Midwest", "Southeast", "Southeast", "Northeast", "Northeast", "Northeast", "Midwest", "Midwest", "Southeast", "Midwest", "West", "Midwest", "West", "Northeast", "Northeast", "Southwest", "Northeast", "Southeast", "Midwest", "Midwest", "Southwest", "West", "Northeast", "Northeast", "Southeast", "Midwest", "Southeast", "Southwest", "West", "Northeast", "Southeast", "West", "Southeast", "Midwest", "West") names_subjects # Associate class labels and colors for the plot class_palette_colors <- c("Northeast" = "blue", "Southeast" = "red", "Midwest" = "purple", "Southwest" = "orange", "West" = "green") # Associate class labels and shapes for the plot class_palette_shapes <- c("Northeast" = 15, "Southeast" = 16, "Midwest" = 17, "Southwest" = 18, "West" = 19) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .subject_class_names = true_region, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = TRUE) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). # Remove the subject names with .add_node_labels = FALSE ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .subject_class_names = true_region, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # Construct a network plot without class labels ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # Construct a network plot without class labels. Also, suppress # the subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 2: Clustering the US Arrests Dataset (vector clustering) ##### ## Not run: ##### BEGIN EXAMPLE 3: Clustering gene expression data (vector clustering) ##### ##### Prior Distribution: Table Invitation Prior (TIP) ##### Likelihood Model: Normal Inverse Wishart (NIW) # Import the TIP library library(tip) # ----- Dataset information ----- # The data were accessed from the UCI machine learning repository # Original link: https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq # Source: Samuele Fiorini, samuele.fiorini '@' dibris.unige.it, # University of Genoa, redistributed under Creative Commons license # (http://creativecommons.org/licenses/by/3.0/legalcode) # from https://www.synapse.org/#!Synapse:syn4301332. # Data Set Information: Samples (instances) are stored row-wise. Variables # (attributes) of each sample are RNA-Seq gene expression levels measured by # illumina HiSeq platform. # Relevant Papers: Weinstein, John N., et al. 'The cancer genome atlas pan-cancer # analysis project.' Nature genetics 45.10 (2013): 1113-1120. # ------------------------------- # Import the data (see the provided link above) X <- read.csv("data.csv") true_labels <- read.csv("labels.csv") # Extract the true indices subject_names <- true_labels$X # Extract the true classes true_labels <- true_labels$Class # Convert the dataset into a matrix X <- data.matrix(X) ##### BEGIN: Apply PCA to the dataset ##### # Step 1: perform Prinicpal Components Analysis (PCA) on the dataset pca1 <- prcomp(X) # Step 2: compute summary information summary1 <- summary(pca1) # Step 3: plot the cumulative percentage of the variance # explained against the number of principal components tip::ggplot_line_point(.x = 1:length(summary1$importance['Cumulative Proportion',]), .y = summary1$importance['Cumulative Proportion',], .xlab = "Number of Principal Components", .ylab = "Cumulative Percentage of the Variance Explained") # The number of principal components chosen is 7, and # the 7 principal components explain roughly 80% of # the variance num_principal_components <- which(summary1$importance['Cumulative Proportion',] <= 0.8) # The clustering is applied to the principal component dataset X <- pca1$x[,as.numeric(num_principal_components)] ##### END: Apply PCA to the dataset ##### # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMENDATION: *** burn >= 1000 *** burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMENDATION: *** samples >= 1000 *** samples <- 1000 # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = c(), .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$trace_plot_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # Create a list where each element stores the cluster assignments cluster_assignment_list <- list() for(k in 1:length(unique(cluster_assignments))){ cluster_assignment_list[[k]] <- true_labels[which(cluster_assignments == k)] } cluster_assignment_list # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and shapes for the plot class_palette_shapes <- c("PRAD" = 19, "BRCA" = 18, "KIRC" = 17, "LUAD" = 16, "COAD" = 15) # Associate class labels and colors for the plot class_palette_colors <- c("PRAD" = "blue", "BRCA" = "red", "KIRC" = "black", "LUAD" = "green", "COAD" = "orange") # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = TRUE) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). # Remove the subject names with .add_node_labels = FALSE ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # Construct a network plot without class labels but with subject labels ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .node_size = 2, .add_node_labels = TRUE) # Construct a network plot without class labels and subject labels ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 3: Clustering gene expression data (vector clustering) ##### ## End(Not run) ##### BEGIN EXAMPLE 4: Matrix Clustering (MNIW) ##### ##### Prior Distribution: Table Invitation Prior ##### Likelihood Model: Matrix Normal Inverse Wishart (MNIW) # Import the tip library library(tip) # A function to generate random matrices from a matrix normal distribution random_mat_normal <- function(mu, num_rows, num_cols){ LaplacesDemon::rmatrixnorm(M = matrix(mu, nrow = num_rows, ncol = num_cols), U = diag(num_rows), V = diag(num_cols)) } # Generate 3 clusters of matrices p <- 5 m <- 3 c1 <- lapply(1:20, function(x) random_mat_normal(mu = 0, num_rows = m, num_cols = p)) c2 <- lapply(1:25, function(x) random_mat_normal(mu = 5, num_rows = m, num_cols = p)) c3 <- lapply(1:30, function(x) random_mat_normal(mu = -5, num_rows = m, num_cols = p)) # Put all the data into a list data_list <- c(c1,c2,c3) # Create a vector of true labels. True labels are only necessary # for constructing network graphs that incorporate the true labels; # this is often useful for research. true_labels <- c(rep("Cluster 1", 20), rep("Cluster 2", 25), rep("Cluster 3", 30)) distance_matrix <- matrix(NA, nrow = length(true_labels), ncol = length(true_labels)) # Distance matrix for(i in 1:length(true_labels)){ for(j in i:length(true_labels)){ distance_matrix[i,j] <- SMFilter::FDist2(mX = data_list[[i]], mY = data_list[[j]]) distance_matrix[j,i] <- distance_matrix[i,j] } } # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: samples >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:dim(distance_matrix)[1]) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data_list, .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "MNIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$histogram_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # If the true labels are available, then show the cluster result via a contigency table table(data.frame(true_label = true_labels, cluster_assignment = cluster_assignments)) # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and colors for the plot class_palette_colors <- c("Cluster 1" = "blue", "Cluster 2" = 'green', "Cluster 3" = "red") # Associate class labels and shapes for the plot class_palette_shapes <- c("Cluster 1" = 19, "Cluster 2" = 18, "Cluster 3" = 17) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = NA, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. Also, suppress the # subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 4: Matrix Clustering (MNIW) ##### ##### BEGIN EXAMPLE 5: Tensor Clustering ##### ##### Prior Distribution: Table Invitation Prior ##### Likelihood Model: CONSTANT # Import the tip library library(tip) # ################## NOTE ################## # Order 3 Tensor dimension: c(d1,d2,d3) # d1: number of rows # d2: number of columns # d3: number of slices ############################################ # Set a random seed for reproducibility set.seed(007) # A function to generate an order-3 tensor generate_gaussian_tensor <- function(.tensor_dimension, .mean = 0, .sd = 1){ array(data = c(rnorm(n = prod(.tensor_dimension), mean = .mean, sd = .sd)), dim = .tensor_dimension) } # Define the tensor dimension tensor_dimension <- c(256,256,3) # Generate clusters of tensors c1 <- lapply(1:30, function(x) generate_gaussian_tensor(.tensor_dimension = tensor_dimension, .mean = 0, .sd = 1)) # Generate clusters of tensors c2 <- lapply(1:40, function(x) generate_gaussian_tensor(.tensor_dimension = tensor_dimension, .mean = 5, .sd = 1)) # Generate clusters of tensors c3 <- lapply(1:50, function(x) generate_gaussian_tensor(.tensor_dimension = tensor_dimension, .mean = -5, .sd = 1)) # Make a list of tensors X <- c(c1, c2, c3) # Compute the number of subjects for each cluster n1 <- length(c1) n2 <- length(c2) n3 <- length(c3) # Create a vector of true labels. True labels are only necessary # for constructing network graphs that incorporate the true labels; # this is often useful for research. true_labels <- c(rep("Cluster 1", n1), rep("Cluster 2", n2), rep("Cluster 3", n3)) # Compute the total number of subjects n <- length(X) # Construct the distance matrix distance_matrix <- matrix(data = NA, nrow = n, ncol = n) for(i in 1:n){ for(j in i:n){ distance_matrix[i,j] <- sqrt(sum((X[[i]] - X[[j]])^2)) distance_matrix[j,i] <- distance_matrix[i,j] } } # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: samples >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:n) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = list(), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "CONSTANT", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$histogram_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # If the true labels are available, then show the cluster result via a contigency table table(data.frame(true_label = true_labels, cluster_assignment = cluster_assignments)) # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and colors for the plot class_palette_colors <- c("Cluster 1" = "blue", "Cluster 2" = 'green', "Cluster 3" = "red") # Associate class labels and shapes for the plot class_palette_shapes <- c("Cluster 1" = 19, "Cluster 2" = 18, "Cluster 3" = 17) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = NA, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. Also, suppress # the subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 5: Tensor Clustering #####
##### BEGIN EXAMPLE 1: Clustering the Iris Dataset (vector clustering) ##### ##### Prior Distribution: Table Invitation Prior (TIP) ##### Likelihood Model: Normal Inverse Wishart (NIW) # Import the tip library library(tip) # Import the iris dataset data(iris) # The first 4 columns are the data whereas # the 5th column refers to the true labels X <- data.matrix(iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]) # Extract the true labels (optional) # True labels are only necessary for constructing network # graphs that incorporate the true labels; this is often # for research. true_labels <- iris[,"Species"] # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: samples >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:dim(iris)[1]) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$histogram_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # If the true labels are available, then show the cluster result via a contingency table table(data.frame(true_label = true_labels, cluster_assignment = cluster_assignments)) # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and colors for the plot class_palette_colors <- c("setosa" = "blue", "versicolor" = 'green', "virginica" = "orange") # Associate class labels and shapes for the plot class_palette_shapes <- c("setosa" = 19, "versicolor" = 18, "virginica" = 17) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = NA, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. Also, suppress # the subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 1: Vector Clustering (NIW) ##### ##### BEGIN EXAMPLE 2: Clustering the US Arrests Dataset (vector clustering) ##### ##### Prior Distribution: Table Invitation Prior (TIP) ##### Likelihood Model: Normal Inverse Wishart (NIW) # Import the TIP library library(tip) # Import the US Arrests dataset data(USArrests) # Convert the data to a matrix X <- data.matrix(USArrests) # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMENDATION: *** burn >= 1000 *** burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMENDATION: *** samples >= 1000 *** samples <- 1000 # Extract the state names names_subjects <- rownames(USArrests) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$trace_plot_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # Create a list where each element stores the cluster assignments cluster_assignment_list <- list() for(k in 1:length(unique(cluster_assignments))){ cluster_assignment_list[[k]] <- names_subjects[which(cluster_assignments == k)] } cluster_assignment_list # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # View the state names # names_subjects # Create a vector of true region labels to see if there is a pattern true_region <- c("Southeast", "West", "Southwest", "Southeast", "West", "West", "Northeast", "Northeast", "Southeast", "Southeast", "West", "West", "Midwest", "Midwest", "Midwest", "Midwest", "Southeast", "Southeast", "Northeast", "Northeast", "Northeast", "Midwest", "Midwest", "Southeast", "Midwest", "West", "Midwest", "West", "Northeast", "Northeast", "Southwest", "Northeast", "Southeast", "Midwest", "Midwest", "Southwest", "West", "Northeast", "Northeast", "Southeast", "Midwest", "Southeast", "Southwest", "West", "Northeast", "Southeast", "West", "Southeast", "Midwest", "West") names_subjects # Associate class labels and colors for the plot class_palette_colors <- c("Northeast" = "blue", "Southeast" = "red", "Midwest" = "purple", "Southwest" = "orange", "West" = "green") # Associate class labels and shapes for the plot class_palette_shapes <- c("Northeast" = 15, "Southeast" = 16, "Midwest" = 17, "Southwest" = 18, "West" = 19) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .subject_class_names = true_region, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = TRUE) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). # Remove the subject names with .add_node_labels = FALSE ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .subject_class_names = true_region, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # Construct a network plot without class labels ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # Construct a network plot without class labels. Also, suppress # the subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 2: Clustering the US Arrests Dataset (vector clustering) ##### ## Not run: ##### BEGIN EXAMPLE 3: Clustering gene expression data (vector clustering) ##### ##### Prior Distribution: Table Invitation Prior (TIP) ##### Likelihood Model: Normal Inverse Wishart (NIW) # Import the TIP library library(tip) # ----- Dataset information ----- # The data were accessed from the UCI machine learning repository # Original link: https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq # Source: Samuele Fiorini, samuele.fiorini '@' dibris.unige.it, # University of Genoa, redistributed under Creative Commons license # (http://creativecommons.org/licenses/by/3.0/legalcode) # from https://www.synapse.org/#!Synapse:syn4301332. # Data Set Information: Samples (instances) are stored row-wise. Variables # (attributes) of each sample are RNA-Seq gene expression levels measured by # illumina HiSeq platform. # Relevant Papers: Weinstein, John N., et al. 'The cancer genome atlas pan-cancer # analysis project.' Nature genetics 45.10 (2013): 1113-1120. # ------------------------------- # Import the data (see the provided link above) X <- read.csv("data.csv") true_labels <- read.csv("labels.csv") # Extract the true indices subject_names <- true_labels$X # Extract the true classes true_labels <- true_labels$Class # Convert the dataset into a matrix X <- data.matrix(X) ##### BEGIN: Apply PCA to the dataset ##### # Step 1: perform Prinicpal Components Analysis (PCA) on the dataset pca1 <- prcomp(X) # Step 2: compute summary information summary1 <- summary(pca1) # Step 3: plot the cumulative percentage of the variance # explained against the number of principal components tip::ggplot_line_point(.x = 1:length(summary1$importance['Cumulative Proportion',]), .y = summary1$importance['Cumulative Proportion',], .xlab = "Number of Principal Components", .ylab = "Cumulative Percentage of the Variance Explained") # The number of principal components chosen is 7, and # the 7 principal components explain roughly 80% of # the variance num_principal_components <- which(summary1$importance['Cumulative Proportion',] <= 0.8) # The clustering is applied to the principal component dataset X <- pca1$x[,as.numeric(num_principal_components)] ##### END: Apply PCA to the dataset ##### # Compute the distance matrix distance_matrix <- data.matrix(dist(X)) # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMENDATION: *** burn >= 1000 *** burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMENDATION: *** samples >= 1000 *** samples <- 1000 # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data.matrix(X), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "NIW", .subject_names = c(), .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$trace_plot_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # Create a list where each element stores the cluster assignments cluster_assignment_list <- list() for(k in 1:length(unique(cluster_assignments))){ cluster_assignment_list[[k]] <- true_labels[which(cluster_assignments == k)] } cluster_assignment_list # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and shapes for the plot class_palette_shapes <- c("PRAD" = 19, "BRCA" = 18, "KIRC" = 17, "LUAD" = 16, "COAD" = 15) # Associate class labels and colors for the plot class_palette_colors <- c("PRAD" = "blue", "BRCA" = "red", "KIRC" = "black", "LUAD" = "green", "COAD" = "orange") # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = TRUE) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). # Remove the subject names with .add_node_labels = FALSE ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # Construct a network plot without class labels but with subject labels ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .node_size = 2, .add_node_labels = TRUE) # Construct a network plot without class labels and subject labels ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = subject_names, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 3: Clustering gene expression data (vector clustering) ##### ## End(Not run) ##### BEGIN EXAMPLE 4: Matrix Clustering (MNIW) ##### ##### Prior Distribution: Table Invitation Prior ##### Likelihood Model: Matrix Normal Inverse Wishart (MNIW) # Import the tip library library(tip) # A function to generate random matrices from a matrix normal distribution random_mat_normal <- function(mu, num_rows, num_cols){ LaplacesDemon::rmatrixnorm(M = matrix(mu, nrow = num_rows, ncol = num_cols), U = diag(num_rows), V = diag(num_cols)) } # Generate 3 clusters of matrices p <- 5 m <- 3 c1 <- lapply(1:20, function(x) random_mat_normal(mu = 0, num_rows = m, num_cols = p)) c2 <- lapply(1:25, function(x) random_mat_normal(mu = 5, num_rows = m, num_cols = p)) c3 <- lapply(1:30, function(x) random_mat_normal(mu = -5, num_rows = m, num_cols = p)) # Put all the data into a list data_list <- c(c1,c2,c3) # Create a vector of true labels. True labels are only necessary # for constructing network graphs that incorporate the true labels; # this is often useful for research. true_labels <- c(rep("Cluster 1", 20), rep("Cluster 2", 25), rep("Cluster 3", 30)) distance_matrix <- matrix(NA, nrow = length(true_labels), ncol = length(true_labels)) # Distance matrix for(i in 1:length(true_labels)){ for(j in i:length(true_labels)){ distance_matrix[i,j] <- SMFilter::FDist2(mX = data_list[[i]], mY = data_list[[j]]) distance_matrix[j,i] <- distance_matrix[i,j] } } # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: samples >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:dim(distance_matrix)[1]) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = data_list, .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "MNIW", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$histogram_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # If the true labels are available, then show the cluster result via a contigency table table(data.frame(true_label = true_labels, cluster_assignment = cluster_assignments)) # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and colors for the plot class_palette_colors <- c("Cluster 1" = "blue", "Cluster 2" = 'green', "Cluster 3" = "red") # Associate class labels and shapes for the plot class_palette_shapes <- c("Cluster 1" = 19, "Cluster 2" = 18, "Cluster 3" = 17) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = NA, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. Also, suppress the # subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 4: Matrix Clustering (MNIW) ##### ##### BEGIN EXAMPLE 5: Tensor Clustering ##### ##### Prior Distribution: Table Invitation Prior ##### Likelihood Model: CONSTANT # Import the tip library library(tip) # ################## NOTE ################## # Order 3 Tensor dimension: c(d1,d2,d3) # d1: number of rows # d2: number of columns # d3: number of slices ############################################ # Set a random seed for reproducibility set.seed(007) # A function to generate an order-3 tensor generate_gaussian_tensor <- function(.tensor_dimension, .mean = 0, .sd = 1){ array(data = c(rnorm(n = prod(.tensor_dimension), mean = .mean, sd = .sd)), dim = .tensor_dimension) } # Define the tensor dimension tensor_dimension <- c(256,256,3) # Generate clusters of tensors c1 <- lapply(1:30, function(x) generate_gaussian_tensor(.tensor_dimension = tensor_dimension, .mean = 0, .sd = 1)) # Generate clusters of tensors c2 <- lapply(1:40, function(x) generate_gaussian_tensor(.tensor_dimension = tensor_dimension, .mean = 5, .sd = 1)) # Generate clusters of tensors c3 <- lapply(1:50, function(x) generate_gaussian_tensor(.tensor_dimension = tensor_dimension, .mean = -5, .sd = 1)) # Make a list of tensors X <- c(c1, c2, c3) # Compute the number of subjects for each cluster n1 <- length(c1) n2 <- length(c2) n3 <- length(c3) # Create a vector of true labels. True labels are only necessary # for constructing network graphs that incorporate the true labels; # this is often useful for research. true_labels <- c(rep("Cluster 1", n1), rep("Cluster 2", n2), rep("Cluster 3", n3)) # Compute the total number of subjects n <- length(X) # Construct the distance matrix distance_matrix <- matrix(data = NA, nrow = n, ncol = n) for(i in 1:n){ for(j in i:n){ distance_matrix[i,j] <- sqrt(sum((X[[i]] - X[[j]])^2)) distance_matrix[j,i] <- distance_matrix[i,j] } } # Compute the temperature parameter estiamte temperature <- 1/median(distance_matrix[upper.tri(distance_matrix)]) # For each subject, compute the point estimate for the number of similar # subjects using univariate multiple change point detection (i.e.) init_num_neighbors = get_cpt_neighbors(.distance_matrix = distance_matrix) # Set the number of burn-in iterations in the Gibbs samlper # RECOMMENDATION: burn >= 1000 burn <- 1000 # Set the number of sampling iterations in the Gibbs sampler # RECOMMENDATION: samples >= 1000 samples <- 1000 # Set the subject names names_subjects <- paste(1:n) # Run TIP clustering using only the prior # --> That is, the likelihood function is constant tip1 <- tip(.data = list(), .burn = burn, .samples = samples, .similarity_matrix = exp(-1.0*temperature*distance_matrix), .init_num_neighbors = init_num_neighbors, .likelihood_model = "CONSTANT", .subject_names = names_subjects, .num_cores = 1) # Produce plots for the Bayesian Clustering Model tip_plots <- plot(tip1) # View the posterior distribution of the number of clusters tip_plots$histogram_posterior_number_of_clusters # View the trace plot with respect to the posterior number of clusters tip_plots$trace_plot_posterior_number_of_clusters # Extract posterior cluster assignments using the Posterior Expected Adjusted Rand (PEAR) index cluster_assignments <- mcclust::maxpear(psm = tip1@posterior_similarity_matrix)$cl # If the true labels are available, then show the cluster result via a contigency table table(data.frame(true_label = true_labels, cluster_assignment = cluster_assignments)) # Create the one component graph with minimum entropy partition_list <- partition_undirected_graph(.graph_matrix = tip1@posterior_similarity_matrix, .num_components = 1, .step_size = 0.001) # Associate class labels and colors for the plot class_palette_colors <- c("Cluster 1" = "blue", "Cluster 2" = 'green', "Cluster 3" = "red") # Associate class labels and shapes for the plot class_palette_shapes <- c("Cluster 1" = 19, "Cluster 2" = 18, "Cluster 3" = 17) # Visualize the posterior similarity matrix by constructing a graph plot of # the one-cluster graph. The true labels are used here (below they are not). ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = NA, .subject_class_names = true_labels, .class_colors = class_palette_colors, .class_shapes = class_palette_shapes, .node_size = 2, .add_node_labels = FALSE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = TRUE) # If true labels are not available, then construct a network plot # of the one-cluster graph without any class labels. Also, suppress # the subject labels. ggnet2_network_plot(.matrix_graph = partition_list$partitioned_graph_matrix, .subject_names = names_subjects, .node_size = 2, .add_node_labels = FALSE) ##### END EXAMPLE 5: Tensor Clustering #####