Annotating NMF factors with sample metadata
Learning bird species communities within the Hawaiian islands
Annotating NMF factors
NMF learns an interpretable low-rank representation of data. However, how do we make sense of the factors in this low-rank latent model? A great way to begin annotating a latent space is to simply map it back to known sample and feature traits.
This vignette demonstrates these concepts using an NMF model of bird species communities throughout the Hawaiian islands.
Install RcppML
Install the RcppML R package from CRAN or the development version from GitHub. Also install the accompanying Machine Learning datasets (MLdata) package:
install.packages('RcppML') # install CRAN version
# devtools::install_github("zdebruine/RcppML") # compile dev version
devtools::install_github("zdebruine/MLdata")
library(RcppML)
library(MLdata)
library(ggplot2)
library(cowplot)
library(viridis)
library(ggrepel)
library(uwot)
The hawaiibirds dataset
The MLdata::hawaiibirds
dataset gives the frequency of bird species in small geographical grids throughout the state of Hawaii.
data(hawaiibirds)
hawaiibirds$counts[1:4, 1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
## grid1 grid2 grid3 grid4
## Common Myna 0.32432432 0.19230769 0.242753623 0.80208333
## Black-crowned Night-Heron 0.06756757 0.07692308 0.007246377 0.03819444
## Black Noddy . 0.26923077 0.188405797 .
## Brown Noddy . 0.38461538 . .
A separate metadata_h
matrix gives the geographical coordinates and the corresponding island for each grid.
head(hawaiibirds$metadata_h)
## grid island lat lng
## 1 grid1 Maui 20.87 -156.44
## 2 grid2 Oahu 21.33 -157.66
## 3 grid3 Hawaii 19.33 -155.19
## 4 grid4 Oahu 21.37 -157.94
## 5 grid5 Hawaii 19.72 -155.11
## 6 grid6 Maui 20.74 -156.24
And a separate metadata_w
matrix gives taxonomic information about each species in the database.
head(hawaiibirds$metadata_w)
## species order
## 1 Common Myna Passeriformes
## 2 Black-crowned Night-Heron Pelecaniformes
## 3 Black Noddy Charadriiformes
## 4 Brown Noddy Charadriiformes
## 5 Bulwer's Petrel Procellariiformes
## 6 Sooty Tern Charadriiformes
## family category status
## 1 Sturnidae (Starlings) perching birds introduced
## 2 Ardeidae (Herons, Egrets, and Bitterns) waders native
## 3 Laridae (Gulls, Terns, and Skimmers) shorebirds native
## 4 Laridae (Gulls, Terns, and Skimmers) shorebirds native
## 5 Procellariidae (Shearwaters and Petrels) seabirds native
## 6 Laridae (Gulls, Terns, and Skimmers) shorebirds native
Cross-validation for Rank Determination
We can learn an NMF model to describe linear combinations of species across geographical grids. First we need to choose a rank.
The rank of a factorization is a crucial hyperparameter. One way to help decide on a rank is cross-validation. This is made easy using the crossValidate
function. See ?crossValidate
for details on methods.
For many applications, there is no “optimal” rank. In this case, we do expect some amount of distinct biodiversity across the various islands, but within the islands there will be a continuum of habitat niches confounding rank of the signal. Additionally, there may be a number of “missing” observations where surveys were incomplete, which will confound signal separation.
Here we cross-validate across 3 independent replicates and plot the result:
plot(crossValidate(hawaiibirds$counts, k = c(1:10, 12, 15, 20, 25, 30), reps = 3, verbose = FALSE))
We’ll choose a rank of k = 10
since this seems to capture much of the signal while giving identifiable factors.
Run robust NMF
Let’s generate a high-quality NMF model across 10 random restarts at very low tolerance:
model <- nmf(hawaiibirds$counts, k = 10, seed = 1:10, tol = 1e-6)
model
## 183 x 1183 x 10 factor model of class "nmf"
## $ w
## nmf1 nmf2 nmf3 nmf4 nmf5
## Common Myna 0.146640316 0.094888073 0.074299917 0.04111043 0
## Black-crowned Night-Heron 0.006777407 0.004027294 0.005781633 0.00000000 0
## Black Noddy 0.000000000 0.006376501 0.000000000 0.00000000 0
## Brown Noddy 0.000000000 0.000000000 0.000000000 0.00000000 0
## Bulwer's Petrel 0.000000000 0.000000000 0.000000000 0.00000000 0
## ...suppressing 178 rows and 5 columns
##
## $ d
## [1] 1350.6295 1256.2949 1153.2389 911.6537 834.4069
## ...suppressing 5 values
##
## $ h
## grid1 grid2 grid3 grid4 grid5
## nmf1 0.0009274172 0.0004718018 0.0005553570 0.003512579 0.0006238265
## nmf2 0.0001676291 0.0002334082 0.0009073722 0.000000000 0.0018705609
## nmf3 0.0005758524 0.0000000000 0.0000000000 0.000000000 0.0005398256
## nmf4 0.0000000000 0.0003021981 0.0000000000 0.003822848 0.0000000000
## nmf5 0.0000000000 0.0000000000 0.0011624112 0.000000000 0.0000000000
## ...suppressing 5 rows and 1178 columns
##
## $ tol: 8.238107e-07
## $ iter: 67
## $ runtime: 0.9558601 sec
In the w
matrix we have factors describing communities of co-occuring bird species.
In the h
matrix we have the association of these bird communities in each surveyed geographical grid.
Geographic focus on NMF factors
What does each NMF factor tell us?
The sample embeddings matrix (h
) gives information about the geographical representation of each NMF factor across all grids. We’ll look at just the first four factors:
plots <- list()
for(i in 1:4){
df <- data.frame(
"lat" = hawaiibirds$metadata_h$lat,
"lng" = hawaiibirds$metadata_h$lng,
"nmf_factor" = model$h[i, ])
plots[[i]] <- ggplot(df, aes(x = lng, y = lat, color = nmf_factor)) +
geom_point() +
scale_color_viridis(option = "B") +
theme_void() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Factor ", i))
}
plot_grid(plotlist = plots, nrow = 2)
Metadata enrichment in factors
Factor 2 is localized largely to the island of Hawaii, factor 3 to the island of Kauai, and factor 4 to Oahu.
Quantitatively, the summary
method for the nmf
S3 class makes it easy to annotate factors using metadata about samples or features. See ?summary.nmf
for info.
In this case, we will use summary
to map factor enrichment in grids corresponding to each Hawaiian island, and species enrichment corresponding to each category.
plot(summary(model, group_by = hawaiibirds$metadata_h$island, stat = "sum"))
In general, grids separate based on the island to which they belong – consistent with the expectation that islands contain distinct species communities.
Notice how several factors explain variation within the big island, “Hawaii”, consistent with the objective of NMF and the biological diversity within that island.
Due to our normalization method (sum
) very small islands with minor contribution to the model objective (i.e. Puuwai) are hardly represented.
plot(summary(model, group_by = hawaiibirds$metadata_w$category, stat = "mean"))
Clearly, there is the greatest signal complexity among “perching birds”. nmf10
is describing “seabirds” while nmf9
is capturing much of the “non-perching birds” information.
NMF biplots
Compare species composition in two factors that are both primarily restricted to the island of Hawaii, factors 7 and 8. The biplot
S3 method for nmf
makes this easy:
biplot(model, factors = c(7, 8), matrix = "w", group_by = hawaiibirds$metadata_w$category) +
scale_y_continuous(trans = "sqrt") +
scale_x_continuous(trans = "sqrt") +
geom_text_repel(size = 2.5, seed = 123, max.overlaps = 15)
Factor 7 describes a wet rainforest community while Factor 8 describes dry rainforest/shrubland communities. Both factors are overwhelmingly focused on “perching birds”.
UMAP on NMF embeddings
We might also be interested in visualizing how factors in \(w\) capture similarities among bird species using UMAP.
set.seed(123)
umap <- data.frame(uwot::umap(model$w))
umap$taxon <- hawaiibirds$metadata_w$category
umap$status <- hawaiibirds$metadata_w$status
plot_grid(
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = taxon)) +
geom_point() + theme_void(),
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = status)) +
geom_point() + theme_void(),
nrow = 1
)
Species are classified based on habitat niche and taxonomic membership. Notice “seabirds” on the left, “perching birds” in the center mixed with “non-perching birds”, and a mix of “waders”, “waterfowl”, and “shorebirds” in the bottom right. There are also two distinct groups of “shorebirds” and “waterfowl”, consistent with distinct inland and shoreline communities.
Hawaii is extinction kingdom. For instance, more than 20 species of endemic honeycreeper have gone extinct in the past two centuries due to the establishment of introduced species and habitat devastation. Few remain. In the UMAP plot above on the right, we can observe that introduced species dominate habitat niches occupied by native perching and non-perching birds, a problem underlying historic and ongoing mass extinction events.
set.seed(123)
umap <- data.frame(uwot::umap(t(model$h)))
umap$group <- hawaiibirds$metadata_h$island
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = group)) +
geom_point() + theme_void()
Islands are also well-defined by the NMF model.
Defining the “Palila” species niche
The Palila is a highly endangered species that survives in small numbers on the eastern slopes of Mauna Kea on the dry island, in a shrubby dry “rainforest” biome. This biome is unique on the island of Hawaii.
What species coexist with the Palila?
Let’s find the highest factorization resolution at which a single factor describes the distribution of the Palila.
palila <- list()
for(rank in 1:20)
palila[[rank]] <- data.frame(
"value" = nmf(hawaiibirds$counts, k = rank, seed = 123, v = F)$w["Palila", ],
"rank" = rep(rank, rank)
)
palila <- do.call(rbind, palila)
ggplot(palila, aes(x = rank, y = value, color = factor(rank))) + geom_jitter(width = 0.1) + theme_classic() +
scale_color_manual(values = rep(c("#F8766D", "#00BDD0"), 10)) + labs("Palila loading in factor") + theme(legend.position = "none")
The model with a rank of 15 contains a factor in which the Palila is both important and specific.
Let’s have a look at the species composition in factor 15, specifically identifying which species are introduced and which are native:
model <- nmf(hawaiibirds$counts, k = 15, seed = 123, v = F)
df <- data.frame("value" = model$w[, which.max(model$w["Palila", ])])
df$status <- hawaiibirds$metadata_w$status
df <- df[order(-df$value), ]
df <- df[df$value > 0.001, ]
df
## value status
## Hawaii Amakihi 0.353110902 native
## Warbling White-eye 0.188662129 native
## House Finch 0.182132197 introduced
## Erckel's Francolin 0.066866618 introduced
## Yellow-fronted Canary 0.045745828 introduced
## California Quail 0.043372489 native
## Palila 0.038791926 native
## Eurasian Skylark 0.032193106 introduced
## Hawaii Elepaio 0.028948444 native
## Red-billed Leiothrix 0.008899639 introduced
## Chukar 0.004825709 introduced
## Indian Peafowl 0.003265103 native
## Chinese Hwamei 0.002520935 introduced
The diet of the Palilla is largely seeds from the “mamame” tree, but also naio berries and mamame flowers, buds, and young leaves. What introduced perching birds may be competing with the Palila for these resources?
perching_birds <- hawaiibirds$metadata_w$species[hawaiibirds$metadata_w$category == "perching birds"]
df[which(rownames(df) %in% perching_birds & df$status == "introduced"), ]
## value status
## House Finch 0.182132197 introduced
## Yellow-fronted Canary 0.045745828 introduced
## Eurasian Skylark 0.032193106 introduced
## Red-billed Leiothrix 0.008899639 introduced
## Chinese Hwamei 0.002520935 introduced
The “House Finch” and “Yellow-fronted Canary” seem to be the most significant competitors in the Palila habitat niche.