Annotating NMF factors with sample metadata

Learning bird species communities within the Hawaiian islands

Annotating NMF factors

NMF learns an interpretable low-rank representation of data. However, how do we make sense of the factors in this low-rank latent model? A great way to begin annotating a latent space is to simply map it back to known sample and feature traits.

This vignette demonstrates these concepts using an NMF model of bird species communities throughout the Hawaiian islands.

Install RcppML

Install the RcppML R package from CRAN or the development version from GitHub. Also install the accompanying Machine Learning datasets (MLdata) package:

install.packages('RcppML')                     # install CRAN version
# devtools::install_github("zdebruine/RcppML") # compile dev version
devtools::install_github("zdebruine/MLdata")
library(RcppML)
library(MLdata)
library(ggplot2)
library(cowplot)
library(viridis)
library(ggrepel)
library(uwot)

The hawaiibirds dataset

The MLdata::hawaiibirds dataset gives the frequency of bird species in small geographical grids throughout the state of Hawaii.

data(hawaiibirds)
hawaiibirds$counts[1:4, 1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##                                grid1      grid2       grid3      grid4
## Common Myna               0.32432432 0.19230769 0.242753623 0.80208333
## Black-crowned Night-Heron 0.06756757 0.07692308 0.007246377 0.03819444
## Black Noddy               .          0.26923077 0.188405797 .         
## Brown Noddy               .          0.38461538 .           .

A separate metadata_h matrix gives the geographical coordinates and the corresponding island for each grid.

head(hawaiibirds$metadata_h)
##    grid island   lat     lng
## 1 grid1   Maui 20.87 -156.44
## 2 grid2   Oahu 21.33 -157.66
## 3 grid3 Hawaii 19.33 -155.19
## 4 grid4   Oahu 21.37 -157.94
## 5 grid5 Hawaii 19.72 -155.11
## 6 grid6   Maui 20.74 -156.24

And a separate metadata_w matrix gives taxonomic information about each species in the database.

head(hawaiibirds$metadata_w)
##                     species             order
## 1               Common Myna     Passeriformes
## 2 Black-crowned Night-Heron    Pelecaniformes
## 3               Black Noddy   Charadriiformes
## 4               Brown Noddy   Charadriiformes
## 5           Bulwer's Petrel Procellariiformes
## 6                Sooty Tern   Charadriiformes
##                                     family       category     status
## 1                    Sturnidae (Starlings) perching birds introduced
## 2  Ardeidae (Herons, Egrets, and Bitterns)         waders     native
## 3     Laridae (Gulls, Terns, and Skimmers)     shorebirds     native
## 4     Laridae (Gulls, Terns, and Skimmers)     shorebirds     native
## 5 Procellariidae (Shearwaters and Petrels)       seabirds     native
## 6     Laridae (Gulls, Terns, and Skimmers)     shorebirds     native

Cross-validation for Rank Determination

We can learn an NMF model to describe linear combinations of species across geographical grids. First we need to choose a rank.

The rank of a factorization is a crucial hyperparameter. One way to help decide on a rank is cross-validation. This is made easy using the crossValidate function. See ?crossValidate for details on methods.

For many applications, there is no “optimal” rank. In this case, we do expect some amount of distinct biodiversity across the various islands, but within the islands there will be a continuum of habitat niches confounding rank of the signal. Additionally, there may be a number of “missing” observations where surveys were incomplete, which will confound signal separation.

Here we cross-validate across 3 independent replicates and plot the result:

plot(crossValidate(hawaiibirds$counts, k = c(1:10, 12, 15, 20, 25, 30), reps = 3, verbose = FALSE))

We’ll choose a rank of k = 10 since this seems to capture much of the signal while giving identifiable factors.

Run robust NMF

Let’s generate a high-quality NMF model across 10 random restarts at very low tolerance:

model <- nmf(hawaiibirds$counts, k = 10, seed = 1:10, tol = 1e-6)
model
## 183 x 1183 x 10 factor model of class "nmf"
## $ w
##                                  nmf1        nmf2        nmf3       nmf4 nmf5
## Common Myna               0.146640316 0.094888073 0.074299917 0.04111043    0
## Black-crowned Night-Heron 0.006777407 0.004027294 0.005781633 0.00000000    0
## Black Noddy               0.000000000 0.006376501 0.000000000 0.00000000    0
## Brown Noddy               0.000000000 0.000000000 0.000000000 0.00000000    0
## Bulwer's Petrel           0.000000000 0.000000000 0.000000000 0.00000000    0
## ...suppressing 178 rows and 5 columns
## 
## $ d
## [1] 1350.6295 1256.2949 1153.2389  911.6537  834.4069
## ...suppressing 5 values
## 
## $ h
##             grid1        grid2        grid3       grid4        grid5
## nmf1 0.0009274172 0.0004718018 0.0005553570 0.003512579 0.0006238265
## nmf2 0.0001676291 0.0002334082 0.0009073722 0.000000000 0.0018705609
## nmf3 0.0005758524 0.0000000000 0.0000000000 0.000000000 0.0005398256
## nmf4 0.0000000000 0.0003021981 0.0000000000 0.003822848 0.0000000000
## nmf5 0.0000000000 0.0000000000 0.0011624112 0.000000000 0.0000000000
## ...suppressing 5 rows and 1178 columns
## 
## $ tol: 8.238107e-07 
## $ iter: 67 
## $ runtime: 0.9558601 sec

In the w matrix we have factors describing communities of co-occuring bird species.

In the h matrix we have the association of these bird communities in each surveyed geographical grid.

Geographic focus on NMF factors

What does each NMF factor tell us?

The sample embeddings matrix (h) gives information about the geographical representation of each NMF factor across all grids. We’ll look at just the first four factors:

plots <- list()
for(i in 1:4){
  df <- data.frame(
    "lat" = hawaiibirds$metadata_h$lat,
    "lng" = hawaiibirds$metadata_h$lng,
    "nmf_factor" = model$h[i, ])
  plots[[i]] <- ggplot(df, aes(x = lng, y = lat, color = nmf_factor)) +
    geom_point() +
    scale_color_viridis(option = "B") +
    theme_void() +
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + 
    ggtitle(paste0("Factor ", i))
}
plot_grid(plotlist = plots, nrow = 2)

Metadata enrichment in factors

Factor 2 is localized largely to the island of Hawaii, factor 3 to the island of Kauai, and factor 4 to Oahu.

Quantitatively, the summary method for the nmf S3 class makes it easy to annotate factors using metadata about samples or features. See ?summary.nmf for info.

In this case, we will use summary to map factor enrichment in grids corresponding to each Hawaiian island, and species enrichment corresponding to each category.

plot(summary(model, group_by = hawaiibirds$metadata_h$island, stat = "sum"))

In general, grids separate based on the island to which they belong – consistent with the expectation that islands contain distinct species communities.

Notice how several factors explain variation within the big island, “Hawaii”, consistent with the objective of NMF and the biological diversity within that island.

Due to our normalization method (sum) very small islands with minor contribution to the model objective (i.e. Puuwai) are hardly represented.

plot(summary(model, group_by = hawaiibirds$metadata_w$category, stat = "mean"))

Clearly, there is the greatest signal complexity among “perching birds”. nmf10 is describing “seabirds” while nmf9 is capturing much of the “non-perching birds” information.

NMF biplots

Compare species composition in two factors that are both primarily restricted to the island of Hawaii, factors 7 and 8. The biplot S3 method for nmf makes this easy:

biplot(model, factors = c(7, 8), matrix = "w", group_by = hawaiibirds$metadata_w$category) + 
  scale_y_continuous(trans = "sqrt") + 
  scale_x_continuous(trans = "sqrt") +
  geom_text_repel(size = 2.5, seed = 123, max.overlaps = 15)

Factor 7 describes a wet rainforest community while Factor 8 describes dry rainforest/shrubland communities. Both factors are overwhelmingly focused on “perching birds”.

UMAP on NMF embeddings

We might also be interested in visualizing how factors in \(w\) capture similarities among bird species using UMAP.

set.seed(123)
umap <- data.frame(uwot::umap(model$w))
umap$taxon <- hawaiibirds$metadata_w$category
umap$status <- hawaiibirds$metadata_w$status
plot_grid(
  ggplot(umap, aes(x = umap[,1], y = umap[,2], color = taxon)) +
    geom_point() + theme_void(),
  ggplot(umap, aes(x = umap[,1], y = umap[,2], color = status)) +
    geom_point() + theme_void(),
  nrow = 1
)

Species are classified based on habitat niche and taxonomic membership. Notice “seabirds” on the left, “perching birds” in the center mixed with “non-perching birds”, and a mix of “waders”, “waterfowl”, and “shorebirds” in the bottom right. There are also two distinct groups of “shorebirds” and “waterfowl”, consistent with distinct inland and shoreline communities.

Hawaii is extinction kingdom. For instance, more than 20 species of endemic honeycreeper have gone extinct in the past two centuries due to the establishment of introduced species and habitat devastation. Few remain. In the UMAP plot above on the right, we can observe that introduced species dominate habitat niches occupied by native perching and non-perching birds, a problem underlying historic and ongoing mass extinction events.

set.seed(123)
umap <- data.frame(uwot::umap(t(model$h)))
umap$group <- hawaiibirds$metadata_h$island
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = group)) +
  geom_point() + theme_void()

Islands are also well-defined by the NMF model.

Defining the “Palila” species niche

The Palila is a highly endangered species that survives in small numbers on the eastern slopes of Mauna Kea on the dry island, in a shrubby dry “rainforest” biome. This biome is unique on the island of Hawaii.

What species coexist with the Palila?

Let’s find the highest factorization resolution at which a single factor describes the distribution of the Palila.

palila <- list()
for(rank in 1:20)
  palila[[rank]] <- data.frame(
    "value" = nmf(hawaiibirds$counts, k = rank, seed = 123, v = F)$w["Palila", ],
    "rank" = rep(rank, rank)
  )
palila <- do.call(rbind, palila)
ggplot(palila, aes(x = rank, y = value, color = factor(rank))) + geom_jitter(width = 0.1) + theme_classic() +
  scale_color_manual(values = rep(c("#F8766D", "#00BDD0"), 10)) + labs("Palila loading in factor") + theme(legend.position = "none")

The model with a rank of 15 contains a factor in which the Palila is both important and specific.

Let’s have a look at the species composition in factor 15, specifically identifying which species are introduced and which are native:

model <- nmf(hawaiibirds$counts, k = 15, seed = 123, v = F)
df <- data.frame("value" = model$w[, which.max(model$w["Palila", ])])
df$status <- hawaiibirds$metadata_w$status
df <- df[order(-df$value), ]
df <- df[df$value > 0.001, ]
df
##                             value     status
## Hawaii Amakihi        0.353110902     native
## Warbling White-eye    0.188662129     native
## House Finch           0.182132197 introduced
## Erckel's Francolin    0.066866618 introduced
## Yellow-fronted Canary 0.045745828 introduced
## California Quail      0.043372489     native
## Palila                0.038791926     native
## Eurasian Skylark      0.032193106 introduced
## Hawaii Elepaio        0.028948444     native
## Red-billed Leiothrix  0.008899639 introduced
## Chukar                0.004825709 introduced
## Indian Peafowl        0.003265103     native
## Chinese Hwamei        0.002520935 introduced

The diet of the Palilla is largely seeds from the “mamame” tree, but also naio berries and mamame flowers, buds, and young leaves. What introduced perching birds may be competing with the Palila for these resources?

perching_birds <- hawaiibirds$metadata_w$species[hawaiibirds$metadata_w$category == "perching birds"]
df[which(rownames(df) %in% perching_birds & df$status == "introduced"), ]
##                             value     status
## House Finch           0.182132197 introduced
## Yellow-fronted Canary 0.045745828 introduced
## Eurasian Skylark      0.032193106 introduced
## Red-billed Leiothrix  0.008899639 introduced
## Chinese Hwamei        0.002520935 introduced

The “House Finch” and “Yellow-fronted Canary” seem to be the most significant competitors in the Palila habitat niche.

Zach DeBruine
Zach DeBruine
Assistant Professor of Bioinformatics

Assistant Professor of Bioinformatics at Grand Valley State University. Interested in single-cell experiments and dimension reduction. I love simple, fast, and common sense machine learning.