Modeling Kelp Observations as a Zero-Inflated Poisson Process

13 May 2022 - Dean Goldman

Introduction

An important part of wildlife monitoring is assessing whether or not a species population has increased or decreased over time. In this post, I share a project that demonstrates statistical modeling applied to kelp observational data from KEEN (Kelp Ecosystem Ecology Network) with the goal of determining a statistically significant change over time in the rate of kelp observations. My goal for this post is to share a useful application of analyzing wildlife data with a statistical model, and breaking down some of the methods I used with some code I put together.

Kelp Observation Dataset

In the KEEN dataset, divers collected marine life sightings from 13 kelp forest sites in regions around the Gulf of Maine between 2014 and 2018. The KEEN dataset captures a variety of wildlife including kelp, fish, algae, and invertebrates. This research will focus solely on the number of kelp sightings per “quadrat”. A quadrat refers to a one square meter sample within a 40 meter transect of marine area at a general kelp forest site. Each transect contains six quadrats spaced eight meters apart, alternating on either side of the transect line.

quadrat protocol
gulf of maine

Modeling Kelp Populations as a Zero-Inflated Poisson Process

To study the behavior of a population of kelp over time, it is useful to to describe the observed data by some theoretical probability distribution. From this model, whose parameters are influenced by our data, we can draw probabilistic conclusions about the population’s past behavior, and ascribe expectations around its potential and future behavior. This particular analysis attempts to express the probability of a given number of kelp observations occuring in a quadrat. This kind of probability distribution is called the Poisson distribution, and more generally the Poisson expresses the probability of a given number of independent events occuring in a fixed interval. You can read more about the Poisson distribution my other blogpost: https://deanjgoldman.github.io/2022/04/15/Poisson-Distribution-Binomial.html.

One important thing to be aware of when using a Poisson distribution is that if there is some element of sparsity in the distribution of events, i.e. there are a lot of 0 counts, a useful modification of the Poisson is the Zero Inflated Poisson (ZIP) [3][4]. In the case of most these kelp sites, there are a lot more 0 counts than expected by a Poission with even the leanest rate parameter \(\lambda=1\). A ZIP distribution inflates the expectation of the number of zeros, so it seems like a pretty good choice.

kelp frequencies


poisson distribution
zip example

Exploratory Analysis

A good starting point for analysis is to compute the empirical statistics of kelp observations per site, e.g. sample mean, variance, etc.


From the sample statistics, we can see that assumptions of a Poisson (average rate \(\lambda\) equals varaiance) are violated. In this case, \(\lambda\) is much closer to the standard deviation than the variance. This is explained by the fact that that most sites are zero-inflated and tend have a good number of outlier counts, meaning that in some quadrats there were a relatively high number of kelp observations made. We can still move forward with the Poisson analysis, but this is an important observation to flag, and probably deserves its own accounting for in future modeling.

If we move forward with our analysis, we do see that the average rate of kelp observations tends to go down year over year in several sites (e.g. Baker North, Baker South, and Nahant). This is a good starting point for a hypothesis test to determine whether or not this apparant decline is actually a statistically significant decay in terms of the population as a Poisson process.

Parameterizing the Model

The basic Zero-Inflated Poisson distribution requires two parameters: \(p\) and \(\lambda\).

\[\begin{aligned} &P(Y=0) = p + (1 - p)e^{-\lambda}\\ &P(Y=k) = (1 - p)\frac{\lambda^{k}e^{-\lambda}}{k!} \end{aligned} \tag{1}\]

The ZIP distribution’s \(p\) parameter represents the probability in a Bernoulli distribution determining whether an interval contains 0 events, or has a number of events determined by a Poisson with parameter \(\lambda\). So the frequency of 0 counts generally “inflates” in a ZIP distribution. \(\lambda\) is the expected rate of occurance in the Poisson process. For our case, it’s the expected number of kelp in a quadrat. A common way to determine this value is to find the value of lambda where the probability that the observed data results from a Poisson distribution with that \(\lambda\) value is higher than all other values. This is called Maximum Likelihood Estimation (MLE). This is a general method for finding parameters to a model, for the case of a very simple Poisson model, finding \(\lambda\) through MLE is equivalent to taking the sample mean [2], but the ZIP distribution slightly deviates from this.

Determing Goodness-of-Fit Between Theoretical Model and Observed Data

Although it seems plausible that kelp observations per quadrat meets the criteria of a ZIP process, we can’t assume the probabiliy density of the number of kelp per quadrat neatly fits into that distribution. It may be the case that the events (a kelp plant sighting) aren’t independent, as in the existence of one or more kelp plants may increase or decrease the probability of others nearby. A critical step in this analysis is the Chi-Square goodness-of-fit test, which will compare the observed distribution- the kelp- with the expected distribution- the Zero Inflated Poisson. The Chi-Square test for goodness of fit is performed by setting up a hypothesis test:

\[\begin{aligned} &H_{0}: X \sim \text{ZIP}(p, \lambda)\\ &vs.\\ &H_{A}: X \not\sim \text{ZIP}(p, \lambda)\\ &\\ &\text{Reject } H_{0} \text{ if } \chi^{2} > \chi^{2}{}_{1-\alpha,k-c} \end{aligned}\]

\(\chi^{2}\) is the normalized sum of the squared difference between the observed frequency of a given number of events and the expected frequencies of those given number of events. This is to say that a single value in \(O\) is the number of times the dataset reveals \(x\) events occuring in the interval, and \(E\) is the expected frequency of \(x\) events occuring in an interval, assuming the null hypothesis is true. When the data fit well into a distribution, i.e. when the difference between \(O\) and \(E\) is small, we can induce that \(\chi^{2}\) will be relatively small.

\[\chi^{2} = \Sigma\frac{(O-E)^{2}}{E} \tag{1}\]

We’re interested in comparing \(\chi^{2}\) with the theoretical number called the critical value, given by \(\chi^{2}{}_{1 - \alpha,k - c}\). The critical value is derived from the Chi-Square distribution, the distribution of the sum of the squares of \(k\) independent standard normal variables. In our case, \(k=1\), because we’re dealing with a single random variable- the number of kelp observations per quadrat.


The critical value is also the value at which which we are \((1-\alpha)\%\) certain that we are correctly rejecting our null hypothesis, given that our \(\chi^{2}\) exceeds the critical value. Another way of running this hypothesis test is to compare the \(p\)-value derived from \(\chi^{2}\) with our significance level \(\alpha\). The \(p\)-value is the probability of observing a test statistic at least as large as \(\chi^{2}\) given that the null hypothesis is true. This approach is similar to the comparison of critical values, the difference being that instead of comparing \(\chi^{2}\) values, we are computing the probability that the deviation between \(\chi^{2}\) and \(\chi^{2}{}_{1 - \alpha,k - c}\) was within an allowed range, given that the null hypothesis is true [2]. If \(p < \alpha\), that probability is low enough that we cannot assume it’s chance, and may reject the null hypothesis. However, if the \(p\)-value is higher than \(\alpha\), it’s likely enough that the deviation between observed and expected is within an acceptable interval of error or randomness, so we fail to reject the null hypothesis. At a high-level, this is sort of like a method for comparing two histograms. Our two histograms being the histogram of observed kelp sightings per quadrat, and a theoretical Poisson pdf.

Implementation

I write out some python code for running the Chi-square goodness-of-fit test for a ZIP distribution.


From the histograms, the residuals of the observed are pretty close to the residuals of the expected, and in a good number of cases we fail to reject the null hypothesis that their errors are from two different distributions. For the purposes of moving forward, let’s call this satisfactory. Note: The rest of this post performs analysis on all of the sites, not just the sites that were determined to have a statistically significant goodness-of-fit, mostly because this is just a research endeavor.

Exact Hypothesis Test (E-Test)

We’ll now compare the earliest and latest recorded years of the observed data for each kelp forest site. The purpose being that if we can determine a statistically significant difference between the two distributions, we can determine that there has been a change in the rate of kelp observations between the earliest and latest years at a site. This can have important implications for decision making in wildlife monitoring- if we’re highly confident there has been a decrease in kelp, that may indicate a priority for wildlife management organizations. The Exact Hypothesis test is useful for this purpose. In most implementations on the internet, a Poisson E-Test compares the two population’s Poisson rate parameters. In the code below, I reference the source code for the Poisson E-Test the python statsmodels library, and hack in some functionality Zero Inflated PMF.


Observations

Visually, we can see that in the plots where null hypothesis is rejected, the orange histograms tend to have a more positive skew than the blue. This indicates the concentration of the frequencies around zero had increased in the later year, which is consistent with the result of the E-Test: that the parameters for the observations fit in the later years are statistically smaller. For sites that reject the null hypothesis, there seems to be a good case that there has been a population decay in the site.

Discussion

One thing I noticed while running these experiments, is that certain settings can produce widely different results, and even differences in passing or failing a hypothesis test. I think the takeaway is that you need to have good reasons for your testing approach, and it may be useful to do some ablation studies or analysis in the experiment’s parameter bounds. The idea being that rather than having just a single test statistic, you can have a range of statistics, and run an analysis about what happens as you traverse the experiment’s parameter space.

Another important thing to mention is that this analysis assumes independence of events. Meaning that the model assumes that the event of one kelp observation is independent from the next. This is probably a little specious. There is certainly room for development on analyzing the competition between organisms, patterns of kelp reproduction, what the ocean floor looks like, and whatever other forces that encourage density or sparsity among kelp. That said, if the model calls all of that a wash and still has predictive power, it could still be a useful model.

For future experiments, I think this analysis would benefit from a modeling approach that dealt with the large outlier frequencies. why do some quadrats have a much higher amount of kelp than the average? There is likely some explanation to be found by investigating the data collection methods, but an interesting approach could be a modification to the Poisson that allows for some of those high counts, maybe a “Tail-Inflated Poisson”?

Thanks for reading 🤙!

Acknowledgements

Gratitude for the people at KEEN who have started and maintained their project, and Austin Rochford for sharing an implementation of a ZIP Likelihood model.

References