Over the last few months, I have been recording leaves, or the pins left standing after the first ball is thrown. In all, I have recorded 774 leaves, many of which are repeated. In fact, of these, only 155 are unique. Every week of our league, I record approximately another 70 leaves. Early on in the process, many new spare combinations were recorded. As the database grew, however, unique spares became increasingly rare. Now, I will record maybe 1-5 new leaves for every 70 added to the database.

Given enough time and a large enough sample, I could in theory observe all possible spare combinations, but to be honest, I don't really want to. This would require an absolutely enormous sample, and I don't want to keep doing this for the rest of my life. Instead, it is possible to use observed trends to estimate the number of possible spares with some degree of precision.

First, I'll give you the answer. If you want to know how I solved this problem, I will explain below. In short, my best estimate is that there are between 269 and 297 different spare combinations that can exist. My best estimate is 283. Therefore of the 1023 spare combinations that can exist in theory, only about 1/4 of them can exist in reality.

To answer this question, I begin with a symmetry assumption:

*If a particular leave has been observed to be possible, then its mirror image is also possible*. For example, if a 9-10 leave has been observed, then a 7-8 pin leave is also assumed to be possible even if it has not been observed. So, the first step in the process is to create a series of mirror image spares using the observed set of leaves. This increases the sample to 1,548 leaves, of which 187 are unique.

There are some interesting patterns in this dataset. The figures above show the frequency distribution of observed spares relative to the number of pins left standing. The greatest diversity of spare combinations observed is for three pin leaves. Of these, I have observed 48. I have observed 45 combinations with four pins remaining and 39 with two. This differs dramatically from the theoretical leaves. In theory, five pin leaves have the greatest diversity with 252 possible combinations, but in actuality, I have only observed 28 different five pin combinations. When viewed as the ratio of observed:possible leaves, there is a very clear pattern. All 10 single pin spare leaves that are possible have been observed. Of two pin leaves, 39 of 45 or 86.7% have been observed. This ratio slowly declines to seven pin leaves, of which only 2 of the 120 possible have been observed. the ratio climbs to ten pin leaves, of which only one is possible (a gutterball), and it has unfortunately been observed. I pulled off this feat twice last week. Ugh.

Here's how I solved for the total number possible. There is clear relationship between sample size and diversity. As shown above, as you observe more leaves, the sample of unique pin combinations grows. This curve is increasing asymptotically, which means that there is a finite number of spare combinations that exist. The curve is approaching that number. At first, it approaches quickly, but as the sample grows, the rate of increase declines. The reason is simple. Early on, you observe a lot of very common leaves and some rare ones. Once you have observed the most common leaves, only rare ones remain. There are likely some pin combinations that are exceedingly rare, combinations that only occur say 1 in every 10,000 frames. To observe these, you need a huge sample size, and I don't have the time or interest in waiting around for them to happen.

So, the solution to the problem is to use regression. I sought an asymptotic function that could be fit to the curve shown above. I found one here. This function describes the behavior of an electrical circuit component called a limiter, but it also seems to describe the relationship between sample size and diversity in bowling spares incredibly well. The fit is quite amazing. Anyway, it is possible to solve for the values of the coefficients that best fit the observed curves, and one of these is the asymptote, or the total number of possible spares. I use a very simple procedure. First, I randomize the list of spares. Then, I fit this function to the curve (like the one shown above) and solve for the asymptote. I repeat this process over and over again. Each time I do this, I get slightly different estimates. Then, I take those estimates and create a 95% confidence interval for the total number of spares that can exist. When I do this, I get a range between 269 and 297 with a mean of 283.

To give you an idea of how long it would take to observe all possible spares, it is a fairly simple question to answer using this regression model. According to this model, if I had a sample of 20,000 observed spares, of these approximately 263 would be unique. If I recorded another 20,000, I would observe four new unique pin combinations. This is why I am happy to stop here. Maybe somebody more insane than myself can pick up the torch.

great post, however i have one question on your regression, what do you mean by randomizing your list of spares? you have 1548 data points right?

ReplyDeletedo you mean you add a gaussian term to the 1548 data points? permute the possible number of spares? (which i dont understand since your spares are actual observed right)?

Thanks for your question. I can't believe somebody actually read this. It always makes me happy to know that I am not the only person interested in this stuff.

ReplyDeleteThe randomization comes in prior to performing the regression. I wrote an excel macro that counts up the number of unique spares as a function of sample size. So, it builds the curve as shown in the 2nd figure above. The exact shape of the curve is somewhat dependent on the order in which each spare was observed. I don't know if the order in which I observed these spares is fairly typical or somewhat unusual.

One way around this problem is to randomize the order in which these spares were observed. So, I just randomly sort the list of spares, build the curve again, and repeat the regression. While the starting and ending point always have the same y-axis value (1 and 187, respectively), the shape of the curve changes.

In essence, I am simulating the observation of the same 774 spares but changing the order in which they were observed.

Does that make sense?