FILMS AND STATISTICS:
Give and Take

 

Question 1: Median or Mean
by Yuri Tsivian


A good place to start is by taking stock of statistical evidence in use. Is the average shot length (ASL, the first variable film scholars normally look at) the best way of contrasting and comparing cutting rates across films?

The question has two sides to it which I want to outline briefly before I hand it over to you four. On the one hand, ASL has worked for generations of film scholars dating back to one day in 1916 when Harvard psychologist Hugo Muensterberg walked into a movie theater, looked at his pocket watch, counted some shots (then called "scenes"), calculated the mean and came up with the following diagnosis:

If the scene changes too often and no movement is carried on without a break, the [photo]play may irritate us by its nervous jerking from place to place. Near the end of the Theda Bara edition of Carmen [1915] the scene changed one hundred and seventy times in ten minutes, an average of a little more than three seconds for each scene. We follow Don Josť and Carmen and the toreador in ever new phases of the dramatic action and are constantly carried back to Don Josť's home village where his mother waits for him. There indeed the dramatic tension has an element of nervousness, in contrast to the Geraldine Farrar version of Carmen [1915] which allows a more unbroken development of the single action.
(Hugo Muensterberg, A Photoplay: a Psychological Study (New York, London: D.Appleton and Company, 1916), p. 45-6.)

On the other hand, a number of modern-day statisticians tend to question the effectiveness of the arithmetic mean on the grounds of its being too sensitive to outliers. Their position is neatly summarized in Nick Redfern's latest study "Statistics and the Analysis of Film Style" (though the study is yet unpublished, I have Nick's kind permission to use parts of it in this conversation):

The most commonly cited statistic is the 'average shot length' (ASL). The ASL is typically described as 'the length of the film in seconds (feet) divided by the number of shots in it.' Clearly, in this context the 'average' referred to is the mean. Unfortunately, a worse choice of a statistic to describe film style could not have been made: the mean is not an appropriate measure of central tendency for a skewed data set with a number of outliers, and these are precisely the characteristics of the distribution of shot lengths. The mean shot length is used to compare how quickly two films are edited, or for comparing the cutting rate of groups of films. However, the fact the mean is not robust to deviations from normality means that these conclusions are not valid and the conclusion of the researcher will clearly be flawed.

Resting upon the opinion of statisticians whose methods have been shaped by working with non-film related data, what Redfern suggests is to refocus film studies from mean to median values:

The median shot length provides a simple robust alternative to the mean because it will locate the centre of any distribution irrespective of its shape and is not affected by the presence of outliers in the data . The mean is affected by outliers, being pulled away from the mass of the data in the direction of the outliers; whereas the median is not affected by the presence of outliers and is, therefore, resistant to their influence. Outliers occur in shot length distributions as shots that are exceptionally long relative to the rest of the shots in a film; and it is for this reason those researchers outside film studies used the median shot length to describe film style and not the mean shot length.

Even though Cinemetrics as a website displays both ASL and MSL (median shot length) for each submission, the question Nick Redfern raises is relevant for cinemetrics as a field. Even if some may not find there are good reasons for film studies to wipe off the slate and start everything from scratch it makes sense perhaps to use Nick's warning as an excuse to revisit our statistical toolkit or get a better sense of the nature of film-related data. What are outliers in our particular case? What price do we pay for ignoring outliers and what price for heeding them too much?

Recently, Barry Salt gave some thought to this in his 2011 study "The Metrics in Cinemetrics". Let me quote what Barry says about unexpectedly long shots in films like Ride Lonesome (1959):

These long take shots could reasonably be referred to as "outliers" in this particular case, but to disregard their existence in an investigation is to shut your eyes to the very thing that makes this film special.

And about ASL in general:

In any case, the mean exists as a basic characteristic of any distribution, and the ASL has been adopted by many other people as a standard measure for film statistics since I introduced it 35 years ago. This is partly because it is easy to get. You just have to know how many shots there are in a film, and the film's length, to work it out. That is how I come to have a database of nearly 10,000 ASLs from complete films, which is very useful for stylistic comparisons. I consciously chose to call it the Average Shot Length, rather than the Mean Shot Length, because I reckoned that a smaller number of the many rather innumerate people in film studies would be put off by the former name. You can only get the median by listing all the shot lengths in a film, as in Cinemetrics. It is worth remarking that if you only consider the median shot length, you can be seriously misled about the distribution of shot lengths in the film you are considering. For instance, both The Lights of New York (1928) and The New World (2005) have median shot lengths of 5.1 seconds, so on this ground alone you might think they have similar distributions, but when you look at their other features it turns out they are very different.

So, the median shot length or the average shot length or maybe both? And is there a line beyond which film studies should not ignore the presence of outliers unless of course these had been caused by measurement errors? If my summary sounds accurate enough let me submit these questions for your consideration.