Computational Modeling of Music Perception

Computational Modeling of Music Perception

Introduction: An Inevitably Inter- or Multi-Disciplinary Approach

Early research with a skeptical attitude on the computer modeling of music perception by Desain et al. (1998) tries to explain why modeling is usually insufficient to describe a complex issue such as music. The authors mention that the inter or multi-disciplinary work can link various topics in one research, and at the intersection of the latest research of these fields, the issues can be explained as they could not have been before by a monodisciplinary approach. The authors suggest modeling musical knowledge and modeling music cognition could be complementary, but they do not need to, which I cannot entirely agree with in some ways as I believe these two domains should always co-operate to provide meaningful and testable results. An extreme example would be a musicologist trying to model rhythm and beat perception without knowing the role of basal ganglia in how people perceive rhythm and beat. In this example, if some subjects were randomly diagnosed with Parkinson’s’ Disease, which has a crucial role in beat induction, any results yielded from this study would not apply to the music cognition of the general population overall. An opposite example related to “historical listening” from Pearce and Eerola (2017), where cognitive aspects of music listening experiences, such as the impact of polyphony to timbre, voice-leading principles, and modality or tonality, are ignored to create a model for predicting the following musical note or interval. In this example, theoretical and stylistic musical knowledge is somewhat neglected, and the models might suffer from over-fitting. In these two examples, it is possible to understand why multiple disciplines need to work together, so the core findings of those fields are not ignored heavily, leading to the conclusions of multi or interdisciplinary works being well inclusive in the final output.

Musical structures such as pitch or rhythm can provide the basis for cross-disciplinary work. Desain et al. (1998) mentions that the pioneering work of combining music theory and computer science can be improved by adding psychology and neuroscience to the convergence of these domains. Whether formal or informal, experimental psychology can investigate music theory as researchers of these domains can model music perception with computer programs that can provide the foundation of further neuroscientific research. Although many computer models exist for music perception and cognition, the psychological validation of these models is lacking details. The authors criticize models related to beat induction and explain the lack of cross-modal comparisons due to different initial questions or disciplinaries on page 153. A similar case occurs in Pearce and Eerola’s (2017) work when researchers’ initial concerns are to fit a model that was created in a previously published study to a significantly different musical context, even if this approach requires eliminating critical variables related to musical experiences. These findings suggest the difficulties of coming up with beneficial and valid results when multiple fields are involved, and especially when these complex fields possess significant limitations in measurements and modeling on their own.

Music offers excellent potential for multi-disciplinary work, but the music itself is not easy to model. Although many modules of music might be placed in the same category at first, they are not compatible with each other usually. For example, Desain et al. (1998) gives grouping and meter as an example of rhythmical structure and places them under one category, but these aspects of music differ significantly. However, Purwins et al. (2008) divides rhythm and grouping into different categories in their initial model, which is a slightly better way of thinking. It is possible to realize that accumulation of categorization errors due to being a study inevitably grounded in multiple fields might be a hurdle to come up with studies that can draw any justifiable conclusions.

Fundamental Problems of Cognitive Science

There are some fundamental problems in the cognitive research generally, not specific to music research: as Desain et al. (1998) state on page 154: “The large variety of incompatible models within music cognition, and the unclear link between them are manifestations of a real problem in cognitive science itself.” One of the reasons for this fundamental problem is the elimination of variables or isolation. For instance, earlier music cognition research was primarily interested in sine waves, but almost none of the music has no sine waves in it: it is instead a sum of different waves. This oversimplified understanding of a complex matter might yield some insights, but usually, it is not the case. This fundamental problem could be related to a philosophical argument, a similar one discussed in the Chinese Room Argument. The Chinese Room Argument (1980) by John Searle is a thought experiment that a person imagines themselves in a room following a computer program for responding to Chinese characters slipped under the door. This person does not understand anything of Chinese; by following the program for manipulating the symbols and numerals just as the computer does, the person sends appropriate strings of Chinese characters back out under the door, and this leads those outside to mistakenly suppose there is a Chinese speaker in the room. This argument suggests that programming an interface may pretend to understand the issue but does not produce real comprehension. A model can pretend that there is comprehension instead of creating understanding.

It is possible to observe the extension of such matters when new data sets are introduced to an accurate model for a specific data set. The authors explain this issue on page 156 as: “Even assuming that such a reasonable mapping exists between input and output of the model and output of the human subject, and assuming that the model builder has succeeded in creating an algorithm that exhibits behavior that agrees with the empirical data, we still cannot say much about the architecture of human cognition.” Just because a model can explain a limited study does not necessarily mean that the same model applies to a broader range: both with new data and fundamentals of human cognition.

One Size Does Not Fit All: A Case Study on Model Selection

Honing (2006) discusses two different models (kinematic and perceptual) with model selection criteria on modeling the ritardandi in music performance, which is an appropriate approach to exploit critical aspects of computational modeling of music cognition. The kinematic approach focuses on commonality in timing patterns and how they are related to laws of physical motion. In contrast, the perceptual approach predicts the degree of expressive freedom that a performer has in interpreting a rhythmic fragment before the listener misinterprets it as a different rhythm on page 365. The kinematic model, or K model, is good at modeling familiar movements; therefore, it can better predict the performance’s ending. This model is derived from deceleration in motion by using velocity, time, and initial velocity. These variables are adapted to musical notation accordingly and normalized when necessary. Two fundamentally different components are used on the P model: perceived regularity or tempo tracker and rhythmic categorization or quantizer. Tempo tracker uses an adaptive oscillator and can predict the period and phase separately at any given time. The second component of the P model, quantizer, takes the timing pattern after interpreted by tempo and outputs a prediction of perceived duration as a range. Approaching the same issue with different models is highly beneficial and might be crucial to understanding music cognition better as the disadvantages or weak aspects of various models can be minimized in a complementary manner.

As a general comparison, the P model is slightly more complex while being more realistic. One critical flaw that the K model suffers from is that it is notation-dependent, which means that it takes the musical notation as input, compared to performance or audio file in the P model. This issue, again, is a frequent problem in the computational modeling of music cognition when the music is assumed as it only exists in notation which eliminates the impact of style, performer, interpretation, instrumentation, orchestration, and timbre. Another possible problem that might come across is the different output of the K model and P model, as the K model predicts the ending moment while the P model predicts a tempo range for the ending. The author suggests that these can be “easily” interpreted as a tempo prediction; however, one should remember that the inputs of the two models are also different. Another way to put it is, K model is predicting “When any performer will end a musical piece based on the musical notation?”, meanwhile, P model is predicting “How a specific performer might finish a specific musical piece based on the performer’s on-going performance of the piece?”. These models are trying to answer different questions and do not precisely interpret them equally in my perspective as a common pitfall of confusing musical notation and performance of the piece. A more grounded comparison might be possible when the K model is kept unchanged while the P model also relies on the notation or when the P model is kept the same while the K model also depends on the performance or audio file.

Comparing different models with different selection criteria is an essential step towards post-positivist research, as any model might inevitably have biases at any given stage of research. Honing (2006) mentions that only ritardandi with smooth shapes were selected from Sundberg and Verrillo (1980) – a set that consists of twenty harpsichord performances of compositions by J. S. Bach, which is used as the data set for comparison. Using a data set that has ritardandi with smooth shapes might create a model that is not good at predicting ritardandi with nonsmooth shapes.

Honing (2006) uses goodness-of-fit, model simplicity, and the degree of surprise in the predictions of both K and P models as selection criteria. Goodness-of-fit measures the precision with which a model fits a particular sample of observed data. The author cannot conclude that one model fits the empirical data better than the other for testing goodness-of-fit. The complexity of a model to the degree of success in making a good fit is the measurement of model simplicity. P model shows less flexibility, which means it is more complicated with less precision compared to the K model. Element of surprise is measured by making a limited range of predictions and predict rough or uneven shapes. P model gives better predictions than the K model in predicting the surprises, as shown in Figure 1. Based on these findings, we can easily conclude that different models have different strengths and weaknesses and when these differences are addressed precisely, this might help build composite models for further research.

Computer modeling of music perception
Fig. 1. Schematic comparison diagram of K model (top) and P model (bottom), based on possible, plausible, and predicted final ritards from Honing (2006).

Final Remarks

This literature review focuses on the fundamental concepts of computer modeling of music perception to address core aspects of this inter or multi-disciplinary field. In this paper, why and how different areas such as musicology, computer science, psychology, and neuroscience should work together in a complementary manner is explained. Difficulties of such combinations are also tackled by mentioning how various research focuses of scholars from multiple fields might yield unclear results. This issue was refined by the exampling accumulation of categorization errors from different areas. The fundamental problems of computer science and cognitive science are described to further illustrate initial possible issues for research towards computer modeling of music perceptionFor example, creating a model might generate an understanding for a specific data set that the model was trained, and the same model might be quite inefficient when a new data set is introduced. These structural viewpoints are followed by an emphasis on model selection criteria of different approaches for modeling ritardandi. The importance of addressing the same issue with different models is highlighted, and the contribution towards post-positivist research attitude is brought up. Finally, some possible flaws of these different modeling approaches are briefly mentioned, and the significance of combining solid aspects of different approaches for further studies is accentuated.


Cole, David, “The Chinese Room Argument”, The Stanford Encyclopedia of Philosophy (Winter 2020 Edition), Edward N. Zalta (ed.), URL =

Desain, P., Honing, H., Vanthienen, H., & Windsor, L. (1998). Computational Modeling of Music Cognition: Problem or Solution? Music Perception: An Interdisciplinary Journal, 16(1), 151-166. doi:10.2307/40285783.

Honing, H. (2006). Computational Modeling of Music Cognition: A Case Study on Model Selection. Music Perception, 23(5), 365–376. doi:10.1525/mp.2006.23.5.365.

Pearce and Eerola (2017), Music perception in historical audiences: towards predictive models of music perception in historical audiences, Journal of Interdisciplinary Music Studies 8, 1-2, pp. 91-120.

Purwins, Hendrik & Herrera, Perfecto & Grachten, Maarten & Hazan, Amaury & Marxer, Ricard & Serra, Xavier. (2008). Computational models of music perception and cognition I: The perceptual and cognitive processing chain. Physics of Life Reviews. 5. 151-168. doi: 10.1016/j.plrev.2008.03.004.

Purwins, Hendrik & Grachten, Maarten & Herrera, Perfecto & Hazan, Amaury & Marxer, Ricard & Serra, Xavier. (2008). Computational models of music perception and cognition II: Domain-specific music processing. Physics of Life Reviews. 5. 169-182. doi: 10.1016/j.plrev.2008.03.005.

Responses to Pearce and Eerola (2017): Modeling historical audiences: What can be inferred? Journal of Interdisciplinary Music Studies 8, 1-2, pp. 132-140.

Thanks for the supervision of Dr. Jane Ellen Harrison at İstanbul Technical University, Center for Advanced Studies in Music. I wrote this final paper for MYL 5032E Psychology and Neuroscience of Music course during the 2021 Spring semester.

Other music analysis papers I wrote: Voice Leading Analysis of J. S. Bach’s BWV 988, Form Analysis of Haydn, Piano Sonata in F major, Hob. XVI:23, 1st movement.

Leave a Reply