Similarity, Data Compression and a Dead Composer


  • Jetse Koopmans University of Amsterdam
  • Daan van den Berg FNWI
  • Vadim Zaytsev FNWI


Domenico Scarlatti (1685-1759) is well-known for his 555 keyboard sonatas. Although his work is greatly revered by many professional musicians, some claim that it does not show any compository development. In this paper, his sonatas are clustered by normalized compression distance (NCD), an algorithmical similarity metric with no musical background knowledge. NCD is rooted in Kolmogorov Complexity (KC), a measure that captures the similarity between any two sonatas in a single number. The results show clusters of similar sonatas and suggest Scarlatti’s work does show compository development, even ‘milestone sonatas’ marking changes in artistic style during his lifetime.


R. Kirkpatrick. Domenico Scarlatti. Princeton

University Press, 1953.

W.D. Sutcliffe. The Keyboard Sonatas of Domenico

Scarlatti and Eighteenth-Century Musical Style.

Cambridge University Press.

Sheveloff. Keyboard. 1970, p. 258.

S.R. Owen. On the Similarity of MIDI Documents.

Harvard College, 2000, pp. 40–41.

T. van Schie. Enige gedachten bij de Sonates van

Scarlatti. 1988. (,

consulted Oct 16th , 2015

M. Li and P.M.B. Vitány. An Introduction to

Kolmogorov Complexity and its Applications. Springer

Verlag, New York, 2nd Edition, 1997.

M. Koucký. A Brief Introduction to Kolmogorov

Complexity. MÚ AV ČR, Praha, 2006, p. 4.

R. Cilibrasi and P.M.B. Vitányi. “Clustering by

compression”. In: Information Theory, IEEE Transactions

on 51.4 (Apr. 2005), pp. 1523–1545.

K. Orpen and D. Huron. Measurement of similarity in

music: A quantitative approach for non-parametric

representations. Computers in Music Research 4, 1992.

G. Cormode, M. Paterson, S. Sahinalp and U.

Vishkin. “Communication complexity of document

exchange”. In: Proc. 11th ACM-SIAM Symposium on

Discrete Algorithms (2000), pp. 197–206.

M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney

and H. Zhang. An information-based sequence distance

and its application to whole mitochondrial genome

phylogeny. Bioinformatics, 17(2).

M. Li and P.M.B. Vitány. “Algorithmic complexity”.

In: International Encyclopedia of the Social & Behavioral

Sciences (2001), pp. 376–382.

M. Li, X. Chen, X. Li, B. Ma and P.M.B. Vitány.

The similarity metric. Proc. 14th ACM-SIAM

Symposium on Discrete Algorithms, 2003, pp. 863–872.

X. Chen, B. Francia, M. Li, B. McKinnon and A.

Seker. “Shared information and program plagiarism

detection”. In: Information Theory, IEEE Transactions on

7 (July 2004), pp. 1545–1551.

R. Cilibrasi, P.M.B. Vitányi and R. de Wolf.

“Algorithmic Clustering of Music Based on String

Compression”. In: Computer Music Journal 28.4 (Dec.

, pp. 49–67.

A. El-Hamdouchi and P. Willett. “Comparison of

hierarchic agglomerative clustering methods for

document retrieval”. In: The Computer Journal 32.3 (June

, pp. 220–227.

P. Knees and M. Schedl. Music Retrieval and

Recommendation: A Tutorial Overview. In Development

in Information Retrieval (2015), pp. 1133-1136.

V. Kumar, H. Pandya and C.V. Jawahar. Identifying

Ragas in Indian Music. In Pattern Recognition (ICPR),

22nd International Conference on (2014), pp.


T. Li, M. Ogihara and Q. Li. A Comparative Study

on Content-Based Music Genre Classification. In

Development in Information Retrieval (2003), pp.


U. Simsekli. Automatic Music Genre Classification

Using Bass Lines. In Pattern Recognition (ICPR), 2010,

pp. 4137-4140.

M. Schedl and D. Hauger. Tailoring Music

Recommendations to Users by Considering Diversity,

Mainstreaminess, and Novelty. In Development in

Information Retrieval (2015), pp. 947-950.

D. Nebel, B. Hammer and T. Villmann. About

Learning of Supervised Generative Models for

Dissimilarity Data. Machine Learning Reports (2013),

pp, 1–19.




How to Cite

Koopmans, J., van den Berg, D., & Zaytsev, V. (2015). Similarity, Data Compression and a Dead Composer. Student Undergraduate Research E-Journal!, 1. Retrieved from



Economics & Social Sciences