ISSN 1989-1938
Espai web patrocinat per:
Revista de pensament musical en V.O.

The PHENICX project at ESMUC



Modern digital multimedia and internet technology have radically changed the ways people experience entertainment and discover new interests online, seemingly without any physical or social barriers. Such new access paradigms are in sharp contrast with the traditional means of entertainment. An illustrative example of this is live music concert performances that are largely being attended by dedicated audiences only.

The PHENICX project aims to bridge the gap between the online and offline entertainment worlds. It will make use of the state-of-the-art digital multimedia and internet technology to make the traditional concert experiences rich and universally accessible: concerts will become multimodal, multi-perspective and multilayer digital artifacts that can be easily explored, customized, personalized, (re)enjoyed and shared among the users. The main goal is twofold: (a) to make live concerts appealing to potential new audience and (b) to maximize the quality of concert experience for everyone.

Scientific objectives of PHENICX are (i) to generate a reliable and effective set of techniques for multimodal enrichment of live music concert recordings suitable for implementation and deployment in real-world situations, and (ii) finding ways to offer the resulting multi-faceted digital artifacts as engaging digital experiences to a wide range of users. The project will establish a methodological framework to map these scientific objectives onto a solid implementation platform that will be progressively developed and tested in real-life use settings.

PHENICX will mainly focus on classical music. However, findings from PHENICX will be relevant to live concert situations in any genre. With an innovative technology partner, as well as two authoritative professional music stakeholders in the consortium, the project has strong immediate impact and dissemination potential.

PHENICX (Performances as Highly Enriched and Interactive Concert Experiences) has been a three years research project (since February 2013) included in the EU’s Seventh Framework Programme for Research (FP7). This paper describes the role of the Escola Superior de Música de Catalunya (ESMUC) in the project.


Concept and objectives

Motivation: the outlook for music as a live performing art

Traditionally, classical music concerts are given in enclosed physical spaces in which musicians perform for a privileged audience that previously acquired the concert tickets. In this context, the concert is a social event that creates a small community in a physical place for a given lapse of time. In consequence, it is difficult for people interested in exploring live performances of unfamiliar music genres to enter this community, remaining as ‘outsiders’ to the music and its entourage.

In the last decades, the access to recorded music has become easier for music lovers. Nevertheless, there exists a skewed situation: even though the access to new music is easier through the availability of physical CDs, online digital stores and streaming services, it is still fairly unusual to buy a ticket to attend a concert of such an unknown music genre. There are other personal (i.e. economic, distance to the concert hall, etc.) and social (finding a concert partner, generational gap, etc.) reasons that make the audience to be less inclined to spend money on these live performances.

In this context, the musical concert performance may remain confined. Although this isolation and exclusiveness may be a valuable part of the social aspect of the concert itself, it is our mission to find new ways of keeping the existing audiences engaged at the same time we propose new ways to include new potential audience groups and, why not, using technology.

Present-day technologies should be able to offer many more possibilities than that. As soon as we get connected to the Web, we do not just have access to recordings: a wealth of supporting information is at our fingertips, ranging from artist information to scores and lead sheets. Apart from these “obvious” and general supporting sources, there are other digitized sources of information that can help us deepen our understanding and experience of the music that is being played. For example, if we hear an interesting sound coming out of an instrument we do not know well, we can look up a Wikipedia article on that instrument, and then find videos of other people playing that same instrument.

As mentioned above, current technologies are able to provide a lot of possibilities to music lovers. The Internet can provide straight access to recordings, supporting information (scores, reviews from specific performances given in the past, etc.). But technologies should be able to provide other services. For instance, the listener can access to tailored information according to his musical preferences (professional performer, music lover, enthusiast, professional from the editing industry, etc.) linked to a specific performance, and share the experience over social media.

These possibilities have been studied ([Knight Foundation, 2002], [Wolf, 2006], [Rizkallah, 2009]) but not really implemented. From the PHENICX consortium, we are convinced that they are the true opportunities to preserve the traditions of live music performance in a way that speaks to present-day audiences. The research advances in the sound and music computing field provide the tools for enriching the live music performances.


PHENICX: towards a digital rebirth of the music concert performance

The focus of the PHENICX project is to propose a methodological and technical framework to transform live music concert performances. It focuses on converting concerts into a multimodal, multi-perspective and multilayer experience that the user can enjoy before, during and after the performance.

As specified in the project proposal, there are two main scientific and technological objectives to be addressed:

  1. The first objective deals with transforming live music concert performances into enriched multimodal, multi-perspective and multilayer digital artifacts. For this, state-of-the-art automated techniques need to be adapted to the given domain and improved so that they meet the practical demands of the objective.
  2. The second objective deals with presenting the multimodal, multi-perspective and multilayer digital music performance artefacts as true digital experiences that can be explored, (re)enjoyed and shared in many customisable and personalized ways.


Consortium and the role of ESMUC

The PHENICX consortium is composed by 7 partners from three different European countries. They involve research groups from universities, cultural institutions and multimedia industry. Partners have different functions according to their expertise:

  • Research providers: Research centers and universities with expertise in technology and multimedia research.
  • Technology integration: Industrial company that gathers the results from all the other partners and build prototypes.
  • Content provider and end users: Cultural institutions providing content and access to real users for feeding and providing feedback to the research partners.

Figure 1 shows the partner composition of the PHENICX project and their geographical distribution.

Figure 1: Composition of the PHENICX project and their geographical distribution.

Figure 1: Composition of the PHENICX project and their geographical distribution.

In this context, the role of the ESMUC is twofold. First, the ESMUC provides content to the research and industrial partners with multimodal recordings of our concerts. These recordings are designed according to research requirements using microphones in specific instrument locations and capturing the gesture of the orchestra conductor. This is the main difference with respect to the other content provider, the Royal Concertgebouw Orchestra, which can offer professional recordings of its performances but using traditional methodologies. In this sense, these two partners complement each other. On the other hand, the ESMUC is also involved in research activity. The ESMUC is responsible for creating technologies that can explain expressivity (in terms of loudness, tempo and articulations) using and improving MOCAP (MOtion CAPture) state of the art technologies. These two contributions are deeply explained in the following sections.


Concert recordings


As mentioned above, the ESMUC provides content to the research and industrial partners through multimodal recordings. In the context of PHENICX, multimodal recordings mean the inclusion of data from different nature, that is, audio, video and MOCAP (MOtion CAPture) streams. The technical requirements for these recordings are in line with some of the research lines proposed by the PHENICX project.

First, our partners at the UPF focus on source separation research. Source separation techniques must be able to automatically extract individual audio sources from a mix. The state of the art on source separation, far from being fully operative, allows to be applied to areas such as noise reduction for speech, sound events detection and music/speech separation. In the last years, music-specific processing has provided novel tools for audio restoration, karaoke, or music remixing. Technologically speaking, current techniques employ either limited information, e.g the melody from a MIDI score [Bosch et al, 2012], or consider a single instrument, such as separating left and right hand of a piano recording [Ewert, 2012]. In the PHENICX context, techniques specialized in orchestra recordings in which a complex signal comprises a large number of simultaneous instruments and the score is accessible are used.

On the other hand, the ESMUC is also active in the research of impersonation techniques. The goal here is to allow offline spectators to recreate the experience of performing the concert as if the listener were a musician. Using the audio data from the live performance and its score, the PHENICX prototype allows the spectator to control the audio generated by a particular instrument of the ensemble and to transform it. In other words, the user must imitate the movements of the performer in a credible way without the instrument (actually, it is only implemented for the conductor). This is possible by using a specific camera that records the movements of the user and extracts its key elements. These elements are used to control high level properties of variations of the instrument score, which are synthesized in real time while being followed by the rest of the orchestra.

In the following sections, we summarize the recorded corpora that required special attention for the above specified research purposes.


Summary of recorded corpora

ESMUC Symphonic Orchestra

The aim of this recording was to provide a full set of multimodal data from a full orchestra to the partners. This corpus is built with the recordings of the concert given by the ESMUC Symphonic Orchestra on June 21th. 2013, at L’Auditori, Barcelona, conducted by George Pelhivanian.

The program for this concert focused on Wagner pieces, but other pieces were also included:

  • Part 1:
    • Giuseppe Verdi, La forza del destino, Overture
    • Joan Albert Amargós, Pax Haganum
    • Carl Maria von Weber, Clarinet concert number 1
  • Part 2:
    • Richard Wagner, Die Walküre, Walkürenritt
    • Richard Wagner, Tristan und Isolde, Prelude
    • Richard Wagner, Götterdämmerung
      • Sigfrieds Tod und Trauermarsch
      • Sigfrieds Rheinfahrt
      • Brünnhildes Grabszene

Angel Belda was the soloist clarinet for the Weber’s Clarinet concert number 1.

Figure 2: Computer in the backstage of L’Auditori controlling the MOCAP acquisition system.

Figure 2: Computer in the backstage of L’Auditori controlling the MOCAP acquisition system.

For this concert, we recorded audio using near and far microphone techniques to provide data for the Source Separation team at the UPF, video, and Mocap information based on the Kinect (to provide data to the ESMUC and PhD students at the UPF). For the audio data, we recorded the full concert with specific microphone techniques according to the requirements of the Source Separation team, and created a stereo mix for concert publishing. We have the audio files in wav 16bits/44100Hz in a ProTools session. Video information was recorded using 4 video cameras in MTS format. Original video data was not synchronized with audio and MOCAP data. MOCAP data was captured using a using a Microsoft’s Kinect XBOX 360. This device allows performing skeleton tracking of humans using its depth image. Figure 2 shows the computer in the backstage of L’Auditori controlling the MOCAP acquisition system.


Beethoven 9th Symphony

The aim of this recording was to provide raw data for source separation and conductor’s gesture recognition research lines in the PHENICX project. The novelty of this recording was the collaboration with the Orquestra Simfònica del Vallès (OSV), located in Sabadell, close to Barcelona. Even though the used resources were the same as those used in the previous concerts recorded by ESMUC, we needed to move all the equipment, forcing us to redesign part of the acquisition system to be light and compact.

This dataset was built with the Beethoven 9th Symphony recorded on May 25th 2014, conducted by Rubén Gimeno, at the Kursaal theatre in Manresa, Spain. The Symphony is divided into four movements:

  • Allegro ma non troppo, un poco maestoso
  • Molto vivace
  • Adagio molto e cantabile
  • Presto

For the 4th movement, the invited choirs were the Cor de Cambra del Palau de la Música i Polifònica de Puig-Reig.

Figure 3: General view of the orchestra at the Kursaal theatre in Manresa.

Figure 3: General view of the orchestra at the Kursaal theatre in Manresa.

Figure 3 shows a general view of the orchestra at the Kursaal theatre in Manresa. This corpus is built with gesture, audio and video data. For the concert, the audio setup was based on 32 channels using near microphone techniques to cover all the different sections of the orchestra. We created a stereo mix of the concert and we have the audio files in wav 16bits/44100Hz format in a ProTools session. We also recorded gesture data from the conductor using Kinect camera and one domestic quality video recording for a visual reference. Audio, video and MOCAP data has been manually aligned.



The aim of this recording was to provide a corpus for recognition of body gestures related to the performance of musicians. The piece, L’Orfeo by Monteverdi, was chosen because of lack of conductor: the orchestra played in a context of a camerata. As all the other MOCAP recordings were related to the conductor, we found interesting to test proposed technologies in this new scenario.

This corpora was created on 2015, April 22nd. and 23rd. during the rehearsals of L’Orfeo, performed by the students of the Early Music department, conducted by Xavier Díaz-Latorre at the Teatre de Sarrià, in Barcelona. Figure 4 shows the musicians during the representation of L’Orfeo at Teatre de Sarrià.

The opera is divided into a prologue and five acts.

Figure 4: Representation of L’Orfeo at Teatre de Sarrià.

Figure 4: Representation of L’Orfeo at Teatre de Sarrià.

The gesture was measured in the Viola da Gamba player, Júlia Garcia-Arévalo, specifically measuring the gravity center of the chair. This instrument is one of the most representative ones in this piece, and it is fully controlled by the performer’s movements because it is sustained by the performer’s knees. Then, the chair’s gravity center is a good measure of her performance.

MOCAP data was captured in both rehearsals using a hand-made system based on force sensors and an Arduino card for sending data to the computer. This data was stored in .txt format including time stamp, x, y and z coordinates of the movement, corresponding to left-right, front-back and overall weight, respectively. Nevertheless, this data was asynchronous, as the Arduino board used to send data from sensors to the computer cannot guarantee a fixed rate. Because of that, a simple sonification technique was developed using SuperCollider. It consists on generating two tones in a stereo file centered at f=440Hz, one of them (left channel) is FM modulated by the “x” value and the other one (right channel) is modulated by “y” value. Both sinusoidal signals are AM modulated by the “z” value.


Crowdsourced filming experience

This concert, also known as “PHENICX collaborative video festival”, wanted to create video content from data uploaded by the audience using their smartphones through the wide-band Wi-Fi provided by the ESMUC. This concert was planned outdoors to attract more audience. It was based on Jazz music to expand the boundaries of the PHENICX project.

This corpus was created following the programme played by the ESMUC Big Band on May 26th 2015, conducted Lluís Vidal, teacher at ESMUC, at the outdoors scenario of L’Auditori:

  • TIPTOE, Thad Jones
  • LEAVING, Richie Beirach
  • GROOVE MERCHANT, Jerome Richardson
  • EL MARINER, Lluís Vidal
  • STRAIGHTEN UP AND FLY RIGHT, Nat King Cole, Irving Mills
  • DON’T WORRY ‘BOUT ME, Rube Bloom, Ted Koehler
  • CAN’T WE BE FRIENDS? Paul James, Kay Swift
  • ARMCHAIR, Django Bates
  • CHORO DANÇADO, Maria Schneider
  • SWING OUT, Bob Mintzer
  • ELVIN’S MAMBO, Bob Mintzer

Specifically, the audience was asked to focus on the first three themes due to the battery cycle in smartphones. Moreover, the audio from a fixed stereo microphone pair close to the mixing console was also recorded. As the concert was partly amplified, only for voices, piano and bass, the full session with separate tracks is not available. Video data was provided by the audience. Figure 5 shows the audience recording the concert with their smartphones.

Figure 5: Audience recording the concert with their smartphones.

Figure 5: Audience recording the concert with their smartphones.

The video corpus is built on data from 34 filmers (from the audience) who created 164 videos. Some of this data has been discarded due to low quality, short duration, or other technical criteria. For the first three themes we focused on, we have the following usable data:

  • TIPTOE: 17 video recordings
  • LEAVING: 12 video recordings
  • GROOVE MERCHANT: 9 video recordings

For the other themes, there are about fifty video recordings available. They are not processed because there are not enough video excerpts to keep continuity in the mix. As predicted, the battery life in smartphones is highly shortened in video recordings.

As mentioned above, video streams were provided by the audience. The audience was asked to record about one or two minutes and upload to a dedicated server managed by VideoDock, the industrial partner in the PHENICX consortium. The ESMUC provided wide-band WiFi network, login and password for all the users. Moreover, VideoDock designed a HTML5 web page to ease the uploading process. Figure 6 shows the webpage the users needed to connect to upload their videos. This audio file described above was used to synchronize all the videos uploaded by the audience to the server. Only those videos with a minimum length and quality were used for the mix.

Figure 6: Webpage created to allow users upload their videos.

Figure 6: Webpage created to allow users upload their videos.

Results were uploaded to Youtube and shared with all the contributors. Two versions for each theme are available: the Grid version, with all the uploaded videos with a minimum quality and duration, and the final result. In both cases, audio and video were synchronized with the reference audio.

    • Grid version: <>
    • Full version: <>
    • Grid version: <>
    • Full version: <>
    • Grid version: <>
    • Full version: <>


Repovizz: A platform for managing multimodal data

Recordings with multimodal corpora were processed and uploaded to Repovizz[1], a data repository and visualization tool for structured storage and user-friendly browsing of music performance multimodal recordings [Mayor 2013]. The primary purpose of RepoVizz is to offer means for the scientific community to gain online access to a music performance multimodal database shared among researchers. Repovizz was developed by the Music Technology Group at the Universitat Pompeu Fabra, one of the partners of the project. Figure 7 shows a screenshot of a possible configuration in Repovizz. Some demos can be found on Youtube[2].

Figure 7: A screenshot of the Repovizz platform.

Figure 7: A screenshot of the Repovizz platform.

Figure 7: A screenshot of the Repovizz platform.

Specifically, RepoVizz is designed to hold synchronized streams of heterogeneous data (audio, video, motion capture, physiological measures, extracted descriptors, etc.), annotations, and musical scores. Data is structured by customizable XML skeleton files enabling meaningful retrieval. Multitrack data visualization is done remotely via a powerful HTML5-based environment that enables web-driven editing (add annotations, extract descriptors) and downloading of datasets.

In the context of the PHENICX project, Repovizz was used to store and organize the generated corpora described above, providing access to researchers and allowing the automatic computation of low and high level descriptors from data coming from different streams. Part of the recorded corpora described in this paper is available to the musical and scientific communities for research purposes.


Becoming the Maestro


One of the prototypes that summarizes multiple research outcomes of the PHENICX project is the so called Becoming the Maestro. Roughly speaking, Becoming the Maestro is a game in which the user conducts the orchestra and the system reacts to these user movements. This idea is based on the widely known Guitar Hero or Band Hero games, but focused on classical music with real audio and video recordings.

Gesture recognition is studied in the PHENICX project to allow the facilitation of interactive music making scenarios. There exists much literature related to recognition of body (or body parts) gestures. Two different approaches can be distinguished: Machine learning techniques (usually supervised) and analytical description techniques. The first ones traditionally focus on detecting discrete gestures from a previously trained dataset [Wobbrock et al., 2007]. On the other hand, analytical techniques are traditionally based on the computation of multiple descriptors from gestures and find correlations with some given models that allow a certain flexibility [Caramiaux et al., 2010][Bevilacqua et al., 2010]. The approach here proposed takes the best of the two worlds for building a robust and computationally efficient gesture recognition solution for conducting the orchestra.

In terms of hardware, we wanted to use a cheap contactless unobtrusive gesture capture system, easily available for final users. The device that best fits these requirements is the popular Kinect camera, by Microsoft, that includes depth-sensing as one of its features.

In the research scenario here proposed, conductor gestures from the recordings were used to reconstruct the conductor’s skeleton, as shown in Figure 8, and derive a set of time dependent features as detailed in Table 1 and Table 2.

Figure 8: Skeleton reconstruction from data provided by Kinect camera.

Figure 8: Skeleton reconstruction from data provided by Kinect camera.

Table 1: Summary of joint descriptors extracted from the skeleton.

Table 1: Summary of joint descriptors extracted from the skeleton.

Table 2: Summary of general descriptors extracted from the skeleton.

Table 2: Summary of general descriptors extracted from the skeleton.


To achieve the goals exposed in the previous section, a strategy based on the understanding of gestures from both professional and untrained users was proposed. For that, the recording sessions that had to provide data for this learning process were divided into two main groups:

First, the learning process from professional conductors was made through the multimodal recording sessions defined above, specifically with ESMUC Festival in spring 2013 and the Orquestra Simfònica del Vallès in Spring 2014. As detailed, audio, video and gesture data was synchronized and uploaded to Repovizz for further analysis by the ESMUC and UPF researchers.

Second, some recordings of untrained users were done. These recordings focused on three aspects of user’s gestures, which were the three properties final users had to control in the prototype: dynamics, tempo and articulation.

In the following sections, we describe step by step how the prototype can extract these three properties (dynamics, tempo and articulation) from gesture data.


Analysis of loudness-related gestures from untrained users

For the study of the loudness conducting behavior of naïve users, twenty-five subjects were asked to conduct a specific piece without further instructions. The repertoire was built on three fragments from a performance of Beethoven’s Eroica 1st Movement played by the Royal Concertgebouw Orchestra. In fact, subjects were not controlling the music. It was intentional and necessary to study spontaneous conducting movements without any predefined rules for control. From data collected in this recording, the conductor skeleton was reconstructed and the descriptors described in Table 1 and Table 2 were computed. Statistical analysis based on the correlation between the loudness and the general descriptors show there are two clear tendencies in movement: some subjects present a high correlation between loudness (L) and Quantity of Movement (QoM) (i.e. their body movement is greater with the loudness) while others present a high correlation between loudness and the vertical position of their hands (Ymax) (i.e. they raise their hands with the loudness). These two behaviors are detailed in Figure 9.

Figure 9: Subjects clustered by correlation of loudness (L) to QoM and Ymax.

Figure 9: Subjects clustered by correlation of loudness (L) to QoM and Ymax.

With these results, it seems that a strategy based on a general model is not the correct one. For that, in future prototypes, the two models will be implemented, and the selection will be made automatically by learning from the first movements of the user.

Analysis of tempo-related gestures from untrained users

A similar strategy as the one used in the loudness analysis was used for the study of the behavior of tempo-related gestures from untrained users. Starting from a sample of twenty-five subjects conducting on top of a set of selected audio excerpts, it is possible to derive the correlation between tempo and recorded gestures through the set of descriptors detailed above.

Here again, the user behavior is divided into two subgroups. First, some users changed from downward to upward motion (maximum acceleration along the y axes) simulating a stroke in the air on every beat. On the other hand, trained users drew the standard 3/4 time signature shown in Figure 10 (this is the time signature for the given audio excerpt), in which beats correspond to the changes in the Y trajectory (not acceleration).

Figure 10: Standard 3/4 time signature figure.

Figure 10: Standard 3/4 time signature figure.

In this case, the maximum in acceleration and trajectories for all the joint points in the skeleton were computed. These values were also used to automatically predict the hand that best followed tempo.

An interesting effect that must be taken into account is the fact that some subjects can anticipate the beat in different ways losing the synchronicity with the audio onsets. As this is one of the characteristics professional conductors can control, this possibility must be included in the analysis. Figure 11 shows the error distribution of the detected onsets in relation to the ground truth manually annotated for three subjects. The black distribution is not centered around zero while still being narrow. The red distribution is similar to the previous one but centered close to zero. Finally, the blue distribution is very wide and is not actually informative about any specific pattern in the beat positions. The black and red distributions can be considered to represent good tempo following.

Figure 11: Error distribution of time differences from three subjects.

Figure 11: Error distribution of time differences from three subjects.

The designed model is able to perform tempo estimation for users moving on each of the two subgroups and detecting the anticipations here described.

Building a model of articulation

The third part of the gesture analysis deals with articulation. Usually, music conducting systems have paid much attention to the control of tempo and dynamics. However, the control of music articulation remains unexplored. Detecting articulation in gesture is not as straight-forward as detecting beats. One of the reasons for this lack of attention in articulation is the fact that although the activity of conducting is highly codified and extensively taught, each conductor develops a personal style and communicates expressive intentions (including articulations) with different nuances. In this sense, it seems that the best strategy to tackle this problem is to actually detect idiosyncratic gesture variations encoding articulation.

Figure 12: Schema of the learning procedure.

Figure 12: Schema of the learning procedure.

Based on the principles of Interactive Machine Learning, a system was built to learn a user-specific model of gesture articulation. More concretely, the user provided a gesture example (a shape drawn in the air), from which dynamic features (velocity and acceleration) were extracted. These features then fed a probabilistic model based on a Gaussian Mixture Model (GMM). GMM was used in a supervised mode: by providing the algorithm with the training dataset and a code for each articulation. Once the model was trained, a new incoming gesture was analyzed online extracting the same features. The model then assigned a continuous value to it, representing the relative distance between each articulation.

Figure 12 shows the schema of this learning procedure. Input examples are represented in the velocity-acceleration feature space and associated to an articulation label. The representation feeds a GMM initialized with the means of each class and adapted using Expectation-Maximization. New samples are printed in this feature space and the model decides in which cluster they belong.

A user study with 20 subjects (10 musicians and 10 non-musicians) that could control the articulation on a synthesized melody (from totally legato to totally staccato) confirmed that the model provides a satisfying interaction scheme to control music articulation.

The prototype

At the time this paper is written, there exists a first version of a fully functional prototype playing some excerpts of the Beethoven 3rd. Symphony, with audio based on MIDI data. Moreover, we are close to finish the inclusion of other Beethoven symphonies with real audio and video data, recorded specifically in the context of the PHENICX project. Figure 13 shows a user playing with the prototype at the Music Technology Group (UPF) labs. We expect that these prototypes will be available for the audiences at different places in Europe soon. It will be announced in the PHENICX web page and in the project social networks.

Figure 13: A player testing the Becoming the Maestro prototype.

Figure 13: A player testing the Becoming the Maestro prototype.


This paper presented the ESMUC contribution to the PHENICX project. This contribution can be split into two main areas: the content provided through multimodal recordings and the research related to conductor’s gesture. The provided recordings have been used by the research community involved in the project and they are partly open to the music and scientific communities for research purposes. On the other side, the Becoming the Maestro prototype, understood as one of the tangible products that includes several research outcomes of the project, will be available soon in some specific locations with the purpose of disseminating classical music to scholars and other potential audience.

In terms of the whole project, the reader can explore the first global prototype including visualization and navigation features and test the PHENICX experience through this link: <>. This is the first integrated prototype and it has to be understood as a global contribution from the different partners in the consortium. Moreover, the PHENICX consortium organized a series of concerts showing these technologies. The last one took place in Seville, in the context of the Singularity University Summit, in march 2015, in which over 600 entrepreneurs, CEOs and researchers from across Europe and the Middle East learned from Silicon Valley’s top experts and gained insights on how life, society and industries will be disrupted and reshaped through technologies that are in development today. Figure 14 shows a snapshot of the event. The reader can see the concert video of Beethoven’s Overture to Prometheus through this link: <>

Figure 14: Snapshot of the concert organized for the Singularity University Summit event, in March 2015.

Figure 14: Snapshot of the concert organized for the Singularity University Summit event, in March 2015.

Going back to the ESMUC contribution, and assuming this specific project is centered in classical music, specifically in orchestral music, experiments with other musical genres have been done. We have to recognize that one of the internal goals of the ESMUC was not to focus on orchestras. Recordings of early music, jazz music or the big choir in the Beethoven 9th symphony proved that. Since its origins, the ESMUC has been promoting a cross disciplinary approach to the music knowledge through their curriculums and activities. The PHENICX project is a good example of that.

On the other side, the analysis of musicians’ gestures showed that, even the described research results are a good contribution to the scientific community, we are still not able to capture, model or reproduce the intrinsic magic of music. Maybe these are good news for musicians and music enthusiasts. The bad news is that the sound and music computing community is getting better and better results in the understanding of emotion, movement and expressivity in music. It’s a matter of time models that can express emotions as performers do.


This work was supported by the European Union Seventh Framework Programme FP7 / 2007-2013 through the PHENICX project (grant agreement no. 601166).

*  *  *

[1] <>
[2] <>

*  *  *


  • [Knight Foundation, 2002] Knight Foundation. (2002). Classical Music Consumer Segmentation Study: How Americans Relate to Classical Music and Their Local Orchestras. Miami: Knight Foundation.
  • [Wolf, 2006] Wolf, T. (2006). The Search for Shining Eyes: Audiences, Leadership and Change in the Symphony Orchestra Field. Miami: Knight Foundation.
  • [Rizkallah, 2009] Rizkallah, E. G. (2009). A Non-Classical Marketing Approach For Classical Music Performing Organizations: An Empirical Perspective. Journal of Business & Economics Research, vol. 7, no. 4.
  • [Bosch, 2012] Bosch, J-J., Kondo, K., Marxer, R. and Janer, J. (2012). Score-informed and Timbre Independent Lead Instrument Separation in Real-world scenarios. of the 20th European Signal Processing Conference.
  • [Ewert, 2012] Ewert, S., Müller, M. (2012). Using Score-Informed Constraints For NMF-Based Source Separation. Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP).
  • [Mayor 2013] Mayor, O., Llimona, Q., Marchini, M., Papiotis, P. & Maestre, E. (2013). repoVizz: a framework for remote storage, browsing, annotation, and exchange of multimodal data. Proc. of the 21st ACM International Conference on Multimedia.
  • [Wobbrock et al., 2007] Wobbrock, J. O., Wilson, A. D., & Li, Y. (2007). Gestures without libraries, toolkits or training: a $1 recogniser for user interface prototypes. Proceedings of the 20th annual ACM symposium on User interface software and technology – UIST ‘07. New York: ACM Press..
  • [Caramiaux et al., 2010] Caramiaux, B. Bevilacqua, F. and Schnell, N. (2010). Towards a gesture-sound cross-modal analysis. Gesture in Embodied Communication and Human-Computer Interaction. Lecture Notes in Computer Science, Volume 5934/2010, 158-170.
  • [Bevilacqua et al., 2010] Bevilacqua, F., Zamborlin, B., & Sypniewski, A. (2010). Continuous realtime gesture following and recognition. Gesture in Embodied Communication and Human-Computer Interaction. Lecture Notes in Computer Science, Volume 934/2010, 158-170.