Efficient Audio Enhancement with a Differentiable Psychoacoustic Loss

AEROMamba project logo
1SMT, DEL/Poli & PEE/COPPE, Federal University of Rio de Janeiro, Brazil
2LCTI, Télécom Paris, IP Paris, France
Submitted to the Journal of the Audio Engineering Society.

Abstract

Audio enhancement consists of improving the perceived quality of audio signals. Initially, with the aim of addressing bandwidth extension, this work proposes $\textrm{AEROMamba}_{\textrm{P}}$, an efficient variant of the $\textrm{AERO}$ super-resolution architecture where attention and LSTM layers are replaced by the Mamba state-space model, and which incorporates a newly developed differentiable perceptual loss derived from the Perceptual Audio Quality Measure (PAQM). During training, the architecture requires approximately 2–4x less GPU memory than the baseline; during inference, it achieves a 14x speedup while using only one-fifth of the GPU memory. When upsampling both a piano dataset and MUSDB18 from 11.025 kHz to 44.1 kHz, subjective listening tests show that $\textrm{AEROMamba}_{\textrm{P}}$ outperforms $\textrm{AERO}$ by 15% in perceived quality scores. Next, to handle the enhancement of audio signals that have been highly compressed by lossy coding, it is further proposed $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$, which applies the same framework but replaces STFT reconstruction losses with the PAQM loss, specifically to enhance MP3 encoded audio at 32 kbps. In listening evaluations, $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$ achieves 52% higher quality rating than $\textrm{AEROMamba}_{\textrm{P}}$ when restoring compressed audio. These results demonstrate that PAQM-driven training coupled with lightweight state-space modeling yields high perceptual quality and computational efficiency in both band-limited and compressed audio scenarios.

Results: Super-resolution of Bandlimited Audio

Results for the MUSDB and PianoEval datasets comparing ViSQOL, LSD, and subjective scores, as well as performance metrics on a NVIDIA RTX 3090 GPU for 10-second samples.

MUSDB Results

Objective and Subjective (Score) metrics for MUSDB (bandlimited)
Model ViSQOL ↑ LSD ↓ Score ↑
$\textrm{Low-Resolution}$ 1.82 3.98 38.22
$\textrm{AERO}$ 2.90 1.34 60.03
$\textrm{AEROMamba}$ 2.93 1.23 66.47
$\textrm{AEROMamba}_{\textrm{P}}$ 3.04 1.19 79.26
$\textrm{AudioSR}$ 3.01 - -

PianoEval Results

Objective and Subjective (Score) metrics for PianoEval
Model ViSQOL ↑ LSD ↓ Score ↑
$\textrm{Low-Resolution}$ 4.36 1.09 72.92
$\textrm{AERO}$ 4.38 0.99 76.89
$\textrm{AEROMamba}-{\textrm{HQ}}$ 4.38 1.00 84.41
$\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$ 4.41 0.90 78.76

Models labeled with $\textrm{-HQ}$ were trained on PianoEval-HQ.

Performance Comparison (NVIDIA RTX 3090)

Inference performance for a 10-second audio sample.
Method GPU Usage (MB) Time (s) Parameters
$\textrm{AERO}$ 17091 1.246 19,432,958
$\textrm{AEROMamba}$ 3000 0.087 20,964,190

Subjective Score Distributions

MUSDB

Violin plot of subjective scores for the MUSDB bandlimited experiment.
(A) $\textrm{Low-Resolution}$. (B) $\textrm{AERO}$ (C) $\textrm{AEROMamba}$ (D) $\textrm{AEROMamba}_{\textrm{P}}$

PianoEval

Violin plot of subjective scores for the PianoEval bandlimited experiment.
(A) $\textrm{Low-Resolution}$. (B) $\textrm{AERO}$ (C) $\textrm{AEROMamba}-{\textrm{HQ}}$ (D) $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$
Statistical Tests (Mann-Whitney U)

Pairwise comparisons of subjective scores (p-values). Values < 0.05 are considered statistically significant.

MUSDB (Subjective)

MUSDB pairwise p-values
Comparison p-value
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AERO}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AERO}$0.0089
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AERO}$< 0.0001

PianoEval (Subjective)

PianoEval pairwise p-values
Comparison p-value
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$0.0587
$\textrm{Low-Resolution}$ vs. $\textrm{AERO}$0.3399
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$0.0101
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AERO}$0.0003
$\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$ vs. $\textrm{AERO}$0.2975

Pairwise comparisons of ViSQOL scores (p-values).

MUSDB (ViSQOL)

MUSDB ViSQOL pairwise p-values
Comparisonp-value
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AERO}$0.0007
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AERO}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{Low-Resolution}$< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{Low-Resolution}$< 0.0001
$\textrm{AERO}$ vs. $\textrm{Low-Resolution}$< 0.0001
$\textrm{AudioSR}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.2178

PianoEval (ViSQOL)

PianoEval ViSQOL pairwise p-values
Comparisonp-value
$\textrm{Low-Resolution}$ vs. $\textrm{AERO}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AERO-HQ}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}$0.2989
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AudioSR}$< 0.0001
$\textrm{AERO}$ vs. $\textrm{AERO-HQ}$0.1077
$\textrm{AERO}$ vs. $\textrm{AEROMamba}$< 0.0001
$\textrm{AERO}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$0.0215
$\textrm{AERO}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.6028
$\textrm{AERO}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$0.6917
$\textrm{AERO}$ vs. $\textrm{AudioSR}$< 0.0001
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}$< 0.0001
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$0.0002
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.2574
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$0.2083
$\textrm{AERO-HQ}$ vs. $\textrm{AudioSR}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AudioSR}$< 0.0001
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.1172
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$0.0806
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AudioSR}$< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$0.8168
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AudioSR}$< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$ vs. $\textrm{AudioSR}$< 0.0001

Audio Examples: MUSDB

Tracks upsampled from 11.025kHz to 44.1kHz

Audio samples comparing $\textrm{Low-Resolution}$, $\textrm{Original (High-Res)}$, $\textrm{AERO}$, $\textrm{AEROMamba}$, and $\textrm{AEROMamba}_{\textrm{P}}$.
Track Original ($\textrm{Low-Res}$)
11.025 kHz
Original ($\textrm{High-Res}$)
44.1 kHz
$\textrm{AERO}$
11.025 → 44.1 kHz
$\textrm{AEROMamba}$
11.025 → 44.1 kHz
$\textrm{AEROMamba}_{\textrm{P}}$
11.025 → 44.1 kHz
459
480
826
625

Results: Enhancement of Heavily Compressed Audio

Objective and subjective scores for low-bitrate (MP3 32kbps) signals and various models evaluated on MUSDB and PianoEval.

MUSDB Results

Objective and Subjective (Score) metrics for MUSDB (MP3 32kbps)
System ViSQOL ↑ LSD ↓ Score ↑
$\textrm{Low-Bitrate}$1.802.0250.7
$\textrm{AEROMamba}$2.451.2449.8
$\textrm{AEROMamba}_{\textrm{P}}$2.991.2749.7
$\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$2.901.2375.6

PianoEval Results

Objective and Subjective (Score) metrics for PianoEval (MP3 32kbps)
System ViSQOL ↑ LSD ↓ Score ↑
$\textrm{Low-Bitrate}$4.352.3369.5
$\textrm{AEROMamba}$4.221.1483.4
$\textrm{AEROMamba}_{\textrm{P}}$4.241.1284.1
$\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$4.411.1385.5

Subjective Score Distributions

MUSDB

Violin plot of subjective scores for the MUSDB compressed experiment.
(A) $\textrm{Low-Bitrate}$. (B) $\textrm{AEROMamba}$ (C) $\textrm{AEROMamba}_{\textrm{P}}$ (D) $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$

PianoEval

Violin plot of subjective scores for the PianoEval compressed experiment.
(A) $\textrm{Low-Bitrate}$. (B) $\textrm{AEROMamba}$ (C) $\textrm{AEROMamba}_{\textrm{P}}$ (D) $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$
Statistical Tests (Mann-Whitney U)

Pairwise comparisons of subjective scores (p-values). Values < 0.05 are considered statistically significant.

MUSDB (Subjective)

MUSDB pairwise p-values (MP3)
Comparisonp-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$0.7390
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.8233
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.8751
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001

PianoEval (Subjective)

PianoEval pairwise p-values (MP3)
Comparisonp-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.9193
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$0.7168
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$0.7034

Pairwise comparisons of ViSQOL scores (p-values).

MUSDB (ViSQOL)

MUSDB ViSQOL (MP3) pairwise p-values
Comparisonp-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001

PianoEval (ViSQOL)

PianoEval ViSQOL (MP3) pairwise p-values
Comparisonp-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$0.0052
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$0.3122
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$< 0.0001

Audio Examples: MUSDB (Compressed)

Tracks restored from 32kbps MP3 to 44.1kHz

Audio samples comparing $\textrm{Low-Bitrate}$, $\textrm{Original (High-Res)}$, $\textrm{AEROMamba}$, $\textrm{AEROMamba}_{\textrm{P}}$, and $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$.
Track Low-Bitrate
MP3 32kbps
High-Res
44.1 kHz
$\textrm{AEROMamba}$ $\textrm{AEROMamba}_{\textrm{P}}$ $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$
459
480
826
625

Extra: Super-resolution of Degraded Historical Recordings

Super-resolution of Alfred Cortot's historical piano recordings. This section showcases the difference between the predictions of super-resolution of bandlimited audio models when trained with noisy or only high-quality recordings. The input tracks contain much more noise than any of the training data from both HQ and non-HQ datasets.

Historical recordings from 1925 and 1942 processed through various model configurations.
Recording Low Resolution High Resolution $\textrm{AEROMamba}$ $\textrm{AEROMamba}\text{-}\textrm{HQ}$ $\textrm{AEROMamba}_{\textrm{P}}$ $\textrm{AEROMamba}_{\textrm{P}}\text{-}\textrm{HQ}$
Cortot (1925)
Cortot (1942)
PianoEval Dataset Metadata

The collected PianoEval dataset consists of two parts. The first is composed of the 24 Preludes for Piano, op. 28, by Chopin performed by 33 pianists in 45 different recordings available on CD (Compact Disc), totaling approximately 22 hours. The second part contains excerpts of Ligeti piano études, a Schumann sonata, and the Barber sonata, played by three different performers, respectively, totaling approximately 3.5 hours. Information about performers, record label and year of recording are detailed in the tables below.

Train/Validation

Train/Validation split for PianoEval dataset
PianistLabelYear
Arrau, C.Columbia1950/1
Arrau, C.Philips1973
Argerich, M.DG1975
Ashkenazy, V.Decca1976
Ashkenazy, V.Decca1992
Bolet, J.RCA1974
Blechacz, R.DG2007
Cherkassky, S.ASV1968
Cortot, A.HMV1926
Cortot, A.HMV1933/4
Cortot, A.Gramophone1942
Cortot, A.Archipel1955
Cortot, A.EMI1957
Davidovich, B.Decca1979
de Larrocha, A.Decca1974
Duchable, F.Erato1988
Dutra, G.Yellow Tail1997
El Bacha, A. R.Forlane1999
François, S.EMI1959
Freire, N.Columbia1970
Harasiewicz, A.Philips1963
Katsaris, C.Sony1992
Kissin, Y.RCA1999
Lima, A. M.Caras1981
Lucchesini, A.EMI1988
Magaloff, N.Philips1975
Novaes, G.M&A1949
Ohlsson, G.EMI1974
Ohlsson, G.Hyperion1989
Perahia, M.Columbia1975
Petri, E.Columbia1942
Pires, M.Erato1975
Pires, M.DG1992
Pogorelich, I.DG1989
Pollini, M.DG1974
Pollini, M.DG2011
Proença, M.Delphos1999
Rubinstein, A.RCA1946
Switala, W.NIFC2006/7
Tiempo, S.Victor1990
Varsi, D.Genuin1988

Test

Test split for PianoEval dataset
PianistLabelYear
B. GlemserNaxos1993
D. PollackNaxos1995
P. L. AimardSony1995
Subjective Test Tracklist

The following table maps the Question IDs (QID) used during the subjective listening tests to the corresponding audio tracks from the MUSDB and PianoEval datasets.

Question IDs and corresponding track names.
QIDTrack QIDTrack
1electronic0113electronic02
2rock0114rock02
3pop0115pop02
4hiphop0116hiphop02
5latin0117reggae01
6other0118other02
702Barber1904Barber
814Ligeti2017Ligeti
905Ligeti2115Ligeti
1007Barber2208Barber
1103Schumann2304Schumann
1202Schumann2415Schumann