Efficient Audio Super-Resolution with a Differentiable Psychoacoustic Loss

AEROMamba project logo
1SMT, DEL/Poli & PEE/COPPE, Federal University of Rio de Janeiro, Brazil
2LCTI, Télécom Paris, IP Paris, France
Submitted to the Journal of the Audio Engineering Society.

Abstract

Audio super-resolution is commonly seen as the task of enhancing low-bitrate audio signals by creating missing high-frequency content. This work proposes AEROMamba-PAQM, an efficient variant of the AERO super-resolution architecture where attention and LSTM layers are replaced by the Mamba state-space model, and which incorporates a newly developed differentiable perceptual loss derived from the Perceptual Audio Quality Measure (PAQM). During training, the architecture requires approximately 2–4x less GPU memory than the baseline; during inference, it achieves a 14x speedup while using only one-fifth of the GPU memory. When upsampling both a piano dataset and MUSDB18 from 11.025 kHz to 44.1 kHz, subjective listening tests show that AEROMamba-PAQM outperforms AERO by 15% in perceived quality scores. To address the broader problem of improving audio that has been highly compressed by lossy coding, it is further proposed AEROMamba-PAQM++, which applies the same framework but replaces STFT reconstruction losses with the PAQM loss, specifically to enhance MP3 encoded audio at 32 kbps. In listening evaluations, AEROMamba-PAQM++ achieves 52% higher quality rating than AEROMamba-PAQM when restoring compressed audio. These results demonstrate that PAQM-driven training coupled with lightweight state-space modeling yields high perceptual quality and computational efficiency in both band-limited and compressed audio scenarios.

Results: Super-resolution of Bandlimited Audio

Results for the MUSDB and PianoEval datasets comparing ViSQOL, LSD, and subjective scores, as well as performance metrics on a NVIDIA RTX 3090 GPU for 10-second samples.

MUSDB Results

Objective and Subjective (Score) metrics for MUSDB (bandlimited)
Model ViSQOL ↑ LSD ↓ Score ↑
Low-Resolution 1.82 3.98 38.22
AERO 2.90 1.34 60.03
AEROMamba 2.93 1.23 66.47
AEROMamba-PAQM 3.04 1.19 79.26
AudioSR 3.01 - -

PianoEval Results

Objective and Subjective (Score) metrics for PianoEval
Model ViSQOL ↑ LSD ↓ Score ↑
Low-Resolution 4.36 1.09 72.92
AERO 4.38 0.99 76.89
AEROMamba-HQ 4.38 1.00 84.41
AEROMamba-PAQM-HQ 4.41 0.90 78.76

Models labeled with `-HQ` were trained on PianoEval-HQ.

Performance Comparison (NVIDIA RTX 3090)

Inference performance for a 10-second audio sample.
Method GPU Usage (MB) Time (s) Parameters
AERO 17091 1.246 19,432,958
AEROMamba 3000 0.087 20,964,190

Subjective Score Distributions

MUSDB

Violin plot of subjective scores for the MUSDB bandlimited experiment.
(A) Low Res. (B) AERO (C) AEROMamba (D) AEROMamba-PAQM

PianoEval

Violin plot of subjective scores for the PianoEval bandlimited experiment.
(A) Low Res. (B) AERO (C) AEROMamba-HQ (D) AEROMamba-PAQM-HQ
Statistical Tests (Mann-Whitney U)

Pairwise comparisons of subjective scores (p-values). Values < 0.05 are considered statistically significant.

MUSDB (Subjective)

MUSDB pairwise p-values
Comparison p-value
Low Res vs. AEROMamba < 0.0001
Low Res vs. AEROMamba-PAQM < 0.0001
Low Res vs. AERO < 0.0001
AEROMamba vs. AEROMamba-PAQM < 0.0001
AEROMamba vs. AERO 0.0089
AEROMamba-PAQM vs. AERO < 0.0001

PianoEval (Subjective)

PianoEval pairwise p-values
Comparison p-value
Low Res vs. AEROMamba-HQ < 0.0001
Low Res vs. AEROMamba-PAQM-HQ 0.0587
Low Res vs. AERO 0.3399
AEROMamba-HQ vs. AEROMamba-PAQM-HQ 0.0101
AEROMamba-HQ vs. AERO 0.0003
AEROMamba-PAQM-HQ vs. AERO 0.2975

Pairwise comparisons of ViSQOL scores (p-values).

MUSDB (ViSQOL)

MUSDB ViSQOL pairwise p-values
Comparisonp-value
AEROMamba vs. AEROMamba-PAQM< 0.0001
AEROMamba vs. AERO0.0007
AEROMamba-PAQM vs. AERO< 0.0001
AEROMamba vs. Low Resolution< 0.0001
AEROMamba-PAQM vs. Low Resolution< 0.0001
AERO vs. Low Resolution< 0.0001
AudioSR vs. AEROMamba-PAQM0.2178

PianoEval (ViSQOL)

PianoEval ViSQOL pairwise p-values
Comparisonp-value
Low Res vs. AERO< 0.0001
Low Res vs. AERO-HQ< 0.0001
Low Res vs. AEROMamba0.2989
Low Res vs. AEROMamba-HQ< 0.0001
Low Res vs. AEROMamba-PAQM< 0.0001
Low Res vs. AEROMamba-PAQM-HQ< 0.0001
Low Res vs. AudioSR< 0.0001
AERO vs. AERO-HQ0.1077
AERO vs. AEROMamba< 0.0001
AERO vs. AEROMamba-HQ0.0215
AERO vs. AEROMamba-PAQM0.6028
AERO vs. AEROMamba-PAQM-HQ0.6917
AERO vs. AudioSR< 0.0001
AERO-HQ vs. AEROMamba< 0.0001
AERO-HQ vs. AEROMamba-HQ0.0002
AERO-HQ vs. AEROMamba-PAQM0.2574
AERO-HQ vs. AEROMamba-PAQM-HQ0.2083
AERO-HQ vs. AudioSR< 0.0001
AEROMamba vs. AEROMamba-HQ< 0.0001
AEROMamba vs. AEROMamba-PAQM< 0.0001
AEROMamba vs. AEROMamba-PAQM-HQ< 0.0001
AEROMamba vs. AudioSR< 0.0001
AEROMamba-HQ vs. AEROMamba-PAQM0.1172
AEROMamba-HQ vs. AEROMamba-PAQM-HQ0.0806
AEROMamba-HQ vs. AudioSR< 0.0001
AEROMamba-PAQM vs. AEROMamba-PAQM-HQ0.8168
AEROMamba-PAQM vs. AudioSR< 0.0001
AEROMamba-PAQM-HQ vs. AudioSR< 0.0001

Audio Examples: MUSDB

Tracks upsampled from 11.025kHz to 44.1kHz

Audio samples comparing Low-Res, High-Res (Ground Truth), AERO, AEROMamba, and AEROMamba-PAQM.
Track Original (Low-Res)
11.025 kHz
Original (High-Res)
44.1 kHz
AERO
11.025 → 44.1 kHz
AEROMamba
11.025 → 44.1 kHz
AEROMamba-PAQM
11.025 → 44.1 kHz
459
480
826
625

Results: Super-resolution of Heavily Compressed Audio

Objective and subjective scores for low-bitrate (MP3 32kbps) signals and various models evaluated on MUSDB and PianoEval.

MUSDB Results

Objective and Subjective (Score) metrics for MUSDB (MP3 32kbps)
System ViSQOL ↑ LSD ↓ Score ↑
Low-Bitrate 1.80 2.02 50.7
AEROMamba 2.45 1.24 49.8
AEROMamba-PAQM 2.99 1.27 49.7
AEROMamba-PAQM++ 2.90 1.23 75.6

PianoEval Results

Objective and Subjective (Score) metrics for PianoEval (MP3 32kbps)
System ViSQOL ↑ LSD ↓ Score ↑
Low-Bitrate 4.35 2.33 69.5
AEROMamba 4.22 1.14 83.4
AEROMamba-PAQM 4.24 1.12 84.1
AEROMamba-PAQM++ 4.41 1.13 85.5

Subjective Score Distributions

MUSDB

Violin plot of subjective scores for the MUSDB compressed experiment.
(A) Low-Bit. (B) AEROMamba (C) AEROMamba-PAQM (D) AEROMamba-PAQM++

PianoEval

Violin plot of subjective scores for the PianoEval compressed experiment.
(A) Low-Bit. (B) AEROMamba (C) AEROMamba-PAQM (D) AEROMamba-PAQM++
Statistical Tests (Mann-Whitney U)

Pairwise comparisons of subjective scores (p-values). Values < 0.05 are considered statistically significant.

MUSDB (Subjective)

MUSDB pairwise p-values (MP3)
Comparison p-value
Low Res vs. AEROMamba 0.7390
Low Res vs. AEROMamba-PAQM 0.8233
Low Res vs. AEROMamba-PAQM++ < 0.0001
AEROMamba vs. AEROMamba-PAQM 0.8751
AEROMamba vs. AEROMamba-PAQM++ < 0.0001
AEROMamba-PAQM vs. AEROMamba-PAQM++ < 0.0001

PianoEval (Subjective)

PianoEval pairwise p-values (MP3)
Comparison p-value
Low Res vs. AEROMamba < 0.0001
Low Res vs. AEROMamba-PAQM < 0.0001
Low Res vs. AEROMamba-PAQM++ < 0.0001
AEROMamba vs. AEROMamba-PAQM 0.9193
AEROMamba vs. AEROMamba-PAQM++ 0.7168
AEROMamba-PAQM vs. AEROMamba-PAQM++ 0.7034

Pairwise comparisons of ViSQOL scores (p-values).

MUSDB (ViSQOL)

MUSDB ViSQOL (MP3) pairwise p-values
Comparisonp-value
Low Res vs. AEROMamba< 0.0001
Low Res vs. AEROMamba-PAQM< 0.0001
Low Res vs. AEROMamba-PAQM++< 0.0001
AEROMamba vs. AEROMamba-PAQM< 0.0001
AEROMamba vs. AEROMamba-PAQM++< 0.0001
AEROMamba-PAQM vs. AEROMamba-PAQM++< 0.0001

PianoEval (ViSQOL)

PianoEval ViSQOL (MP3) pairwise p-values
Comparisonp-value
Low Res vs. AEROMamba< 0.0001
Low Res vs. AEROMamba-PAQM< 0.0001
Low Res vs. AEROMamba-PAQM++0.0052
AEROMamba vs. AEROMamba-PAQM0.3122
AEROMamba vs. AEROMamba-PAQM++< 0.0001
AEROMamba-PAQM vs. AEROMamba-PAQM++< 0.0001

Audio Examples: MUSDB

Tracks restored from 32kbps MP3 to 44.1kHz

Audio samples comparing Low-Bitrate (MP3), High-Res (Ground Truth), and AEROMamba variants.
Track Low-Bitrate
MP3 32kbps
High-Res
44.1 kHz
AEROMamba AEROMamba-PAQM AEROMamba-PAQM++
459
480
826
625
PianoEval Dataset Metadata

We collected the PianoEval data set, which consists of two parts. The first is composed of the 24 Preludes for Piano, op. 28, by Chopin performed by 33 pianists in 45 different recordings available on CD (Compact Disc), totaling approximately 22 hours. The second part contains excerpts of Ligeti piano études, a Schumann sonata, and the Barber sonata, played by three different performers, respectively, totaling approximately 3.5 hours. Each file is stored in WAV format, stereo mode and sampled at 44.1 kHz. Information about performers, record label and year of recording are detailed in the Tables below.

Train/Validation

Train/Validation split for PianoEval dataset
Pianist Record label Year
Arrau, C.Columbia1950/1
Arrau, C.Philips1973
Argerich, M.Deutsche Grammophon1975
Ashkenazy, V.Decca1976
Ashkenazy, V.Decca1992
Bolet, J.RCA1974
Blechacz, R.Deutsche Grammophon2007
Cherkassky, S.ASV1968
Cortot, A.HMV1926
Cortot, A.HMV1933/4
Cortot, A.Gramophone1942
Cortot, A.Archipel [live]1955
Cortot, A.EMI1957
Davidovich, B.Decca1979
de Larrocha, A.Decca1974
Duchable, F.Erato1988
Dutra, G.Yellow Tail1997
El Bacha, A. R.Forlane1999
François, S.EMI1959
Freire, N.Columbia1970
Harasiewicz, A.Philips1963
Katsaris, C.Sony1992
Kissin, Y.RCA1999
Lima, A. M.Caras11981
Lucchesini, A.EMI19882
Magaloff, N.Philips1975
Novaes, G.Music and Arts [live]1949
Ohlsson, G.EMI1974
Ohlsson, G.Hyperion1989
Perahia, M.Columbia1975
Petri, E.Columbia1942
Pires, M.Erato1975
Pires, M.Deutsche Grammophon1992
Pogorelich, I.Deutsche Grammophon1989
Pollini, M.Deutsche Grammophon1974
Pollini, M.Deutsche Grammophon2011
Proença, M.Delphos1999
Rubinstein, A.RCA1946
Switala, W.NIFC2006/7
Tiempo, S.Victor1990
Varsi, D.Genuin1988

1 Refers to a magazine.
2 Refers to the release year, not the recording year.

Test

Test split for PianoEval dataset
Pianist Record label Year
B. GlemserNaxos1993
D. PollackNaxos1995
P. L. AimardSony1995
Subjective Test Tracklist

The following table maps the Question IDs (QID) used during the subjective listening tests to the corresponding audio tracks from the MUSDB and PianoEval datasets.

Question IDs and corresponding track names.
QID Track QID Track
1 electronic01 13 electronic02
2 rock01 14 rock02
3 pop01 15 pop02
4 hiphop01 16 hiphop02
5 latin01 17 reggae01
6 other01 18 other02
7 02Barber 19 04Barber
8 14Ligeti 20 17Ligeti
9 05Ligeti 21 15Ligeti
10 07Barber 22 08Barber
11 03Schumann 23 04Schumann
12 02Schumann 24 15Schumann