Efficient Audio Super-Resolution with a Differentiable Psychoacoustic Loss

Abstract

Audio enhancement consists of improving the perceived quality of audio signals. Initially, with the aim of addressing bandwidth extension, this work proposes $\textrm{AEROMamba}_{\textrm{P}}$, an efficient variant of the $\textrm{AERO}$ super-resolution architecture where attention and LSTM layers are replaced by the Mamba state-space model, and which incorporates a newly developed differentiable perceptual loss derived from the Perceptual Audio Quality Measure (PAQM). During training, the architecture requires approximately 2–4x less GPU memory than the baseline; during inference, it achieves a 14x speedup while using only one-fifth of the GPU memory. When upsampling both a piano dataset and MUSDB18 from 11.025 kHz to 44.1 kHz, subjective listening tests show that $\textrm{AEROMamba}_{\textrm{P}}$ outperforms $\textrm{AERO}$ by 15% in perceived quality scores. Next, to handle the enhancement of audio signals that have been highly compressed by lossy coding, it is further proposed $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$, which applies the same framework but replaces STFT reconstruction losses with the PAQM loss, specifically to enhance MP3 encoded audio at 32 kbps. In listening evaluations, $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$ achieves 52% higher quality rating than $\textrm{AEROMamba}_{\textrm{P}}$ when restoring compressed audio. These results demonstrate that PAQM-driven training coupled with lightweight state-space modeling yields high perceptual quality and computational efficiency in both band-limited and compressed audio scenarios.

Results: Super-resolution of Bandlimited Audio

Results for the MUSDB and PianoEval datasets comparing ViSQOL, LSD, and subjective scores, as well as performance metrics on a NVIDIA RTX 3090 GPU for 10-second samples.

MUSDB Results

Objective and Subjective (Score) metrics for MUSDB (bandlimited)
Model	ViSQOL ↑	LSD ↓	Score ↑
$\textrm{Low-Resolution}$	1.82	3.98	38.22
$\textrm{AERO}$	2.90	1.34	60.03
$\textrm{AEROMamba}$	2.93	1.23	66.47
$\textrm{AEROMamba}_{\textrm{P}}$	3.04	1.19	79.26
$\textrm{AudioSR}$	3.01	-	-

PianoEval Results

Objective and Subjective (Score) metrics for PianoEval
Model	ViSQOL ↑	LSD ↓	Score ↑
$\textrm{Low-Resolution}$	4.36	1.09	72.92
$\textrm{AERO}$	4.38	0.99	76.89
$\textrm{AEROMamba}-{\textrm{HQ}}$	4.38	1.00	84.41
$\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	4.41	0.90	78.76

Models labeled with $\textrm{-HQ}$ were trained on PianoEval-HQ.

Performance Comparison (NVIDIA RTX 3090)

Inference performance for a 10-second audio sample.
Method	GPU Usage (MB)	Time (s)	Parameters
$\textrm{AERO}$	17091	1.246	19,432,958
$\textrm{AEROMamba}$	3000	0.087	20,964,190

Subjective Score Distributions

MUSDB

PianoEval

Statistical Tests (Mann-Whitney U)

Pairwise comparisons of subjective scores (p-values). Values < 0.05 are considered statistically significant.

MUSDB (Subjective)

MUSDB pairwise p-values
Comparison	p-value
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AERO}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AERO}$	0.0089
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AERO}$	< 0.0001

PianoEval (Subjective)

PianoEval pairwise p-values
Comparison	p-value
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	0.0587
$\textrm{Low-Resolution}$ vs. $\textrm{AERO}$	0.3399
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	0.0101
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AERO}$	0.0003
$\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$ vs. $\textrm{AERO}$	0.2975

Pairwise comparisons of ViSQOL scores (p-values).

MUSDB (ViSQOL)

MUSDB ViSQOL pairwise p-values
Comparison	p-value
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AERO}$	0.0007
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AERO}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{Low-Resolution}$	< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{Low-Resolution}$	< 0.0001
$\textrm{AERO}$ vs. $\textrm{Low-Resolution}$	< 0.0001
$\textrm{AudioSR}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.2178

PianoEval (ViSQOL)

PianoEval ViSQOL pairwise p-values
Comparison	p-value
$\textrm{Low-Resolution}$ vs. $\textrm{AERO}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AERO-HQ}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}$	0.2989
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	< 0.0001
$\textrm{Low-Resolution}$ vs. $\textrm{AudioSR}$	< 0.0001
$\textrm{AERO}$ vs. $\textrm{AERO-HQ}$	0.1077
$\textrm{AERO}$ vs. $\textrm{AEROMamba}$	< 0.0001
$\textrm{AERO}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$	0.0215
$\textrm{AERO}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.6028
$\textrm{AERO}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	0.6917
$\textrm{AERO}$ vs. $\textrm{AudioSR}$	< 0.0001
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}$	< 0.0001
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$	0.0002
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.2574
$\textrm{AERO-HQ}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	0.2083
$\textrm{AERO-HQ}$ vs. $\textrm{AudioSR}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}-{\textrm{HQ}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AudioSR}$	< 0.0001
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.1172
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	0.0806
$\textrm{AEROMamba}-{\textrm{HQ}}$ vs. $\textrm{AudioSR}$	< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$	0.8168
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AudioSR}$	< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}-\textrm{HQ}$ vs. $\textrm{AudioSR}$	< 0.0001

Audio Examples: MUSDB

Tracks upsampled from 11.025kHz to 44.1kHz

Audio samples comparing $\textrm{Low-Resolution}$, $\textrm{Original (High-Res)}$, $\textrm{AERO}$, $\textrm{AEROMamba}$, and $\textrm{AEROMamba}_{\textrm{P}}$.
Track	Original ($\textrm{Low-Res}$) 11.025 kHz	Original ($\textrm{High-Res}$) 44.1 kHz	$\textrm{AERO}$ 11.025 → 44.1 kHz	$\textrm{AEROMamba}$ 11.025 → 44.1 kHz	$\textrm{AEROMamba}_{\textrm{P}}$ 11.025 → 44.1 kHz
459
480
826
625

Results: Enhancement of Heavily Compressed Audio

Objective and subjective scores for low-bitrate (MP3 32kbps) signals and various models evaluated on MUSDB and PianoEval.

MUSDB Results

Objective and Subjective (Score) metrics for MUSDB (MP3 32kbps)
System	ViSQOL ↑	LSD ↓	Score ↑
$\textrm{Low-Bitrate}$	1.80	2.02	50.7
$\textrm{AEROMamba}$	2.45	1.24	49.8
$\textrm{AEROMamba}_{\textrm{P}}$	2.99	1.27	49.7
$\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	2.90	1.23	75.6

PianoEval Results

Objective and Subjective (Score) metrics for PianoEval (MP3 32kbps)
System	ViSQOL ↑	LSD ↓	Score ↑
$\textrm{Low-Bitrate}$	4.35	2.33	69.5
$\textrm{AEROMamba}$	4.22	1.14	83.4
$\textrm{AEROMamba}_{\textrm{P}}$	4.24	1.12	84.1
$\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	4.41	1.13	85.5

Subjective Score Distributions

MUSDB

PianoEval

Statistical Tests (Mann-Whitney U)

Pairwise comparisons of subjective scores (p-values). Values < 0.05 are considered statistically significant.

MUSDB (Subjective)

MUSDB pairwise p-values (MP3)
Comparison	p-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$	0.7390
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.8233
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.8751
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001

PianoEval (Subjective)

PianoEval pairwise p-values (MP3)
Comparison	p-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$	< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.9193
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	0.7168
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	0.7034

Pairwise comparisons of ViSQOL scores (p-values).

MUSDB (ViSQOL)

MUSDB ViSQOL (MP3) pairwise p-values
Comparison	p-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$	< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001

PianoEval (ViSQOL)

PianoEval ViSQOL (MP3) pairwise p-values
Comparison	p-value
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}$	< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	< 0.0001
$\textrm{Low-Bitrate}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	0.0052
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P}}$	0.3122
$\textrm{AEROMamba}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001
$\textrm{AEROMamba}_{\textrm{P}}$ vs. $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$	< 0.0001

Audio Examples: MUSDB (Compressed)

Tracks restored from 32kbps MP3 to 44.1kHz

Audio samples comparing $\textrm{Low-Bitrate}$, $\textrm{Original (High-Res)}$, $\textrm{AEROMamba}$, $\textrm{AEROMamba}_{\textrm{P}}$, and $\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$.
Track	Low-Bitrate MP3 32kbps	High-Res 44.1 kHz	$\textrm{AEROMamba}$	$\textrm{AEROMamba}_{\textrm{P}}$	$\textrm{AEROMamba}_{\textrm{P} \bar{\textrm{S}}}$
459
480
826
625

Extra: Super-resolution of Degraded Historical Recordings

Super-resolution of Alfred Cortot's historical piano recordings. This section showcases the difference between the predictions of super-resolution of bandlimited audio models when trained with noisy or only high-quality recordings. The input tracks contain much more noise than any of the training data from both HQ and non-HQ datasets.

Historical recordings from 1925 and 1942 processed through various model configurations.
Recording	Low Resolution	High Resolution	$\textrm{AEROMamba}$	$\textrm{AEROMamba}\text{-}\textrm{HQ}$	$\textrm{AEROMamba}_{\textrm{P}}$	$\textrm{AEROMamba}_{\textrm{P}}\text{-}\textrm{HQ}$
Cortot (1925)
Cortot (1942)

PianoEval Dataset Metadata

The collected PianoEval dataset consists of two parts. The first is composed of the 24 Preludes for Piano, op. 28, by Chopin performed by 33 pianists in 45 different recordings available on CD (Compact Disc), totaling approximately 22 hours. The second part contains excerpts of Ligeti piano études, a Schumann sonata, and the Barber sonata, played by three different performers, respectively, totaling approximately 3.5 hours. Information about performers, record label and year of recording are detailed in the tables below.

Train/Validation

Train/Validation split for PianoEval dataset
Pianist	Label	Year
Arrau, C.	Columbia	1950/1
Arrau, C.	Philips	1973
Argerich, M.	DG	1975
Ashkenazy, V.	Decca	1976
Ashkenazy, V.	Decca	1992
Bolet, J.	RCA	1974
Blechacz, R.	DG	2007
Cherkassky, S.	ASV	1968
Cortot, A.	HMV	1926
Cortot, A.	HMV	1933/4
Cortot, A.	Gramophone	1942
Cortot, A.	Archipel	1955
Cortot, A.	EMI	1957
Davidovich, B.	Decca	1979
de Larrocha, A.	Decca	1974
Duchable, F.	Erato	1988
Dutra, G.	Yellow Tail	1997
El Bacha, A. R.	Forlane	1999
François, S.	EMI	1959
Freire, N.	Columbia	1970
Harasiewicz, A.	Philips	1963
Katsaris, C.	Sony	1992
Kissin, Y.	RCA	1999
Lima, A. M.	Caras	1981
Lucchesini, A.	EMI	1988
Magaloff, N.	Philips	1975
Novaes, G.	M&A	1949
Ohlsson, G.	EMI	1974
Ohlsson, G.	Hyperion	1989
Perahia, M.	Columbia	1975
Petri, E.	Columbia	1942
Pires, M.	Erato	1975
Pires, M.	DG	1992
Pogorelich, I.	DG	1989
Pollini, M.	DG	1974
Pollini, M.	DG	2011
Proença, M.	Delphos	1999
Rubinstein, A.	RCA	1946
Switala, W.	NIFC	2006/7
Tiempo, S.	Victor	1990
Varsi, D.	Genuin	1988

Test

Test split for PianoEval dataset
Pianist	Label	Year
B. Glemser	Naxos	1993
D. Pollack	Naxos	1995
P. L. Aimard	Sony	1995

Subjective Test Tracklist

The following table maps the Question IDs (QID) used during the subjective listening tests to the corresponding audio tracks from the MUSDB and PianoEval datasets.

Question IDs and corresponding track names.
QID	Track	QID	Track
1	electronic01	13	electronic02
2	rock01	14	rock02
3	pop01	15	pop02
4	hiphop01	16	hiphop02
5	latin01	17	reggae01
6	other01	18	other02
7	02Barber	19	04Barber
8	14Ligeti	20	17Ligeti
9	05Ligeti	21	15Ligeti
10	07Barber	22	08Barber
11	03Schumann	23	04Schumann
12	02Schumann	24	15Schumann

Efficient Audio Enhancement with a Differentiable Psychoacoustic Loss

Abstract

Results: Super-resolution of Bandlimited Audio

MUSDB Results

PianoEval Results

Performance Comparison (NVIDIA RTX 3090)

Subjective Score Distributions

MUSDB

PianoEval

MUSDB (Subjective)

PianoEval (Subjective)

MUSDB (ViSQOL)

PianoEval (ViSQOL)

Audio Examples: MUSDB

Results: Enhancement of Heavily Compressed Audio

MUSDB Results

PianoEval Results

Subjective Score Distributions

MUSDB

PianoEval

MUSDB (Subjective)

PianoEval (Subjective)

MUSDB (ViSQOL)

PianoEval (ViSQOL)

Audio Examples: MUSDB (Compressed)

Extra: Super-resolution of Degraded Historical Recordings

Train/Validation

Test