-
Notifications
You must be signed in to change notification settings - Fork 382
Description
While working with the DFN3 training and model implementation, I came across a couple of points regarding the local SNR-based switching mechanism that I'd like to kindly ask for clarification on.
1. About the lsnr_dropout Configuration Flag
I noticed the lsnr_dropout flag in the training configuration. This setting defaults to False and is also explicitly set to False in all of the provided DFN3 training config.ini files.
I was wondering if this feature was actively used during the final experiments for the paper?
2. Discrepancy in LSNR-based Decoder Switching Logic
The DFN3 paper describes a multi-stage switching mechanism based on the predicted local SNR. Specifically, the paper states:
We further predict the local SNR ξ ∈ [−15, 35] dB within the encoder network on frame level. This allows to completely disable the ERB or the DF decoder depending on the current noise conditions. That is, we define the following criteria:
ξ < −10 dB: Disable both decoders, return silent spectrum.
ξ > 20 dB: Disable DF decoder. Only low noise condition, enhancing the periodicity is not necessary.
else: Run all stages for best noise reduction
When reviewing the implementation, I was able to find the -10 dB threshold being used to create a boolean mask for active frames. Here is the relevant code snippet I'm looking at:
in deepfilternet3.DfNet.forward():
if self.lsnr_droput:
idcs = lsnr.squeeze() > -10.0
b, t = (spec.shape[0], spec.shape[2])
m = torch.zeros((b, 1, t, self.erb_bins), device=spec.device)
df_coefs = torch.zeros((b, t, self.nb_df, self.df_order * 2))
spec_m = spec.clone()
emb = emb[:, idcs]
e0 = e0[:, :, idcs]
e1 = e1[:, :, idcs]
e2 = e2[:, :, idcs]
e3 = e3[:, :, idcs]
c0 = c0[:, :, idcs]
if self.run_erb:
if self.lsnr_droput:
m[:, :, idcs] = self.erb_dec(emb, e3, e2, e1, e0)
else:
m = self.erb_dec(emb, e3, e2, e1, e0)
spec_m = self.mask(spec, m)
else:
m = torch.zeros((), device=spec.device)
spec_m = torch.zeros_like(spec)
if self.run_df:
if self.lsnr_droput:
df_coefs[:, idcs] = self.df_dec(emb, c0)
else:
df_coefs = self.df_dec(emb, c0)
df_coefs = self.df_out_transform(df_coefs)
spec_e = self.df_op(spec.clone(), df_coefs)
spec_e[..., self.nb_df :, :] = spec_m[..., self.nb_df :, :]
else:
df_coefs = torch.zeros((), device=spec.device)
spec_e = spec_m
idcs = lsnr.squeeze() > -10.0 will create a mask for each T where lsnr is greater than -10 dB
however, the same idcs is used for the df_coefs. So in this case, all time points with an lsnr of >-10 dB will be sent through both decoders, those below, will be zeros. But never will the df_decoder be turned off.
This could cause some artefacts I assume, maybe like the one mentioned here: #657
Also, if the lsnr_dropout isn't used, why train the lsnr_fc? The LSNR Loss only helps in making the lsnr_fc better, but if it isn't used, why train it?