Abstracts of papers published in 2025

第28回画像の認識・理解シンポジウム (MIRU2025), 口頭発表論文, 査読付き

高品質な日本語マルチモーダルデータセットのスケーラブルな構築手法に関する研究
Toshiki katsube, Taiga Fukuhara, Kohei Uehara, Kenichiro Ando, Yusuke Mukuta, Tatsuya harada
　近年，機械学習技術の進歩により視覚情報と言語情報を統合して処理するVision & Language（V&L）モデルの発展がめざましいです。V&Lモデルの学習のためには，テキストとデータがペアになったV&Lデータセットが必要であるが，日本語など英語以外の言語のデータセットは量的にも質的にも不足しています．本研究では，日本特有の知識や文化を反映した，高品質な画像キャプションデータセットをスケーラブルに構築する手法を提案しました．提案手法は，画像とaltテキストのダウンロード，画像に含まれる物体の検出，LLMによるaltテキストの整形の三段階に分かれます．LLMによってテキストを整形することにより，従来の自動で構築されたデータセットよりも高品質なデータセット構築を可能にしました．構築したデータセットを用いてV&Lモデルを学習し，V&Lモデルの日本語性能の向上に寄与することを確認しました．

Interspeech 2025

FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge
Nabarun Goswami, Tatsuya Harada
We propose a multi-stage framework for universal speech enhancement, designed for the Interspeech 2025 URGENT Challenge. Our system first employs a Sparse Compression Network to robustly separate sources and extract an initial clean speech estimate from noisy inputs. This is followed by an efficient mask-predict generative model that refines speech quality by leveraging self-supervised features and optimizing a masked language modeling objective on acoustic tokens derived from a neural audio codec. In the final stage, a fusion network integrates the outputs of the first two stages with the original noisy signal, achieving a balanced improvement in both signal fidelity and perceptual quality. Additionally, a shift trick that aggregates multiple time-shifted predictions, along with output blending, further boosts performance. Experimental results on challenging multilingual datasets with variable sampling rates and diverse distortion types validate the effectiveness of our approach.

Transactions on Machine Learning Research (TMLR), 2025

HyperVQ: MLR-based Vector Quantization in Hyperbolic Space
Nabarun Goswami, Yusuke Mukuta, Tatsuya Harada
The success of models operating on tokenized data has heightened the need for effective tokenization methods, particularly in vision and auditory tasks where inputs are naturally continuous. A common solution is to employ Vector Quantization (VQ) within VQ Variational Autoencoders (VQVAEs), transforming inputs into discrete tokens by clustering embeddings in Euclidean space. However, Euclidean embeddings not only suffer from inefficient packing and limited separation—due to their polynomial volume growth—but are also prone to codebook collapse, where only a small subset of codebook vectors are effectively utilized. To address these limitations, we introduce HyperVQ, a novel approach that formulates VQ as a hyperbolic Multinomial Logistic Regression (MLR) problem, leveraging the exponential volume growth in hyperbolic space to mitigate collapse and improve cluster separability. Additionally, HyperVQ represents codebook vectors as geometric representatives of hyperbolic decision hyperplanes, encouraging disentangled and robust latent representations. Our experiments demonstrate that HyperVQ matches traditional VQ in generative and reconstruction tasks, while surpassing it in discriminative performance and yielding a more efficient and disentangled codebook.

EDM-TTS: Efficient Dual-Stage Masked Modeling for Alignment-Free Text-to-Speech Synthesis
Nabarun Goswami, Hanqin Wang, Tatsuya Harada
Tokenized speech modeling has significantly advanced zero-shot text-to-speech (TTS) capabilities. The most de facto approach involves a dual-stage process: text-to-semantic (T2S) followed by semantic-to-acoustic (S2A) generation. Several auto-regressive (AR) and non-autoregressive (NAR) methods have been explored in literature for both the stages. While AR models achieve state-of-the-art performance, its token-by-token generation causes inference inefficiencies, while NAR methods while being more efficient, require explicit alignment for upsampling intermediate representations, which constrains the model's capability for more natural prosody. To overcome these issues, we propose an Efficient Dual-stage Masked TTS (EDM-TTS) model that employs an alignment-free masked generative approach for the T2S stage that overcomes the constrains of an explicit aligner, while retaining the efficiency of NAR methods. For the S2A stage, we introduce an innovative NAR approach using a novel Injection Conformer architecture, that effectively models the conditional dependence among different acoustic quantization levels, optimized by a masked language modeling objective, enabling zero-shot speech generation. Our evaluations demonstrated not only the superior inference efficiency of EDM-TTS, but also its state-of-the-art high-quality zero-shot speech quality, naturalness and speaker similarity.

Proceedings of the Reinforcement Learning Conference (RLC), 2025

Offline Reinforcement Learning with Wasserstein Regularization via Optimal Transport Maps
Motoki Omura, Yusuke Mukuta, Kazuki Ota, Takayuki Osa, Tatsuya Harada
Offline reinforcement learning (RL) aims to learn an optimal policy from a static dataset, making it particularly valuable in scenarios where data collection is costly, such as robotics. A major challenge in offline RL is distributional shift, where the learned policy deviates from the dataset distribution, potentially leading to unreliable out-of-distribution actions. To mitigate this issue, regularization techniques have been employed. While many existing methods utilize density ratio-based measures, such as the f-divergence, for regularization, we propose an approach that utilizes the Wasserstein distance, which is robust to out-of-distribution data and captures the similarity between actions. Our method employs input-convex neural networks (ICNNs) to model optimal transport maps, enabling the computation of the Wasserstein distance in a discriminator-free manner, thereby avoiding adversarial training and ensuring stable learning. Our approach demonstrates comparable or superior performance to widely used existing methods on the D4RL benchmark dataset.

Proceedings of the International Conference on Machine Learning (ICML), 2025

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning
Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
　 For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025

A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains
Dexuan Zhang, Thomas Westfechtel, Tatsuya Harada
　 Capturing high-quality photographs across diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) strongly influence image quality. This difficulty intensifies in multi-view scenarios, where each viewpoint can have distinct lighting and image signal processor (ISP) settings, causing photometric inconsistencies between views. These lighting degradations and view variations significant challenges to both NeRF- and 3D Gaussian Splatting (3DGS)-based novel view synthesis (NVS) frameworks. To address this, we introduce Luminance-GS, a novel approach to achieve high-quality novel view synthesis results under diverse and challenging lighting conditions using 3DGS. By adopting per-view color space mapping and view adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions—including low-light, overexposure, and varying exposure—without altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality. We would release the source code.

Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment
Ziteng Cui, Xuangeng Chu, Tatsuya Harada
　 Capturing high-quality photographs across diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) strongly influence image quality. This difficulty intensifies in multi-view scenarios, where each viewpoint can have distinct lighting and image signal processor (ISP) settings, causing photometric inconsistencies between views. These lighting degradations and view variations significant challenges to both NeRF- and 3D Gaussian Splatting (3DGS)-based novel view synthesis (NVS) frameworks. To address this, we introduce Luminance-GS, a novel approach to achieve high-quality novel view synthesis results under diverse and challenging lighting conditions using 3DGS. By adopting per-view color space mapping and view adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions—including low-light, overexposure, and varying exposure—without altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality. We would release the source code.

The Thirteenth International Conference on Learning Representations, ICLR 2025

T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning
Nabarun Goswami, Hanqin Wang, Tatsuya Harada
　 We introduce T2V2 (Text to Voice and Voice to Text), a unified non-autoregressive model capable of performing both automatic speech recognition (ASR) and text-to-speech (TTS) synthesis within the same framework. T2V2 uses a shared Conformer backbone with rotary positional embeddings to efficiently handle these core tasks, with ASR trained using Connectionist Temporal Classification (CTC) loss and TTS using masked language modeling (MLM) loss. The model operates on discrete tokens, where speech tokens are generated by clustering features from a self-supervised learning model. To further enhance performance, we introduce auxiliary tasks: CTC error correction to refine raw ASR outputs using contextual information from speech embeddings, and unconditional speech MLM, enabling classifier free guidance to improve TTS. Our method is self-contained, leveraging intermediate CTC outputs to align text and speech using Monotonic Alignment Search, without relying on external aligners. We perform extensive experimental evaluation to verify the efficacy of the T2V2 framework, achieving state-of-the-art performance on TTS task and competitive performance in discrete ASR.