Paper Title Number 4
Published in GitHub Journal of Bugs, 2024
This paper is about fixing template issue #693.
Recommended citation: Your Name, You. (2024). "Paper Title Number 3." GitHub Journal of Bugs. 1(3).
Download Paper
Published in GitHub Journal of Bugs, 2024
This paper is about fixing template issue #693.
Recommended citation: Your Name, You. (2024). "Paper Title Number 3." GitHub Journal of Bugs. 1(3).
Download Paper
Published in IEEE RA-L, 2025
In this study we propose a novel training method that leverages both learning-based and n-gram based automatic evaluation metrics as rewards to generate free-form mobile manipulation instructions.
Recommended citation: K. Katsumata, M. Kambara, D. Yashima, R. Korekata, and K. Sugiura, "Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement", IEEE RA-L, vol. 10, no. 3, pp. 3022–3029, 2025.
Download Paper
Published in IEEE RA-L, 2025
In this study we propose RelaX-Former, a method that leverages unlabeled positive labels and introduces a double relaxed contrastive learning approach to handle unlabeled positive and negative samples, improving the alignment between images and text.
Recommended citation: D. Yashima, R. Korekata, and K. Sugiura, "Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning With Dense Labeling", IEEE RA-L, vol. 10, no. 2, pp. 1728–1735, 2025.
Download Paper
Published in arXiv, 2025
We present the AIRoA MoMa Dataset, a large-scale hierarchical dataset designed to advance research in mobile manipulation within indoor environments.
Recommended citation: R. Takanami, P. Khrapchenkov, S. Morikuni, J. Arima, Y. Takaba, S. Maeda, T. Okubo, G. Sano, S. Sekioka, A. Kadoya, M. Kambara, N. Nishiura, H. Suzuki, T. Yoshimoto, K. Sakamoto, S. Ono, H. Yang, D. Yashima, A. Horo, T. Motoda, K. Chiyoma, H. Ito, K. Fukuda, A. Goto, K. Morinaga, Y. Ikeda, R. Kawada, M. Yoshikawa, N. Kosuge, Y. Noguchi, K. Ota, T. Matsushima, Y. Iwasawa, Y. Matsuo, and T. Ogata (2025). ``AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation', arXiv preprint arXiv:2509.25032.
Download Paper
Published in CVPR 2026 Findings, 2026
We propose NaiLIA, a multimodal retrieval method for nail design images that comprehensively aligns with dense intent descriptions and palette queries.
Recommended citation: K. Amemiya, D. Yashima, K. Katsumata, T. Komatsu, R. Korekata, S. Otsuki, and K. Sugiura, "NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries", CVPR Findings, 2026.
Download Paper
Published in CVPR 2026, 2026
We propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations, using sparse RGB keyframes for appearance and a refined motion representation for temporal dynamics.
Recommended citation: D. Yashima, S. Kurita, Y. Oda, and K. Sugiura, "ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding", CVPR, 2026.
Download Paper
Published in arXiv, 2026
We propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently, outperforming a representative large-scale VLA by 21 points in task success rate while achieving approximately three times faster inference.
Recommended citation: Y. Takagi, M. Kambara, D. Yashima, K. Seno, K. Tokura, and K. Sugiura, "AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation", arXiv preprint arXiv:2603.15046, 2026.
Download Paper
Published in arXiv, 2026
We propose HiFlow, a tokenization-free coarse-to-fine autoregressive policy that operates directly on raw continuous actions via flow matching, eliminating the need for discrete action tokenizers.
Recommended citation: D. Yashima, K. Seno, S. Kurita, Y. Oda, and K. Sugiura, "HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching", arXiv, 2026.
Download Paper
Published in ICPR 2026, 2026
We propose ABMamba, a fully open MLLM based on Deep State Space Models with linear computational complexity that enables scalable video captioning, achieving competitive performance with approximately three times higher throughput.
Recommended citation: D. Yashima, S. Kurita, Y. Oda, S. Suzuki, S. Otsuki, and K. Sugiura, "ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning", ICPR, 2026.
Download Paper