Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Daichi Yashima
Daichi Yashima is a robotics researcher at Keio University focused on foundation models, multimodal language understanding, and embodied AI.
Posts
portfolio
publications
Paper Title Number 4
Published in GitHub Journal of Bugs, 2024
This paper is about fixing template issue #693.
Recommended citation: Your Name, You. (2024). "Paper Title Number 3." GitHub Journal of Bugs. 1(3).
Download Paper
Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement
Published in IEEE RA-L, 2025
In this study we propose a novel training method that leverages both learning-based and n-gram based automatic evaluation metrics as rewards to generate free-form mobile manipulation instructions.
Recommended citation: K. Katsumata, M. Kambara, D. Yashima, R. Korekata, and K. Sugiura, "Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement", IEEE RA-L, vol. 10, no. 3, pp. 3022–3029, 2025.
Download Paper
Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning With Dense Labeling
Published in IEEE RA-L, 2025
In this study we propose RelaX-Former, a method that leverages unlabeled positive labels and introduces a double relaxed contrastive learning approach to handle unlabeled positive and negative samples, improving the alignment between images and text.
Recommended citation: D. Yashima, R. Korekata, and K. Sugiura, "Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning With Dense Labeling", IEEE RA-L, vol. 10, no. 2, pp. 1728–1735, 2025.
Download Paper
AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation
Published in arXiv, 2025
We present the AIRoA MoMa Dataset, a large-scale hierarchical dataset designed to advance research in mobile manipulation within indoor environments.
Recommended citation: R. Takanami, P. Khrapchenkov, S. Morikuni, J. Arima, Y. Takaba, S. Maeda, T. Okubo, G. Sano, S. Sekioka, A. Kadoya, M. Kambara, N. Nishiura, H. Suzuki, T. Yoshimoto, K. Sakamoto, S. Ono, H. Yang, D. Yashima, A. Horo, T. Motoda, K. Chiyoma, H. Ito, K. Fukuda, A. Goto, K. Morinaga, Y. Ikeda, R. Kawada, M. Yoshikawa, N. Kosuge, Y. Noguchi, K. Ota, T. Matsushima, Y. Iwasawa, Y. Matsuo, and T. Ogata (2025). ``AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation', arXiv preprint arXiv:2509.25032.
Download Paper
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
Published in CVPR 2026 Findings, 2026
We propose NaiLIA, a multimodal retrieval method for nail design images that comprehensively aligns with dense intent descriptions and palette queries.
Recommended citation: K. Amemiya, D. Yashima, K. Katsumata, T. Komatsu, R. Korekata, S. Otsuki, and K. Sugiura, "NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries", CVPR Findings, 2026.
Download Paper
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
Published in CVPR 2026, 2026
We propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations, using sparse RGB keyframes for appearance and a refined motion representation for temporal dynamics.
Recommended citation: D. Yashima, S. Kurita, Y. Oda, and K. Sugiura, "ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding", CVPR, 2026.
Download Paper
AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation
Published in arXiv, 2026
We propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently, outperforming a representative large-scale VLA by 21 points in task success rate while achieving approximately three times faster inference.
Recommended citation: Y. Takagi, M. Kambara, D. Yashima, K. Seno, K. Tokura, and K. Sugiura, "AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation", arXiv preprint arXiv:2603.15046, 2026.
Download Paper
HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching
Published in arXiv, 2026
We propose HiFlow, a tokenization-free coarse-to-fine autoregressive policy that operates directly on raw continuous actions via flow matching, eliminating the need for discrete action tokenizers.
Recommended citation: D. Yashima, K. Seno, S. Kurita, Y. Oda, and K. Sugiura, "HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching", arXiv, 2026.
Download Paper
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
Published in ICPR 2026, 2026
We propose ABMamba, a fully open MLLM based on Deep State Space Models with linear computational complexity that enables scalable video captioning, achieving competitive performance with approximately three times higher throughput.
Recommended citation: D. Yashima, S. Kurita, Y. Oda, S. Suzuki, S. Otsuki, and K. Sugiura, "ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning", ICPR, 2026.
Download Paper
