Daichi Yashima

I am a Ph.D. student in Computer Science at Keio University, advised by Prof. Komei Sugiura. I am supported by the JSPS Research Fellowship for Young Scientists (DC1). I started my Ph.D. in April 2026 after completing the Master's program in one year.

My research focuses on foundation models and multimodal language understanding for embodied AI: building systems that can execute complex tasks in the physical world. I work on multimodal large language models, vision-language-action models, video understanding, and mobile manipulation.

News

2026/07 I will be co-organizing the LIMIT workshop at ECCV 2026.
2026/06 New paper "Flow as Flow" is out!
2026/06 Our paper has been accepted to IROS 2026.
2026/06 Our paper has been accepted to INTERSPEECH 2026.
2026/04 Awarded JSPS Research Fellowship for Young Scientists (DC1).
2026/04 Started my Ph.D. at Keio University (early graduation from Master's by 1 year).
2026/03 Our paper has been accepted to ICPR 2026.
2026/02 Our papers have been accepted to CVPR 2026 and CVPR 2026 Findings.
2025/03 Our paper has been accepted to IEEE RA-L.
2025/02 Our paper has been accepted to IEEE RA-L.

Publications

2026

RIGEL: Self-Distilled Score Adaptation for Image and Video Captioning Evaluation

S. Koyama, K. Matsuda, Y. Wada, S. Hirano, D. Yashima, and K. Sugiura

Preprint

[paper]

Flow as Flow: Modeling Robot Velocity Fields as Probability Velocity Fields for Flow-Based Object Manipulation

K. Seno, D. Yashima, Y. Takagi, K. Tokura, and K. Sugiura

Preprint

[paper] [project]

HiFlow: Tokenization-Free Scale-Wise Autoregressive Policy Learning via Flow Matching

D. Yashima, K. Seno, S. Kurita, Y. Oda, and K. Sugiura

IROS 2026 (Acceptance Rate: 36%, h5-index: 92)

[paper] [project]

ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

S. Suzuki*, K. Tokura*, D. Yashima*, K. Amemiya*, K. Sugiura, and S. Takamichi

INTERSPEECH 2026 (h5-index: 112)

[paper] [project] [code]

MLLM-as-a-Judge Exhibits Model Preference Bias

S. Koyama*, Y. Wada*, D. Yashima*, and K. Sugiura

Preprint

[paper] [project]

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

D. Yashima, S. Kurita, Y. Oda, S. Suzuki, S. Otsuki, and K. Sugiura

ICPR 2026 (h5-index: 68)

[paper]

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

Y. Takagi, M. Kambara, D. Yashima, K. Seno, K. Tokura, and K. Sugiura

Preprint

[paper] [project]

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

D. Yashima, S. Kurita, Y. Oda, and K. Sugiura

CVPR 2026 (Acceptance Rate: 25.42%, h5-index: 450)

[paper] [project] [code]

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

K. Amemiya, D. Yashima, K. Katsumata, T. Komatsu, R. Korekata, S. Otsuki, and K. Sugiura

CVPR 2026 Findings (Acceptance Rate (main + findings): 36%, h5-index: 450)

[paper] [project] [dataset]

2025

AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation

R. Takanami, P. Khrapchenkov, S. Morikuni, J. Arima, Y. Takaba, S. Maeda, T. Okubo, G. Sano, S. Sekioka, A. Kadoya, M. Kambara, N. Nishiura, H. Suzuki, T. Yoshimoto, K. Sakamoto, S. Ono, H. Yang, D. Yashima, A. Horo, T. Motoda, K. Chiyoma, H. Ito, K. Fukuda, A. Goto, K. Morinaga, Y. Ikeda, R. Kawada, M. Yoshikawa, N. Kosuge, Y. Noguchi, K. Ota, T. Matsushima, Y. Iwasawa, Y. Matsuo, and T. Ogata

Preprint

[paper] [code]