Zixin Guo

I'm a fourth-year PhD student at Department of Computer Science, Aalto University. My current research interests include:

  • Multimodal Learning, advised by Jorma Laaksonen (Aalto University, Finland 🇫🇮) and in collaboration with Min Cao (Soochow University, China 🇨🇳).
  • Eye Tracking, in close collaboration with Yue Jiang (University of Utah, USA 🇺🇸).

Email  /  Scholar  /  Github  /  LinkedIn

profile photo

Selected Publications

* = Equal contribution     # = Mentoring role

Imagine How To Change: Explicit Procedure Modeling for Change Captioning
Jiayang Sun*, Zixin Guo*#, Min Cao, Guibo Zhu, Jorma Laaksonen
ICLR, 2026  (CCF A)
paper / code
SeekUI: Predicting Visual Search Behavior on Graphical User Interfaces with a Reward-Augmented Vision Language Model
Zixin Guo*, Yue Jiang*, Luis A. Leiva, Antti Oulasvirta
CHI, 2026  (CCF A)
paper / code
Learning to Describe Implicit Changes: Noise-Robust Pre-training for Image Difference Captioning
Zixin Guo, Jiayang Sun, Tzu-Jui Julius Wang, Abduljalil Radman, Selen Pehlivan, Min Cao, Jorma Laaksonen
Findings of EMNLP, 2025  (CCF B)
paper
EyeFormer: Predicting Scanpaths in Free-Viewing Tasks with Transformer-Guided Reinforcement Learning
Yue Jiang*, Zixin Guo*, Hamed Rezazadegan Tavakoli, Luis A. Leiva, Antti Oulasvirta
UIST, 2024  (CCF A)
paper / code
PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Zixin Guo, Tzu-Jui Julius Wang, Selen Pehlivan, Abduljalil Radman, Jorma Laaksonen
SIGIR, 2023  (Short Paper)  (CCF A)
paper
CLIP4IDC: CLIP for Image Difference Captioning
Zixin Guo, Tzu-Jui Julius Wang, Jorma Laaksonen
AACL, 2022  (Short Paper)  (50+ Citations)
paper / code

Other Publications

Valor32k-AVQA v2.0: Open-Ended Audio-Visual Question Answering Dataset and Benchmark
Ines Riahi, Abduljalil Radman, Zixin Guo, Rachid Hedjam, Jorma Laaksonen
ACM MM, 2025
paper
Prompt-based Weakly-supervised Vision-language Pre-training
Zixin Guo, Tzu-Jui Julius Wang, Selen Pehlivan, Abduljalil Radman, Min Cao, Jorma Laaksonen
Pattern Recognition Letters, 2025
paper
Efficient Text-to-video Retrieval via Multi-modal Multi-tagger Derived Pre-screening
Yingjia Xu, Mengxia Wu, Zixin Guo, Min Cao, Mang Ye, Jorma Laaksonen
Visual Intelligence, 2025
paper
Diffusion-based Multimodal Video Captioning
Jaakko Kainulainen, Zixin Guo#, Jorma Laaksonen
ACCV, 2024
paper
Impact of Design Decisions in Scanpath Modeling
Parvin Emami, Yue Jiang, Zixin Guo, Luis A. Leiva
ETRA, 2024
paper
Post-Attention Modulator for Dense Video Captioning
Zixin Guo, Tzu-Jui Julius Wang, Jorma Laaksonen
ICPR, 2022
paper

Website template from Jon Barron.