My work studies how large language models reason, act, refuse unsafe requests, and portray psychological traits. I turn these questions into benchmarks, open-source systems, and training techniques for more reliable AI applications.

Multimodal, tool-using, and collaborative agents for complex, long-horizon tasks.

ArXiv 2026

MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences

Abstracts interaction data into atomic decision experiences via hindsight reasoning, then retrieves them at inference through policy-driven wide- and deep-search strategies, improving fine-grained visual perception and multimodal reasoning over trajectory-level retrieval baselines.

ArXiv 2026

MMSkills: Towards Multimodal Skills for General Visual Agents

Represents reusable multimodal procedures as compact, state-conditioned skill packages — a textual procedure paired with runtime state cards and multi-view keyframes — generated from public trajectories and consulted by a branch-loaded skill agent at runtime.

WWW 2026

DeepAgent: A General Reasoning Agent with Scalable Toolsets

A reasoning agent capable of tackling general tasks by searching for and using the appropriate tools from over 16,000 RapidAPIs in an end-to-end agentic reasoning process.

ArXiv 2026

OmniGAIA

A benchmark for evaluating omni-modal general AI assistants.

EMNLP 2024

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

The Multi-Agent Debate (MAD) framework addresses the Degeneration-of-Thought problem and explores divergent chains of thought through structured agent interaction.

Mathematical, reflective, and efficient long-chain reasoning.

ICLR 2026

REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models

Tackles overthinking in large reasoning models with a small reflection model that enables parallel sampling and sequential revision during online RL, plus a reflection reward that preserves reflection ability — reducing inference cost by ~35% without sacrificing accuracy.

ICLR 2026

DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

A dual-reward RL framework that classifies problems as Simple or Hard in real time and adaptively shortens or extends Chain-of-Thought, improving both accuracy and token efficiency on challenging math benchmarks.

EMNLP 2024

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Defines the Degeneration-of-Thought problem in self-reflection and addresses it with multi-agent debate over divergent chains of thought.

Risk awareness, jailbreak robustness, multilingual safety, and refusal behavior.

NeurIPS 2025 D&B

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

ACL 2025

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training (DeRTa)

ACL 2025 (Findings)

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

ICLR 2024

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

CipherChat examines whether safety alignment generalizes to non-natural languages such as ciphers; GPT-4 understands ciphers well enough to produce unsafe outputs.

Emotion, personality, and psychological portrayals in conversational AI.

ICLR 2024 Oral

On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs

PPBench evaluates diverse psychological aspects of LLMs, including personality traits, interpersonal relationships, motivational tests, and emotional abilities.

NeurIPS 2024

Emotionally Numb or Empathetic? Evaluating How LLMs Feel using EmotionBench

Preprint

ChatGPT an ENFJ, Bard an ISTJ: Empirical Study on Personalities of Large Language Models