My research studies how large language models reason, act as agents, refuse unsafe requests, and portray psychological traits. I develop benchmarks, open-source systems, and training/alignment methods to diagnose model failures and build more reliable AI applications.

Multimodal, tool-using, and collaborative agents for complex, long-horizon tasks.

MuSEAgent: A Multimodal Reasoning Agent with Stateful Experiences

Abstracts interaction data into atomic decision experiences via hindsight reasoning, then retrieves them at inference through policy-driven wide- and deep-search strategies, improving fine-grained visual perception and multimodal reasoning over trajectory-level retrieval baselines.

MMSkills: Towards Multimodal Skills for General Visual Agents

Represents reusable multimodal procedures as compact, state-conditioned skill packages — a textual procedure paired with runtime state cards and multi-view keyframes — generated from public trajectories and consulted by a branch-loaded skill agent at runtime.

DeepAgent cover

DeepAgent: A General Reasoning Agent with Scalable Toolsets

A reasoning agent capable of tackling general tasks by searching for and using the appropriate tools from over 16,000 RapidAPIs in an end-to-end agentic reasoning process.

OmniGAIA: Towards Native Omni-Modal AI Agents

A benchmark and foundation agent for omni-modal AI assistants. OmniGAIA synthesizes multi-hop queries across video, audio, and image via an omni-modal event graph; the accompanying OmniAtlas agent uses active omni-modal perception trained with hindsight-guided tree exploration and OmniDPO.

Multi-Agent Debate cover

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

The Multi-Agent Debate (MAD) framework addresses the Degeneration-of-Thought problem and explores divergent chains of thought through structured agent interaction.

Mathematical, reflective, and efficient long-chain reasoning.

REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models

Tackles overthinking in large reasoning models with a small reflection model that enables parallel sampling and sequential revision during online RL, plus a reflection reward that preserves reflection ability — reducing inference cost by ~35% without sacrificing accuracy.

DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

A dual-reward RL framework that classifies problems as Simple or Hard in real time and adaptively shortens or extends Chain-of-Thought, improving both accuracy and token efficiency on challenging math benchmarks.

Multi-Agent Debate cover

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Defines the Degeneration-of-Thought problem in self-reflection and addresses it with multi-agent debate over divergent chains of thought.

Risk awareness, jailbreak robustness, multilingual safety, and refusal behavior.

Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

PaSBench evaluates proactive safety across 416 multimodal scenarios in five safety-critical domains. Top models such as Gemini-2.5-pro reach 64–71% accuracy but miss 45–55% of risks under repetition — failure analysis traces this to unstable proactive reasoning rather than missing knowledge.

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training (DeRTa)

Identifies a refusal position bias in safety-tuning data and proposes Decoupled Refusal Training: MLE with a harmful response prefix plus Reinforced Transition Optimization, letting LLaMA-3 and Mistral models refuse at any position throughout a harmful response without hurting performance.

Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step

Decomposes a malicious image-generation query into innocuous sub-queries and iteratively edits the output; bypasses safeguards on GPT-4V, GPT-4o, and Gemini 1.5/Pro in over 60% of cases. A companion Think Twice Prompting defense blocks more than 95%.

CipherChat cover

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

CipherChat examines whether safety alignment generalizes to non-natural languages such as ciphers; GPT-4 understands ciphers well enough to produce unsafe outputs.

Emotion, personality, and psychological portrayals in conversational AI.

PsychoBench cover

On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs

PPBench evaluates diverse psychological aspects of LLMs, including personality traits, interpersonal relationships, motivational tests, and emotional abilities.

Apathetic or Empathetic? Evaluating LLMs' Emotional Alignments with Humans

Uses emotion appraisal theory to test how LLMs' feelings shift across 400+ situations grouped into 36 factors, benchmarked against responses from 1,200+ human subjects. Models including GPT-4, Mixtral-8x22B, and LLaMA-3.1 respond appropriately in some cases but fail to align with human emotional behavior or connect similar situations.

On the Reliability of Psychological Scales on Large Language Models

Across 2,500 settings per model on GPT-3.5/4, Gemini-Pro, and LLaMA-3.1, shows that LLMs respond consistently to the Big Five Inventory. Further demonstrates GPT-3.5 can emulate diverse personalities and represent specific population groups when given targeted prompts.