Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation

Hongxing Fan1, Shuyu Zhao1, Jiayang Ao2, Lu Sheng1

1Beihang University    2The University of Melbourne

Corresponding Author

Teaser
Our framework tackles complex occlusions through these key capabilities: (1) Structural & Semantic Reasoning, which recovers geometric continuity (e.g., hidden limbs) and contextual details (e.g., text) beyond pixel clues; and (2) Diverse Hypothesis Generation, which models the multimodal nature of invisible regions (e.g., diverse plushie states). Furthermore, we introduce (3) the MAC-Score, a human-aligned evaluation metric. As shown in the bottom-right, our metric resolves the paradox where incomplete results are favored by traditional metrics, providing a robust metric for amodal completion.

Abstract

Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by error accumulation and inference instability. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets.

Method

Method overview
Overview of the proposed Closed-Loop Collaborative Multi-Agent Reasoning Framework. The pipeline fundamentally decouples semantic planning from visual synthesis through three coordinated stages: (1) Holistic Collaborative Reasoning: A coalition of agents synergizes to parse the scene’s geometry, forming an initial spatial plan. (2) Closed-Loop Verification: A self-correcting mechanism where a Verification Agent iteratively scrutinizes the initial plan to rectify segmentation errors and recover overlooked occluders. (3) Hypothesis Generation: To address the uncertainty of invisible regions, the Hypothesis Agent leverages the refined context to propose diverse semantic descriptions. Finally, the Inpainting Agent executes the verified plan to synthesize the high-fidelity amodal result in a single pass.

Amodal Completion with Guidance

Qualitative results
Qualitative comparison against SOTA amodal completion approaches. We compare with Pix2Gestalt, PD-MC, and OWAAC. Annotations denote MAC-Consistency scores and MAC-Completeness states (✓/×). Row 1 (Anatomy): Baselines truncate the tail or hallucinate unrealistic limbs (e.g., OWAAC’s legs), while we recover the natural bird shape. Row 2 (Texture): Baselines leave significant “ghost” artifacts (e.g., yellow blobs), whereas we cleanly recover the texture. Rows 3 & 4 (Reasoning): Baselines struggle with reasoning. In Row 3, PD-MC hallucinates unnatural object; in Row 4, OWAAC misinterprets the dog’s pose. Our method correctly infers both geometry and posture. Row 5 (Geometry): Competitors fail to extrapolate the taxi’s length, resulting in distorted/truncated bodies; ours maintains the taxi’s structural completeness. Row 6 (Text): Only our method accurately recovers missing semantic text characters. Overall, our superior visual quality aligns with the higher quantitative scores, verifying the rationality of our proposed metrics.

Results Images

Metrics

View completeness prompt
You are an expert in visual perception and object recognition.

You will be given **two images**:
- The **first image** is the original image containing the scene.
- The **second image** is the **segmented result** of the main object, obtained from an **Amodal Completion** task.

Your task is to determine whether the segmented object (the second image) represents a **complete and intact** version of the object seen in the original image.

Definitions:

- **"Complete"** means that the object is entirely visible within the image frame, not partially cut off, hidden, or distorted. The segmented result should contain the object in its natural, full form as it appears in the real world.
- **"Incomplete"** means the object is missing parts, truncated at the edges, occluded, or not consistent with the full object that should exist in the original image.

**Important:**
- Focus on comparing the segmented result (second image) with the original image (first image).  
  The segmented object should correspond to the same object visible in the original image and should not miss essential parts.
- If the segmented object appears cut off, has missing limbs or edges, or is inconsistent with the object’s full structure in the original image, it should be classified as **Incomplete**.

Instructions:

1. Carefully compare the segmented object (second image) with the original image (first image).
2. Determine if the segmented object is **Complete** or **Incomplete**.
3. Provide your decision in this strict JSON format:

{
  "object_status": "Complete" | "Incomplete",
  "explanation": "A short sentence explaining why you made this decision, focusing on missing parts, truncation, or mismatch with the original image."
}

Note:
- Only use "Complete" or "Incomplete" as the categories.
-- Focus on whether the segmented result accurately represents the complete form of the object seen in the original image.
View consistency prompt
You are an evaluator comparing two images:

The original image, which contains the visible part of the object (partially occluded).
The completed image, which shows the object after amodal completion.

Amodal completion is the process of inferring and representing an object’s occluded parts so the object is understood as a **complete, closed whole**.

Your task is to rate how consistent and realistic the completed object itself is.

**Critical Definition of "Incomplete":**
If the completed object still looks cut off, truncated, or has a straight "image border" edge where it should be round or continuous, it is considered **Incomplete**. This is a structural failure.

Evaluation Dimensions:

**1. Structural Continuity (0–4 points)**
*Focus on the closure and logical continuation of the shape.*
* **0: The object boundary is abruptly cut off, forming a straight line or unnatural truncation (looks like the original occluded input). The shape is NOT closed.**
* 1: The object attempts to close the shape but the boundary is severely distorted, jagged, or structurally impossible.
* 2: Generally continuous but with noticeable misalignment or irregularities in the completed region.
* 3: Contours flow seamlessly and align well between completed and visible regions.
* 4: Boundaries are perfectly closed, continuous, and fully consistent with the visible parts.

**2. Semantic Consistency (0–4 points)**
* 0: The completed region introduces incorrect or unrelated elements.
* 1: Roughly matches the object but contains major semantic errors (e.g., wrong parts or unrealistic details).
* 2: Generally consistent but with notable smaller semantic inaccuracies.
* 3: Mostly consistent, with only very minor or negligible semantic differences.
* 4: Perfect match to the original object’s type, structure, and expected real-world form.

**3. Object Realism (0–2 points)**
* 0: The completed object does not resemble a plausible real-world version of the object (e.g., a half-object is not realistic).
* 1: Somewhat realistic but with small inconsistencies.
* 2: Perfectly realistic and faithful to how this object should appear in reality.

Scoring:
Add up the points from all categories.
score = Structural Continuity + Semantic Consistency + Object Realism

Output Format:
{
"score": X,
"dimension_scores": {
"structural_continuity": Y,
"semantic_consistency": Z,
"object_realism": A
},
"explanation": "One or two sentences summarizing why you gave this score."
}

BibTeX