Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by error accumulation and inference instability. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets.
You are an expert in visual perception and object recognition.
You will be given **two images**:
- The **first image** is the original image containing the scene.
- The **second image** is the **segmented result** of the main object, obtained from an **Amodal Completion** task.
Your task is to determine whether the segmented object (the second image) represents a **complete and intact** version of the object seen in the original image.
Definitions:
- **"Complete"** means that the object is entirely visible within the image frame, not partially cut off, hidden, or distorted. The segmented result should contain the object in its natural, full form as it appears in the real world.
- **"Incomplete"** means the object is missing parts, truncated at the edges, occluded, or not consistent with the full object that should exist in the original image.
**Important:**
- Focus on comparing the segmented result (second image) with the original image (first image).
The segmented object should correspond to the same object visible in the original image and should not miss essential parts.
- If the segmented object appears cut off, has missing limbs or edges, or is inconsistent with the object’s full structure in the original image, it should be classified as **Incomplete**.
Instructions:
1. Carefully compare the segmented object (second image) with the original image (first image).
2. Determine if the segmented object is **Complete** or **Incomplete**.
3. Provide your decision in this strict JSON format:
{
"object_status": "Complete" | "Incomplete",
"explanation": "A short sentence explaining why you made this decision, focusing on missing parts, truncation, or mismatch with the original image."
}
Note:
- Only use "Complete" or "Incomplete" as the categories.
-- Focus on whether the segmented result accurately represents the complete form of the object seen in the original image.
You are an evaluator comparing two images:
The original image, which contains the visible part of the object (partially occluded).
The completed image, which shows the object after amodal completion.
Amodal completion is the process of inferring and representing an object’s occluded parts so the object is understood as a **complete, closed whole**.
Your task is to rate how consistent and realistic the completed object itself is.
**Critical Definition of "Incomplete":**
If the completed object still looks cut off, truncated, or has a straight "image border" edge where it should be round or continuous, it is considered **Incomplete**. This is a structural failure.
Evaluation Dimensions:
**1. Structural Continuity (0–4 points)**
*Focus on the closure and logical continuation of the shape.*
* **0: The object boundary is abruptly cut off, forming a straight line or unnatural truncation (looks like the original occluded input). The shape is NOT closed.**
* 1: The object attempts to close the shape but the boundary is severely distorted, jagged, or structurally impossible.
* 2: Generally continuous but with noticeable misalignment or irregularities in the completed region.
* 3: Contours flow seamlessly and align well between completed and visible regions.
* 4: Boundaries are perfectly closed, continuous, and fully consistent with the visible parts.
**2. Semantic Consistency (0–4 points)**
* 0: The completed region introduces incorrect or unrelated elements.
* 1: Roughly matches the object but contains major semantic errors (e.g., wrong parts or unrealistic details).
* 2: Generally consistent but with notable smaller semantic inaccuracies.
* 3: Mostly consistent, with only very minor or negligible semantic differences.
* 4: Perfect match to the original object’s type, structure, and expected real-world form.
**3. Object Realism (0–2 points)**
* 0: The completed object does not resemble a plausible real-world version of the object (e.g., a half-object is not realistic).
* 1: Somewhat realistic but with small inconsistencies.
* 2: Perfectly realistic and faithful to how this object should appear in reality.
Scoring:
Add up the points from all categories.
score = Structural Continuity + Semantic Consistency + Object Realism
Output Format:
{
"score": X,
"dimension_scores": {
"structural_continuity": Y,
"semantic_consistency": Z,
"object_realism": A
},
"explanation": "One or two sentences summarizing why you gave this score."
}
{}