In animation generation combining paintings and live-action footage, the core challenge for cross-modal visual content creation lies in effectively coordinating feature discrepancies across multimodal ...