Timely and accurate maps of fine-grained urban villages (UVs) are essential for rational urban planning, which highlights the importance for automatic recognition methods as alternative to labor-intensive land survey, especially for large cities with high-density urban areas where UV maps cannot be updated frequently. However, it is challenging to simultaneously achieve accurate and fine-grained recognition of UVs from remote sensing images in high-density cities, due to the problem of low discrimination of remote sensing features showed in UVs. To address this issue, in this paper, we have proposed a hierarchical recognition framework which can integrate remote and social sensing data to recognize fine-grained UVs. The hierarchical framework follows the human cognition processes and has explicit geographical meaning for each step, which ensures its interpretability. Besides, remote and social sensing data can be fused easily in this framework so that the abstract concept of UV can be sufficiently characterized in both coarse and fine scales. To validate the effectiveness of the proposed approach, extensive experiments in Shenzhen, a typical high-density megacity in China with complicated UVs, have been conducted and a fine-grained map with spatial resolution of 2.5~m was obtained. The results show that the proposed approach achieved an impressive performance, with overall accuracy and Kappa of 96.23% and 0.920 respectively. Furthermore, comparative assessments and ablation studies were performed to demonstrate the effectiveness of the hierarchical recognition framework as well as the fusion of remote and social sensing data.