Urban region function recognition is key to rational urban planning and management. Due to the complex socioeconomic nature of functional land use, recognizing urban region function in high-density cities using remote sensing images alone is difficult. The inclusion of social sensing has the potential to improve the function classification performance. However, effectively integrating the multi-source and multi-modal remote and social sensing data remains technically challenging. In this paper, we have proposed a novel end-to-end deep learning-based remote and social sensing data fusion model to address this issue. Two neural network based methods, one based on a 1-dimensional convolutional neural network (CNN) and the other based on a long short-term memory (LSTM) network, have been developed to automatically extract discriminative time-dependent social sensing signature features, which are fused with remote sensing image features extracted via a residual neural network. One of the major difficulties in exploiting social and remote sensing data is that the two data sources are asynchronous. We have developed a deep learning-based strategy to address this missing modality problem by enforcing cross-modal feature consistency (CMFC) and cross-modal triplet (CMT) constraints. We train the model in an end-to-end manner by simultaneously optimizing three costs, including the classification cost, the CMFC cost and the CMT cost. Extensive experiments have been conducted on publicly available datasets to demonstrate the effectiveness of the proposed method in fusing remote and social sensing data for urban region function recognition. The results show that the seemingly unrelated physically sensed image data and social activities sensed signatures can indeed complement each other to help enhance the accuracy of urban region function recognition.