Article Preview
TopIntroduction
With the rapid development of mobile Internet and digital technology, short video platforms have become one of the social media applications with the largest user scale and the highest level of activity in the world. Platforms represented by Douyin, TikTok, and Kuaishou have more than 600 million daily active users, and the average daily usage time of users generally exceeds 90 minutes (He et al., 2024; Jiang et al., 2022). Their powerful content dissemination and social influence are profoundly reshaping the way of information consumption and the digital cultural ecology. However, as the platform content becomes increasingly saturated and user interests tend to be diverse and changeable, the platform faces severe challenges in user retention and content distribution efficiency (Jing & Qing, 2024; Ping & Yue, 2024). Relevant data show that the 7-day retention rate of new users is generally lower than 25%, while the recommendation click rate of active users continues to decline. Some users have high screen-sliding bounce rates and weakened willingness to interact (Jiang et al., 2022; Sannidhan et al., 2023). These problems reflect that the current recommendation system has obvious bottlenecks in understanding and predicting user behavior. In this context, in-depth understanding of users' decision-making behavior on short video platforms (such as content selection, interactive participation, social sharing, etc.) is not only a key fulcrum for improving recommendation effects, improving user experience and extending the life cycle of the platform, but also an important academic topic to reveal the information dissemination mechanism of social media and release the potential of the digital economy.
While existing research has predominantly applied traditional machine learning techniques or unimodal data analyses to investigate user behavior—such as employing text mining for sentiment analysis of comments or predicting video popularity based solely on visual features—these approaches exhibit notable limitations(Cai et al., 2022). Short videos inherently combine visual, audio, and textual elements, making unimodal analysis insufficient to fully grasp the content's semantic and emotional depth (Goodwin et al., 2024; Zhou et al., 2021). The user decision-making process is inherently dynamic and sequential, and is influenced by historical behavior, social relationships, and real-time context. Traditional models cannot effectively model the dynamic evolution of temporal dependencies and user preferences.
Multimodal deep learning can more comprehensively characterize the content characteristics of short videos by integrating heterogeneous data such as images, texts, and audios (He & Li, 2024; Xing et al., 2025). Reinforcement learning, focused on dynamic decision-making, optimizes strategies through interactions between an agent and its environment, making it ideal for simulating user decisions on platforms (He et al., 2025; Mubarak et al., 2021). The challenge remains to effectively combine multimodal data processing, temporal modeling, and reinforcement learning to develop intelligent models that accurately reflect user decision-making (He et al., 2021; Rezaee et al., 2024).
This paper introduces the MT-DQN (Multimodal Temporal Deep Q-Network) model as a solution to these challenges, focusing on the analysis of user decision-making behaviors on short-video platforms. The proposed model is comprised of three key components: a Transformer-based multimodal fusion module (Zhao et al., 2023) that enables cross-modal semantic understanding of video content; a temporal graph neural network module (Su & Wu, 2025) designed to capture the evolving patterns of user behavior and social interactions; and a deep Q-network module (Yan et al., 2021) that refines decision-making strategies through reinforcement learning. Together, these elements work in tandem to create a robust framework for user behavior analysis, addressing content perception, behavior modeling, and decision optimization. The contributions of this work are summarized as follows: