Access Full-Text Recommend to Your Library

Free Access

Open access articles are freely available for download

Add to Personal Library

Share

Share with Librarian Share with Colleague Fair Use Policy

More Information

Access on Platform
Favorite
Cite Article Cite Article

MLA

Wang, Jinmeiyang, et al. "Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning: Deep Learning for User Decision Behavior." JOEUC vol.37, no.1 2025: pp.1-24. https://doi.org/10.4018/JOEUC.389737

APA

Wang, J., Dong, J., & Zhou, L. (2025). Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning: Deep Learning for User Decision Behavior. Journal of Organizational and End User Computing (JOEUC), 37(1), 1-24. https://doi.org/10.4018/JOEUC.389737

Chicago

Wang, Jinmeiyang, Jing Dong, and Li Zhou. "Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning: Deep Learning for User Decision Behavior," Journal of Organizational and End User Computing (JOEUC) 37, no.1: 1-24. https://doi.org/10.4018/JOEUC.389737

Export Reference

For Librarians

Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning: Deep Learning for User Decision Behavior

Jinmeiyang Wang (Shanghai Jiaotong University, China), Jing Dong (Columbia University, USA), and Li Zhou (McGill University, Canada)

Source Title: Journal of Organizational and End User Computing (JOEUC) 37(1)

DOI: 10.4018/JOEUC.389737

Abstract

This paper proposes the MT-DQN model, which integrates a Transformer, Temporal Graph Neural Network (TGNN), and Deep Q-Network (DQN) to address the challenges of predicting user behavior and optimizing recommendation strategies in short-video environments. Experiments demonstrated that MT-DQN consistently outperforms traditional concatenated models, such as Concat-Modal, achieving an average F1-score improvement of 10.97% and an average NDCG@5 improvement of 8.3%. Compared to the classic reinforcement learning model Vanilla-DQN, MT-DQN reduces MSE by 34.8% and MAE by 26.5%. Nonetheless, the authors also recognize challenges in deploying MT-DQN in real-world scenarios, such as its computational cost and latency sensitivity during online inference, which will be addressed through future architectural optimization.

Article Preview

Top

Introduction

With the rapid development of mobile Internet and digital technology, short video platforms have become one of the social media applications with the largest user scale and the highest level of activity in the world. Platforms represented by Douyin, TikTok, and Kuaishou have more than 600 million daily active users, and the average daily usage time of users generally exceeds 90 minutes (He et al., 2024; Jiang et al., 2022). Their powerful content dissemination and social influence are profoundly reshaping the way of information consumption and the digital cultural ecology. However, as the platform content becomes increasingly saturated and user interests tend to be diverse and changeable, the platform faces severe challenges in user retention and content distribution efficiency (Jing & Qing, 2024; Ping & Yue, 2024). Relevant data show that the 7-day retention rate of new users is generally lower than 25%, while the recommendation click rate of active users continues to decline. Some users have high screen-sliding bounce rates and weakened willingness to interact (Jiang et al., 2022; Sannidhan et al., 2023). These problems reflect that the current recommendation system has obvious bottlenecks in understanding and predicting user behavior. In this context, in-depth understanding of users' decision-making behavior on short video platforms (such as content selection, interactive participation, social sharing, etc.) is not only a key fulcrum for improving recommendation effects, improving user experience and extending the life cycle of the platform, but also an important academic topic to reveal the information dissemination mechanism of social media and release the potential of the digital economy.

While existing research has predominantly applied traditional machine learning techniques or unimodal data analyses to investigate user behavior—such as employing text mining for sentiment analysis of comments or predicting video popularity based solely on visual features—these approaches exhibit notable limitations(Cai et al., 2022). Short videos inherently combine visual, audio, and textual elements, making unimodal analysis insufficient to fully grasp the content's semantic and emotional depth (Goodwin et al., 2024; Zhou et al., 2021). The user decision-making process is inherently dynamic and sequential, and is influenced by historical behavior, social relationships, and real-time context. Traditional models cannot effectively model the dynamic evolution of temporal dependencies and user preferences.

Multimodal deep learning can more comprehensively characterize the content characteristics of short videos by integrating heterogeneous data such as images, texts, and audios (He & Li, 2024; Xing et al., 2025). Reinforcement learning, focused on dynamic decision-making, optimizes strategies through interactions between an agent and its environment, making it ideal for simulating user decisions on platforms (He et al., 2025; Mubarak et al., 2021). The challenge remains to effectively combine multimodal data processing, temporal modeling, and reinforcement learning to develop intelligent models that accurately reflect user decision-making (He et al., 2021; Rezaee et al., 2024).

This paper introduces the MT-DQN (Multimodal Temporal Deep Q-Network) model as a solution to these challenges, focusing on the analysis of user decision-making behaviors on short-video platforms. The proposed model is comprised of three key components: a Transformer-based multimodal fusion module (Zhao et al., 2023) that enables cross-modal semantic understanding of video content; a temporal graph neural network module (Su & Wu, 2025) designed to capture the evolving patterns of user behavior and social interactions; and a deep Q-network module (Yan et al., 2021) that refines decision-making strategies through reinforcement learning. Together, these elements work in tandem to create a robust framework for user behavior analysis, addressing content perception, behavior modeling, and decision optimization. The contributions of this work are summarized as follows:

Complete Article List

Search this Journal:

Reset

Volume 38: 1 Issue (2026)

Volume 37: 1 Issue (2025)

Volume 36: 1 Issue (2024)

Volume 35: 3 Issues (2023)

Volume 34: 10 Issues (2022)

Volume 33: 6 Issues (2021)

Volume 32: 4 Issues (2020)

Volume 31: 4 Issues (2019)

Volume 30: 4 Issues (2018)

Volume 29: 4 Issues (2017)

Volume 28: 4 Issues (2016)

Volume 27: 4 Issues (2015)

Volume 26: 4 Issues (2014)

Volume 25: 4 Issues (2013)

Volume 24: 4 Issues (2012)

Volume 23: 4 Issues (2011)

Volume 22: 4 Issues (2010)

Volume 21: 4 Issues (2009)

Volume 20: 4 Issues (2008)

Volume 19: 4 Issues (2007)

Volume 18: 4 Issues (2006)

Volume 17: 4 Issues (2005)

Volume 16: 4 Issues (2004)

Volume 15: 4 Issues (2003)

Volume 14: 4 Issues (2002)

Volume 13: 4 Issues (2001)

Volume 12: 4 Issues (2000)

Volume 11: 4 Issues (1999)

Volume 10: 4 Issues (1998)

Volume 9: 4 Issues (1997)

Volume 8: 4 Issues (1996)

Volume 7: 4 Issues (1995)

Volume 6: 4 Issues (1994)

Volume 5: 4 Issues (1993)

Volume 4: 4 Issues (1992)

Volume 3: 4 Issues (1991)

Volume 2: 4 Issues (1990)

Volume 1: 3 Issues (1989)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning: Deep Learning for User Decision Behavior

Abstract

Introduction

Complete Article List