Computer Science ETDs

Publication Date

Spring 3-11-2020


Widespread Chinese social media applications such as Sina Weibo (Chinese Twitter), the most popular social network in China, are widely known for monitoring and deleting posts to conform to Chinese government requirements. Censorship of Chinese social media is a complex process that involves many factors. There are multiple stakeholders and many different interests: economic, political, legal, personal, etc., which means that there is not a single strategy dictated by a single government authority. Moreover, sometimes Chinese social media do not follow the directives of government, out of concern that they are more strictly censoring than their competitors.

One crucial question in this context to answer is: What kinds of features lead to a given post being likely to be censored? Previous work trying to answer this question (1) ignores the multi-modal nature of social networks and only focuses on the text content, and (2) relies on narrow datasets collected by tracking small number of users over a few months rather than years. Thus, these approaches produce results that are limited and biased toward whatever was trending.

My thesis: Censors pay the most attention to these factors: the user who has posted the content, number of reposts, and the sentiment of the text content than other factors, with the first factor being the strongest. I attempt to support this thesis by using data mining techniques to uncover censors' policies and priorities in Chinese social networks, specifically Sina Weibo. I take a multi-modal approach that takes text content, image content, metadata and other factors, e.g., sentiment, into account. The goals of my thesis are to: 1) investigate how different factors such as text, image, and metadata, etc., correlate with censorship, and how consistently and quickly different topics are censored, 2) determine to what extent censorship is based on the person being posted about, 3) determine to what extent censorship is based on the person posting the post, and 4) predict censorship by considering all available information.




Social networks, censorship, machine learning, deep learning, NLP

Document Type


Degree Name

Computer Science

Level of Degree


Department Name

Department of Computer Science

First Committee Member (Chair)

Jedidiah R. Crandall

Second Committee Member

Abdullah Mueen

Third Committee Member

Marina Kogan

Fourth Committee Member

Michael Tschantz