Computer Science ETDs

Author

Yaojia Zhu

Publication Date

12-1-2013

Abstract

The stochastic block model is a powerful tool for inferring community structure from network topology. However, the simple block model considers community structure as the only underlying attribute for forming the relational interactions among the nodes, this makes it prefer a Poisson degree distribution within each community, while most real-world networks have a heavy-tailed degree distribution. This is essentially because the simple assumption under the traditional block model is not consistent with some real-world circumstances where factors other than the community memberships such as overall popularity also heavily affect the pattern of the relational interactions. The degree-corrected block model can accommodate arbitrary degree distributions within communities by taking nodes' popularity or degree into account. But since it takes the vertex degrees as parameters rather than generating them, it cannot use them to help it classify the vertices, and its natural generalization to directed graphs cannot even use the orientations of the edges. We developed several variants of the block model with the best of both worlds: they can use vertex degrees and edge orientations in the classification process, while tolerating heavy-tailed degree distributions within communities. We show that for some networks, including synthetic networks and networks of word adjacencies in English text, these new block models achieve a higher accuracy than either standard or degree-corrected block models. Another part of my work is to develop even more generalized block models, which incorporates other attributes of the nodes. Many data sets contain rich information about objects, as well as pairwise relations between them. For instance, in networks of websites, scientific papers, patents and other documents, each node has content consisting of a collection of words, as well as hyperlinks or citations to other nodes. In order to perform inference on such data sets, and make predictions and recommendations, it is useful to have models that are able to capture the processes which generate the text at each node as well as the links between them. Our work combines classic ideas in topic modeling with a variant of the mixed-membership block model recently developed in the statistical physics community. The resulting model has the advantage that its parameters, including the mixture of topics of each document and the resulting overlapping communities, can be inferred with a simple and scalable expectation- maximization algorithm. We test our model on three data sets, performing unsupervised topic classification and link prediction. For both tasks, our model outperforms several existing state-of-the-art methods, achieving higher accuracy with significantly less computation, analyzing a data set with 1.3 million words and 44 thousand links in a few minutes.

Language

English

Keywords

complex networks, community detection, generative model, stochastic block model, topic modeling, document classification, link prediction

Document Type

Dissertation

Degree Name

Computer Science

Level of Degree

Doctoral

Department Name

Department of Computer Science

First Advisor

Moore, Cristopher

First Committee Member (Chair)

Saia, Jared

Second Committee Member

Forrest, Stephanie

Third Committee Member

Clauset, Aaron

Fourth Committee Member

Moore, Cristopher

Project Sponsors

McDonnell Foundation, AFOSR, DARPA

Share

COinS