11月24日论文推荐(附下载地址)
论文名:
Camel: Content-Aware and Meta-path Augmented Metric Learning for Author Identification
作者:
Chuxu Zhang, Chao Huang, Lu Yu, Xiangliang Zhang, Nitesh V. Chawla
推荐理由:
这篇文章关注的是利用历史数据寻找匿名文章的可能作者。文章的重点主要有两个。其一是 metric learning ,作者对文章的 abstract 进行 word embedding,随后通过 GRU 编码为 d 维的 embedding,通过拉近文章和真正作者的距离,以及拉远虚假作者的距离训练模型。从而给定一个 abstract,这个模型就可以给出历史数据中哪个作者和这篇文章最接近。其二是 meta-path walk,也就是对由作者,机构,文章,出版方构成的异质网络建模。在这个网络上采取一定策略 walk,各结点的类型由 meta-path 指定。将得到的 walk 作为监督信息,使用 skipgram 模型增强之前的训练结果。这种方式不仅利用了“文章-作者”这样的直接监督信息,还同时利用了各种间接的信息,例如形如“作者-文章-文章-作者”的引用。
Abstract
In this paper, we study the problem of author identification in big scholarly data, which is to effectively rank potential authors for each anonymous paper by using historical data. Most of the existing deanonymization approaches predict relevance score of paper-author pair via feature engineering, which is not only time and storage consuming, but also introduces irrelevant and redundant features or miss important attributes. Representation learning can automate the feature generation process by learning node embeddings in academic network to infer the correlation of paper-author pair.
However, the learned embeddings are often for general purpose (independent of the specific task), or based on network structure only (without considering the node content). To address these issues and make a further progress in solving the author identification problem, we propose Camel, a content-aware and meta-path augmented metric learning model. Specifically, first, the directly correlated paper-author pairs are modeled based on distance metric learning by introducing a push loss function. Next, the paper content embedding encoded by the gated recurrent neural network is integrated into the distance loss. Moreover, the historical bibliographic data of papers is utilized to construct an academic heterogeneous network, wherein a meta-path guided walk integrative learning module based on the task-dependent and content-aware Skipgram model is designed to formulate the correlations between each paper and its indirect author neighbors, and further augments the model. Extensive experiments demonstrate that Camel outperforms the state-of-the-art baselines. It achieves an average improvement of 6.3% over the best baseline method.
论文下载链接
https://www3.nd.edu/~dial/publications/zhang2018camel.pdf