12月22日论文推荐(附下载地址)
论文名:
Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email
作者:
Ying Sheng (Google)
Sandeep Tata (Google)
James B. Wendt (Google)
Jing Xie (Google)
Qi Zhao (Google)
Marc Najork (Google)
推荐理由:
“Anatomy of a Privacy-Safe Large-Scale Information Extraction System Over Email”是一篇Applied Data Science Track的文章,这也是Google Gmail团队最近几年若干实用型文章中的一篇。记得KDD 2010的时候他们团队就有一篇推荐相关收信人的文章,方法非常简单、实用,而且很快该方法就迅速上线到系统,目前已经成为Gmail的标配。这次发表的文章是从Gmail的邮件内容中抽取结构化信息,例如个人相关的账单信息,飞机行程信息等。和传统的信息抽取不同,这里的抽取更关键的是要保证规模以及抽取中可能遇到的隐私问题。下图是整个抽取系统的架构图。
这个全新的抽取架构叫做Juicer,核心的技术方面一个是把传统的基于模板的方法进行了扩展,此外在抽取的时候加入了对隐私保护的考虑。例如具体抽取模板的时候使用了k-anonymity方法进行了匿名化,而且初始的标注数据是一个很小的由用户自愿拿出来的标注集。另外还有个很大的挑战是数据质量,由于训练数据比较少,所以数据的偏差性很大,系统通过一些观察,比如偏差主要是倾向资深用户,所以系统的训练主要是用老/资深用户的数据进行训练,这一定程度上纠正了偏差问题。最后在几个不同案例的抽取上,系统取得了很好的效果。
Abstract
Extracting structured data from emails can enable several assistive experiences, such as reminding the user when a bill payment is due, answering queries about the departure time of a booked flight, or proactively surfacing an emailed discount coupon while the user is at that store.
This paper presents Juicer, a system for extracting information from email that is serving over a billion Gmail users daily. We describe how the design of the system was informed by three key principles: scaling to a planet-wide email service, isolating the complexity to provide a simple experience for the developer, and safeguarding the privacy of users (our team and the developers we support are not allowed to view any single email). We describe the design tradeoffs made in building this system, the challenges faced and the approaches used to tackle them. We present case studies of three extraction tasks implemented on this platform—bill reminders, commercial offers, and hotel reservations—to illustrate the effectiveness of the platform despite challenges unique to each task. Finally, we outline several areas of ongoing research in largescale machine-learned information extraction from email.
论文获取方式:后台回复“20181222”
由清华大学—中国工程院知识智能联合研究中心举办的知识·智能系列报告会将于2019年1月份举行第1期,届时会邀请学术界的大咖来跟大家分享,大家最想看到哪个主题?请投上您宝贵的一票!!!