当前位置: 首页 >新闻中心

新闻中心
利用大数据技术进行图处理
2015-05-07

大数据

处理非常大型的图对象一直都是个挑战,但最近大数据技术的进步却让这一工作变得更具实践性。作为纽约市的一家专注于跨设备内容分发的创业公司,Tapad利用大数据技术处理TB级的数据,并已将图处理作为其商业模型的核心业务。

像Facebook和Twitter这样的社交网络,其数据天生就适合于图表示法。而对这方面属性不太明显的数据,我们也可以用图对象来表示,比如Tapad的设备图。Tapad的联合创始人兼CTO,Dag Liodden,解释了为什么对设备使用图表示法很有意义:

“Tapad采用面向图的方式对设备间的关系进行建模。在设备图中,我们把匿名标示符(如cookie ID)表示为节点并且追踪这些节点的市场信息。节点间的边则结合使用测定数据、概率统计模型以及机器学习技术计分或加权重。我们将‘设备’的概念定义为一个起始设备或节点(比如说某个浏览器的cookie ID)和由该起点出发的、在一组可定制边约束下能达到的节点集合(比如说一个Tablet和一个Connected TV的cookie ID)。相对于单个节点仅有的聚合信息,实际的图结构使我们能够在动态平衡数据准确度和规模方面更具灵活性,而且还能更容易地运用新的边推理模型来对图进行扩充。”

用合适的工具完成合适的工作很重要,这个道理同样适用于图处理:对于通过传统工作负载就能处理的图对象,我们就没必要使用大数据技术。正如Dag所说:

“‘大数据’对我而言就像个门槛,跨过之后你就不能再使用少数通用的、现成的工具来存储和分析数据了,而是要依据具体的用例对不同的技术加以取舍。随着软硬件解决方案的进步和成熟,这些阈值每年都在变动,而我们所处理的数据集的大小以及所进行的分析的复杂程度亦是如此。”

对Facebook来说,这个阈值达到了几PB级,详情可参阅他们在2013纽约ACM SIGMOD大会上的报告。对Tapad而言,图对象的数据量虽然较小,但依然不可能用传统的方法来处理:

“全美的图对象当前有大约11亿个节点,它们代表着移动电话、平板、笔记本、游戏终端以及电视机。其中有些节点是临时的,比如因为浏览器使用非持久的cookie,导致节点缺少数据而没有边缘。非临时节点平均有大概5个边缘和约500个离散的信息片段与其相关联,如行为分段。实时图数据量达到了几TB级,而且我们还要跨多个数据中心每秒对其进行几十万次的读取、写入操作。图对象的更新实现了跨地域相互复制,每个数据中心由配备了20TB Flash SSD存储和2TB RAM的服务器来支撑。”

近几年涌现出很多处理大型图对象的技术,尤其是2013年,我们看到了几个新成员加入到该生态系统中。有两类系统值得考虑:

针对OLTP工作负载,能够快速低延迟访问小部分图数据的图数据库。
针对OLAP工作负载,能够对图对象中的大部分数据进行批处理的图处理引擎。

知名的图数据库已经很多了,但最近仍冒出了几个标新立异的项目。Neo4j算是最老牌、最成熟的图数据库之一,但因不支持分片而依然存在可伸缩性的问题。另一个相当年轻,却在2013年非常流行的数据库便是Titan。作为后端无关的图数据库,它支持HBase和Cassandra的可伸缩架构,并且如2013年的一篇博文所报道的,它在内部使用了一套优化的顶点和边表示法以使其能处理几十亿个边对象。

但你不必非要使用图特定数据库,更通用的可伸缩的NoSQL数据库也是有效的解决方案。基于Google BigTable并在2011年开源的Apache Accumulo就是一个通用数据库的例子,它的数据记录很灵活,所以也适合存储大型图对象,同时还可以用来存储含有类型化的边和权重的图对象,2013年发布的一份技术报告表明NSA也在使用它。Cassandra或者Aerospike则是另一种数据库,它们能通过适当的数据模型,用边、顶点和权重给图对象高效地建模。Facebook也构建了自己的解决方案,他们在被称为Tao的系统中使用了MySQL和Memcache组合,并正在使用这一方案为其用户提供社区图服务。据Dag所说,Tapad在其设备图的设计过程中也运用了同样的哲学:

“将实时的图对象保存在键值对存储中可以支持快速的遍历和更新。我们就是把图的快照周期性地存进HDFS,然后从中提取它们进行高级图处理并用其他数据流来扩充,之后再把结果回填至‘实时图’。虽然使用图特定的数据库会有一些优势,但以我们目前的设置,既可以在键值对存储中极快且简单地遍历图对象,还可在Hadoop上慢速但非常灵活地进行遍历和分析操作,对我们来说它工作的很好,至少现在如此。”

和存储于数据库中的图对象一样 ,可大规模进行的操作也只是局限于查找和小范围的遍历。至于在图对象中进行更加复杂的分析,就需要分布式的批处理框架。为了达到最佳性能,GraphLab框架使用了Message Passing Interaface(MPI)模型来调整并运行基于HDFS数据的复杂算法。而新近的框架如Apache Giraph和Apache Hama则基于Bulk Synchronous Paralle(BSP)范式,该范式是由Google的Pregel项目推广开的。而生态系统中最新的项目便是GraphX和Faunus。GraphX项目运行于2013年才问世的Spark之上,而Faunnus则通过用Hadoop运行MapReduce作业的方式来处理Titan数据库中图对象。Tapad正在运用这些新技术处理其离线图数据。按照Dag所说:

“目前,我们主要的图处理框架虽是Apache Giraph,但我们也在尝试Saprk GraphX和GraphLab。所有这些架构还都很年轻,学习曲线也颇为陡峭,而且全都有自己的优缺点及注意事项。举个例子,Giraph和GraphX由于能很好地支持我们的Hadoop架构所以很方便,但GraphLab却完全是因为其性能而更吸引我们。”

有些项目正试图提供统一的架构以支持OLTP和OLAP查询。来自Lab41的Dendrite就是这样一个项目,它利用基于Titan的GraphLab进行存储、处理,并用AngularJS实现可视化。因为这个非常年轻的项目在2014年年初才公开,所以社群反响有限,但是它试着顾及到所有用例,这应该有助于它的普及。

英语原文:

Processing extremely large graphs has been and remains a challenge, but recent advances in Big Data technologies have made this task more practical. Tapad, a startup based in NYC focused on cross-device content delivery, has made graph processing the heart of their business model using Big Data to scale to terabytes of data.

Social networks like Facebook or Twitter contain data that naturally lends itself to a graph representation. But graphs can be used to represent less obvious data, as in the case of Tapad’s device graph. Dag Liodden, Tapad’s co-founder and CTO, describes why using a graph representation for devices makes sense:

Tapad takes a graph-oriented approach to modeling relationships between devices. Anonymous identifiers (such as cookie IDs) are represented as nodes in our Device Graph and we track marketing information to these nodes. Edges between the nodes are scored / weighted using a combination of deterministic data and probabilistic statistical models / machine learning techniques. The concept of a “device” is defined as a starting device / node (let’s say the cookie ID of a browser) and the collections of nodes (let’s say the cookie IDs of a Tablet and a Connected TV) that are reachable from that starting point given a customizable set of edge constraints. Having an actual graph structure, as opposed to just aggregated information into a single node, gives us the flexibility to balance accuracy and scale dynamically as well as more easily augment the graph with new edge inference models.
Using the right tool for the right job is important, and the same goes for graph processing: there is no need to use Big Data technologies for graphs that can be handled by more traditional workloads, like Dag says:

“Big Data” to me is the threshold where you no longer can use a small set of general purpose, off-the-shelf tools to store and analyze your data, but instead have to tailor different technologies to address specific use cases. These thresholds keep moving every year as software and hardware solutions evolve and mature, but so does the size of the data sets we deal with and the level of sophistication of the analysis we need to perform.
For Facebook, this threshold is in the single digit petabytes, as detailed during their submission to the 2013 ACM SIGMOD conference in NYC. For Tapad, the amount of data in the graph is smaller but would still be impossible to process using traditional methods:

The US graph currently has about 1.1 billion nodes, representing mobile phones, tablets, laptops, gaming consoles and TVs. Some of these nodes are transient; for instance, due to a browser with non-persistent cookies, and thus have little data and no edges. The non-transient nodes have about five edges on average and around 500 discrete pieces of information, such as behavioral segments, associated with them. The live graph data weighs in at multiple TB and we read / write from / to it several hundred thousand times per second across multiple data centers. Updates to the graph are geographically cross-replicated and each data center is currently serving off of servers backed by 20 TB of Flash SSD storage and 2 TB of RAM.
The recent years have seen a surge in the number of technologies used to process graphs at scale, especially 2013 which saw several new additions to the ecosystem. There are two classes of systems to consider:

Graph databases for OLTP workloads for quick low-latency access to small portions of graph data.
Graph processing engines for OLAP workloads allowing batch processing of large portions of a graph.
The list of graph databases is already very long, but several projects have emerged and differentiated themselves recently. Neo4j is one of the oldest and most mature graph databases, but still suffers from scalability issues since it doesn’t support sharding yet. Another database that, albeit pretty young, has been gaining a lot of popularity in 2013 is Titan. As a backend-agnostic graph database, it can leverage both HBase and Cassandra’s scalable architecture and uses an optimized vertex and edge representation internally to allow it to scale to billions of edges as reported in a blog post in 2013.

But one does not need to use graph-specific databases, and more generic scalable NoSQL databases can also be an effective solution to the problem. Apache Accumulo, a technology based on Google’s BigTable and open-sourced in 2011, is an example of a generic database that can also be a good fit to store graphs at scale because records are flexible and can be used to store graphs with typed edges and weights, and is actually being used by the NSA according to a technical report published in 2013. Cassandra or Aerospike are other examples of databases that, with a proper data model, can effectively model a graph with edges, vertexes and weights. Facebook also built their own solution using MySQL and Memcache in a system called Tao, which is being used to serve the social graph to its users. And according to Dag, Tapad used the same philosophy in the design of their device graph:

The live graph lives in a key-value store to allow for fast traversals and updates. We regularly snapshot the graph into HDFS where it can be retrieved for more advanced graph processing and augmented with other data streams. The results are later fed back into the “live graph”. There are advantages to using a graph specific database, but our current setup with extremely fast, simple traversals of the graph in our key-value store and slow, but very flexible traversal and analysis on Hadoop is serving us well, at least for now.

Even with a graph stored in a database, the operations that can be performed at scale will likely be limited to lookups and small traversals. For more complex analysis on a larger portion of a graph, there is a need for batch processing distributed frameworks. For the best performance, the GraphLab framework uses the Message Passing Interface (MPI) model to scale and run complex algorithms using data in HDFS. More recent frameworks like Apache Giraph and Apache Hama are based on the Bulk Synchronous Parallel (BSP) paradigm popularized by Google’s Pregel project. And the latest additions to the ecosystem are the GraphX project running on top of Spark which was unveiled in 2013, and Faunus, which is using Hadoop to run MapReduce jobs to process graphs in a Titan database. Tapad is using these new technologies to process their offline graph data. According to Dag:

Currently, our main graph processing framework is Apache Giraph, but we are experimenting with Spark GraphX and Graphlab as well. All of these frameworks are still pretty young, the learning curve is pretty steep and all come with their own pros, cons and caveats. For instance, Giraph and GraphX are convenient as they fit nicely into our Hadoop infrastructure, but Graphlab is very appealing due to the sheer performance of it.

Some projects are attempting to provide a unified framework to answer both OLTP and OLAP queries. Dendrite from Lab41 is such a project that leverages GraphLab on top of Titan for storage and processing, and AngularJS for visualization. This is still a very young project unveiled in early 2014 so the community reaction is limited, but the fact that it attempts to cover every use case should help drive adoption.