使用Go解析超大XML文档

文章由LinuxBoy分享于2019-03-31 06:03:18热评（671）

使用Go解析超大XML文档

我最近在处理Wiki百科的一些XML文件，有一些非常大的XML文件，例如最新的修订版文件时36G（未压缩）。关于解析XML，我曾经在几种语言中做过实验，最终我发现Go非常的适合。

Go拥有一个通用的解析XML的库，也能很方便的编码。一个比较简单的处理XML的办法是一次性将文档解析加载到内存中，然而这中办发对于一个36G的东西来讲是不可行的。

我们也可以采用流的方式解析，但是一些在线的例子比较简单而缺乏，这里是我的解析wiki百科的示例代码。(full example code at https://github.com/dps/go-xml-parse/blob/master/go-xml-parse.go)

这里有其中的维基xml片段。

// <page>
//     <title>Apollo 11</title>
//      <redirect title="Foo bar" />
//     ...
//     <revision>
//     ...
//       <text xml:space="preserve">
//       {{Infobox Space mission
//       |mission_name=
//       |insignia=Apollo_11_insignia.png
//     ...
//       </text>
//     </revision>
// </page>

在我们的Go代码中，我们定义了一个结构体（struct）来匹配<page>元素。

type Redirect struct {
Title string `xml:"title,attr"`
}

type Page struct {
    Title string `xml:"title"`
    Redir Redirect `xml:"redirect"`
    Text string `xml:"revision>text"`
}

现在我们告诉解析器wikipedia文档包括一些<page>并且试着读取文档，这里让我们看看他如何以流的方式工作。其实这是非常简单的，如果你了解原理的话--遍历文件中的标签，遇到<page>标签的startElement，然后使用神奇的 decoder.DecodeElement API解组为整个对象，然后开始下一个。

decoder := xml.NewDecoder(xmlFile)

for {
    // Read tokens from the XML document in a stream.
    t, _ := decoder.Token()
    if t == nil {
        break
    }
    // Inspect the type of the token just read.
    switch se := t.(type) {
    case xml.StartElement:
        // If we just read a StartElement token
        // ...and its name is "page"
        if se.Name.Local == "page" {
            var p Page
            // decode a whole chunk of following XML into the
            // variable p which is a Page (se above)
            decoder.DecodeElement(&p, &se)
            // Do some stuff with the page.
            p.Title = CanonicalizeTitle(p.Title)
            ...
        }
...

我希望在你需要自己解析一个大的XML文件的时候，这些能节省你一些时间。

推荐文章：

使用Go解析超大XML文档