More easily extract pieces out of HTML documents using XPath and CSS selectors. CSS selectors are particularly useful in conjunction with http://selectorgadget.com/: it makes it easy to find exactly which selector you should be using. If you haven't used CSS selectors before, work your way through the fun tutorial at http://flukeout.github.io/

html_nodes(x, css, xpath)

html_node(x, css, xpath)

Arguments

x

Either a document, a node set or a single node.

css, xpath

Nodes to select. Supply one of css or xpath depending on whether you want to use a CSS or XPath 1.0 selector.

html_node vs html_nodes

html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

CSS selector support

CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:

  • Pseudo selectors that require interactivity are ignored: :hover, :active, :focus, :target, :visited

  • The following pseudo classes don't work with the wild card element, *: *:first-of-type, *:last-of-type, *:nth-of-type, *:nth-last-of-type, *:only-of-type

  • It supports :contains(text)

  • You can use !=, [foo!=bar] is the same as :not([foo=bar])

  • :not() accepts a sequence of simple selectors, not just single simple selector.

Examples

# CSS selectors ---------------------------------------------- ateam <- read_html("http://www.boxofficemojo.com/movies/?id=ateam.htm") html_nodes(ateam, "center")
#> {xml_nodeset (1)} #> [1] <center><table border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcd ...
html_nodes(ateam, "center font")
#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
html_nodes(ateam, "center font b")
#> {xml_nodeset (1)} #> [1] <b>$77,222,099</b>
# But html_node is best used in conjunction with %>% from magrittr # You can chain subsetting: ateam %>% html_nodes("center") %>% html_nodes("td")
#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/studio/chart/?studio=fox.htm"> ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&a ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>
ateam %>% html_nodes("center") %>% html_nodes("font")
#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
td <- ateam %>% html_nodes("center") %>% html_nodes("td") td
#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/studio/chart/?studio=fox.htm"> ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&a ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>
# When applied to a list of nodes, html_nodes() returns all nodes, # collapsing results into a new nodelist. td %>% html_nodes("font")
#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
# html_node() returns the first matching node. If there are no matching # nodes, it returns a "missing" node if (utils::packageVersion("xml2") > "0.1.2") { td %>% html_node("font") }
#> {xml_nodeset (7)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font> #> [2] <NA> #> [3] <NA> #> [4] <NA> #> [5] <NA> #> [6] <NA> #> [7] <NA>
# To pick out an element at specified position, use magrittr::extract2 # which is an alias for [[ library(magrittr)
#> #> Attaching package: ‘magrittr’
#> The following objects are masked from ‘package:testthat’: #> #> equals, is_less_than, not
ateam %>% html_nodes("table") %>% extract2(1) %>% html_nodes("img")
#> {xml_nodeset (6)} #> [1] <img src="https://m.media-amazon.com/images/M/MV5BMTc4ODc4NTQ1N15BMl5BanB ... #> [2] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [3] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [4] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box Offic ... #> [5] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" valign= ... #> [6] <img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=6034961&amp;cv=2.0 ...
ateam %>% html_nodes("table") %>% `[[`(1) %>% html_nodes("img")
#> {xml_nodeset (6)} #> [1] <img src="https://m.media-amazon.com/images/M/MV5BMTc4ODc4NTQ1N15BMl5BanB ... #> [2] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [3] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [4] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box Offic ... #> [5] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" valign= ... #> [6] <img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=6034961&amp;cv=2.0 ...
# Find all images contained in the first two tables ateam %>% html_nodes("table") %>% `[`(1:2) %>% html_nodes("img")
#> {xml_nodeset (6)} #> [1] <img src="https://m.media-amazon.com/images/M/MV5BMTc4ODc4NTQ1N15BMl5BanB ... #> [2] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [3] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [4] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box Offic ... #> [5] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" valign= ... #> [6] <img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=6034961&amp;cv=2.0 ...
ateam %>% html_nodes("table") %>% extract(1:2) %>% html_nodes("img")
#> {xml_nodeset (6)} #> [1] <img src="https://m.media-amazon.com/images/M/MV5BMTc4ODc4NTQ1N15BMl5BanB ... #> [2] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [3] <img src="//www.assoc-amazon.com/e/ir?t=boxofficemojo-20&amp;l=as2&amp;o= ... #> [4] <img src="/img/misc/bom_logo1.png" width="245" height="56" alt="Box Offic ... #> [5] <img src="/img/misc/IMDbSm.png" width="34" height="16" alt="IMDb" valign= ... #> [6] <img src="http://b.scorecardresearch.com/p?c1=2&amp;c2=6034961&amp;cv=2.0 ...
# XPath selectors --------------------------------------------- # chaining with XPath is a little trickier - you may need to vary # the prefix you're using - // always selects from the root node # regardless of where you currently are in the doc ateam %>% html_nodes(xpath = "//center//font//b") %>% html_nodes(xpath = "//b")
#> {xml_nodeset (21)} #> [1] <b>Adjuster:</b> #> [2] <b>The A-Team</b> #> [3] <b>$77,222,099</b> #> [4] <b><a href="/studio/chart/?studio=fox.htm">Fox</a></b> #> [5] <b><nobr><a href="/schedule/?view=bydate&amp;release=theatrical&amp;date ... #> [6] <b>Action</b> #> [7] <b>1 hrs. 57 min.</b> #> [8] <b>PG-13</b> #> [9] <b>$110 million</b> #> [10] <b>Domestic:</b> #> [11] <b>$77,222,099</b> #> [12] <b>43.6%</b> #> [13] <b>Worldwide:</b> #> [14] <b>$177,238,796</b> #> [15] <b>&gt; View All 14 Weekends</b> #> [16] <b>Showdown: 'Men-on-a-Mission'</b> #> [17] <b>4</b> #> [18] <b>Chart</b> #> [19] <b>Rank</b> #> [20] <b>Charts (Premier Pass Users Only)</b> #> ...