html_node() and html_nodes() find HTML tags (nodes) using CSS selectors or XPath expressions.

html_nodes(x, css, xpath)

html_node(x, css, xpath)

Arguments

x

Either a document, a node set or a single node.

css, xpath

Nodes to select. Supply one of css or xpath depending on whether you want to use a CSS or XPath 1.0 selector.

Value

html_node() returns a nodeset the same length as the input. html_nodes() flattens the output so there's no direct way to map the output to the input.

Details

CSS selectors are particularly useful in conjunction with https://selectorgadget.com/, which makes it very easy to discover the selector you need. If you haven't used CSS selectors before, I'd recommend starting with the the fun tutorial at http://flukeout.github.io/.

CSS selector support

CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:

  • Pseudo selectors that require interactivity are ignored: :hover, :active, :focus, :target, :visited.

  • The following pseudo classes don't work with the wild card element, *: *:first-of-type, *:last-of-type, *:nth-of-type, *:nth-last-of-type, *:only-of-type

  • It supports :contains(text)

  • You can use !=, [foo!=bar] is the same as :not([foo=bar])

  • :not() accepts a sequence of simple selectors, not just a single simple selector.

Examples

url <- paste0( "https://web.archive.org/web/20190202054736/", "https://www.boxofficemojo.com/movies/?id=ateam.htm" ) ateam <- read_html(url) html_nodes(ateam, "center")
#> {xml_nodeset (1)} #> [1] <center><table border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcd ...
html_nodes(ateam, "center font")
#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
html_nodes(ateam, "center font b")
#> {xml_nodeset (1)} #> [1] <b>$77,222,099</b>
# html_nodes() well suited to use with the pipe ateam %>% html_nodes("center") %>% html_nodes("td")
#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>
ateam %>% html_nodes("center") %>% html_nodes("font")
#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
td <- ateam %>% html_nodes("center") %>% html_nodes("td") td
#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>
# When applied to a list of nodes, html_nodes() returns all matching nodes # beneath any of the elements, flattening results into a new nodelist. td %>% html_nodes("font")
#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
# html_node() returns the first matching node. If there are no matching # nodes, it returns a "missing" node td %>% html_node("font")
#> {xml_nodeset (7)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font> #> [2] <NA> #> [3] <NA> #> [4] <NA> #> [5] <NA> #> [6] <NA> #> [7] <NA>
# To pick out an element or elements at specified positions, use [[ and [ ateam %>% html_nodes("table") %>% .[[1]] %>% html_nodes("img")
#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...
ateam %>% html_nodes("table") %>% .[1:2] %>% html_nodes("img")
#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...