html_node()
and html_nodes()
find HTML tags (nodes) using CSS selectors
or XPath expressions.
html_nodes(x, css, xpath) html_node(x, css, xpath)
x | Either a document, a node set or a single node. |
---|---|
css, xpath | Nodes to select. Supply one of |
html_node()
returns a nodeset the same length as the input.
html_nodes()
flattens the output so there's no direct way to map
the output to the input.
CSS selectors are particularly useful in conjunction with https://selectorgadget.com/, which makes it very easy to discover the selector you need. If you haven't used CSS selectors before, I'd recommend starting with the the fun tutorial at http://flukeout.github.io/.
CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.
It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:
Pseudo selectors that require interactivity are ignored:
:hover
, :active
, :focus
, :target
, :visited
.
The following pseudo classes don't work with the wild card element, *:
*:first-of-type
, *:last-of-type
, *:nth-of-type
,
*:nth-last-of-type
, *:only-of-type
It supports :contains(text)
You can use !=, [foo!=bar]
is the same as :not([foo=bar])
:not()
accepts a sequence of simple selectors, not just a single
simple selector.
url <- paste0( "https://web.archive.org/web/20190202054736/", "https://www.boxofficemojo.com/movies/?id=ateam.htm" ) ateam <- read_html(url) html_nodes(ateam, "center")#> {xml_nodeset (1)} #> [1] <center><table border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcd ...html_nodes(ateam, "center font")#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>html_nodes(ateam, "center font b")#> {xml_nodeset (1)} #> [1] <b>$77,222,099</b># html_nodes() well suited to use with the pipe ateam %>% html_nodes("center") %>% html_nodes("td")#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>ateam %>% html_nodes("center") %>% html_nodes("font")#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>td <- ateam %>% html_nodes("center") %>% html_nodes("td") td#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td># When applied to a list of nodes, html_nodes() returns all matching nodes # beneath any of the elements, flattening results into a new nodelist. td %>% html_nodes("font")#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font># html_node() returns the first matching node. If there are no matching # nodes, it returns a "missing" node td %>% html_node("font")#> {xml_nodeset (7)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font> #> [2] <NA> #> [3] <NA> #> [4] <NA> #> [5] <NA> #> [6] <NA> #> [7] <NA># To pick out an element or elements at specified positions, use [[ and [ ateam %>% html_nodes("table") %>% .[[1]] %>% html_nodes("img")#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...ateam %>% html_nodes("table") %>% .[1:2] %>% html_nodes("img")#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...