read_html()
operates on the HTML source code downloaded from the server.
This works for most websites but can fail if the site uses javascript to
generate the HTML. read_html_live()
provides an alternative interface
that runs a live web browser (Chrome) in the background. This allows you to
access elements of the HTML page that are generated dynamically by javascript
and to interact with the live page by clicking on buttons or typing in
forms.
Behind the scenes, this function uses the chromote package, which requires that you have a copy of Google Chrome installed on your machine.
Value
read_html_live()
returns an R6 LiveHTML object. You can interact
with this object using the usual rvest functions, or call its methods,
like $click()
, $scroll_to()
, and $type()
to interact with the live
page like a human would.
Examples
if (FALSE) {
# When we retrieve the raw HTML for this site, it doesn't contain the
# data we're interested in:
static <- read_html("https://www.forbes.com/top-colleges/")
static %>% html_elements(".TopColleges2023_tableRow__BYOSU")
# Instead, we need to run the site in a real web browser, causing it to
# download a JSON file and then dynamically generate the html:
sess <- read_html_live("https://www.forbes.com/top-colleges/")
sess$view()
rows <- sess %>% html_elements(".TopColleges2023_tableRow__BYOSU")
rows %>% html_element(".TopColleges2023_organizationName__J1lEV") %>% html_text()
rows %>% html_element(".grant-aid") %>% html_text()
}