(to be released as rvest 1.0.0)
html_text2() provides a more natural rendering of HTML nodes into text, converting
<br> into “\n”, and removing non-significant whitespace (#175). By default, it also converts
into regular spaces, which you can suppress with
preserve_nbsp = TRUE (#284).
html_table() has been re-written from scratch to more closely mimic the algorithm that browsers use for parsing tables. This should mean that there are far fewer tables for which it fails to produce some output (#63, #204,
fill argument has been deprecated since it is no longer needed.
html_table() now returns a tibble rather than a data frame to be compatible with the rest of the tidyverse (#199). Its performance has been considerably improved (#237). It also gains a
na.strings argument to control what values are converted to
NA (#107), and a
convert argument to control whether to run the conversion (#311).
rvest is now licensed as MIT (#287).
Since this is the 1.0.0 release, I included a large number of API changes to make rvest more compatible with current tidyverse conventions. Older functions have been deprecated, so existing code will continue to work (albeit with a few new warnings).
rvest now imports xml2 rather than depending on it. This is cleaner because it avoids attaching all the xml2 functions that you’re less likely to use. To reduce the change of breakages, rvest re-exports xml2 functions
url_absolute(), but your code may now need an explicit
html_form() now returns an object with class
rvest_form (instead of form). Fields within a form now have class
rvest_field, instead of a variety of classes that were lacking the
rvest_ prefix. All functions for working with forms have a common
submit_form() was renamed to
session_submit() because it returns a session.
xml() functions have been removed.
minimal_html() (which doesn’t appear to be used by any other package) has had its arguments flipped to make it more intuitive.
The “harvesting the web” vignette has been rewritten to focus more on basics rvest, eliminating the screenshots to keep the installed package as svelte as possible. It’s also been renamed to
vignette("rvest") since it’s the vignette that you should read first.
The SelectorGadget vignette is now a web-only article, https://rvest.tidyverse.org/articles/articles/selectorgadget.html, so we can be more generous with screenshots since they’re no longer bundled with every install of the package. Together with the rewrite of the other vignette, this means that rvest is now ~90 Kb instead of ~1.1 Mb.
All uses of IMDB have been eliminated since the site explicitly prohibits scraping (#195).
html_form_set() can now accept character vectors allowing you to select multiple checkboxes in a set or select multiple values from a multi-
<select> (#127, with help from @juba). It also uses dynamic dots so that you can use
!!! if you have a list of values (#189).
If you’re using xml2 1.0.0,
html_node() will now return a “missing node”.
Parse rowspans and colspans effectively by filling using repetition from left to right (for colspan) and top to bottom (rowspan) (#111)
Updated a few examples and demos where the website structure has changed.
Made compatible with both xml2 0.1.2 and 1.0.0.
rvest has been rewritten to take advantage of the new xml2 package. xml2 provides a fresh binding to libxml2, avoiding many of the work-arounds previously needed for the XML package. Now rvest depends on the xml2 package, so all the xml functions are available, and rvest adds a thin wrapper for html.
A number of functions have change names. The old versions still work, but are deprecated and will be removed in rvest 0.4.0.
html_node() now throws an error if there are no matches, and a warning if there’s more than one match. I think this should make it more likely to fail clearly when the structure of the page changes.
xml_structure() has been moved to xml2. New
html_structure() (also in xml2) highlights id and class attributes (#78).
submit_request() (and hence
submit_form()) is now case-insensitive, and so will find
<input type=SUBMIT> as well as
submit_request() (and hence
submit_form()) recognizes forms with
<input type="image"> as a valid form submission button.
xml_structure(): new function that displays the structure (i.e. tag and attribute names) of a xml/html object (#10).
html_node() method for session.