Guess and repair faulty character encoding.

These functions help you respond to web pages that declare incorrect encodings. You can use guess_encoding to figure out what the real encoding is (and then supply that to the encoding argument of html), or use repair_encoding to fix character vectors after the fact.

guess_encoding(x)

repair_encoding(x, from = NULL)

Arguments

x	A character vector.
from	The encoding that the string is actually in. If `NULL`, `guess_encoding` will be used.

stringi

These function are wrappers around tools from the fantastic stringi package, so you'll need to make sure to have that installed.

Examples

# A file with bad encoding included in the package
path <- system.file("html-ex", "bad-encoding.html", package = "rvest")
x <- read_html(path)
x %>% html_nodes("p") %>% html_text()
#> [1] "\xc9migré cause célèbre déjà vu."

guess_encoding(x)
#>     encoding language confidence
#> 1 ISO-8859-1       fr       0.31
#> 2 ISO-8859-2       ro       0.22
#> 3   UTF-16BE                0.10
#> 4   UTF-16LE                0.10
#> 5    GB18030       zh       0.10
#> 6       Big5       zh       0.10
#> 7 ISO-8859-9       tr       0.06
#> 8 IBM424_rtl       he       0.01
#> 9 IBM424_ltr       he       0.01
# Two valid encodings, only one of which is correct
read_html(path, encoding = "ISO-8859-1") %>% html_nodes("p") %>% html_text()
#> [1] "Émigré cause célèbre déjà vu."
read_html(path, encoding = "ISO-8859-2") %>% html_nodes("p") %>% html_text()
#> [1] "Émigré cause célčbre déjŕ vu."