Get element textSource:
There are two ways to retrieve text from a element:
html_text() is a thin wrapper around
which returns just the raw underlying text.
html_text2() simulates how
Roughly speaking, it converts
<br /> to
"\n", adds blank lines
<p> tags, and lightly formats tabular data.
html_text2() is usually what you want, but it is much slower than
html_text() so for simple applications where performance is important
you may want to use
A document, node, or node set.
TRUEwill trim leading and trailing spaces.
Should non-breaking spaces be preserved? By default,
html_text2()converts to ordinary spaces to ease further computation. When
will appear in strings as
"\ua0". This often causes confusion because it prints the same way as
# To understand the difference between html_text() and html_text2() # take the following html: html <- minimal_html( "<p>This is a paragraph. This another sentence.<br>This should start on a new line" ) # html_text() returns the raw underlying text, which includes whitespace # that would be ignored by a browser, and ignores the <br> html %>% html_element("p") %>% html_text() %>% writeLines() #> This is a paragraph. #> This another sentence.This should start on a new line # html_text2() simulates what a browser would display. Non-significant # whitespace is collapsed, and <br> is turned into a line break html %>% html_element("p") %>% html_text2() %>% writeLines() #> This is a paragraph. This another sentence. #> This should start on a new line # By default, html_text2() also converts non-breaking spaces to regular # spaces: html <- minimal_html("<p>x y</p>") x1 <- html %>% html_element("p") %>% html_text() x2 <- html %>% html_element("p") %>% html_text2() # When printed, non-breaking spaces look exactly like regular spaces x1 #>  "x y" x2 #>  "x y" # But aren't actually the same: x1 == x2 #>  FALSE # Which you can confirm by looking at their underlying binary # representaion: charToRaw(x1) #>  78 c2 a0 79 charToRaw(x2) #>  78 20 79