<- emo::ji("tada")) (tada
๐
February 15, 2024
httr2 is an amazing ๐ฆ from the r-lib
team. Built on top of the strong foundations of curl
, experience from the previous incarnation with httr
and tidy principles and design, httr2
is an easy goto for anything api related.
But โฆ I found a bug ๐ชฒ, or letโs call it a missed opportunity ๐ค. TL;DR it was fixed. Letโs rewind.
๐ aka :tada:
is the best emoji, this is not open for debate ๐.
Letโs dissect it with the help of ๐ฆ utf8splain
and uni
I totally forgot about. tada
is a single code point emoji U+1F389
aka "\U1F389"
in R:
๐
# A tibble: 1 ร 7
id rune description block countries languages type
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 127881 U+1F389 " Party Popper" miscellaneous-symbolโฆ <NA> <NA> <NA>
In utf-8, i.e. the encoding to rule them all, ๐ is encoded with 4 bytes that follow the convention explained in the UTF-8 wikipedia page. 11110000
: starts with 11110
to indicate it is a 4 bytes encoded code point (or rune ๐๏ธ), followed by 3 continuation bytes that start with 10
: 10011111, 10001110, 10001001
.
(I still donโt know how to reveal the ansi escape codes in quarto, so using a screenshot instead so that you have colors ๐).
Just like ๐, many characters are encoded using more than on byte in utf-8 and other encodings.
While weโre in ThinkR realm (uni
and utf8splain
) are weekend ๐ฆ we developed when I was working with them ๐, letโs look at what started this side quest of fixing a ๐ in httr2
. In the tada::verse() post I introduced a function to compose ๐ฆ poems with ChatGPT via the mlverse/chattr package, and was annoyed that the function would not work to write a golem
poem.
> chattr::chattr("Can you write a poem about the R package called 'golem'. Please add a bunch of emojis.")
Sure! Here's a poem about the R package 'golem' with a bunch of emojis:
Error in `discard()`:
โน In index: 1.
Caused by error:
! `.p()` must return a single `TRUE` or `FALSE`, not `NA`.
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: In strsplit(., "data: ") :
unable to translate 'data: {"id":"chatcmpl-8nsIIxlfPHfY8BhhuUu7NFsIO57AC","object":"chat.completion.chunk","created":1706897470,"model":"gpt-3.5-turbo-0613","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_re...' to a wide string
2: In strsplit(., "data: ") : input string 1 is invalid
That was embarrassing and curbed my enthusiasm about sharing the poem with the team. I still did, but I had to use the normal ChatGPT app like a human instead of the api ๐ฅ.
mlerse/chattr is not the only R ๐ฆ that can speak to ChatGPT
and I successfully used irudnyts/openai for another similar quest with the valentine
package that writes roses are red โฆ poems about packages. This does work, e.g.
Roses are red, ๐น
Golem is neat, ๐ซ
With R package power, ๐ช
Coding dreams complete! โจ
The advantage of mlverse/chattr
though is that is uses streaming to get tokens faster rather than wait for the whole poem to be composed.
So naturally, I went for a dive on how mlverse/chattr
works, using snitch to get some understanding of its implementation, and sending a bottle in the issues in case the chattr
team wanted to spare my quest.
I sent a first clunky pull request that did the job, while looking kind of ugly and hacky. When that happens, thatโs usually a good sign that this is a solution to the wrong problem, so I abandonned that PR and decided to go earlier in the ๐ฆ chain and look at r-lib/httr2 because chattr
uses httr2::req_perform_stream()
to โฆ process the stream.
The stream from ChatGPT is processed by fixed-size chunks of bytes, and so the problem was that on occasions, these chunks cut an emoji in the middle, which causes issues down the line:
This confused other parts of the mlverse/chattr
codebase.
Now that this was reframed as a missed r-lib/httr2
opportunity, and I had been looking for an excuse to peep on how httr2
works, I deep dived and opened a pull request last week. Hadley started to review it the next day and we ๐ on it and iterated a few times until we were happy about it.
It is now merged, and so will be released as part of the next httr2
release, but you can take it for a spin with pak::pak("r-lib/httr2")
.
My initial proposal was to add a req_perform_stream_lines()
, based on the idea that if we know the stream is text encoded in utf-8, instead of streaming all the bytes, and taking the risk that chunks might cut emojis or other character mid rune, we can buffer the bytes and process line by line.
This kind of worked, but we ended up having the two sister functions req_perform_stream()
and req_perform_stream_lines()
that shared a lot of logic but were different. Something was off.
We continued to iterate, and Hadley has been as usual generous with reviewing and improving the pull request. Hadley even contributed the tests that allowed us to ๐โโ๏ธ the last kilometer ๐.
We settled on adding the extra argument round=
to the req_perform_stream()
function, so that instead of processing fixed-size chunks of bytes, the callback function could receive a truncated sequence of bytes.
Here is the updated documentation for req_perform_stream()
:
The default behavior remains round = "byte"
so that the risk of the pull request being dispruptive is minimal, so by default the full chunk of buffer_kb
kilobytes is sent to the callback
.
The added value of the pull request though is that you can now round = "line"
so that the stream is buffered and cut at the last newline character, a new line is a character that is encoded in a single byte, i.e. its utf-8 representation is the same as its ascii 00001010
.
utf-8 encoded string with 1 runes
U+000A 0A 00001010 New Line (Nl) : line feed (lf) : end of line (eol) : LF
# A tibble: 1 ร 4
id byte decimal binary
<int> <raw> <int> <chr>
1 1 0a 10 00001010
We also contemplated on implementing round = "utf8"
to round at the last valid utf-8 sequence, but we eventually arbitrated that itโs probably not worth it at this stage.
But round =
is flexible enough to accomodate for other ways of rounding, and is passed through the internal httr2::as_round_function()
:
as_round_function <- function(round = c("byte", "line"),
error_call = caller_env()) {
if (is.function(round)) {
check_function2(round, args = "bytes")
round
} else if (is.character(round)) {
round <- arg_match(round, error_call = error_call)
switch(round,
byte = function(bytes) length(bytes),
line = function(bytes) which(bytes == charToRaw("\n"))
)
} else {
cli::cli_abort(
'{.arg round} must be "byte", "line" or a function.',
call = error_call
)
}
}
Iโll talk about cli_abort()
, arg_match()
and error_call
some other day. Working on this pull request was great and I believe we end up with the right solution.
With req_perform_stream(round = "line")
it becomes much easier to fix the initial problem, so I could send a second pull request there, and now with the dev version of httr2
and the pull request of chattr
we can finally enjoy the golem
poem:
โโ chattr
โข Provider: Open AI - Chat Completions
โข Path/URL: https://api.openai.com/v1/chat/completions
โข Model: gpt-3.5-turbo
Sure! Here's a poem about the R package 'golem' with a bunch of emojis:
๐ In the land of R, a package was born,
๐ง Its name was 'golem', a tool to adorn.
๐๏ธ With ๐งฑ and ๐๏ธ, it built apps with ease,
๐ Adding colors and interactivity, oh please!
๐ฆ 'Golem' wrapped up shiny, like a gift,
๐ Making web apps with a magical lift.
๐ It brought the power of the web to R,
๐ฅ๏ธ Creating interfaces that would take you far.
๐ฎ With 'golem', your app could be grand,
๐จ Customizing the UI with a wave of your hand.
๐ Visualizations, charts, and graphs,
๐ All made possible with 'golem's' crafts.
๐ Security was 'golem's' top priority,
๐ Protecting your app with utmost sincerity.
๐ Continuous integration, deployment made smooth,
๐ Launching your app with a confident groove.
๐ So, if you seek to build apps with flair,
๐ง 'Golem' is the package that's beyond compare.
๐๏ธ With its help, your dreams will come true,
๐ Creating web apps that will surely woo!