class: center, middle, inverse, title-slide # Performant R ## Doing things faster with R ### Dr. Richard J. Acton ### Babraham Institute ### 2024-10-05 --- <style> .slug { color: var(--slug_grey); font-size: 0.5em; position: absolute; bottom: 0; left: 0; bottom: 0; padding: 2rem 64px 1.5rem 64px; } </style> # Slides & Code .right-column[ ###
git repo [github.com/RichardJActon/performantR](https://github.com/RichardJActon/performantR) `git clone https://github.com/RichardJActon/performantR.git` Pre-configured cloud environment for exercises: [![launch - renku](https://renkulab.io/renku-badge.svg)](https://renkulab.io/projects/racton/performantR/sessions/new?autostart=1) ### Slides Directly [richardjacton.github.io/performantR/Performant_R.html](https://richardjacton.github.io/performantR/Performant_R.html) ] .left-column[ <img src="Performant_R_files/figure-html/unnamed-chunk-1-1.png" width="100%" style="display: block; margin: auto;" /> - [*p*] presenter view - [*o*] overview - [*f*] fullscreen - [*h*] help/more ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- # Performant R - Overview - Profiling R to find bottlenecks with {[profvis](https://rstudio.github.io/profvis/)} & Other Measurements - Beware premature optimisation! > "The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; __premature optimization is the root of all evil (or at least most of it) in programming.__" - *_Donald Knuth - The Art of Computer Programming_* - Performance Tips & Tricks - Vectorise all the things - Don't Grow objects - Slighly hacky hash tables - Coding efficiently != Writing efficient code - Functional Programming in R - Parallelism, Concurrency, & Asynchrony with {[future](https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html)}s & {[promises](https://rstudio.github.io/promises/index.html)} - Caching computationally expensive results - faster save and load with {[qs](https://github.com/traversc/qs)} & {[fst](https://www.fstpackage.org/)} - Using pipeline managers {[targets](https://books.ropensci.org/targets/)} .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- # Profiling with {[profvis](https://rstudio.github.io/profvis/)} and RStudio .panelset[ .panel[.panel-name[Why Profile your code?] - Don't waste your time optimizing the wrong thing - Know Your bottlenecks! - Don't assume you know what is taking the most time, test it! - See what is taking the most time &/or memory - Prioritise these things to make your code faster - How? – Profiling with {[profvis](https://rstudio.github.io/profvis/)} - Other Simple timing/benchmarking tools e.g. {[bench](https://github.com/r-lib/bench)}, {[microbenchmark](https://github.com/joshuaulrich/microbenchmark/)} & {[tictoc](https://collectivemedia.github.io/tictoc/)} - in RStudio: - __Profile > Profile Selected Line(s)__ ] .panel[.panel-name[Code] ```r profvis::profvis(height = "400px", interval = 0.005, { n <- 1e6 df <- data.frame( group = letters[1:10], * value = rnorm(n) ) plot <- ggplot(df, aes(value)) + geom_density() + facet_wrap(~group) }) ``` ] .panel[.panel-name[Output]
] .panel[.panel-name[Tips] - The `interval` argument to `profvis` can be set to profile something is too fast for the default `0.01`s - At < `0.005` (5 ms) timing accuracy becomes unreliable - Profiling and it's visualisation can be separated - E.g. profiling a script on a headless server - `utils::Rprof` performs the profiling - Output can be saved to file and viewed with `profvis` using the `prof_input` argument ] .panel[.panel-name[video demo] <center> <iframe width="80%" height="400px" src="https://www.youtube-nocookie.com/embed/rmnee9I2dvk" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? Figure out what the actual bottlenecks are in your code so you know what to prioritise when trying to make things go faster. There is no point spending hours making the wrong thing go fast if it is bottlenecked by something else. In Depth Video on profiling: https://www.youtube.com/watch?v=LR8S2lrH_XM Other resource: [R Programming for Data Science - Roger D. Peng Ch 19](https://bookdown.org/rdpeng/rprogdatascience/profiling-r-code.html) ? cover other benmarking tools, microbenchmark, tictoc etc.? --- ## Benchmarking - Profiling may be overkill and you want a quick execution time check. {[tictoc](https://collectivemedia.github.io/tictoc/)} is great for this. `tic()` to start timing, `toc()` to stop (You can do this with the built-in `Sys.time()` and taking the start/stop diff) - If you need very high temporal precision &/or the ability to repeat benchmark many times {[microbenchmark](https://github.com/joshuaulrich/microbenchmark/)} is your friend - {[bench](https://github.com/r-lib/bench)} provides results in a nice *tidy* format and has built-in visualisations .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## $> Exercise: Profiling/Benchmarking Pick a measurement tool {[profvis](https://rstudio.github.io/profvis/)}, {[bench](https://github.com/r-lib/bench)}, {[microbenchmark](https://github.com/joshuaulrich/microbenchmark/)}, {[tictoc](https://collectivemedia.github.io/tictoc/)} or just use `Sys.time()` to determine which of these 2 pieces of code takes longer to run & which part takes longest? ```r sample_names <- c("WT1","WT2","Mut1","Mut2"); n <- 2e4 gene_names <- unique(purrr::map_chr(1:n, ~paste0( c(sample(c(letters, LETTERS), sample(3:6, 1)),"-", sample(1:7, 1)), collapse = "" ))); n <- length(gene_names) ``` .pull-left[ ```r counts_df <- data.frame( rpois(n, 5), rpois(n, 5), rpois(n, 5), rpois(n, 5), row.names = gene_names ) colnames(counts_df) <- sample_names cond_means_df <- data.frame( WT = rowMeans(counts_df[,1:2]), Mut = rowMeans(counts_df[,3:4]) ) ``` ] .pull-right[ ```r counts_m <- matrix( rpois(n*4, 5), dimnames = list( gene_names, sample_names ), nrow = n ) cond_means_m <- cbind( rowMeans(counts_m[,1:2]), rowMeans(counts_m[,3:4]) ) colnames(cond_means_m) <- c("WT", "Mut") ``` ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? https://www.youtube.com/watch?v=fRv3-LnEUjg --- ## $> Possible Solution: Profiling/Benchmarking .pull-left[ ```r mbres <- microbenchmark( "df" = { counts_df <- data.frame( rpois(n, 5), rpois(n, 5), rpois(n, 5), rpois(n, 5), row.names = gene_names ) colnames(counts_df) <- sample_names # counts_df[1:2,] condition_means_df <- data.frame( WT = rowMeans(counts_df[,1:2]), Mut = rowMeans(counts_df[,3:4]) ) # head(condition_means_df, 2) }, "mat" = { counts_m <- matrix( rpois(n*4, 5), dimnames = list( gene_names, sample_names ), nrow = n ) # counts_m[1:2,] condition_means_m <- cbind( rowMeans(counts_m[,1:2]), rowMeans(counts_m[,3:4]) ) colnames(condition_means_m) <- c("WT", "Mut") # head(condition_means_m, 2) } ) ``` ] .pull-right[ ``` ## Unit: milliseconds ## expr min lq mean median uq max neval ## df 8.403657 8.714039 10.17626 8.868798 9.774115 20.53731 100 ## mat 4.017615 4.205887 5.02517 4.253904 4.470900 18.61002 100 ``` .can-edit[ - Observations? ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? - calling rpois repeatedly is more expensive than calling it once - matrices are 'cheaper' `data.frames` - profvis requries a certain minimum --- ## Profiling & Benchmarking - a note on JIT > R 3.4 introduced JIT (Just in time compilation), the R interpreter will now by default cache the compiled byte code of certain R code calls making subsequent calls faster. This means the first time you run something it may run slower than on subsequent runs of the same code, within an R session. You can disable JIT compilation if you want to benchmark the un-compiled version of your code e.g. to optimise for speed on first run. ```r old_jit <- compiler::enableJIT(-1) # -ve gets current level compiler::enableJIT(0) # 0 is off # Code ... compiler::enableJIT(old_jit) ``` For more on the byte code compiler {[compiler](https://stat.ethz.ch/R-manual/R-devel/library/compiler/html/compile.html)} --- # Performancs Tips & Tricks - Vectorise all the things - DON’T Grow vectors (This used to matter a lot more) - Slightly Hacky hash tables - Give up and Write C++ - Miscellaneous - efficient representation Lgl, Int, factors, RLE - `ifelse`, `if_else` or `if {} else {}`? - `anyNA()` vs `any(is.na()))` - partial sorting .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] .footnote[ See also: [The R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) ] --- ## Vectorise all the things - *Rule of Thumb*: __a vector is faster than loop__ - Applies to matrices/arrays too as these are secretly vectors - Matrix arithmetic/linear algebra is very fast - If you can cast your problem in these terms you can get a speedup ```r mat <- matrix(1:1e6, nrow = 1000) tic() x <- integer(length = nrow(mat)) for (i in 1:nrow(mat)) { x[i] <- sum(mat[i,]) } toc() ``` ``` ## 0.012 sec elapsed ``` ```r tic() x <- rowSums(mat) toc() ``` ``` ## 0.004 sec elapsed ``` ??? https://r-coder.com/matrix-operations-r/ https://shainarace.github.io/LinearAlgebra/multapp.html --- ### $> Exercise: Vectorise! .panelset[.panel[.panel-name[Data] .pull-left[ ```r prices <- c( "cookies(x12)" = 4, "cakes" = 10, "pies" = 8 ) sales <- matrix( c(3,7,2,5,3,5,1,5,4,4,3,2,3,0,4), dimnames = list( c("Mon","Tue","Wed","Thu", "Fri"), names(prices) ), nrow = 5 ) ``` ] .pull-right[ ``` ## cookies(x12) cakes pies ## 4 10 8 ## cookies(x12) cakes pies ## Mon 3 5 3 ## Tue 7 1 2 ## Wed 2 5 3 ## Thu 5 4 0 ## Fri 3 4 4 ``` ] ] .panel[.panel-name[You can do better...] .pull-left[ ```r (earnings <- apply(sales, 1, \(x) x * prices)) ## Mon Tue Wed Thu Fri ## cookies(x12) 12 28 8 20 12 ## cakes 50 10 50 40 40 ## pies 24 16 24 0 32 (daily_total <- apply(earnings, 2, sum)) ## Mon Tue Wed Thu Fri ## 86 54 82 60 84 (weekly_total_by_good <- apply(earnings, 1, sum)) ## cookies(x12) cakes pies ## 80 190 96 (grand_total <- sum(apply(earnings, 1, sum))) ## [1] 366 ``` ] .pull-right[ ``` ## # A tibble: 1 × 2 ## min median ## <bch:tm> <bch:tm> ## 1 86.9µs 102µs ``` ] ] ] --- ### $> Vectorise! Solution ```r (earnings <- t(sales) * prices) ## Mon Tue Wed Thu Fri ## cookies(x12) 12 28 8 20 12 ## cakes 50 10 50 40 40 ## pies 24 16 24 0 32 (daily_total <- colSums(earnings)) ## Mon Tue Wed Thu Fri ## 86 54 82 60 84 (weekly_total_by_good <- rowSums(earnings)) ## cookies(x12) cakes pies ## 80 190 96 (grand_total <- sum(earnings)) ## [1] 366 ``` ``` ## # A tibble: 1 × 2 ## min median ## <bch:tm> <bch:tm> ## 1 10.7µs 11.9µs ``` --- ## Growing Objects .pull-left[ ```r # Grow vec <- numeric(0) for(i in 1:n) vec <- c(vec, i) # pre-allocate vec <- numeric(n) for(i in 1:n) vec[i] <- i #direct vec <- 1:n ``` ] .pull-right[ <img src="Performant_R_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? --- ## DON’T Grow vectors (?) > __Before R 3.4, growing a vector in a loop was expensive__ - each time the vector grew by one, R would copy the vector to a new block of memory, with the new item at the end. However, as of R 3.4, R tries to put vectors in a memory location with room to grow, and if there exists only one reference to the vector, it can grow in place, except in the uncommon case where there’s no more room to grow. (If there are two references to the vector, then R must make a copy.) This means that __growing a vector is a fast operation, as long as there’s only one reference to it.__ > Although __growing a vector is now fast, growing a data frame is still an expensive operation__ and results in copying all of the data. [^1] The old advice now applies mostly just to `data.frame`s - Append == BAD - If your __output__ is __of known length declare__ the __data structure__ to house your results __first__ - If you have an __upper bound__ make the __larger__ data structure and __shrink__ it to just the populated values when done [^1]: https://rstudio.github.io/profvis/examples.html .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Object Growth Benchmarking .pull-left[ - Vector pre-allocation has slight benefits for vectors (length > ~ 165 *for my system) - Relatively small when compared to the impact on `data.frame`s ```r # grow df <- data.frame(A = integer()) for(i in 1:n) df <- rbind(df, data.frame(A = i)) # pre-allocate df <- data.frame(A = integer(length = n)) for(i in 1:n) df[i, 1] <- i # direct df <- data.frame(A = 1:n) ``` ] .pull-right[
] --- # $> EXERCISE - profiling The following code is for moving square in a simulated game of monopoly Profiling the larger codebase for simulating the game revealed that rolling the dice was taking up a lot of the total run time. This is an action which happen many times in the execution of the code so optimizing here is high leverage. Use {[profvis](https://rstudio.github.io/profvis/)} to find the slow parts and try to make it faster *Note: square 11 is Jail* This example is lightly adapted from 'Efficient R programming' by Colin Gillespie & Robin Lovelace [^2] [^2]: https://csgillespie.github.io/efficientR/performance.html#example-optimising-the-move_square-function --- .panelset[ .panel[.panel-name[Code] ```r profvis(height = "450px", interval = 0.005, { initial <- sample(1:40, 1e3, replace = TRUE) new <- integer(length = 1e3) for (i in seq_along(initial)) { current <- initial[i] df <- data.frame( d1 = sample(seq(1, 6), 3, replace = TRUE), d2 = sample(seq(1, 6), 3, replace = TRUE) ) df$Total <- apply(df, 1, sum) df$IsDouble <- df$d1 == df$d2 if(df$IsDouble[1] & df$IsDouble[2] & df$IsDouble[3]) { current <- 11 current <- current + sum(df$Total[1:3]) } else if (df$IsDouble[1]) { current <- current + sum(df$Total[1:2]) } else { current <- current + sum(df$Total[1]) } if(current %% 40 == 0) {new[i] <- 40} else {new[i] <- current %% 40} } }) ``` ] .panel[.panel-name[Output]
] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? `move_square` function body, calling in functional form did not give linewise results for some reason. --- .panelset[ .panel[.panel-name[Code] ```r profvis(height = "400px", interval = 0.005, { initial <- sample(1:40, 1e3, replace = TRUE) new <- integer(length = 1e3) for (i in seq_along(initial)) { current <- initial[i] * rolls <- matrix(sample(seq(1, 6), 6, replace = TRUE), ncol = 2) # data.frame -> matrix * Total <- rowSums(rolls) # apply -> rowSums IsDouble <- rolls[,1] == rolls[,2] * if(IsDouble[1] && IsDouble[2] && IsDouble[3]) { # & to && current <- 11 # Go to Jail! * } else if (IsDouble[1] && IsDouble[2]) { current <- current + sum(Total[1:3]) } else if (IsDouble[1]) { current <- current + sum(Total[1:2]) } else { current <- current + sum(Total[1]) } if(current %% 40 == 0) {new[i] <- 40} else {new[i] <- current %% 40} } }) ``` ] .panel[.panel-name[Output]
] .panel[.panel-name[Explanation] ### Example improvements - `sample` is only run once in this version - Only calling sample once to generate all the rolls has less overhead than making two seperate function calls - `matrix` is often faster than `data.frame` - matrices are secretly vectors and dataframes are secretly lists of vectors. Matrices being simpler and less flexible data structures have less overhead than data.frames. - Vectorise with `rowSums` instead of `apply` - `rowSums` is an efficient C implementation whereas apply does it's iteration in R which will be slower - `&&` is faster than `&` and is valid for single comparisons - `&` & `|` are vectorised, `&&` & `||` only evaluate the first element - __BONUS__ Double and triple rolls occur relatively rarely, can you get even more performance by only rolling them when needed? ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? > R 3.4 introduced JIT (Just in time compilation), the R interpreter will now by default cache the compiled byte code of certain R code calls making subsequent calls faster. (I think this may also alter some profvis outputs so that JIT compiled code does not always have granular line by line times? I may be wrong about this but the behaviour does not always match what I'd expect from some examples especially user defined function calls) ```r df <- data.frame(a = c(1,2,3), b = c("A","B","C")) # mixed types df as.list(df) attributes(df) str(df) ``` ```r mat <- matrix(c(1,2,3,"A","B","C"), byrow = TRUE, nrow = 2) mat attributes(mat) # all one type str(mat) ``` --- ## What are hash tables? Hash tables let you rapidly find things by name, here's how: <center> <iframe width="65%" height="375px" src="https://www.youtube-nocookie.com/embed/KyUTuwz_b7Q" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Slightly Hacky hash tables - R does not have an 'ordinary' data structure that implements hash tables - R `environments` however can be hashed ```r myenv <- new.env(hash = TRUE) myenv$tmp <- "test" myenv$tmp ``` ``` ## [1] "test" ``` - `myenv` Can be used much like a named list but cannot be subset by index - nor can they be renamed like a list ```r myenv[1]; names(myenv)[1] <- "renamed" ``` ``` ## Error in myenv[1]: object of type 'environment' is not subsettable ``` ``` ## Error in names(myenv)[1] <- "renamed": names() applied to a non-vector ``` .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Constructing hashed environments - Environments must be constructed by iteration not directly from vectors ```r myenv <- new.env(hash = TRUE) keys <- c("A", "B", "C"); values <- 1:3 purrr::walk2(keys, values, ~{myenv[[.x]] <- .y}) ls(myenv) ``` ``` ## [1] "A" "B" "C" ``` - Constructing hashed environments is slow - Retrieving from them is very fast, irrespective of size .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? hashtables / hashmaps explanations Maybe show one of these in the longer workshop session, probably not a compsci audiance so might be valuable for understanding https://www.youtube.com/watch?v=KyUTuwz_b7Q https://www.youtube.com/watch?v=FhNJ6aikTVI hashtable performance in R https://www.r-bloggers.com/2015/03/hash-table-performance-in-r-part-i/ https://blog.dominodatalab.com/a-quick-benchmark-of-hashtable-implementations-in-r --- ## Performace Trade-offs in `R` hashtables - It is very quick to find things in a hashtable but... - It takes a long time to construct a hashtable
--- ## $> EXERCISE: Using hashes as an index Construct a data.frame with a character column Construct a hashed index in an environment use the environment to get the row indices and subset the date.frame compare the speed of this to the conventional approach .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Give up and Write C++ - But in R! - The {[Rcpp](https://www.rcpp.org/)} library make this easy *(The R integration not the C++)* ```r cppFunction( "bool isOddCpp(int num = 10) { bool result = (num % 2 == 1); return result; }" ) isOddCpp(42L); isOddCpp(69L) ``` ``` ## [1] FALSE ``` ``` ## [1] TRUE ``` - The `cppFunction` lets you write simple in-line C++ functions and use them in R - This is possible with C, FORTRAN and even Rust 🦀 but a lot more work .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## $> EXERCISE: Write a basic Rccp function Just using the `cppFunction()` for now. Rcpp for everyone: https://teuder.github.io/rcpp4everyone_en/ Re-writing R in Cpp: https://adv-r.hadley.nz/rcpp.html .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ### {Rcpp} beyond `cppFunction()` - For longer code use a file with the `.cpp` extension and the `sourceCpp()` function. - The header should contain: ```cpp #include <Rcpp.h> using namespace Rcpp; ``` - Functions to export to `R` should be preceded by: `// [[Rcpp::export]]`. (like `@export` in Roxygen) - R code in special comments will run when the file is sourced, this is useful for testing e.g. ```cpp /*** # R Code x <- isOddCpp(3) */ ``` - Example file: [Rcpp_examples.cpp](Rcpp_examples.cpp) - In R markdown use the chunk argument: `engine = 'Rcpp'` to write C++ inline --- ### monopoly example in {Rcpp} .panelset[ .panel[.panel-name[code] ```r cppFunction("int move_square_Rcpp(int current) { IntegerVector dice {1,2,3,4,5,6}; IntegerVector rolls = sample(dice, 2, true); bool isDouble = rolls(0) == rolls(1); if(isDouble) { IntegerVector rolls2 = sample(dice, 2, true); // only run 1/6 times bool isDouble2 = rolls2(0) == rolls2(1); if(isDouble2) { IntegerVector rolls3 = sample(dice, 2, true); // only run 1/36 times bool isDouble3 = rolls3(0) == rolls3(1); if(isDouble3) { current = 11; return(current); } // Go to Jail! } else { current += sum(rolls2); } } current += sum(rolls); if(current % 40 == 0) { return current; } else { return current % 40; } }") ``` __Remember C++ has 0 based indexing! (unlike R)__ ] .panel[.panel-name[Profiling {Rcpp} monopoly] The probability of a double is `\(\frac{1}{6}\)` and of a triple `\(\frac{1}{36}\)` so save time by only rolling them when they are needed. too fast for the profiler to measure well with default interval! ```r profvis(height = "400px", interval = 0.005, { initial <- sample(1:40, 1e3, replace = TRUE) sapply(initial, move_square_Rcpp) }) ``` ``` ## Error in parse_rprof(prof_output, expr_source): No parsing data available. Maybe your function was too fast? ``` ] .panel[.panel-name[Trying {microbench}] ```r initial <- sample(1:40, 1e3, replace = TRUE) res <- microbenchmark::microbenchmark( Rcpp_monopoly = { sapply(initial, move_square_Rcpp) } ) res ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max neval ## Rcpp_monopoly 1.902882 1.993653 2.507664 2.034512 2.111219 11.7337 100 ``` ~2ms down from ~15ms in improved `R` and from ~225ms un-optimised `R` ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? Rcpp for everyone https://teuder.github.io/rcpp4everyone_en/ Re-writing R in Cpp: https://adv-r.hadley.nz/rcpp.html Nice applied biological example: https://www.youtube.com/watch?v=spVVjrgHHNU --- ### Take care when choosing the C++ Route 🦶 🔫 For those of us used the the safety of an interpreted language like `R` C++ can be a bit of a foot-gun if we try and get fancy. All kinds of nightmarish memory bugs lurk in the shadows to acost the unwary traveler in these foreign lands. Only venture here when needs must and preferably accompanied by a seasoned veteran of the `C` languages who can sniff out a dangling pointer from 3 miles hence. Ah! memory leak! 🧠 💦 Run! I exaggerate, but if you don't know the language you can spend hours trying to fix something relatively simple when a little clever tweak in R might have got you almost as much performance gain. Developer time and the ability for future people, (including future you), to maintain your code must be considered as well as compute time. .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## But... .panelset[ .panel[.panel-name[more optimal R function] ```r move_square_prob <- function(current) { dice <- 1L:6L; rolls <- sample(dice, 2L, replace = TRUE) isDouble <- rolls[1] == rolls[2] if (isDouble) { rolls2 <- sample(dice, 2L, replace = TRUE) isDouble2 <- rolls2[1] == rolls2[2] if (isDouble2) { rolls3 <- sample(dice, 2L, replace = TRUE) isDouble3 <- rolls3[1] == rolls3[2] if(isDouble3) { current <- 11L; return(current) } } else { current <- current + sum(rolls2) } } current <- current + sum(rolls) if(current %% 40L == 0L) { return(current) } else { return(current %% 40L) } return(current) } ``` ] .panel[.panel-name[All Results] ``` ## Unit: milliseconds ## expr min lq mean median uq ## Rcpp_monopoly 1.090453 1.282960 1.464381 1.338911 1.424408 ## bad_R_monopoly 297.483057 317.702966 328.277304 325.240488 331.920142 ## better_R_monopoly 15.963452 18.222063 20.327268 18.828246 20.869306 ## opt_R_monopoly 4.736132 5.466059 6.136475 5.614250 5.845586 ## max neval ## 6.847772 100 ## 462.145343 100 ## 46.562341 100 ## 19.961067 100 ``` ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Miscalaneous A grab bag of tidbits .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ### Smaller Representations .panelset[ .panel[.panel-name[integers] If you can user integers do, they are smaller then doubles, but logicals don't gain you anything over integers. ```r d <- double(length = 1e6); format(object.size(d), "Kb") ## [1] "7812.5 Kb" i <- integer(length = 1e6); format(object.size(i), "Kb") ## [1] "3906.3 Kb" ir <- sample(1:1e6, 1e6); format(object.size(ir), "Kb") ## [1] "3906.3 Kb" l <- logical(length = 1e6); format(object.size(l), "Kb") ## [1] "3906.3 Kb" ``` ] .panel[.panel-name[factors] - `character`s are big they can be made a lot smaller with alternative representations - factors are secretly integers, each name is stored only once ```r mychar <- rep(purrr::map_chr(1:5, ~paste(sample(letters, 10, replace = TRUE), collapse = "")), 2e4) format(object.size(mychar), "Kb") ``` ``` ## [1] "781.6 Kb" ``` ```r myfac <- as.factor(mychar); str(myfac); format(object.size(myfac), "Kb") ``` ``` ## Factor w/ 5 levels "avblmsxhct","kwzrlnawqs",..: 4 5 1 2 3 4 5 1 2 3 ... ``` ``` ## [1] "391.4 Kb" ``` ] .panel[.panel-name[Run Length Encoding `rle`] - Very effective if order does not matter - Also good when data is frequently repetitive - Store the names and length of the runs of names thus can usually use a shorter integer vector ```r mychar_sorted <- sort(mychar); mychar_rle <- rle(mychar_sorted); str(mychar_rle) ## List of 2 ## $ lengths: int [1:5] 20000 20000 20000 20000 20000 ## $ values : chr [1:5] "avblmsxhct" "kwzrlnawqs" "symjvlhqwn" "wakudpmdlx" ... ## - attr(*, "class")= chr "rle" format(object.size(mychar_sorted), "Kb"); format(object.size(mychar_rle), "Kb") ## [1] "781.6 Kb" ## [1] "1 Kb" ``` ] .panel[.panel-name[Sparse matrices] If your data has many 0 values a sparse matrix from the {[Matrix](https://cran.r-project.org/web/packages/Matrix/index.html)} can reduce it's size. If it's *very* big a file back sparse matrix from {[bigsparser](https://github.com/privefl/bigsparser)} may also be a solution. .pull-left[ ```r n <- 1e3; i <- sample(1:n,70); j <- sample(1:n,70); x <- rpois(70,2) MS <- Matrix::sparseMatrix(i,j,x=x,dims = c(n,n)); MS[1:2,1:10] ``` ``` ## 2 x 10 sparse Matrix of class "dgCMatrix" ## ## [1,] . . . . . . . . . . ## [2,] . . . . . . . . . . ``` ```r format(object.size(MS), "Kb") ``` ``` ## [1] "6.2 Kb" ``` ] .pull-right[ ```r M <- matrix(nrow = n, ncol = n) for (q in seq_along(x)) { M[i[q], j[q]] <- x[q] } format(object.size(M), "Kb") ``` ``` ## [1] "3906.5 Kb" ``` ] ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ### Approaches to `if else` .pull-left[ - `dplyr::if_else()` is faster than `base::ifelse()` - Vectorised operations can be faster than either - looping over `if(){}else{}` and `switch` is slow - But may have and edge when `length == 1` ```r marks <- sample(1:100, 1000, replace = TRUE) n <- 1000 # Vector - fastest solution results <- rep("fail", length(marks[1:n])) results[marks[1:n] >= 40] <- "pass" ``` ] .pull-right[
] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ### `anyNA()` is (usually) faster than `any(is.na())` .pull-left[ - Speedup varies with length and sparseness ```r nvec <- sample(1:100, 1e6, replace = TRUE) nvec[sample(1:1e6, 5e3)] <- NA_integer_ ``` ] .pull-right[ ```r tic() any(is.na(nvec)) ``` ``` ## [1] TRUE ``` ```r toc() ``` ``` ## 0.006 sec elapsed ``` ```r tic() anyNA(nvec) ``` ``` ## [1] TRUE ``` ```r toc() ``` ``` ## 0.002 sec elapsed ``` ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ### Partial Sorts - The default sorting algorithm 'radix' is fastest of the sort methods - `sort(x, decreasing = TRUE)` may be marginally quicker than `rev(sort(x))`, negligible in my testing - If you just need part of sort done e.g. the top 10 this can be faster especially for strings `sort(x, partial = 1:10)`
.slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- # ‘Efficienct’ coding is not all about performance - If future you / future maintainers of your code can’t understand your optimised code to fix it when it breaks is it really efficient? - Consider development time as well as execution time. - Is it a good use of time to spend 5 hours re-writing a piece of code that you will run 5 times total and takes 5 minutes to run in R in C++ so it runs in 20s? - `R` is fast to write, and relatively easy to comprehend but can be slow to run - optimise for developer and maintainer time when writing your code - Simple and self-contained - __Well Documented__ - Unit tests > devtime vs runtime .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? devtime vs runtime https://www.youtube.com/watch?v=ttJCwzfsAc8 --- # Functional Programming in R - R is often written in *imperative* style - Typically involves loops and modifications to global state - Good for many problems but often does not scale well to larger code bases - namespace gets cluttered with many objects - Out of order execution in interactive environments can cause inconsistent state - S the language from which R evolved began as purely *functional* (loops were added later). - So `R` has a functional heritage but is not optimised for recursion like e.g. `haskell` - Functional Style - Makes heavier use of lexical scope, thus fewer globals - Makes a series of transformations to inputs 'pipeline-able' - Map/Reduce & The {purrr} library - more consistent API then `apply` - better 'type saftey' (insofar as that is a thing in R) .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? "The Joy of Functional Programming (for Data Science)" with Hadley Wickham https://www.youtube.com/watch?v=bzUmK0Y07ck [DSC 4.0] What Makes a Good Function - Hadley Wickham keynote https://www.youtube.com/watch?v=Qne86lxjgtg --- ## Imperative Vs Functional .pull-left[ Imperative Style ```r resulti <- character(length = length(letters)) names(resulti) <- LETTERS for (i in seq_along(letters)) { resulti[i] <- paste0(letters[i], i) } head(resulti) ``` ``` ## A B C D E F ## "a1" "b2" "c3" "d4" "e5" "f6" ``` 3 assignments, 2 subsets, 1 set length, 3 'lines' ] .pull-right[ 'Functional' Style ```r resultf <- letters %>% purrr::set_names(LETTERS) %>% purrr::map2_chr( ., seq_along(.), ~paste0(.x, .y) ) head(resultf) ``` ``` ## A B C D E F ## "a1" "b2" "c3" "d4" "e5" "f6" ``` 1 assignment, 0 subsets, 0 set length, 1 'line' ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? --- ## Advantages of Functions ### Why Write Functions? - Testable - Reusable / Portable / Self-contained - lexical scoping reduces the complexity of the state of the program which often grows very large in imperative style ### Good Functions - __Referential Transparency__ (no globals or side effects just params) - Endomorphism (same form as input and output) - Simple / Single Purpose .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## What makes a good function? .pull-left[ __BAD__ Function 😡 ```r globalx <- 0L; globaly <- 20 fun1 <- function(x = 1) { while (globalx < globaly) { globalx <<- globalx + x } } fun1() globalx ``` ``` ## [1] 20 ``` Not-referentially transparent, has global side-effects ] .pull-right[ __GOOD__ Function 😇 ```r globalx <- 0L; globaly <- 20 fun2 <- function( init = 0L, increment = 1, maxv = 20 ) { while (init < maxv) { init <- init + increment } init } globalx <- fun2(globalx, maxv = globaly) globalx ``` ``` ## [1] 20 ``` Referentially transparent, has No global side-effects ] Nothing except the parameter values affect the output of the function Nothing in the environment is changed by the function, except by its return value .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Good Functions are *usually* easier to parallelise Many side effects can be a problem if they happen in the wrong order or try to modify the same thing. Global values are not always accessible to parallel processes so specifying everything as a parameter avoids this issue. .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## `%>%` ceci n'est pas une pipe {[magrittr](https://magrittr.tidyverse.org/)} Pipes are a useful tool for taking the output of one function and giving it to another. Let's say we want to do a series of operations on a vector each of which is a function. ```r x <- rnorm(10) sum(sqrt(abs(x))) ``` ``` ## [1] 5.277688 ``` We take the absolute value of x with `abs()`, the square root of each with `sqrt()` and then the sum of the square roots with `sum()`. In the code the order is reversed from it's order in the verbal description. Reading starts from the outside with `sum` but the logical order of the operations starts with x and works outwards. ```r x %>% abs() %>% sqrt() %>% sum() ``` ``` ## [1] 5.277688 ``` Using the pipe `%>%` from the {[magrittr](https://magrittr.tidyverse.org/)} package you can invert this order, and instead write your code in the order that you might describe or __speak__ what you are doing. .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? https://magrittr.tidyverse.org/logo.png --- # Parallelism, Asynchrony, & Concurrency Nomenclature and mental models 🧠 - **Parallelism** - *Solution*, Making things faster by doing parts of them at the same time - You choose it 🤦 - Correctness is the same output as sequential - **Concurrency** - *Problem*, multiple possible things happening, potentially, all at once. - It chooses you 😮 (e.g. Shiny) - Correctness is context dependent - **Asynchrony** (related to 'event driven' / 'reactive' programming) - Generally A consequence of parallelism/concurrency - Non-blocking (let's you do other things while you wait) - Handling - Waiting for the things you need to move on to the next step - Dealing with things that may show up in an unexpected order .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? Source for how to think about Parallelism, Asynchrony, & Concurrency Jonathan Worthington. Keynote: Perl 6 Concurrency https://www.youtube.com/watch?v=hGyzsviI48M Slides: https://jnthn.net/papers/2019-perlcon-concurrency.pdf Tip: The Raku/perl6 community is a great one to follow to learn cool programming things as it's full of massive language design nerds Async Programming in R https://www.youtube.com/watch?v=Vt6f3qMehXI --- # Task Parallelism Vs Data Parallelism - **Task parallel** - Do different things which don't have a path dependency at the same time - **Data parallel** - Do the same thing to a lot of chunks of data - Operation must be independent of the results on other chunks of the data .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? `map` is trivially parallelisable, `reduce` is not --- # *More Power! ⚡* Need speed? just add threads! - Parallel processing with Futures - How not to use all your RAM - Futures as an interface to HPC Schedulers with {[future.batchtools](https://future.batchtools.futureverse.org/)} - {[bigstatsr](https://privefl.github.io/bigstatsr/)} On disc matrices for massive models etc. .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Parallelism with Futures .panelset[ .panel[.panel-name[parallel processing] - There are a variety of ways to do parallel and asynchronous processing in R - {[future](https://future.futureverse.org/)} provides a nice consistent interface to many of them, it is usually best to use it for this reason. - {[foreach](https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html)} remains popular and has compatability with {[future](https://future.futureverse.org/)} - Avoid using base {parallel} `mclappy` etc. as these are not cross platform - With {[future](https://future.futureverse.org/)} the __USER__ defines the parallel runtime / backend - With {[progressr](https://progressr.futureverse.org/)} the __USER__ defines how they you like to receive progress updates - {[clustermq](https://mschubert.github.io/clustermq/)} Has less overhead on HPC clusters than {[future](https://future.futureverse.org/)} but requires support from the cluster. ] .panel[.panel-name[{foreach} and why to use the {future} API] <center> <iframe width="65%" height="375px" src="https://www.youtube-nocookie.com/embed/GC5VXjc52t0" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] .panel[.panel-name[{[future](https://future.futureverse.org/)} & {[progressr](https://progressr.futureverse.org/)}] <center> <iframe width="65%" height="375px" src="https://www.youtube-nocookie.com/embed/FRF0c0Wls3E" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? longer more detailed talk on future https://www.youtube.com/watch?v=2ZlpFkFMy7E&t=600s furrr https://www.youtube.com/watch?v=hlx3-iv6HIQ advanced asynchronous processing with promises: https://rstudio.github.io/promises/articles/overview.html (most useful for Shiny apps) Common problems, non-exportable objects https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html foreach with a future backend https://cran.r-project.org/web/packages/doFuture/vignettes/doFuture.html --- ## Parallelism with Futures .panelset[ .panel[.panel-name[Wait... What?] - Coding for futures - {[future.apply](https://future.apply.futureverse.org/)} parallel versions of the base `apply` function family - {[furrr](https://furrr.futureverse.org/)} parallel versions of the {[purrr](https://purrr.tidyverse.org/)} map functions - %<-% Future assignment, RHS is evaluated in a subprocess ```r tic() result <- furrr::future_map(letters[1:5], ~Sys.sleep(1)) toc() ``` ``` ## 5.052 sec elapsed ``` Wait... Why did it take 5 seconds to wait 5 seconds in parallel? shouldn't it only take ~1? Yes but first we need a `plan()`! the default behavior of the `future_*` functions is to run sequentially just like their serial counterparts. ] .panel[.panel-name[We need a plan!] Plans ```r plan(multisession, workers = 5) tic() result <- furrr::future_map(letters[1:5], ~Sys.sleep(1)) toc() ``` ``` ## 2.395 sec elapsed ``` ```r plan(sequential) ``` Running `plan(sequential)` at the end resets things to normal and closes remaining child processes. Note that there is some overhead associated with making child processes it took almost 2 seconds to wait 5 seconds in parallel ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## $> Exercise: parallelise .panelset[ .panel[.panel-name[Exercise] Parallelise the the optimized R implementation of the monopoly `move_square` function and determine approximately at how many simulated moves is worth the overhead of parallelisation? ] .panel[.panel-name[Integers] ```r move_square <- function(current){ rolls <- matrix(sample(seq(1, 6), 6, replace = TRUE), ncol = 2) Total <- rowSums(rolls) IsDouble <- rolls[,1] == rolls[,2] if(IsDouble[1] && IsDouble[2] && IsDouble[3]) { * current <- 11L # integer! } else if (IsDouble[1] && IsDouble[2]) { current <- current + sum(Total[1:3]) } else if (IsDouble[1]) { current <- current + sum(Total[1:2]) } else { current <- current + sum(Total[1]) } *if(current %% 40L == 0L) {current <- 40L} else {current <- current %% 40L} *as.integer(current) # ensure the result is in integer form } ``` ] .panel[.panel-name[solution] .pull-left[ Note the parallel processes can cause problems for random number generation so make use of parallel safe RNG ```r initial <- sample(1:40, 1e5, replace = TRUE) move <- purrr::map_int(initial, move_square) plan(multisession) move <- furrr::future_map_int( initial, move_square, *.options = furrr_options(seed = TRUE) ) plan(sequential) ``` ] .pull-right[
] ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Evaluation topologies - more complex plans .pull-left[ A complex plan: ```r plan(list( tweak(remote, workers = "cheops"), # name of a remote host to ssh into # e.g. cheops1 tweak( batchtools_slurm, resources = list( ntasks = 5, mem_per_cpu = 16000L, walltime = "06:00:00" )), tweak(multisession, workers = 10) )) ``` ] .pull-right[ The code: ```r chromosomes <- as.roman(1:5); samples <- 1:10 result %<-% # ssh furrr::future_map(chromosomes, ~{ # Seperate Slurm Jobs chr <- .x furrr::future_map(samples, ~{ # Individual cores on the compute nodes data <- readr::read_tsv( paste0(chr, "_", .x, ".tsv")) ) } }) ``` ] __I DO NOT recommend doing this over ssh if the connection is lost before the job is finished the results will not be collected!__ .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? Note that using a remote host requires that you can connect to the host without a password prompt. This can be achieved with a password protected ssh key using an ssh agent which keeps the key unlocked on your local system for a period of time after an initial password prompt. Where I placed 'cheops' in the above example you can substitute: <username>@remote.host e.g. racton@cheops.rrz.uni-koeln.de or define a convenient allias for your `~/.ssh/config` file ``` Host cheops # set the connection name / alias Hostname cheops.rrz.uni-koeln.de User racton # YOUR USERNAME IdentityFile ~/.ssh/id_rsa # this is the default and can be left out ForwardX11 yes # other optional ssh connection settings ``` Add your ssh key to the agent in a terminal session: e.g. `ssh-add ~/.ssh/id_rsa` --- ### Configuring HPC for {future.batchtools} with Slurm To begin dispatching jobs from within R scripts you need to configure {future.batchtools} The `R` module needs to be loaded by default Batchtools needs two config files: `~/.batchtools.conf.R` To set your default Job resource requests `~/.batchtools.slurm.tmpl` As a Slurm Job template Templates for these files are [R_on_cheops](./R_on_cheops) ??? If you want to work interactively with the scheduler and use RStudio then I would recommend starting RStudio on a head node or in an interactive job in a terminal multiplexer such as [screen](https://www.putorius.net/linux-screen-command.html) or [tmux](https://www.howtogeek.com/671422/how-to-use-tmux-on-linux-and-why-its-better-than-screen/) then connecting via ssh port forwarding. SSH port forwarding means that the rendering of the UI is done client-side not server-side as with X11 forwarding so the experience of using a GUI is much smoother. The terminal multiplexer will keep the RStudio server job running even if your connection is lost so you can reconnect without losing your session. see: R_on_cheops/cheops_R_remote.nb.html --- # Async, Blocking & Promises - A simple `future` assignment is blocking - The main R process is tied up waiting for the future to finish and return a result - How do we get our main R process back to do other things while we wait? {[promises](https://rstudio.github.io/promises/index.html)} .pull-left[ ```r res <- NULL prms <- future_promise({ Sys.sleep(1); "Woke Up!" }) then(prms, function(value) {res <<- value}) prms # Not done yet ``` ``` ## <Promise [pending]> ``` ```r res # Still NULL ``` ``` ## NULL ``` ```r Sys.sleep(3) ``` ] .pull-right[ ```r 2+2 # Free to do other things ``` ``` ## [1] 4 ``` ```r prms # All Done! ``` ``` ## <Promise [pending]> ``` ```r res # Now has the promised value ``` ``` ## NULL ``` ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## $> EXERCISE: Play with promises Try out the `%...>%` operator. ```r px <- NULL fx <- \(x){Sys.sleep(x); x} future_promise(2) %...>% fx() %...>% {px <<- .} px ``` ``` ## NULL ``` They are a bit counter intuitive as once you have made a promise it's hard to escape you always get another promise back. This is intentional so that you can chain them together, they are often used in contexts where you don't explicitly need the result back your are just giving a promise to something that understands how to handle them. For instance {Shiny} is 'aware' of promises and using them in conjunction with {future} can be very effective at reducing unresponsive UI issues in {Shiny} apps. This is cover specifically in this vignette: [Using `promises` with Shiny](https://rstudio.github.io/promises/articles/shiny.html) .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## How not to use all your RAM - R likes to copy things into memory, do not pass big objects to subprocesses or you will eat all your RAM! - Do not reference large globals from within code that will be execuded in a subprocess. - Subset first to be safe - Beware of lazy evaluation! - 😞 `function(x[i])` - 😃 `y <- x[i]; function(y)` - In the first example the entirety of `x` is passed to the function's environment and the sub-setting happens there. - Read from file directly from the sub-process and don't pass an object through the parent environment. The default upper limit of on object size passed between sessions is ~500Mb though you can adjust this with `options(future.globals.maxSize= Inf)`, or a less than infinite value in bytes if you are doing this though you may want to rethink your approach it this is likely to be inefficient. .small[_*Tip:*_ if you remove `rm()` a big object from an R session you won't immediately get your memory back. Running the garbage collector manually with `gc()` can speed this up.] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- # Larger than memory matrices with {[bigstatsr](https://privefl.github.io/bigstatsr/)} There is an upper limit on the max size of a matrix in R. A matrix cannot hold more than ~2,147,483,647 elements even if there is enough RAM to do so. matrices are just vectors with dimensions and vectors are indexed with an R integer Thus the max value of an R integer is the max number of element in an R matrix (sort of). {[bigstatsr](https://privefl.github.io/bigstatsr/)} makes use of memory mapping to store a file backed matrix on disk in a binary file format that allows ```r fbm_demo <- tempfile() X <- bigstatsr::FBM(1e3, 1e3, type = "integer", backingfile = fbm_demo) X[1:3, 1:5] # Only select manageable subsets ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ``` ```r *# X[] - Not a good idea! # X[1,] or X[,1] may also not be a good idea depending on the dimensions ``` .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## Penalised regression on larger than memory matrices with {[bigstatsr](https://privefl.github.io/bigstatsr/)} .panelset[ .panel[.panel-name[Regression] The `bigstatsr::big_spLogReg()` & `bigstatsr::big_spLinReg()` functions allow the fitting of lasso/elastic net/ridge penalised regression models on these large matrices. Reading reading a matrix into base R loads all the data into memory by default so you need to construct big matrices by streaming the contents of the file through R into a FBM using approaches describes in the vignette [reading files to a FBM](https://privefl.github.io/bigstatsr/articles/read-FBM-from-file.html). ] .panel[.panel-name[Video] <center> <iframe width="80%" height="400px" src="https://www.youtube-nocookie.com/embed/UXKYqpAo6cs" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? In depth video from the packge author: https://www.youtube.com/watch?v=UXKYqpAo6cs bigstatsr parallelism https://privefl.github.io/blog/a-guide-to-parallelism-in-r/ --- # Fast serialisation with {[qs](https://github.com/traversc/qs)} & {[fst](https://www.fstpackage.org/)} - 😡 `save(); load()` - unwanted side effects from namespace clashes and unexpected objects - 😐 `saveRDS(x); x <- readRDS()` - Saves single object, the name is defined by assignment - 😎 `qs::qsave(); qs::qread()` - similar to `RDS` functions but used the highly performant `lz4` / `zstd` compression libraries to speed thing up - Specifically for `data.frame` like objects `fst::write_fst(), fst::read_fst()` - Uses the same compression libraries as `qs` but only works on tabular data __No List Columns__ - However this allows on disk row/column reading 🤯 - Doing: `fst::read_fst("data.fst", columns = c(1,2), from = 1, to = 50)` only reads columns 1 & 2 and rows 1 to 50 into memory. - With other methods you have to read in everything and drop the things you don't need afterwards. .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? with fst you can do things somewhat similarly to a bigmatrix only reading the rows/columns of data that you need into the process and parallelising over chunks of your data just passing the workers the coords from which to load the data from file not the data itself. --- ## $> EXERCISE: Benchmark {[qs](https://github.com/traversc/qs)}/{[fst](https://www.fstpackage.org/)} Compare `saveRDS()`/`readRDS()` to their equivalents in {[qs](https://github.com/traversc/qs)}/{[fst](https://www.fstpackage.org/)} Read only the first column of a data.frame saved as an `.fst` file and compare the speed / memory use of this to trying to do the equivalent with an ordinary flat file such as a `.csv` .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- # The {[targets](https://docs.ropensci.org/targets/)} pipeline manager .panelset[ .panel[.panel-name[targets] - Many large analyses have slow computationally expensive steps - Many outputs e.g. graphics require tweaking but not re-running the whole pipeline - figuring out what to rerun, when, and if you've already done it or not becomes complex to keep track of - reproducible analysis that you don't have to re-run from scratch to guarantee works - Easy to run any independent processes in parallel To perform some of the benchmarks I presented here I used [_targets.R](./_targets.R) because some of the benchmarks take a very long time to run and I wanted to cache the results. You can see this targets pipeline in the [_targets.R](./_targets.R) file in this presentation's git repo. ] .panel[.panel-name[targets] ```r library(targets) tar_script({ list( # A targets pipeline consists of a list of 'targets' tar_target(data_file, "data.csv", format = "file"), tar_target(data, read_tsv(data_file)), tar_target(params, c(0.01, 0.1, 1)), tar_target(model, fit_model(data, params), pattern = map(params)), tar_target(plot, plot_results(model), pattern = map(model)) ) }) tar_make() # Only runs 'out of date' targets and their dependencies ``` ] .panel[.panel-name[interactive use] Many of the graphs in this presentation are targets loaded into the R session that builds this presentation ```r x <- tar_read(target) tar_load(target2); target ``` ] .panel[.panel-name[graph] ```r targets::tar_visnetwork() ``` ``` ## * The library is already synchronized with the lockfile. ```
] .panel[.panel-name[Video] <center> <iframe width="80%" height="400px" src="https://www.youtube-nocookie.com/embed/FODSavXGjYg" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] ] .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? The targets book https://books.ropensci.org/targets/ --- ## Free task parallelism! Because {targets} understands the dependency relationships between individual steps in an analysis it can automatically evaluate independent tasks in parallel. If you declare a plan in your `_targets.R` file before your list of targets then use the `tar_make_future()` function instead of `tar_make()` this is all that's needed. Parallelism within jobs is also possible, just include multiple levels in your plan and use {future} functions in your targets. .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- ## $> EXERCISE: create a simple targets pipeline BONUS: - Make a target that reads an input file and becomes invalid when the file in changes. - Include an Rmarkdown report as a target output (Note the changing the `.Rmd` file invalidates the target) s .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] --- # Additional Resources There are references and additional resources in the presenters notes throughout this presentation as wells as some extra stuff in the github repo. In depth walk through example of optimising `R` code for a network analysis <center> <iframe width="80%" height="400px" src="https://www.youtube-nocookie.com/embed/78icyDMZJyQ" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)] ??? return early --- # Feedback, Questions & Extra material - Leave comments, questions & Constructive criticism at the [
github discussion section](https://github.com/RichardJActon/performantR/discussions) for this presentation's repo. https://github.com/RichardJActon/performantR/discussions ### Like my slides? Made with: - {[rmarkdown](https://rmarkdown.rstudio.com/)} - {[xaringan](https://github.com/yihui/xaringan/)} - {[xaringanExtra](https://github.com/gadenbuie/xaringanExtra)} - Icons & emoji from {[fontawesome](https://rstudio.github.io/fontawesome/)} & {[emo](https://github.com/hadley/emo)} [Tutorial from user2021](https://presentable-user2021.netlify.app/) .slug[HDBI | Dr. Richard J. Acton | Babraham Institute | 2024-10-05 | [✉️Richard.Acton@babraham.ac.uk](mailto:Richard.Acton@babraham.ac.uk)]