Beyond Single-Core R

Jonathan Dursi
http://github.com/ljdursi/beyond-single-core-R

Today's Outline

Today will look something like this:

  • How to think about scaling
  • Parallel Package
    • Multicore
      • mcparallel/mccollect/mclapply
      • parallel RNG
      • load balancing, chunking
      • pvec
    • Snow
      • makecluster/stopcluster/clusterExport
      • clusterSplit
  • Foreach
    • chunking, iterators
  • Scalable Data Analysis Best Practices

Extra material online

  • R and Memory
  • Data file formats
  • BigMemory
  • Rdsm
  • pbdR

Not Covered

  • R in other frameworks
    • SparkR (R + Apache Spark)
    • RHadoop
    • Cool but more about the other framework than R

Thinking about scaling

Some hardware terms

Hardware:

  • Node: A single motherboard, with possibly multiple sockets
  • Processor/Socket: the silicon containing likely multiple cores
  • Core: the unit of computation; often has hardware support for
  • Pseudo-cores: can appear to the OS as multiple cores but share much functionality between other pseudo-cores on the same core

Sockets, Cores, and Hardware therads

Some software terms

Processes and threads:

  • Process: Data and code in memory
  • One or more threads of execution within a process
  • Threads in the same process can see most of the same memory
  • Processes generally cannot peer into another processes memory

Interpreted languages: generally you can only directly work with processes

Can call libraries that invoke threads (BLAS/LAPACK)