Beyond Single-Core R

Jonathan Dursi
http://github.com/ljdursi/beyond-single-core-R

Today's Outline

Today will look something like this:

  • How to think about scaling
  • Parallel Package
    • Multicore
      • mcparallel/mccollect/mclapply
      • parallel RNG
      • load balancing, chunking
      • pvec
    • Snow
      • makecluster/stopcluster/clusterExport
      • clusterSplit
  • Foreach
    • chunking, iterators
  • Scalable Data Analysis Best Practices

Extra material online

  • R and Memory
  • Data file formats
  • BigMemory
  • Rdsm
  • pbdR

Not Covered

  • R in other frameworks
    • SparkR (R + Apache Spark)
    • RHadoop
    • Cool but more about the other framework than R

Thinking about scaling

Some hardware terms

Hardware:

  • Node: A single motherboard, with possibly multiple sockets
  • Processor/Socket: the silicon containing likely multiple cores
  • Core: the unit of computation; often has hardware support for
  • Pseudo-cores: can appear to the OS as multiple cores but share much functionality between other pseudo-cores on the same core

Sockets, Cores, and Hardware therads

Some software terms

Processes and threads:

  • Process: Data and code in memory
  • One or more threads of execution within a process
  • Threads in the same process can see most of the same memory
  • Processes generally cannot peer into another processes memory

Interpreted languages: generally you can only directly work with processes

Can call libraries that invoke threads (BLAS/LAPACK)

Processes vs threads

Parallel computing: faster, bigger, more

One turns to parallel computing to solve one of three problems:

My program is too slow.

Perhaps using more cores — e.g., all cores on my desktop — will make things faster.

  • Compute bound.
  • Tools:
    • parallel/multicore
    • Rdsm
    • GPUs

Rack of Computers

Parallel computing: faster, bigger, more

My problem is too big.

Perhaps splitting the problem up onto multiple computers in a cluster will give it access to enough memory to run effectively.

  • Memory bound
  • Tools:
    • parallel/snow
    • pbdR

Rack of Computers

Parallel computing: faster, bigger, more

There are too many computations to do - one task runs in a reasonable amount of time, but I have to run thousands!

Perhaps splitting the problem up onto multiple computers in a cluster will give it access to enough memory to run.

  • Tools:
    • gnu-parallel
    • parallel
    • job queues…