Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

appendix F Working with large datasets

R holds all of its objects in virtual memory. For most of us, this design decision has led to a zippy interactive experience, but for analysts working with large datasets, it can lead to slow program execution and memory-related errors.

Memory limits depend primarily on the R build (32versus 64-bit) and the OS version involved. Error messages starting with “cannot allocate vector of size” typically indicate a failure to obtain sufficient contiguous memory, whereas error messages starting with “cannot allocate vector of length” indicate that an address limit has been exceeded. When working with large datasets, try to use a 64-bit build if at all possible. See ?Memory for more information.

There are three issues to consider when working with large datasets: efficient programming to speed execution, storing data externally to limit memory issues, and using specialized statistical routines designed to efficiently analyze massive amounts of data. First we’ll consider simple solutions for each. Then we’ll turn to more comprehensive (and complex) solutions for working with big data.

F.1 Efficient programming

A number of programming tips can help you improve performance when working with large datasets:

Vectorize calculations when possible. Use R’s built-in functions for manipulating vectors, matrices, and lists (for example, ifelse, colMeans, and rowSums), and avoid loops (for and while) when feasible.

Use matrices rather than data frames (they have less overhead).

When using the read.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment.char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.

551

552

APPENDIX F Working with large datasets

Correctly size objects initially, rather than growing them from smaller objects by appending values.

Use parallelization for repetitive, independent, and numerically intensive tasks.

Test programs on a sample of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.

Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) removes all objects from memory, providing a clean slate. Specific objects can be removed with rm(object). After removing large objects, a call to gc() will initiate garbage collection, ensuring that the objects are removed from memory.

Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot

.com) to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.

Profile your programs to see how much time is being spent in each function. You can accomplish this with the Rprof()and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.

Use compiled external routines to speed up program execution. You can use the Rcpp package to transfer R objects to C++ functions and back when more optimized subroutines are needed.

Section 20.4 offers examples of vectorization, efficient data input, correctly sizing objects, and parallelization.

With large datasets, increasing code efficiency will only get you so far. When you bump up against memory limits, you can also store your data externally and use specialized analysis routines.

F.2 Storing data outside of RAM

Several packages are available for storing data outside of R’s main memory. The strategy involves storing data in external databases or in binary flat files on disk and then accessing portions as needed. Several useful packages are described in table F.1.

Table F.1 R packages for accessing large datasets

Package

Description

 

 

bigmemory

Supports the creation, storage, access, and manipulation of massive

 

matrices. Matrices are allocated to shared memory and memory-mapped

 

files.

ff

Provides data structures that are stored on disk but behave as if they’re

 

in RAM.

filehash

Implements a simple key-value database where character string keys are

 

associated with data values stored on disk.

 

 

APPENDIX F Working with large datasets

553

Table F.1 R packages for accessing large datasets

 

 

 

 

Package

Description

 

 

 

 

ncdf, ncdf4

Provide an interface to Unidata netCDF data files.

 

RODBC, RMySQL, ROracle,

Each provides access to external relational database management sys-

RPostgreSQL, RSQLite

tems.

 

 

 

 

These packages help overcome R’s memory limits on data storage. But you also need specialized methods when you attempt to analyze large datasets in a reasonable length of time. Some of the most useful are described next.

F.3 Analytic packages for out-of-memory data

R provides several packages for the analysis of large datasets:

The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory-efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.

Several packages offer analytic functions for working with the massive matrices produced by the bigmemory package. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigrf package can be used to fit classification and regression forests. The bigtabulate package provides table(), split(), and tapply() functionality, and the bigalgebra package provides advanced linear algebra functions.

The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package.

The data.table package provides an enhanced version of data.frame that includes faster aggregation; faster ordered and overlapping range joins; and faster column addition, modification, and deletion by reference by group (without copies). You can use the data.table structure with large datasets (for example, 100 GB in RAM), and it’s compatible with any R function expecting a data frame.

Each of these packages accommodates large datasets for specific purposes and is relatively easy to use. More comprehensive solutions for analyzing data in the terabyte range are described next.

F.4 Comprehensive solutions for working with enormous datasets

At least five projects have been designed to facilitate the use of R with terabyte-class datasets. Three are free and open source (RHIPE, RHadoop, and pbdr), and two are commercial products (Revolution R Enterprise with RevoScaleR and Oracle R Enterprise). Each requires some familiarity with high-performance computing.

The RHIPE package (www.datadr.org/) provides a programming environment that deeply integrates R and Hadoop (a free Java-based software framework for the

554

APPENDIX F Working with large datasets

processing of large datasets in a distributed computing environment). Additional software from the same authors provides “divide and recombine” methods and data visualization for very large datasets.

The RHadoop project offers a collection of R packages for managing and analyzing data with Hadoop. The rmr package provides Hadoop MapReduce functionality from within R, and the rhdfs and rhbase packages support access to HDFS file systems and HBASE datastores. A Wiki (https://github.com/RevolutionAnalytics/RHadoop/ wiki) describes the project and provides tutorials. Note that RHadoop packages must be installed from GitHub rather than CRAN.

The pbdR (Programming with Big Data in R) project enables high-level data parallelism in R through a simple interface to scalable, high-performance libraries (such as MPI, ScaLAPACK, and netCDF4). The pbdR software also supports the single program, multiple data (SPMD) model on large-scale computing clusters. See http://r-pbd.org/ for details.

Revolution R Enterprise (www.revolutionanalytics.com) is a commercial version of R that includes RevoScaleR, a package supporting scalable data analyses and highperformance computing. RevoScaleR uses a binary XDF data file format to optimize streaming data from disk to memory, and it provides a series of big-data algorithms for common statistical analyses. You can perform data-management tasks and obtain summary statistics, cross tabulations, correlations and covariances, nonparametric statistics, linear and generalized linear regression, stepwise regression, k-means clustering, and classification and regression trees on terabyte-sized datasets. Additionally, Revolution R Enterprise can be integrated with Hadoop (via RHadoop packages) and IBM Netezza (via a plug-in for IBM PureData System for Analytics). At the time of this writing, students and professors in academic settings can obtain a free software subscription (excluding the IBM components).

Finally, Oracle R Enterprise (www.oracle.com) is a commercial product that makes the R environment available for use with massive datasets stored in Oracle databases and Hadoop. Oracle R Enterprise is part of Oracle Advanced Analytics, and it requires an installation of Oracle Database Enterprise Edition. Virtually all of R’s functionality, including the thousands of contributed packages, can be applied to terabyte-sized data problems using the Oracle R Enterprise interface. This is a relatively expensive but comprehensive solution, and it will appeal primarily to large organizations with deep pockets.

Working with datasets in the gigabyte-to-terabyte range can be challenging in any language. Each of these approaches comes with a significant learning curve. Of the four, RevoScaleR is perhaps the easiest to learn and install. (Important disclaimer: I teach Revolution R courses as an adjunct instructor and may be biased.)

Additional information on the analysis of large datasets is available in the CRAN task view “High-Performance and Parallel Computing with R” (http://cran.r-project

.org/web/views). This is an area of rapid change and development, so be sure to check back often.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]