philosoraptor_indeed Today, I got a license of the new Stata/MP 13 (dual core), so I decided to make some succinct comparisons with R (Rstudio). Many more tests will come in the following weeks, but today I focused only on the basics: processing text files. Essentially, reading and writing raw datasets. The results I obtained, surprised me—I've to confess—R outperformed Stata in most of the data assignments I ran. Although the Stata version I'm using is a multicore one—not the basic and cheaper inter-cooled version (Stata/IC)—the results I obtained go in a negative direction for Stata. Interesting though, my experience as Stata and R user made me believe that Stata was much faster than R for performing trivial tasks, including loading data tables into memory. However, the evidence I got today contradicts my previous opinion about Stata. Of course, a multicore version of this package doesn't help much, since parallel computation provides benefits only for completing repetitive tasks that take at least one or two seconds to get through (there are quite a few posts about this topic, including my own here ).

The Results:

Any simple work starts by feeding the statistical package with raw data. I tested how quickly these packages get through a semicolon delimited text file of about 450MB. As the following output shows, Stata took 134.37 seconds for reading this raw data. However, R took much less time to import the same file, only 102.49 seconds. Therefore, in this simple but critical task, R outperformed Stata by loading a pure text file 24% faster than did Stata.

Stata importing output: R importing output:

Reading raw text into memory can be tricky, since each software may have different strategies for loading different sorts of data. But how about testing for difference in loading their native formats? Again, I tested how quickly they load the same data but converted to their own file formats (.Rdata and .dta). Overall, R did quite well loading its file in 19.23 seconds, while Stata did so in 89.66 seconds. This mean that R loaded the dataset 78% faster than Stata, or to put differently, my Stata/MP 13 took 4.6 times more than R to load the "same data".

How about exporting data already in the memory to the disk? When it comes to exporting back data from memory to the disk as a text delimited, Stata finally outperformed R. While Stata took 67.25 seconds for writing a file of 458MB of raw text, R needed 5.7 seconds more to do the same (72.93 seconds). Therefore, Stata exported the data 8% faster than R did.

Finally, exporting data from memory to disk but as their native format, R outperformed Stata in few dozens seconds again. While Stata took 118.35 seconds, R took only 42.53 seconds. That is, R took roughly 2/3 of a minute to perform its duty, while Stata did so in roughly 2 minutes. Or to put differently, the Stata Corporation package was 2.78 times slower than the free one. This difference is huge when we think of percentages: R was 57% faster than Stata. It is certainly not trivial.

R and Stata are comparable for the tests I performed because both packages have to load all the data in at once before performing any analysis. However, reading data into memory may take longer because R and Stata produce distinct native formats, which also affect the final size of the file. Actually, one of the marvelous things I like in R is its competence to store data. For instance, the example dataset I'm using for conducting these tests takes 458MB of physical disk as a raw text file. However, if I store this file as Stata format, the outcome file will need 1.16GB of disk, which is 2.59 times more space to store the same amount of information. Nonetheless, storing the same file as R format (Rdata), it will need only 54.3MB of disk.

All in all, R outperformed Stata in 3 out of 4 trivial tasks. Stata outperformed R only when writing data from memory to the disk, although the difference resulted wasn't that big, only 5.7 seconds is too small for Stata to celebrate. Can anyone prove me wrong? Moreover, the results reported here are based only on one-shot test; the ideal design to benchmark for differences in performance would be several repetitions of the same work, so to obtain the average of performance. The idea was set forth.

14 Thoughts on “R vs Stata: Importing and Saving Datasets

  1. Thomas Speidel on January 7, 2014 at 4:17 am said:

    Interesting results. Please consider posting this on Statalist. I'm sure the folks at Stata will take a look.

  2. Chris Kennedy on January 7, 2014 at 4:29 am said:

    This is an interesting line of review, glad that you have brought it up. However, your methodology can easily be improved by increasing the n size of such a test - no reliable benchmark is based on an n of 1. For example, you are vulnerable to random differences in background CPU usage, or any other latent process just like any other statistical measurement. Increase your sample size to 50-500 replications and run a statistical test to compare the distributions, then you'll be on the right track for benchmarking statistical software.

    I see that you reference this constraint at the conclusion of the article - this caveat should be at the beginning instead, so that isn't missed by readers. It is somewhat naive to evaluate statistical software without actually using statistical analysis, as the people who will read this article will know its methodology is entirely vulnerable to reporting point estimates that are completely within the standard errors of the distributions.

    • Daniel on January 7, 2014 at 1:19 pm said:

      Thanks for passing by and giving comments Chris. Basically, I'm aware of the limitations, but I got so excited by the first results that I wanted to spread out the word. I guess the next step is to elaborate a proper test.

    • Matt Dowle on January 7, 2014 at 2:03 pm said:

      Chris, iiuc the file takes of the order 2 minutes to load. Sure, a second run is often faster due to various cache effects, and a third consecutive repeat of loading the same file may be instructive. But does he really need to repeat loading the file 50-500 times? It's often the first run (without the benefit of cache effects) that's the most relevant to a user loading a file once or loading a set of files. I think 3 consecutive runs are sufficient to benchmark this task.

      • Daniel on January 7, 2014 at 4:03 pm said:

        Matt, if what you're arguing holds constant for all packages, than it doesn't matter much for the comparasion itself, since both R and Stata will benefit from learning data structure. The critical thing here, saying my flaw, is to obtain an average of the processes.

  3. Stas K on January 7, 2014 at 2:20 pm said:

    The difference in the sizes of the stored data sets suggests that you did not try to optimize the storage format for Stata. You may have read everything as the -double- type, which takes 8 bytes per element, and this is very wasteful; you can do better with more economical storage types like -int- and -byte-. (SPSS is even worse, it does not seem to be aware of these economical types at all.) Type -compress- in Stata before saving, and compare the differences. The difference in the size of the saved data set would take care of the difference in I/O times.

    As another, more subtle statistical comment on top of what Chris said -- if you are a proficient R user with years of experience, you should be comparing your workflow with that of a proficient Stata user that would know a comparable suite of data management and statistical analysis tricks. I am a novice R user, and in my hands, R simply cannot beat Stata because I am so good in Stata :).

    • Daniel on January 7, 2014 at 3:58 pm said:

      Thanks for commenting. I may produce a more systematic comparison with a replicable example in the next days.

    • In this case the compress function should be part of the evaluation process which will slow down the Stata speed further significantly.

      I use Stata in my daily worklife but I am slightly annoyed by the term chosen for the function 'compress'. Stata's dta-Format is such a space hog. Using 7zip on an output dta file gives me compression rates of more than 90% (99% in extreme cases e.g. 431.6MB down to 1.3MB) and that is AFTER using 'compress'. Nice for mailing though, but I need to send a warning with it to what it's gonna expand once unpacked.

      This test done here is specifically interesting as the Stata program structure necessitates much more reading and writing data from harddrive compared to R once you do data manipulation. Which apparently aggravates the lack of speed.

      So thank you for the test.

      • Daniel on January 13, 2014 at 5:58 pm said:

        Thanks for passing by Felix. I plan to do a later test incorporating alternatives and comments gave by the readers.

  4. In R, try this:
    library(data.table)
    system.time(cand_Brazil<-fread("cand_Brazil.txt"))

    =D

    • Daniel on January 7, 2014 at 4:57 pm said:

      Yeas, I know data.table is much faster than the default read.table function, though the goal wasn't find the optimal way to importat data, but to compare Stata and R. Thanks for point me out, using data.table::fread the time drops to 10 seconds. That is, 10 times faster than the base function.

  5. Billy B on January 7, 2014 at 8:09 pm said:

    In my experience, I would disagree completely with the results you've found. On several occasions, I've crashed R Studio trying to read in a larger dataset (e.g., 500,000 - 1,000,000 observations with ~ 200 variables), although I failed to have the same problem working with the file in Stata. Storage format optimization is certainly going to be a major determining factor in any type of I/O benchmarking, as well as the complexity of the data being stored (e.g., is everything a string format, double/quad precision, byte formats, etc…). If your goal is to simulate something along those lines, you may want to vary a few different factors to get a reasonable result from your experiment. For example, you would want to vary the number of variables in the file, the number of observations, the proportion of string to numeric variables, the length of string variables, the encoding of the strings, the proportion of integer vs floating point numeric variables, # of processors being used for the I/O operations (keep jn mind that not everything in MP is parallelized), amount of RAM available, etc…

    Although I've not had the greatest experiences with R, I would hardly allow a single experience to color my judgment of it as a platform overall. The danger of things like this is that often naive users will fail to understand that a single test is insufficient in terms of the quality of the inference that can be based on it.

    • Daniel on January 8, 2014 at 2:33 am said:

      Thanks for your thoughts Billy. The motivation for writing this post, despite so naive experiment, was that the result surprised me. Before yesterday I had a firm opinion about Stata superiority on reading files. 2 years before, I suffered a lot importing files like 800K - 1300k observations. R simply couldn't handle them with the machine I had while Stata does. Somehow, the lately improvement in the R package was great.

      • Same here. I was using Stata a lot a few years ago, and every time I tried to replicate a large-N analysis in R, it would crash it or slow down the whole machine. It would be worse in RStudio, and better in Terminal, running R from the shell.

        All this has also gone away in my experience, and RStudio now opens a million lines from a .rda file in very little time (and you gotta love the fact that .rda files are compressed). I still run R from the shell for the heavier stuff.

Post Navigation