As part of a suite of articles on big data, Szymon Walkowiak, who worked with us as a data scientist, writes about the capabilities of R for researchers who wish to process and analyse big data.
R, the open-source data analysis environment and programming language, allows users to conduct a number of tasks that are essential for the effective processing and analysis of big data. One of the advantages of R is that it facilitates data management processes such as transformations, subsetting and “cleaning”, and helps users in greece rcs data carry out exploratory data analysis and prepare the data for statistical testing. R also contains numerous ready-to-use machine learning and statistical modelling algorithms which allow users to create reproducible research and develop data products.
R has grown in popularity over past few years due to its unrivalled statistical capabilities and its active community of users. Although the rise of big data temporarily threatened this growth —R traditionally struggled with large data sets because of its memory limitations—R is now one of the 20 most popular general purpose programming languages (according to the TIOBE Programming Community Index 2014). It is also the most favoured analytical tool, along with Python and SQL, by data science recruiters (for more information visit KDnuggets). Recent developments of new R packages and big data open-source projects like Apache Hadoop and Cloud computing – for example, Microsoft Azure or Amazon Web Services – have made it possible to manage, analyse and visualise terabytes or even petabytes of data by individual users or researchers in a user-friendly, inexpensive and resource-efficient manner.