Apr 9, 2018

Working With Large Datasets

Working with very large datasets that are too big to fit into memory is an important task. Below are some of the techniques and tools to work efficiently with large datasets:-

1) Consider exploring the large datasets on a machine with more RAM, high-speed processor and/or more cores. Instead of investing in physical hardware, consider cheaper options of owning/renting virtual machines.

2) When creating models using complex Machine Learning algorithms on huge data, write smarter code:-
 a)Vectorize your code and avoid for loops. 
 b) Allocate memory efficiently in code by creating an empty variable(s) with an appropriate number    of elements. 
 Example in R: var1 <- numeric(10000) (if you know the numbers required in advance)

3) Some can thwart this constraint by storing their data objects in a database, and only using selected subsets that fit in memory.

4) When exploring or training a dataset we may not need a complete dataset. So use sub-samples of the available data instead of requiring the whole dataset to be held at once. But ensure that your sub-sample is representative of the complete dataset.

5) Divide and conquer: Split data into chunks and analyze, but do this in batches of fixed sizes that fit into memory.

6) Use parallel computation as opposed to serial. Use split-apply-combine or MapReduce techniques.

7) If your data is very large, consider using tools like Spark/HDFS. Spark shell enables us to do interactive data analysis using Python. Data analysis and machine learning is supported by Mllib library. If the dataset is so huge that R or Pandas is unable to handle and doing analysis on a sub-sample is not an option for you.  Then consider using Spark.

If you know of any other tools and techniques to work with very large datasets, feel free to comment below.

No comments:

Post a Comment