How to download really big data sets for big data testing

Sandipan Ghosh
3 min readAug 3, 2020

For a long time, I have been working with big data technologies, like MapReduce, Spark, Hive, and very recently I have started working on AI/ML. For different types of bigdata framework testing and text analysis, I do have to do a large amount of data processing. We have a Hadoop cluster, where we usually do this.

However recently, I had a situation where I had to crunch 100 GBs of data on my laptop. I didn't have the opportunity to put this data to our cluster, since it would require a lot of approval, working with admin to get space, opening up the firewall, etc.

So I took up the challenge to get it done using my laptop. My system only has 16 Gb of ram and i5 processor. Another challenge was I do not have admin access, so I can not install any required software without approval. However, luckily I had Docker installed.

For processing the data I can use Spark on local mode as spark support parallel processing using CPU cores. As i5 has 4 cores and 4 threads, the spark could run the entire process on 8 parallel processes.

How to get the Data: Yellow cab data

Now to the real topic, where to get the really big opensource data, which is 100GB in size? We need both structure(CSV) and semistructured(JSON)data

Source1:- After little research, I found out that we can download entire yellow cab data from the NYC gov data site. Here is the link

This does need a little bit of effort to download the data, as all the data are split in monthly CSV. Each CSV is 2 GB in size. So I wrote a python program that will download each month CSV for the website into a local directory and will also show a little progress bar on the screen.

import urllib.request
from tqdm import tqdm

class DownloadProgressBar(tqdm):
def update_to(self, b=1, bsize=1, tsize=None):
if tsize is not None:
self.total = tsize
self.update(b * bsize - self.n)

def download_url(url, output_path):
with DownloadProgressBar(unit='B', unit_scale=True,
miniters=1, desc=url.split('/')[-1]) as t:
urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

year = list(range(2009, 2020))
month = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
for x, y in [(x, y) for x in year for y in month]:
print("fetching data for %s, %s" % (x, y))
link = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_%s-%s.csv" % (x, y)
file_name = '/home/sandipan/Documents/yellow_taxi/yellow_taxi_data_%s-%s.csv' % (x, y)
print(link, file_name)
download_url(link, file_name)

JSON data

How about the semi-structured data? well, we can use ‘Open Library’ data. The Open Library is an initiative intended to create “one web page for every book ever published.” You can download their dataset which is about 20GB of compressed data.

we can download the data very easily using wget. Then unzip it using unzip command.

wget — continue http://openlibrary.org/data/ol_cdump_latest.txt.gz

Well, that's all, have fun with all the data.

In my next post, I will post how to process the data locally.

--

--

Sandipan Ghosh

Bigdata solution architect & Lead data engg with experience in building data-intensive applications,tackling challenging architectural and scalability problems.