Cloud Computations — Quick data analysis with AWS Athena, Glue and Databricks spark

Sandipan Ghosh
6 min readOct 3, 2021

Throughout my carrier, I always had a situation that I had to fix failing production jobs. Most of the time, the debug involved analysis of input data to figure out the error in the raw data.

For the last ten years, I have also been doing data analysis to provide quick business insights. This often involves running a complex query on an extensive set of data.

Most of the time, we do not have access to the production environment to debug a job or install the required packages. It’s also advisable not to debug jobs in the production environment as it might have a negative performance impact or completely break the job.

We have been using a few tools to debug, mainly Hive, Presto, Tableau, etc. These tools are not always the best option as often it’s required to have custom code/ser-der/package need to be used for debugging falling jobs because of data issues. I like to use spark, however; the spark does not have a pretty GUI, which data analytics love. We can defiantly use iPython notebook or Zeppelin. But both tools required a considerable amount of time to set up. Often, these tools are not available in production.

So, I was looking for an alternative that can be set up quickly, have a pretty GUI, and are easily scalable.

I was looking into Cloud (AWS) and came across Athena and Glue. So I tried the use case. I have considered Twitter data (5.5GB and letter 20GB for scalability) for the input data, which I had collected last year. This data is complex enough and should be able to serve our test purpose.

Athena:

  • Serverless, so no need to set up and manage the infrastructure
  • Supports standard SQL, so need to learn a new language
  • Store the data in S3 and use the same data for analysis in Athena. No need to pre-processing the data
  • Hive in the backend, so support all hive properties and have wide varieties of data support(CSV, JSON, ORC, Avro, and Parquet etc)
  • We can easily use hive serializer deserializer to parse custom data format
  • As this is a managed service the perforce might not be same over the time
  • External table — by default it creates an external table. You can not create an internal table, even if you change the table structure command. This is done to avoid accidental deletion of the table and dropping the entire data from S3
  • It saved every query we run in Athena in a temp/work dir under the user-specified s3 location. I think we need to have some mechanism to delete these dirs/files in regular intervals. Else this will spin into a monster eating all the S3 budget

To test this service, I kept the input data in the S3 bucket. Then create an Athena table from the S3 bucket. I chose the data format like JSON and it asked me to enter the columns name manually. This is a challenging job for my use case as the JSON has over 300 columns. It’s challenging to enter 300 columns manually. I hoped the UI could parse the file(or first row as column name in CSV) to infer the schema. This could have been an excellent facility for creating a table with many columns. However, scanning files to infer schema can take longer and cost more if you have a sizeable amount of data. I’ve found writing queries manually is a better option than using the GUI.

I also noticed the GUI cannot run multiple queries together, like a script. This is ridiculous. If I wanted to run a script, then I must run one query at a time.

Fig 1: Athena with creating columns manually

I started looking for an alternative solution and came across AWS GLUE.

GLUE:

  • Serverless, so no need to set up and manage the infrastructure
  • It is a data integration/ETL/Metadata management tool, so it can be used in a wide range of use cases
  • Can read the data from S3, parse and load to Athena
  • Support wide varieties of data (CSV, JSON, ORC, Avro, and Parquet etc.)
  • Can schedule a crawler to get new data from pre-defined location, then parse and load to proper hive location or partition

So I set up a GLUE crawler to get the JSON data to a hive table. Then use that table to query in Athena. However, GLUE did not parse the nested JSON like I was except. I could have written a hive query to parse the nested data and flat the structure. However, this would have taken more time and will quickly become unmanageable. Also, querying the table in Athena produced the below error. Looks like the data is not parsed correctly (I know this is a proper file as I had parsed this data before and used it for Machine Learning).

Fig 2: Schema problem with table create by GLUE

I had used Databricks before and wanted to try this for the use case. Databricks have a free community edition and can be configured to read data from S3. The community edition has a data limit of 5GB. If we keep our data in S3, then this limitation will not apply here.

Once you log into the community edition, create a cluster. I chose the below configuration. This step will take 5–10 minutes to finish.

Fig 3: Creating Databricks community edition cluster

For the Databricks cluster to have read and write access to S3, we must have an IAM user/role having read and write access to s3. We need to have both reads and write as we want Databricks to write the output data to s3.

Fig 4: IAM role page in AWS
Fig 5: IAM — policies in AWS

We can use the access id and key to read/write data from S3. Here is the code to do this.

import url lib

# unmount if already mounted

# dbutils.fs.unmount(“/mnt/corona_s3”)

# get the S3 mounted to the local DataBriks dir

access_key = “xxxxxxxxx”

secret_key = “xxxxxxxx”

# parse the special char

encoded_secret_key = urllib.parse.quote(secret_key).replace(“/”, “%2F”)

aws_bucket_name = “sandipan-atena”

local_mount_name = “corona_s3”

dbutils.fs.mount(“s3a://%s:%s@%s” %(access_key, encoded_secret_key, aws_bucket_name), “/mnt/%s” %local_mount_name)

# display the files

display(dbutils.fs.ls(“/mnt/corona_s3/corona/”))

# run few sql

data_loc = “/mnt/corona_s3/corona/*.json”

data = spark.read.json(data_loc)

data.createOrReplaceTempView(“data”)

data.printSchema()

# Query to get the data points

extract_data = spark.sql(“””select user.screen_name as user_screen_name,

user.location as user_location,

extended_tweet.full_text extended_tweet_full_text,

retweeted_status.user.screen_name retweeted_status_user_screen_name,

retweeted_status.text retweeted_status_text,

retweeted_status.extended_tweet.full_text retweeted_status_extended_tweet_full_text,

quoted_status.user.screen_name quoted_status_user_screen_name,

quoted_status.text quoted_status_text,

quoted_status.extended_tweet.full_text quoted_status_extended_tweet_full_text,

place.country place_country,

place.country_code place_country_code,

timestamp_ms

from data”””)

# some sample data

extract_data.show(10,False)

# Write to final data

extract_data.coalesce(2).write.mode(“Overwrite”).parquet(“/mnt/corona_s3/corona_extract/”)

Fig 6: Running query
Fig 7: Query running with stages
Fig 8: Writing the output data
Fig 9: Final output

Once the output is ready, we can use the AWS GLUE crawler to crawl the data and create/update an Athena hive table exposed to business users.

I have used Databricks’s community edition, which has no consistent performance due to its free nature. However, the paid version can easily replace the free version.

The entire code you can get from git location:- “https://github.com/ghoshm21/spark_book/tree/main/ch06_Clud_computation

Please contact ‘contact@sandipan.tech’ for any suggestion or help.

--

--

Sandipan Ghosh

Bigdata solution architect & Lead data engg with experience in building data-intensive applications,tackling challenging architectural and scalability problems.