HOW TO PARSE XML DATA TO A SAPRK DATAFRAME

Sandipan Ghosh
2 min readMar 14, 2022

Purpose

In one of my projects, I had a ton of XML data to perse and process. XML is an excellent format with tags, more like key-value pair. JSON also is almost the same, but more like a strip down version of XML, So JSON is very lightweight while XML is heavy.
Initially, we thought of using python to parse the data and convert it to JSON for the spark to process. However, the challenge is the size of the data. The entire 566GB of data would take a long time for python to perse.
So the obvious choice was the pyspark. We want to perse the data with the schema to a data frame for post-processing. However, I don’t think, out of the box pysaprk support XML format.
This document will demonstrate how to work with XML in pyspark. This same method should work in spark with scala without significant changes.

Option 1:-

Use spark-xml parser from data bricks
Data bricks have 2 xml parser; one spark compiles with scala 2.11 and another one with scala 2.12.
Please make sure you use the correct one.
I have spark compiled with scala 2.11
We can include the dependency in a pom or sbt file. Then open the spark-shell(submit) with “ — packages com.databricks:spark-xml_2.11:0.6.0”
Ref: https://stackoverflow.com/questions/50429315/read-xml-in-spark

Option 2:-

Like my current organization, if you have difficulties in compiling the code with maven or sbt because of the dependency to download from the internet, then you can use the jar file from data bricks website.

scala 2.11: https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.6.0/spark-xml_2.11-0.6.0.jar
scala 2.12: https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/0.6.0/spark-xml_2.12-0.6.0.jar

While launching the spark-shell or submitting it, please include the JAR with — jars full/path/of the/jar

pyspark — jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11–0.6.0.jar

How to parse the data:-

Very simple, read the xml with the format option and spark should infer the schema.

df = spark.read \
.format(“com.databricks.spark.xml”) \
.option(“rootTag”, “SmsRecords”) \
.option(“rowTag”, “sms”) \
.load(“full/path/of/the/xml”)

Note:- the “rootTag” is the starting of the xml tag or the main (toor) tag while the “rowTag” is the main row tag.

Input data
Code and schema
After parsing the data

--

--

Sandipan Ghosh

Bigdata solution architect & Lead data engg with experience in building data-intensive applications,tackling challenging architectural and scalability problems.