Parsing an XML file into an OLAP Database in SQLite

Cole Crescas
3 min readAug 19, 2021

An XML file is a common extensible markup language file used to store and transport data. Often used as the back end data storage from a website, it can be used for extracting, transforming & loading data into a database. For this example I choose to perform this ETL process on a PubMed Article file containing various information about this scholarly articles. The end goal was to analyze trends for the amount of articles published per quarter and year, grouped by the author & journal published.

Below is a preview of the xml file showing some of the tags we will be extracting for this case.

Sample XML file

The first step is to design a database schema used for our analysis. I choose to create a star schema around the article fact table with two summary tables for analysis as well. An image of the ERD is attached below.

ERD for Schema

Following the database design I setup the SQLite database using SQL commands in Rstudio. The following code outlines the table names, attributes and datatypes for the schema. I then used the dbWriteTable command to convert R dataframes created in the following code sections into SQLite tables in the database.

In order to create the tables in R, I extracted the xml information into dataframes using Xpath and used a for loop to parse through each indivual article. Finally, using the rbind function, I was able to combine the datalist into one dataframe. This was then used to create other dataframes and input into SQLite.

Extracting xml info into

To create the final summary tables I used the dbGetQuery function of the DBI package in R. This allowed me to query the already existing tables using SQL and create another dataframe that could be uploaded to the SQLite database.

Using these summary tables, I was able to create final graphs to view the relationship between the number of published articles and quarters of the year. For clarification, I broke up the quarters as months 1–3 were quarter 1, etc. instead of the regular financial quarters. This could be a reason for the skew in the data to have less publications in the spring time of the year.

Graph of the number of published articles by quarter

Finally, I checked my online version of SQLite to see the tables that were uploaded to the database. An image of the tables database can be seen here:

SQLite Database

Full code & XML file can be found here : https://github.com/colaso96/OLAP_database/blob/main/XML_ETL.Rmd

If you enjoyed this please like and follow for more content! Thank you!

--

--