Parsing an XML file into an OLAP Database in SQLite
An XML file is a common extensible markup language file used to store and transport data. Often used as the back end data storage from a website, it can be used for extracting, transforming & loading data into a database. For this example I choose to perform this ETL process on a PubMed Article file containing various information about this scholarly articles. The end goal was to analyze trends for the amount of articles published per quarter and year, grouped by the author & journal published.
Below is a preview of the xml file showing some of the tags we will be extracting for this case.
The first step is to design a database schema used for our analysis. I choose to create a star schema around the article fact table with two summary tables for analysis as well. An image of the ERD is attached below.
Following the database design I setup the SQLite database using SQL commands in Rstudio. The following code outlines the table names, attributes and datatypes for the schema. I then used the dbWriteTable command to convert R dataframes created in the following code sections into SQLite tables in the database.
In order to create the tables in R, I extracted the xml information into dataframes using Xpath and used a for loop to parse through each indivual article. Finally, using the rbind function, I was able to combine the datalist into one dataframe. This was then used to create other dataframes and input into SQLite.
To create the final summary tables I used the dbGetQuery function of the DBI package in R. This allowed me to query the already existing tables using SQL and create another dataframe that could be uploaded to the SQLite database.
Using these summary tables, I was able to create final graphs to view the relationship between the number of published articles and quarters of the year. For clarification, I broke up the quarters as months 1–3 were quarter 1, etc. instead of the regular financial quarters. This could be a reason for the skew in the data to have less publications in the spring time of the year.
Finally, I checked my online version of SQLite to see the tables that were uploaded to the database. An image of the tables database can be seen here:
Full code & XML file can be found here : https://github.com/colaso96/OLAP_database/blob/main/XML_ETL.Rmd