Learning Matplotlib: Graphing Weather Patterns
As an engineering major, I have only worked with Java, C/C++, and MATLAB while at Hopkins (as of my Junior year). Because of the vast usage of Python in data science and ML, I decided to learn Python on my own by taking some online courses. In one of my courses, Applied Plotting, Charting & Data Representation in Python, there was an assignment to plot temperature trends. We were given a dataset from The National Centers for Environmental Information that records the low and high daily temperatures from several weather stations around Ann Arbor, Michigan. The assignment was as follows:
Read the documentation and familiarize yourself with the dataset, then write some python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.
Watch out for leap days (i.e. February 29th), it is reasonable to remove these points from the dataset for the purpose of this visualization.
Make the visual nice! Leverage principles from the first module in this course when developing your solution. Consider issues such as legends, labels, and chart junk.
Coming from Java, I always put my import statements at the top even though I don't have to in Python.
The first thing I like to do when I have CSV files is to check the datatypes of the data and also see how the data is laid out.
From here we can see that three columns of the CSV are "objects" (strings) and one column, the Data_Value column, contains int64 data. The integers in the Data_Value column represent high and low temperatures as tenths of a degree Celsius so we will have to divide by ten to get Celsius. Since Matplotlib uses standard library datetime objects, we will have to convert those strings of dates into datetime objects later (we could use the parse_dates parameter in the read_csv function as well). We can also see that the data is not sorted at all (at least by date) so we will have to sort the data. Lastly, we also see that there's a column for ID which identifies which weather station reported this temperature data. We could ignore those ID numbers because we know that all of the data came from weather stations around Ann Arbor.
The assignment also came with the chunk of code below that allows us to visualize where the weather stations are located.
We see that the weather stations are near Ann Arbor, Michigan with the majority of them located South and West of Ann Arbor. For many real-life projects, it's important to know where your data is coming from.
Next, I decided to manipulate the main dataframe. First, I converted the string dates into the pandas datetime objects. If we were to use these dates to plot with Matplotlib, we would have to convert these into standard python library datetime objects as opposed to keeping them as Numpy dates. This was very confusing to me at first because I didn't realize there were multiple types of dates in Python. Next, I sort the dataframe by the dates such that the 2005 data was at the top and the 2015 data was at the bottom. Then I changed the index into a multi-index of month and day. After that, I got rid of any data from February 29th which is leap year because not every year has a leap year. Lastly, I converted the data from tenths of a degree Celsius to just Celsius.
Next, I created four more dataframes. Two are high and low temperatures from 2015 and two are high and low temperature from 2005-2014 as the assignment directed. In the assignment, it said to scatter plot the 2015 data if the temperatures were hotter or colder than the hottest and coldest times in 2005-2014. Lastly, I changed the index to a new date range that represented dates in 2015 so that the indices for each dataframe were the same.
My next step was to format the plot so that it would look nice yet have minimal "chart junk." I wanted the X-axis to be months as opposed to individual dates. I opted to get rid of the minor tickmarks and the gridlines. Also, I added a legend with the best location parameter and I added the degrees symbol in the Y-axis label. I also made the figure 10 x 6 to make it more horizontal as it defaults to a square.
The last step is to actually plot the data. I took a color scheme from an online website because I am a little colorblind so it helps that Google has made a color palette for their material design framework. I decided to make the low temperatures cooler colors (shades of blue) and the high temperatures warmer colors (shades of red) with the area in between a mid-tone color (light green). I also made the 2015 markers triangles that point in the direction that it exceeded the highs or lows of the previous years (ie. it points up if that day's temperature was hotter than the previous years' highs).
Finally, we have a plot!
Throughout this project, I learned a lot about Python's Matplotlib library. I am impressed how powerful and customizable it is, especially in comparison to programs like Excel. However, doing this exercise made me appreciate how much faster it is to make a nice-looking plot in Excel but Excel doesn't offer nearly as many customizations as Matplotlib does. Prior to this project I had only used Plotly, a beautiful Python web-based graphing library, and MATLAB for creating graphs using code. MATLAB was definitely much more intuitive to me and faster to learn. However, it was interesting that although John Hunter, who helped create Matplotlib, based Matplotlib on MATLAB, Matplotlib offers much more customization from different tickmark options to different legends.
Full Code: https://github.com/tchanxx/Coursera/blob/master/Applied%20Plotting%2C%20Charting%20%26%20Data%20Representation/Assignment%202%20-%20Plotting%20Weather%20Patterns/Assignment2.ipynb