DS3 Project Committee June Updates

9 min readAug 25, 2021

The DS3 Projects Subcommittee is a department-student collaboration founded to assist students in gaining practical data-science experience by working on real-world problems with both faculty and industry. The following updates are the initial work of this year’s project committees, and this article will continue to be updated as more progress is made.

Netflix Recommendations

Research Question: How can we build a Netflix Recommendation system best tailored to a user based on their previous watch history and ratings for various films/shows?

So far, we have spent a lot of time exploring the Netflix Prize dataset and creating a research question, as well as developing a long term plan for the rest of this summer. When we first started delving into our dataset, we hit a couple of hiccups, as we realized that we did not have as many features as we would have liked to ideally have. The only feature that was given was just the rating of each of the movies by the reviewer, but we had hoped to have more information about each movie such as the genre, director, and the actors, and even information about the user’s history to build a recommender system. Nevertheless, we were undeterred and are currently performing basic EDA. We are seeing how far we can get with this dataset by merging it with an IMDB dataset to include more features so we can create more complex recommender models.

Additionally, we have found multiple Kaggle Movie datasets featuring movies not only shown on Netflix with metadata surrounding each film. Ultimately, we would like to create an overall database that includes the metadata for each film and the user ratings provided from the Netflix dataset. Our goals for the upcoming month is to continue searching for and exploring the features of a certain movie, and be able to create a definite dataset that can be used to perform content-based and user-based filtering to start our recommender system.

Spotify Songs and Behaviors

Research Question: Predicting whether or not a user will skip a song based on features

With the first month of work under the rug, my group and I have progressed through the basic exploratory data analysis required to tackle the question we’ve posed. With a majority of our group members being underclassmen, we spent a large portion of our time this month ensuring everyone had the basic knowledge and skills to tackle our introductory exploratory data analysis (pandas, matplotlib, seaborn, numpy, etc). With such a large dataset, a majority of our analysis has been done on the sample dataset. We’ve investigated different features in our dataset, such as a song’s popularity score, and its correspondence with tempo, key, beat, etc. We’ve looked into what values are missing and what could be causing the missing values. Luckily, we have not run into any major hiccups or issues outside of making sure everyone is caught up on the necessary skills needed to conduct our research question.

NASA/Mars Imagery

Research Question: Image Classification Task for Mars Rover Images

For this month, our team worked on setting project goals, creating a project schedule (including a Gantt chart), and building a baseline model for our project. We have decided to first implement an image classification model using CNN. From there, we will explore other possible tasks, such as identifying day/night time on Mars, classifying the weather, as well as tracking the Mars rovers’ activity timeline through the images it took. Since all of our team members have different experiences working deep learning, we are constantly searching for online resources to better understand its application. Currently, we are working on using transfer learning to build a baseline model.

Smartphone Human Activity Recognition

Research Question: What is the most effective model to predict human activity from smartphone sensors?

We started by analyzing our data and its features, and from there we decided on the direction we wanted to take with the project by developing our research question. We decided that we will be focusing on four different models: K Nearest Neighbors, Random Forest, Decision Tree, and Naïve Bayes. We experienced a slight hiccup when we switched from using google colab, to locally ran notebooks instead. Then, we each created several different visualizations to help us understand our dataset more, and learnt the process of Exploratory Analysis. After EDA, we turned our categorical variable of activities to a numerical variable, labelling them from 1 to 6. In order to reduce our features from 500 to 3, we coded PCA, which allowed us to test our first model, K nearest neighbors, on our training set and resulted in a 90.5% accuracy score on our test set. This is only a baseline model and we plan to continue testing on the three other models mentioned prior. Within each model, we will also perform GridSearch, finding the optimal hyperparameters for each one that will result in the highest accuracy.

Rotten Tomatoes

Research Question: Have natural language features of critic and audience reviews on Rotten Tomatoes changed over the last 20 years and is there an association between the overall rating for a movie and its box office success?

In our project, we hope to look for qualities of critic and audience reviews in top movies, and see if these values may have changed over time. Additionally, we want to investigate the relationship between Rotten Tomatoes reviews and the box office success of these movies. So far, our group has completed the web scraping portion of our project. We first identified that we wanted to scrape the top 100 movies of each year from the last 20 years in our dataset. This was decided because it made for a convenient sample of movies that could be easily isolated and then scraped off the website. Although these movies all generally have higher reviews, we can use this fact to look for the positive features of movies that may have contributed to their success. Additionally, the time frame means that we can do a time series analysis on the data. Following this, we decided to scrape off all of the metadata that was available on the Rotten Tomatoes page for each movie. This included the likes of audience ratings, the rotten tomatoes score, run time, maturity ratings, and genres. After this, we scraped off 10 audience reviews and 20 critic reviews for each movie. The result of the web scraping is a dataset that contains a list of movies, their corresponding metadata, and a list of their audience reviews and critic reviews. Using this dataset, we plan on performing natural language processing analysis on the reviews, and using their results to create some form of predictive modeling or regression on other factors such as box office score. We believe that the associations found could be different between the audience and critic reviews. We also hope to see if there is a relationship between the time frame that the movies are released, and the different factors that are prevalent in the reviews.

Actor/Speech Recognition

Research Question: How can we use a deep learning model to identify emotion in soundbites from the TV sitcom Friends?

The primary focus of our project is to explore and analyze audio data, and create a model that will successfully predict certain characteristics associated with dialogue. Given the popularity of the show “Friends”, we thought the MELD dataset would be an interesting choice. This dataset consists of the following main resources: MP4 files of the dialogues from the show “Friends”, and CSV files of the annotated data containing speaker, emotion and sentiment information. With our available data, it is possible to perform both Speaker Recognition as well as Emotion/Sentiment Detection. We have decided to aim for the latter and performed the following steps keeping this in mind: Data Pre-Processing: Librosa is the main package used for audio feature extraction and analysis. It works much more readily with WAV format files than MP4 ones. Thus, we used the MoviePy package to convert our audio files to the desired format. EDA: We explored both the annotated data as well as the raw audio files. EDA on the annotated data gave us a better understanding of the distribution of our audio files among the various Emotion/Sentiment categories. The Librosa package allowed us to visualize the raw audio files as spectrograms, as well as time-series plots of the amplitudes. Additionally, we also used the Fast Fourier Algorithm (FFT) to create some frequency domain plots. Parallelly, we are continuing to research different Digital Signal Processing methods, including but not limited to computing Mel-Frequency Cepstral Coefficients (MFCCs) — a very important Feature Extraction technique. These will be important as we move along to our next phase of Baseline Model Building.

NFT Blockchain Analysis

Research Question: We are interested in detection of illegitimate trades in the WAX transaction network, particularly in the NFT space.

The dataset that we want to work with is decentralized, and lives in the WAX-EOS blockchain. As such, getting our data is a task in itself. In this brief update, we chronicle what we’ve done so far, where we’ve hit roadblocks, and what we’ve learned. Our first approach involved attempting to use the primary public API endpoint for the WAX-mainnet listed in their documentation (chain.wax.io). Our team learned a lot about interfacing with the web and interacting with APIs using Python, but ultimately ran into issues with this API. We found that we were only able to pull a single block for every request, which is actually a very small amount of data. For reference, the WAX blockchain is over 121 million blocks long and generates multiple blocks per second at the time of writing. Furthermore, we found that the API was a Remote Procedure Call (RPC) API, as opposed to more commonly found REST APIs. This means that with every API call, the server runs an actual procedure, rather than simply pulling data from a database. Because of this, the API rate limited us pretty quickly, not allowing us to make more than 30 requests every hour. We also found lists of other endpoints running on different nodes on the WAX network, but ultimately discovered that they were running the same underlying node software, which meant that they were also RPC APIs with low rate limits. For these reasons, we realized it would be infeasible to acquire a meaningful amount of data through these APIs, and explored other options. Our team has also been exploring running our own node on the WAX blockchain to acquire our data. We ran into issues running the node software on our own computers, so we turned to cloud computing on Microsoft Azure. We all have limited prior experience with cloud computing platforms, and have recently been familiarizing ourselves with using virtual machines and container instances. Eventually, we were able to successfully build the EOS nodeos — the core service daemon that all nodes in the WAX network depend on. We ran into issues interfacing with live blockchain networks, and are currently trying to figure this out. We are also exploring other cloud services such as AWS, specifically for the work around on the daemon.

In regards to our analysis, our group still remains committed to the objective of detecting illegitimate trades in the WAX transaction network, specifically focusing on NFT’s. Given that the NFT market consists of transactions between users, we’ve felt that utilizing graph theory would be an intuitive and appropriate approach. Specifically, to model our data, we plan on using an weighted undirected graph. Elaborating on the components of this particular type of graph, each node represents an individual user, an edge between two nodes signifies a transaction between two users, and finally the weight of an edge is the number of transactions made between a pair of users. To identify transactions which appear illegitimate, we are aiming to isolate only the edges with abnormally high weights which implies a series of transactions meant to artificially inflate the value of an NFT. One method we plan on performing this analysis is using a modified Kruskal’s Algorithm, which will reduce our largely connected graph to only its maximum spanning tree.

USGS Wildfires

Research Question: What are some factors that contribute to the size and intensity of wildfires?

The end goal of our project is to create an interactive dashboard that is both interpretable and insightful in regards to wildfires in the United States throughout the last few decades. We have data on roughly 1.8 million wildfires, including details such as the total number of acres burned, the date of discovery, the statistical cause of the fire, and a longitude, latitude coordinate pair for a respective fire. We plan to perform analysis and exploration of the fires in a locational and temporal sense using geospatial analysis and time series related prediction methods. One interesting finding in regards to the fires contained in our data set is that despite the data set being roughly 1.8 million “wildfires”, a large majority of the fires (roughly 1.6 million) are under 10 acres burned. We visualize this skew via a horizontal bar plot displaying the Fire Size Class Frequencies, noticing how a majority of the data falls within classes A and B. We then explored the causes of these relatively small fires. We noticed that Debris Burning, Arson, and Lightning were some of the more common descriptions, which can be viewed by examining the stacked bar plot presenting Cause of Fires Resulting in under 10 acres Burned.