Week 2 Thoughts

From now on we are on our own. Our projects from now on would be individual projects. This week we were introduced to our new project in which we need to scrape data and perform linear regression on that data. As of right now, I am choosing to predict the 2nd week box office performance for a movie. I got this idea, from noticing that most movies experience a significant decline in box office revenue from the first week to the second. However, a minority of movies have an increase. I want to look into the variables that are associated with this. I have scraped data from Boxofficemojo and pulled in api data from Omdbapi.

My biggest issues this week were trying to scrape data. I was able to get the links  on Boxofficemojo for all the movies I wanted to look at pretty easily with BeautifulSoup. Then, based on a suggestion by our TA, I switched to Scrapy and tried to get each movie’s data. It took me a while to understand and use Scrapy, and especially to find the right xpaths to get the data I wanted. On Monday and Tuesday of this week, I started to seriously doubt how well my project was going to be and doubted a bit if I was ready for the bootcamp. But I pushed through, and after many attempts of trying various xpath strings, and with some helps from fellow students, I have all the data I need to run linear regression. Since I already know a lot about linear regression, I don’t expect to run into anymore major hurtles.

Every single day this week I have been staying in the classroom to keep working till about 730 or 8 and then going home and coding until 1030. And…I actually enjoy it. I have realized again that I wanted to do data science because I get to problem solve and think deeply about the data I’m analyzing. I can’t think of the last time I have worked this hard, especially for these sustained periods of time. I credit this to liking what I am doing, and strict project deadlines. 

Other thoughts:

I have made it a priority to get around 8 hours of sleep. When I first started the bootcamp I got between 4-6, and I couldn’t process much of the lessons, or be able to deeply analyze the data I was looking at. I have always thought sleep is important, and don’t think it is more productive to work more but sleep less.

Data science is not for people who like results fast or who are not willing to start over after something doesn’t work. I ended up screwing up a small part of my code that meant I was losing a bunch of movies information when I was loading it into pandas.  Thankfully, a classmate caught the  error early on, and it only cost me about 2 hours. Mistakes are inevitable, 

Appreciate small wins. Some of my happiest moments are when my code just works. 

Written on April 16, 2017