New York Forest Rangers
Verifying Volunteer Entries to the 2015 NYC Street Trees Census
Never has a walk around the block been so important. As the pandemic continues to wreak havoc on everyone’s social wellbeing, New Yorkers can find some shred of solace in the over 650,000 publicly owned trees that grace their streets. And thanks to NYC Open Data, I was able to analyze trends in the health of street trees and what factors affect it. I set out to find helpful changes for the next census, as well as policy goals for healthier trees.
The data
In 2015, the NYC Dept. of Parks & Recreation conducted a census of all the street trees in the five boroughs. That’s every tree that isn’t in a park or privately owned, which totals 652,173 (not counting stumps and dead trees). I looked closely at the health of the trees, which was split into three categories: Good (82% of the data), Fair (14%), and Poor (4%). Already, it’s clear that one of the big challenges for my model will be this large class imbalance.
There were 41 other variables in the data, including tree diameter, species type, any problems with roots or branches or trunk, whether or not they have tree guards, as well as location information of the tree — coordinates and information like community board, state assembly/senate district, city council district, etc.
About 30% of the data was collected by volunteers, with the rest being made up of professional parks staff. You can see in the maps below that volunteers didn’t make it into the outer depths of the boroughs.
Manhattan and Brooklyn were overrepresented by volunteer data collection, and if you were Rain Man, you may see that volunteers tended to rate trees more poorly on average. With this in mind, I began to wonder if the lack of training and reach of volunteers may have affected how they rated a tree and the veracity of that rating.
Outline
After cleaning my data, I conducted extensive EDA, fit a baseline model, engineered some geospatial features, and created dummy variables. I looked closely at location, tree species, particular problems, and infrastructure. Using Folium, I made maps to discover areas in need of a street tree revamp.
Finally, I developed a model to act as a sort of validation on the work of volunteers. For example, if the model predicts that a tree is in poor condition, but a volunteer lists it as fair or good, one could flag that listing to be checked on by a professional.
EDA
During EDA, I found that the species of tree is fairly significant, Norway maple having the most poor health entries, and sawtooth oak having the most good entries. I would recommend sticking to the healthier varieties toward the right of this graph.
Problems with the tree itself were obviously significant. Unfortunately, problems that were listed vaguely as “other” (shown in the graphs below) appear to hold the highest significance.
I was surprised to see branch problems having the largest correlation with poorer health statuses, though not much can be said regarding causation. These may simply be the most quickly and easily identified problems — generally one cannot see the roots of a tree. Even though other is rather vague, I suggest conducting more regular maintenance, as well as pushing for environmental protections. Hopefully that plastic bag ban will result in less other in the branches.
And in future censuses, I would recommend either having columns with more specificity or a custom notes column. This latter option could be analyzed using natural language processing (NLP).
Feature Engineering
While the dataset was fairly complete, I was still able to engineer two variables that turned out to be very important in my final model: the number of trees on a block and the distance to the nearest tree. I figured a block that is too crowded with trees may result in poorer health on average. Similarly, a tree that is very close to another tree may have a more difficult time maintaining good health. While I didn’t find any discernible relationships between these variables during EDA, it does seem they helped at some point in the Random Forest model’s decision nodes.
Beyond that, I created dummy columns for tree species, borough, number of stewards (volunteers that monitor a tree’s health), presence of tree-guards (those mini-fences around the tree’s dirt patch), and community board in which the tree is located. In the end, my model used 147 total features.
Modeling
I trained and tested several models, though I was limited due to the size of the data and my computational power. I found that the Random Forest models gave me the best results. I tried resampling with SMOTE, but that only made my model too complex and less accurate. My vanilla model — without my engineered features or any tuning — gave me a weighted F1 score of 61.5%. This score improved to 82.1% with an untuned Random Forest model (other than using a class weight of balanced). Finally, after a GridSearch and some trial and error, I built a model that gave me an F1 score of 74.5% yet a much better spread of true positives (see confusion matrices below).
It may seem strange to prefer a model with lower accuracy and F1 score, but the untuned model heavily overpredicted the majority class. I introduced a custom metric by looking at the precision of only the Good predictions, which helped me to penalize this overprediction and build a model with a more even spread. My final model still predicts the majority class too often, but at a much lower rate. The finalmodel also had a much more accurate prediction of the Fair class with only a minor hit on Poor predictions. And more of the Poor health trees were predicted as Fair, which is closer to its true value.
For the curious, my hyperparameters were:
-- class_weight = 'balanced'
-- max_features = 11
-- min_samples_leaf = 3
-- max_depth = 55
-- n_estimators = 500
Top features
Let’s take a look at which features accounted for the largest decreases in Gini impurity.
Surprisingly, the only three continuous variables (distance to the nearest tree, number of trees on block, and diameter) had the largest impact on the model. Sidewalk damage and other problems with the tree itself were also important. It’s worth mentioning that none of these variables showed much significance during EDA, other than tree diameter, which showed some correlation between thinner trees and Poor health. Still, these other variables must have had solid predictive power after interacting with other branches within the Random Forest trees.
Also present here are some of the more common species, all of which were in the top 5 most commonly planted trees in NYC. Norway maples had the least trees in Good health, whereas London planetrees and honeylocusts were among the tree species with the most Good ratings. Finally, it’s notable that neighborhood (for this project, I used the community board number in which a tree was located) barely made the top 15; borough did not make the cut.
Conclusions
There is a lot of useful information within this dataset that can benefit New York City’s streets for years to come. While it’s perhaps a no-brainer, I would recommend sticking to the healthier varieties in the species graph seen earlier and avoiding the less healthy varieties. I also would suggest conducting more regular maintenance, as well as pushing for environmental protections, in the style of the plastic ban bag.
Community outreach and engagement in some of the neighborhoods with a lower ratio of trees in Good health may help to reverse those trends.
And continue the practices in the neighborhoods with the highest proportions of trees in Good health:
Finally, since there is always room for improvement, ideally the next census, would include columns with more specificity or a notes column. These would provide greater opportunities for analysis and help the Parks department determine which problems tend to lead to a decline in tree health. That said, I feel lucky to live in a city that takes the health of its trees so seriously and among people that are willing to volunteer to make that happen.
Future considerations
Multiclass classification is generally more difficult than binary classification, and in the future I would consider reframing this project as a binary Good or Not Good tree health analysis. Not only should this make my model stronger, but it would also align more closely to the project’s goal of checking in on potentially misclassified volunteer entries.
In the future, I would like to train the model I created, as well as the binary one I just discussed, on all of the professional entries. Then I can see how the model’s predictions line up with the volunteer entries and create a list of trees that may need a second look from a professional.
Project repo
You can check out my project repo on GitHub:
https://github.com/p-szymo/nyc_trees