Thesis Proposal
Contents
2.2 The negative binomial distribution
2.6 The auto negative binomial model
3.2.3 Percent of Urban Population
3.2.4 Percent of persons living under poverty
3.2.5 Driving under the influence
3.3 Transportation infrastructure related Factors
3.3.2 Number of miles by functional class
4 Analysis of Preliminary results
5.3 Variables and functional forms selection
Tables
Table 1 Description of variables.
Table 2 Linear models for natural log of fatal crashes (sample size = 67)
Table 3 Spatial Log-linear CAR model of Fatal Crashes 2002.
Table 4 Analysis of Parameter Estimates, Negative Binomial Model.
Table 5 Availability of data by year at October 15, 2004
Figures
Figure 1 Number of Fatal Crashes reported by County in the State of Pennsylvania Year 2000
Figure 2 Predicted Total Precipitation Surface Year 2000
According to a World Health Organization/World Bank report “The Global Burden of Disease”(Murray and Lopez, 1996), deaths from non-communicable diseases are expected to climb 77% from 1990 to 2020 (from 28.1 million to 49.7 million) and traffic accidents are the main cause of this rise. Road traffic injuries are expected to take third place in the rank order of disease burden by the year 2020.
Another study of the World Health Organization, the publication “Injury: A Leading Cause of the Global Burden of Disease” (1999), reports that the leading injury-related cause of death among people aged 15 to 44 years is traffic injuries. Of the 5.8 million people who died of injuries in 1998, 1,170,694 died as a direct result of injuries sustained in a motor vehicle accident.
It is clear that deaths due to car crashes is one of the biggest problems on public health around the world and developed countries are not an exception. Just in 2002, 42,815 persons died in 38,309 crashes in USA (FARS, 2004), 1614 of them in the state of Pennsylvania.
Motor-vehicle crashes not only cause deaths but also injures ranging from minor to major or severe injures. In 2001, from the total reported crashes, 117,860 persons resulted injured just in the state of Pennsylvania (PennDOT, 2002). From this total, 5,039 persons had major injuries, 23,292 with moderate injuries, 76,796 with minor injuries and 12,733 with unknown type of injuries.
Given these facts it is clear that traffic safety must play an important roll in any transportation policy-maker group, and the U.S. Department of Transportation (USDOT) is not an exception. Safety is one of the five current strategic goals of the USDOT. Just in 2002, the budget for The National Highway Traffic Safety Administration (NHTSA), which is one of the multiple agencies related to traffic safety under the USDOT, reached the sum of 424 millions of dollars (NHTSA, 2004). However, these resources are scarce, therefore, a better understanding of road crashes is needed in order to direct the resources to the most vulnerable areas in space (i.e. counties or districts), in type (i.e. intersections, rural highways, and ramps), and user groups (i.e. pedestrians and drivers).
Good engineering begins by understanding the problem at hand. As Haight (1988) commented on traffic safety, “Many of us have heard demands that we ‘do something,’ but it is only recently that there have been suggestions that we should ‘know what we are doing’ before we begin to do it.” Consequently, traffic safety modeling may be use as a tool to increase our level of understanding and knowledge about traffic crashes, hoping to finally reduce the impact of them in the society as a whole and into the users as individuals.
Although roadway crashes are by nature determined by individuals (e.g. the driver), it is practically impossible to study, at individual level, the influence of spatially defined factors like land use, demographic characteristics, and traffic volume, among others. In most of the roadway accident studies, the data is grouped in spatial units that range from intersection or road section level to zip code or county level. Several studies of road crashes at different area levels, ranging from local census tracks to counties, have been performed in recent times (Amoros et al, 2003, Miaou et al, 2003, Noland and Oh, 2004, Noland and Quddus, 2004, and MacNab, 2004). However, from these studies, only the works from Miaou et al and MacNab explicitly model spatial autocorrelation by using Bayesian modeling techniques.
The recent development of spatial modeling techniques has enabled researchers to investigate important issues related to risk estimation, unmeasured confounding variables, and spatial dependence (Richardson, 1992). In general, spatial correlation is analogous to temporal correlation where the dependence among observations produces higher variance of the estimates and therefore, underestimated standard errors if it is not recognized and addressed.
Other important advantage of spatial models is that spatial effects may reflect unmeasured cofounding variables. This is particularly useful for unmeasured cofounders that vary in space like weather, population, and others. More important yet, as MacNab (2004) mentioned, “The methods also facilitate spatial smoothing and data pooling when regions under investigation involve small-population areas.” Here the term ‘small-population areas’ refers to areas that present very few events given its rare-event nature, for example roadway crashes.
Previous research has deal with the spatial component of road crashes in different ways. For example authors like Levine et al (1995a) and Jones et al (1996) have modeled crashes as point events. Other recent studies like Shankar et al (1996), Amoros et al (2003), Miaou et al (2003), Noland and Oh (2004), and MacNab (2004) have modeled road crashes at different area levels, ranging from road sections to local census tracks or counties.
In the study by Levine et al (1995a), the crashes were geocoded to the nearest intersection or ramp. Once the location of the accident was assigned, different ‘spatial’ statistics were calculated including mean center, standard distance deviation based on “great circle” distance, the standard deviational ellipse (1st and 2nd principal component), and the Nearest Neighbor Index; based on the x and y coordinate of the accidents. The work was concentrated in developing spatial probability ellipses for different categories of crashes, i.e. all crashes or alcohol related, one, two or three or more vehicles involved, etc. The main shortcoming of this analysis is its descriptive rather than predictive approach. In addition, the statistical assumptions in which the work is tacitly based are commonly violated. For example, the assumption of normal spatial distribution of points in the x and y coordinate, which implies the absence of clustering that is clearly violated by crash data.
The work by Jones et al (1996) is other example of the use of spatial point pattern analysis in traffic accidents. This paper presents a classical K-function analysis on the residuals of a logit model where the log-odds were selected to be fatalities as oppose to seriously injured. The variables of the model were: age, type of user (pedestrian, bicyclist, Motor Vehicle driver) and number of casualties. With this Ad Hoc approach, the authors fund that, once the trend was removed from the data, the residuals presented clustering. Although this study includes the analysis of certain contributed factors as opposed to the work by Levine et al (1995a), it still failed to include the spatial correlation into the coefficient estimation directly.
Levine et al (1995b) also estimate a spatial model at census block level. They estimated what is called a “spatial lag” model. The spatial lag model is defined as:
![]()
where Yi is an N by 1 vector of observations on the dependent variable for all locations, i, W(Yj) is a weighted matrix of N by 1 vector of values for the dependent variable summed over all locations j, where i≠j (the “spatial lag”), ρ is the coefficient of the spatial lag (the spatial autoregressive term), X is an N by K matrix of observations on the explanatory variables, β is a K by 1 vector of regression coefficients, and ε is an N by 1 vector of normally distributed random error terms, with mean 0 and constant variance σ2. The explanatory variables included in the model were: freeway crossing the block (dummy), miles of arterials or highways, miles of minor roads, miles of freeways, population, and employment. Even though, the model takes into account the spatial correlation of the data, it is based on the normally distributed assumption for number of crashes rather than Poisson or negative binomial distributions.
Other researcher that has worked in the analysis of crashes taking in consideration the spatial component is Robert B. Noland. In his work “A spatially disaggregate analysis of road casualties in England” (Noland and Quddus, 2004) the analysis is performed at “ward” (census track) level. Negative Binomial models with dependant variables total fatalities, serious injures, and slight injuries were estimated. The independent variables were classified into 4 different categories: land use indicator variables (employment and population density), road characteristics, demographic characteristics (age cohorts), and traffic flow proxies (proximate and total employment). Some of the limitations of this study are: the use of cross-sectional data, the use of proxy variables for traffic flow estimation, and the lack of spatial correlation analysis.
On other work with country-level data Noland and Oh (2004) estimated the expected number of crashes using infrastructure characteristics and demographic indicators as independent variables in a negative binomial model. Although this article presented an improvement with respect to the former study by using four years of data, the absence of traffic flow data and spatial correlation analysis persist. Similarly, the study by Amoros et al (2003) presents negative binomial models of road crashes incidence and severity at county level using road type and its interactions with county as independent variables.
The most advanced work in terms of spatial modeling of traffic crashes, within the knowledge of the author, was developed by Miaou et al (2003). The authors developed a series of spatial models of crashes at county level for data from the state of Texas. Poisson-based full hierarchical Bayes models of Fatal (K), incapacitating (A), and non-incapacitating (B) injuries were estimated using both frequency and rate values (using VMT as an offset term). Conditional Auto-Regressive model (CAR) was used to model spatial correlation and Markov Chain Monte Carlo (MCMC) was used to sample the posterior probability distribution. The main drawback of this work is the use of the surrogate variables: percent of time that the road is wet, sharp horizontal curves, and road side hazards.
These variables were estimated by proportions of crashes. For example, for percent of time that the road is wet, the variable was estimated by dividing the number of crashes that occurred under wet pavement by the total number of crashes. These estimators are clearly biased in the direction of the effect. Given the poor definition of contributing factors in the model, it is likely that the spatial correlation is overestimated.
This project aims to estimate spatial models while controlling for known contributing factors of traffic crashes. It is expected that using better explanatory variables, like the ones proposed in this project, results in a better estimation of spatial interactions. In addition, hypothesis of possible contributing factors, that have not been tested before, like mean travel time to work or number of Driver Under the Influence Arrests will be tested. Another interesting research question that will be addressed in the work is the effect that the inclusion of spatial-related variables like environmental and population-related variables has in the spatial correlation of the data. In other words the question is how much spatial correlation can be detected in the model, if any, once spatial distributed variables like population and weather are included into the model.
In this study the contributing factors were divided into three main categories: socioeconomic, transportation infrastructure-related, and environmental factors. The former category attempts to describe the user characteristics while transportation infrastructure factor seeks to explain the system characteristics. Environmental factors include weather related variables like number of days with snow and total precipitation. User characteristics involve drivers and other vehicle occupants, as well as pedestrians and other users of the transportation facilities. Transportation related factors involve the characteristics of the transportation infrastructure like Vehicle-miles traveled and road mileage.
The purpose of this research is to develop spatial models of road crashes for the State of Pennsylvania at county level while controlling for socioeconomic, transportation related, and environmental factors. Different combinations of contributing factors and possible interaction terms will be tested. The work is organized as follows: the next section presented the methodology to use, next the sources and nature of the data analyzed in this study will be described, followed by the presentation of the preliminary results, and finally a time table or program of activities is showed.
When data arise as counts, the Poisson distribution is typically used to model them. Traffic crashes are a clear example of count data; therefore, a Poisson distribution is a useful stating point. In words of Shankar et al (1995), it presents two important advantages: “(i) it lends itself well to modeling of count data by virtue of its discrete, nonnegative integer-distribution characteristics and (ii) can be generalized to more flexible distributional forms.”
The probability function for a Poisson distribution is:
,
, z =
0,1,….
(1)
where Pr(z) is the probability of z number of events and λ is the expected number of events.
Now, a model based on a Poisson distribution can be written as
(2)
where X is the vector of covariates and β is the vector of coefficients.
An important characteristic of the Poisson distribution is that its variance is equal to its mean.
Several authors including Shankar et al (1995), Noland and Quddus (2004), and Lord et al (2003) have argued that car crashes are best represented by negative binomial distributions. A negative binomial distribution is a nonnegative count distribution generated by a Poisson process with variance greater that the mean. This key feature makes the negative binomial distribution preferred over the Poisson distribution where the variance is equal to the mean. In the words of Shankar et al (1995) “It is well known, based on the findings of many previews research efforts, that accident frequency data tend to be overdispersed,…”
A negative binomial distribution can be considered as a Poisson distribution with mean λ, which is itself a gamma random variable. The model can be written as
(3)
where X is the vector of covariates, β is the vector of coefficients, and ε is the error term.
The marginal probability distribution, after mixing over λ, is:

=
,
z =
0,1,….
(4)
where Γ(§) is the Gamma function and β and μ are the parameters of the gamma random variable λ (Note that the parameter β in equation 2 is different from the vector of coefficients β in equation 1).
A general spatial model can be described as:
(5)
where s = (x,y) denotes the coordinates of a sample site, Z(s) denotes the variable of interest at the location s and D a set of spatial locations at which data can be obtained.
A lattice process is a finite collection of n elements D = {s1, s2, ···,sn}.
Now the objective is to build a model for the join distribution of the data
Z(s1), Z(s2), ···,Z(sn).
For that a class of auto-models will be consider.
Just as we used conditional distributions to model time series data, we will use conditional distributions to model spatial data on a lattice. Assume:
;
(6)
that is, the conditional probability of Z(si) = zi given the realization of the data at all remaining sites depends only on the data at sites sj belonging to the collection of sites Ni.
A site sj is defined to be a neighbor of site si if the conditional distribution of Z(si) given the data at all remaining sites depends on the realization zj of Z(sj):
Ni = {j : sj is a neighbor of si}
The neighboring structure can be built in multiple ways, depending on the research objectives. In the case of this study, given the irregular nature of the lattice (counties), it is convenient to define a neighbor of the county si to any county sj that shares boundaries with si.
Then, {Z(si) : i = 1, 2, …, n} is a Markov Random Field if the conditional distribution of the data at any given site, given the realization of the data at the remaining sites, depends only on the realization of the data at the neighbor sites.
Assuming that conditional on
, the data Z(sj)
is Gaussian distributed, and that there are pairwise spatial
interactions only;
the conditional distribution of Z(sj) is:
(7)
where cij are the spatial autocorrelation coefficients.
Maximum likelihood estimation requires the join distribution of the data vector z. This join distribution is:
(8)
where I is the identity matrix and C is the connectivity matrix base on the defined neighboring structure.
In a multiple regression model
(9)
where X is the vector of covariates and β is the vector of coefficients.
Then
(10)
and the expected value of
is
(11)
Now, recalling the model proposed by Levine et al (1995b):
(12)
where W(Yj) is a weighted matrix of N by 1 vector of values for the dependent variable summed over all locations j.
If cij = ρ, a single spatial autocorrelation coefficient, equations (9) and (10) take the same form, therefore, the method proposed by Levine et al (1995b) is just a especial case of the Conditional Autoregressive Model (CAR).
The auto Poisson conditional specification (assuming pairwise-only dependence between sites is (Cressie, 1993):
![]()
,
(13)
where
,
(14)
where θij = θji , θii = 0 and θ’s are the spatial autocorrelation coefficients.
If trend or large-scale variation is introduced, then:
α = Xβ, (15)
where, as in equation 1, X is the vector of covariates and β is the vector of coefficients.
An auto negative binomial model is a conditional specified spatial model that, like a negative binomial model, describes count processes with overdispersion but takes into account spatial correlation amount different sites. The conditional specification of an auto negative binomial model can be defined as (Cressie, 1993):
![]()
,
(16)
= 0, 1, 2,….
If pairwise-only dependence between sites is assumed, then
,
(17)
where θij = θji , θii = 0 and θ’s are the spatial autocorrelation coefficients.
Finally, if trend or large-scale variation is introduced, then α = Xβ, where, as in equation 1, X is the vector of covariates and β is the vector of coefficients.
Given this conditional distribution, one can find the join distribution which is required for maximum likelihood estimation of model parameters.
As mentioned before, in this study the risk factors will be divided into three main categories: socioeconomic, transportation infrastructure related, and environmental factors.
Among socioeconomic factors, the following will be study in this work:
The transportation related factors are:
The environmental factors are:
The data will be collected from many different sources including US Census Bureau, Pennsylvania Department of Transportation, and the National Climatic Center (NOAA). Following, there is and explanation of each of the variables consider in the analysis including the specific source of each variable.
The crash data consist on Fatal Crashes Reports from FARS database. FARS is the Fatal Accident Report System, part of the National Center for Statistics and Analysis of the National Highway Traffic Safety Administration. The system is presented in a Web-Based Encyclopedia format, accessible from the Internet for querying from years 1994 to 2002 and also in database format for download from years 1974 to 1993. The database includes registers for each crash, vehicle, and person involved into a fatal crash. The system defines a Fatal Crash as “A police-reported crash involving a motor vehicle in transport on a trafficway in which at least one person dies within 30 days of the crash” (FARS, 2004). Figure 1 presents the map of fatal crashes by county in the year 2000.
These factors try to explain differences in risk of car crashes that persons are subject of given their individual differences in the social and economical areas. These may be described as the ‘human factors’ that contribute to car crashes. Different authors like Shinar (1978) and Evans (1991) have suggested factors including age, sex, and personality to explain car crashes risk. In the case of this work, the socioeconomic factors taken into consideration try to reflect these individual differences at county aggregate level.
Different authors, including Evans (1991) and Kam (2003), have shown that young and old drivers have higher risk of car accidents. Therefore, the variables percent of persons between 16 and 24 and percent of persons over 65 will be included in the analysis. However, it may be remark that these variables correspond to the total population rather than the population of drivers but this is the only data available.
Younger population groups are often associated with higher risk of road accidents (Noland and Qudduss, 2004). The higher risk is associated with higher exposition given the lack of awareness about the danger of the roads and the condition of pedestrian. This situation is also expected in elderly pedestrians with may be reflected into the variable percent of persons over 65.
The source of data for percent of persons under each age cohort is the US Census Bureau.
Sex has been found to be a key factor on car crashes risk by Evans (1991) and Kam (2003). According to their findings, male drivers have higher crash risk than females. In the case of this study, the percent of males will be used to capture this effect. Again, it must be highlight that the variable is based on the whole population instead of the populations of drivers.
The source of data for percent of males is the US Census Bureau.
Although the percent of urban population is clearly a socioeconomic variable, it indicates the level of urbanization of the county and therefore it is an indicator of land use intensity which is more related to transportation infrastructure. Higher land use intensities are normally associated to higher car accidents risks. The results of Noland and Qudduss (2004) showed that land use is associated with car accidents.
The source of data for percent of urban population is the US Census Bureau.
Figure 1 Number of Fatal Crashes reported by County in the State of Pennsylvania Year 2000
In this study the percent of persons living under poverty is used as an indicator of area deprivation. Area deprivation has been found to be positively related to car crashes by Chichester et al (1998), Abdalla et al (1997), and Noland and Qudduss (2004).
The source of data for percent of persons living under poverty is also the US Census Bureau, specifically, Small Area Income & Poverty Estimates Office.
According to Evans (1991) “From the earliest days of motorization, alcohol has been recognized as a factor leading to increased crash risk.” Gary et al (2003) found that in 1998 dry counties in Kentucky (those were alcohol sell is prohibited) had fewer alcohol related traffic crashes and fewer driving under influence (DUI) arrests per 1000 licensed drivers. For this paper the number of DUI arrests was used to related alcohol consumption and crashes frequency.
The information on number of DUI arrest was supplied by the Uniform Crime Reporting Unit of the Pennsylvania State Police.
In transportation related factors, vehicle-miles traveled is often used as exposure indicator (Miaou et al, 2003) along with the number of miles of different functional classes per county (Noland and Qudduss, 2004). In addition to these factors the mean travel time to work will be tested. The hypothesis is that higher travel times to work results in higher exposure and therefore higher risk.
Vehicle-Miles Traveled (VMT) is a performance measure related to the level of usage of a particular section or group of highways. It is evident that the crash risk increases with increases on VMT because of the increase of exposure. Miaou et al (2003) uses VMT by county, among other variables, to predict crash frequency in the state of Texas. A different approach is to use VMT as denominator or normalizing variable for the dependent variable (i.e. number of crashes by VMT). However, in this study the number of VMT will be modeled explicitly as one of the risk factors to quantify its contribution to crash risk comparing with the others risk factors analyzed in the model.
The Daily Vehicle-Miles Traveled (DVMT) was obtained from the Annual Highway Statistics Report published by PennDOT.
Different functional classes have different design and operational standards, with higher standards for higher functional classes. Therefore, it is expected that higher functional categories will be associated with fewer crashes. Noland and Qudduss (2004) tested this effect for roads in England; however, they could not find statistically significant differences in crash frequency attributable to the roadway functional classification. In this work it will be test whether or not the number of miles of higher functional class roads results significant with respect to crash risk.
The number of miles by functional category was also obtained from the Annual Highway Statistics Report published by PennDOT.
The mean travel time to work by county is expected to be related to higher crash risk because of the increase on exposure. In addition, higher mean travel times to work are associated with more intensive land uses. Dense urban areas often present high levels of traffic congestion which is related with higher travel times. In the knowledge of the authors, there is not an antecedent of use of travel time as predictor of crash risk; therefore, this variable is one of the most interesting ones in the analysis.
The data on travel time to work were derived from answers to long-form questionnaire of the US Census applied to one of each six households in the years 1990 and 2000. The elapsed time includes time spent waiting for public transportation, picking up passengers in carpools, and time spent in other activities related to getting to work. The dataset is available in the US census web page (US Census Bureau, 2004).
Many environmental factors can be associated with higher crash risk. Some examples of environmental factors are rain, snow and darkness. The higher risk may be associated to the reduction of driver’s performance (i.e. sight distance) and also the vehicle’s performance (i.e. wet pavement).
Given the data accessibility constrains of the project, the environmental factors taken into consideration are solely weather related. Weather related factors have been investigated in the past by Shankar et al (1995) and Edwards (1996). Both studies found positive correlation between weather hazards and crash frequency.
The amount of and number of days of rain and snow will be analyzed into the model. The data source for these variables is the National Climatic Data Center (NCDC) of the National Oceanic and Atmospheric Administration (NOAA). Hundreds of weather stations will be used to generate predicted surfaces for each variable and then the variables will be summarized at county level for including a single data value by county into the database. An example of this is presented in Figure 2 that shows the predicted total precipitation surface for the year 2000.
Figure 2 Predicted Total Precipitation Surface Year 2000
A previous unpublished work by the author (Aguero, 2004) developed a log-linear relationship between fatal crash frequency and some of the predictor variables mentioned before. The variables included into the analysis were:
Table 1 Description of variables.
|
Dependent Variable |
|
|
Lcrashes |
Ln of total fatal crashes in 2002 |
|
Socioeconomic Variables |
|
|
P_pov |
Percent of population under poverty in 2000 |
|
P16 |
Percent of population under 16 in 2000 |
|
P16_24 |
Percent of population between 16 and 24 in 2000 |
|
P65 |
Percent of population over 65 |
|
Pmales |
Percent of males in 2000 |
|
P_urban |
Percent of urban population in 2000 |
|
LDUI |
Ln of Driving Under Influence Arrests in 2002 |
|
Transportation Related Variables |
|
|
LDVMT |
Ln of Daily Vehicle-Miles Traveled in 2002 |
|
Lfed_aid |
Ln of miles of federal aid roads in 2002 |
|
Lnonfed_aid |
Ln of miles of non-federal aid roads in 2002 |
|
Ltravel_t |
Ln of Mean travel time to work (minutes), workers age 16+, 2000 |
|
Ltotal |
Ln of miles of roads (federal and non-federal aid) in 2002 |
|
Pfed_aid |
Percent of miles of federal aid roads in 2002 |
Although, biased estimators are expected given the misspecification of the model (log-normal linear regression instead of a count data model like Poisson or negative binomial), this model can be seen as a good first approximation of the problem. Table 2 presents the results.
The results from table 2 are promising. Not only do the models indicate a very good fit to the data, but also many variables of interest are statistically significant. However, it seems to be necessary to incorporate more data. The actual sample size is just 67 because just one year was used. The goal is to include data for at least one or two more years, depending on data availability.
Table 2 Linear models for natural log of fatal crashes (sample size = 67)
|
MODEL 1 |
MODEL 2 |
MODEL 3 |
MODEL 4 |
|
Variable |
Estimate |
Estimate |
Estimate |
Estimate |
|
S.E |
S.E |
S.E |
S.E |
|
|
p-value |
p-value |
p-value |
p-value |
|
|
Intercept |
-12.2579 |
-9.6169 |
-9.5382 |
-9.6221 |
|
4.7518 |
1.4727 |
1.4641 |
1.4311 |
|
|
0.0126 |
0.0000 |
0.0000 |
0.0000 |
|
|
P_pov |
0.0564 |
0.0430 |
0.0423 |
0.0412 |
|
0.0198 |
0.0174 |
0.0173 |
0.0168 |
|
|
0.0062 |
0.0163 |
0.0178 |
0.0177 |
|
|
P16 |
0.0165 |
|
|
|
|
0.0601 |
||||
|
0.7851 |
||||
|
P16_24 |
-0.0095 |
|
|
|
|
0.0352 |
||||
|
0.7874 |
||||
|
P65 |
-0.0250 |
|
|
|
|
0.0485 |
||||
|
0.6072 |
||||
|
Pmales |
0.0595 |
|
|
|
|
0.0433 |
||||
|
0.1753 |
||||
|
P_urban |
0.0020 |
|
|
|
|
0.0044 |
||||
|
0.6545 |
||||
|
LDUI |
0.1884 |
0.2473 |
0.2420 |
0.2325 |
|
0.1272 |
0.1064 |
0.1058 |
0.1011 |
|
|
0.1443 |
0.0238 |
0.0261 |
0.0253 |
|
|
LDVMT |
0.4555 |
0.4547 |
0.4481 |
0.4376 |
|
0.1701 |
0.1374 |
0.1373 |
0.1326 |
|
|
0.0097 |
0.0017 |
0.0019 |
0.0017 |
|
|
Lfed_aid |
0.0717 |
-0.0459 |
|
|
|
0.2758 |
0.2446 |
|||
|
0.7958 |
0.8516 |
|||
|
Lnonfed_aid |
0.2669 |
0.3132 |
|
|
|
0.2195 |
0.1797 |
|||
|
0.2292 |
0.0869 |
|||
|
Ltravel_t |
0.4628 |
0.5705 |
0.5686 |
0.5643 |
|
0.3338 |
0.2239 |
0.2241 |
0.2221 |
|
|
0.1713 |
0.0137 |
0.0140 |
0.0139 |
|
|
Ltotal |
|
|
0.2869 |
0.3177 |
|
0.1647 |
0.1344 |
|||
|
0.0871 |
0.0216 |
|||
|
Pfed_aid |
|
|
-0.4084 |
|
|
1.2435 |
||||
|
0.7442 |
||||
|
R2 |
0.895 |
0.886 |
0.886 |
0.886 |
|
Ajusted R2 |
0.874 |
0.874 |
0.874 |
0.876 |
A preliminary spatial model was also tested under the assumption of log-normality. The log transformation was possible given the fact that there were not zeros into the crash frequency data for the year 2002. However, this is not the case for other years; therefore the assumption of a Gaussian distribution cannot be used for the final model. Table 3 presents the results for the gaussian log-linear Conditional Auto Regressive (CAR) model (see section 2.4 for details in model specification). The software R (R Development Core Team, 2003) and functions by Rathbun (2004) were used to calculate the parameters.
Table 3 Spatial Log-linear CAR model of Fatal Crashes 2002.
|
|
estimate |
S.E |
t |
p-value |
|
intercept |
-10.251 |
1.40963 |
7.272 |
0.0000 |
|
Povperc |
0.060329 |
0.01617 |
3.732 |
0.0004 |
|
LDVMT |
0.584798 |
0.11839 |
4.940 |
0.0000 |
|
Ltravelt |
0.651254 |
0.22693 |
2.870 |
0.0056 |
|
LDUI |
0.283772 |
0.09353 |
3.034 |
0.0035 |
|
|
|
|
Z |
|
|
gamma |
0.125247 |
0.03555 |
3.523 |
0.0002 |
|
sigmasq |
0.088247 |
0.01573 |
5.610 |
0.0000 |
Here, gamma (γ) is the single spatial dependence parameter proposed in the model. Notice that some variables were removed from the initial non-spatial model due to high multicollinearity which led it to convergence problems. Also it is noticeable that the spatial parameter is significant. This indicates spatial correlation, which was not surprising given the clustering of the data that can be noticed in Figure 1. Again, the preliminary results suggest that the work is heading into the right direction.
Table 4 shows a preliminary Negative Binomial model without spatial dependence using a different dataset, particularly for the years 1999 and 2000. The dependant variable, as always was the number of fatal crashes. The software SAS was used to estimate the parameters.
Table 4 Analysis of Parameter Estimates, Negative Binomial Model.
|
Parameter |
DF |
Estimate |
Standard Error |
Wald 95% Confidence Limits |
Chi-Square |
Pr > ChiSq |
||
|
Intercept |
1 |
-0.1074 |
11.4913 |
-22.63 |
22.4153 |
0.00 |
0.9925 |
|
|
DVMT_T |
1 |
0.0269 |
0.0121 |
0.0032 |
0.0507 |
4.97 |
0.0259 |
|
|
L_DUI |
1 |
0.5360 |
0.0573 |
0.4238 |
0.6483 |
87.55 |
<.0001 |
|
|
L_P_POV |
1 |
0.2222 |
0.1383 |
-0.0488 |
0.4932 |
2.58 |
0.1080 |
|
|
L_P0_14 |
1 |
0.0941 |
0.7784 |
-1.4316 |
1.6198 |
0.01 |
0.9038 |
|
|
L_P15_24 |
1 |
-0.6779 |
0.3895 |
-1.4413 |
0.0854 |
3.03 |
0.0818 |
|
|
L_P65 |
1 |
-0.9877 |
0.5536 |
-2.0728 |
0.0973 |
3.18 |
0.0744 |
|
|
L_P_MALES |
1 |
0.8594 |
2.1220 |
-3.2998 |
5.0185 |
0.16 |
0.6855 |
|
|
Dispersion |
1 |
0.0643 |
0.0158 |
0.0398 |
0.1042 |
|
|
|
Although some of the variables in this model showed to be not significant, even at 90%, it is clear that the dispersion parameter is significantly different from zero, which proves overdispersion in the crash frequency data.
A series of future tasks and analysis will be listed below with a description of the task, the objectives, and the expected outcome or product.
This task is obviously the starting point of the whole research effort. Table 5 summarized the data collection state up today.
Table 5 Availability of data by year at October 15, 2004
|
|
Year |
||||||
|
Variable |
1990 |
1996 |
1997 |
1998 |
1999 |
2000 |
2001 |
|
Crashes |
X |
X |
X |
X |
X |
X |
X |
|
Population related |
X |
X |
X |
X |
X |
X |
X |
|
Poverty and Income |
X(89) |
X |
X |
X |
X |
X |
|
|
DUI |
X |
X |
X |
X |
X |
X |
X |
|
travel time (census) |
X |
|
|
|
|
X |
|
|
DVMT |
|
|
|
X |
X |
X |
X |
|
Miles of road |
X |
X |
X |
X |
X |
X |
X |
|
Weather |
|
|
|
|
X |
X |
|
It is important to notice that Mean Travel Time to Work is available just for the census years of 1990 and 2000. On the other hand, other census-based information like population and income are available as estimates and projections from the Census Bureau.
Poverty and Income data were obtained from the Small Area Income and Poverty Estimates (SAIPE) program of the U.S. Census Bureau. The next release of these estimates is scheduled for November 2004; which, hopefully, will include data for 2001.
With respect to DVMT and miles of road missing, currently we are working to obtain this data from PennDOT.
The weather data is already available from the Internet; therefore, it will be downloaded as soon as the remaining data is collected.
The weather data from hundreds of stations that are part of the NCDC will be used to generate predicted surfaces for each variable and then the variables will be summarized at county level for including a single data value by county into the database. For the generation of the surfaces, geostatistical models described by Cressie (1993) will be used.
An example of this is presented in Figure 2 that shows the predicted total precipitation surface for the year 2000. This surface was generated using geoR (Ribeiro JR., P.J. and Diggle, P.J., 2001) a package of R (R Development Core Team, 2003).
Surfaces of Total Precipitation for years 1999 and 2000 that were already generated present the simplest case because the variable Total Precipitation is normally distributed and substantially different from zero for all the State of Pennsylvania and surroundings.
The case of Total Snow is more complex because there are areas in the state with very low or zero total snow at least for one year. In addition the data is skewed to the left. Bayesian Kriging models for continuous non-normally distributed data will be used for this case.
The most complex case is for the count variables number of days with precipitation and number of days with snow. For these variables a Full Bayesian inference approach will be used as described by Rathbun (2004).
In the two last cases, a PC Cluster run by the High Performance Computing Group of Penn State's Academic Services and Emerging Technologies (ASET) is planed to be used. This computer cluster will be able to process the high amount of calculations and iterations needed in the bayesian inference methods in a feasible time frame. However, should be clear that the effort in modeling weather related variables for this project will be undertaken as far as it is feasible within the time constrains.
Different transformations for the independent variables may be tested. In particular, Noland and Oh (2004) recommend logarithmic transformations in the independent variables to minimize heteroskedasticity in the data; therefore, this transformation will be tested. Other functional transformations like dummy variables for discrete or continuous variables using some threshold value will be used. In addition, interactions between variables like those tested by Shankar et al (1995) can be also incorporated to the models.
The modeling process will follow the next sequence:
· Development of descriptive statistics for all the variables in the model.
· Development of NB models without spatial interaction.
· Finally, construction of spatial NB models for the data.
· Model comparison and analysis.
The final estimation of parameters may be performed using R to calculate the maximum likelihood estimators. For this, it is necessary to develop a series of specific functions, because R does not have those algorithms programmed. An alternative is to use the WinBUGS Package (WinBUGS, 2004), software develop to applied full bayesian models using Marckov Chain Monte Carlo (MCMC).
Table 6 presents the tentative schedule of work, meetings, submissions, and other necessary to complete the Master of Science Thesis according to the Graduate School.
Abdalla, I.M., Robert, R., Derek, B., McGuicagan, D.R.D., 1997. An investigation into the relationships between area social characteristics and road accident casualties. Accident Analysis and Prevention, Vol. 29.
Aguero, J., 2004. Consideration of socioeconomic and transportation related variables in fatal crash frequency analysis. Unpublished.
Amoros, E., Martin, J.L., Laumon,B., 2003. Comparison of road crashes incident and severity between some French counties. Accident Analysis and Prevention, vol. 35.
Chichester, B.M., Gregan, J.A., Anderson, D.P., Kerr, J.M., 1998. Associations between road traffic accidents and socio-economic deprivation on Scotland’s west coast. Scot. Med. J. vol. 43.
Cressie, N.A.C., 1993. Statistics for Spatial Data. Revised Edition, John Wiley and Sons, USA.
Edwards, J.B., 1996. Weather-related road accidents in England and Wales: a spatial analysis. Journal of Transport Geography, vol. 4.
Evans, L., 1991. Traffic safety and the driver. Van Nostrand Reinhold, New York.
FARS Fatal Accident Report System Web-Based Encyclopedia, 2004. Web page http://www-fars.nhtsa.dot.gov, visited 9/6/2004.
Haight, F. H., 1988. Research and theory in traffic safety. Paper presented to International Symposium on Traffic Safety Theory and Research Methods, sponsored by SWOV, Amsterdam, Netherlands.
Jones, A.P., Langford, I.H., Bentham, G., 1996. The Application of K-function Analysis to the Geographical Distribution of Road Traffic Accident Outcomes in Norfolk, England, Social Science and Medicine, vol. 42, No 6, pp 879 – 885.
Levine, N., Kim, K.E.,. Nitz, L.H., 1995. Spatial Analysis of Honolulu Motor Vehicle Crashes: I. Spatial Patterns. Accident Analysis and Prevention, Vol. 27, No 5, pp 663-674.
Levine, N., Kim, K.E.,. Nitz, L.H., 1995. Spatial Analysis of Honolulu Motor Vehicle Crashes: II. Zonal Generators. Accident Analysis and Prevention, Vol. 27, No 5, pp 675-685.
Lord, D., Washington, S.P., Ivan, J.N., 2003. Statistical challenges with modeling motor vehicle crashes: understanding the implications of alternative approaches. Presented at the 2004 Transportation Research Board Meeting, Washington D.C.
MacNab, Y.C., 2004. Bayesian spatial and ecological models for small-area accident and injury analysis. Accident Analysis and Prevention, Article in press.
Miaou, S., Song, J.J., Mallick, B.K., 2003. Roadway Traffic Crash Mapping: A Space-Time Modeling Approach. Journal of Transportation and Statistics, Vol. 6, No 1.
Murray, C., Lopez, A. (editors), 1996. The Global Burden of Disease. Harvard University Press.
The National Highway Traffic Safety Administration (NHTSA), 2004. Web page http://www.dot.gov/bib2004/nhtsa.html, visited 9/8/2004.
Noland, R.B., Oh, L., 2004. The effect of infrastructure and demographic change on traffic-related fatalities and crashes: a case study of Illinois country-level data. Accident Analysis and Prevention, Vol. 36
Noland, R.B., Qudduss, M.A., 2004. A spatially disaggregate analysis of road casualties in England. Accident Analysis and Prevention, Article in press.
Pennsylvania Department of Transportation, 2002. Pennsylvania Crash Facts and Statistics: 2001.
Rathbun, S.L., 2004. Class notes: Spatial Statistics, The Pennsylvania State University, spring 2004.
R Development Core Team, 2003. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org.
Ribeiro JR., P.J. and Diggle, P.J., 2001. geoR: A package for geostatistical analysis. R-NEWS Vol 1, No 2.
Richardson, S., 1992. Statistical Methods for geographical correlation studies. In: Elliott, P., Cuzick, J., English, D., Stern, R. (Eds.), Geographical and Environmental Epidemiology: Methods for Small-Area Studies. Oxford University Press, London, pp. 181–204.
Shankar, V., Mannering, F., Barfield, W., 1995. Effect of roadway geometrics and environmental factors on rural freeway accident frequencies. Accident Analysis and Prevention, vol. 27.
Shinar, D., 1978. Psychology on the Road, the Human Factor in Traffic Safety. John Wiley and Sons. USA.
US Census Bureau, 2004. American FactFinder, US Census Bureau Web page http://factfinder.census.gov/ visited 6/8/2004.
WinBUGS, 2004. Imperial College and Medical Research Center, United Kingdom.
World Health Organization, 1999. Injury: A Leading Cause of the Global Burden of Disease. Web page http://www.who.int/violence_injury_prevention/injury/gbi/gbi8/en/index1.html, visited 9/7/2004.