This week saw the release of IMPLAN’s Panel Data, which opens up new opportunities for statistical, shift share, or industry growth analyses—to name just a few.
Now, in the interest of full disclosure, I didn’t know very much about panel data before preparing to interview Jenny Thorvaldson, Chief Economist and Director of Data Development at IMPLAN Group, and Jimmy Squibb, Data Development Specialist and Regional Economist at IMPLAN Group. But the process for compiling the data and making it available in a practical format are completely fascinating. Here’s what I learned:
Tim French: What I understand of panel data in the traditional sense is this: Way back in the day you had a business and all you could know about that business was what they sold and spent money on. Essentially, what you had was a list of debits and credits in a checkbook. And that was kind of a one-dimensional perspective on what that company did. Then, the following year if you were to put those checkbooks side-by-side, you’d have an added dimension. With more and more checkbooks, you eventually have, instead of one column of data, a panel of data; that’s where the term comes from. Is that essentially what we’re talking about when we’re talking about panel data?
Jimmy Squibb: Not exactly. What you described is more customarily called “time series data.” So, if you have one business—whether it’s debits or credits or some data about one business once—that’s just a data point or several observations. If you have that same business and the same types of observations over time, you have time series data. Alternatively, one could have, only for one year, observations from several businesses which would be called cross-sectional data. And then if you kind of combine those two (looking at several businesses over several years) then you have panel data.
TF: So, where time series is looking at an individual entity over time, panel data is comparing multiple entities over time.
JS: That’s right. We initially called this thing a time series data set, but from our perspective it’s defining feature was that it used consistent measurements over time. So, IMPLAN has always been kind of cross-sectional, so to speak, insofar as each year you can get data and there are several different entities within one time. Now you can have several different entities (and an entity is an industry in an area over time).
Jenny Thorvaldson: And that’s part of what spurred us to do this—our clients were wanting to use IMPLAN data in kind of a time series sense. They would be comparing our models from five years ago to this year’s model. And we would say, you know, they’re not really comparable because we’ve changed sectoring schemes, we’ve gotten new raw data sources, we’ve improved our methodologies for estimating missing numbers, etc., so you’re not really comparing apples to apples. After getting that question so many times, we thought, "well, gosh. Why don’t we make a 'time series version' of our IMPLAN data?" So, that’s where it all began. Once we started wanting to market this, that’s when we decided, well, it’s actually more accurate to call this “panel data” because someone could buy several geographies for all of those years. In that case, it would be panel data. If they did just buy one geography for all those years then that would be time series data. But really, the product as a whole is better described as being “panel.”
TF: So, as a client of IMPLAN, can you create your own panel?
TF: What do you think will be the most common use case for the panel data?
JT: Well, I’m not sure what would necessarily be the most common. But, if I were a researcher, I would be using these data in a statistical analysis. So, I would pull them into another software system and do some sort of study that relates various things to other things. Jimmy and I actually have an example of how we did that. We used the data to do a study on how economic diversity relates to unemployment rates in a region. The findings of which, I’m excited to share, have been accepted to be published this year in The Journal of Regional Analysis and Policy—so, that would be one example that people interested in our panel data could look to for just one example of how this data could be used.
JS: Our customers are always more inventive than we are in terms of using IMPLAN data. But, Jenny’s right—we wrote a paper that studied the effect of economic diversity basically being whether you have a representation of lots of industries in an area—the relationship between that and unemployment rates. One can do that with cross-sectional data; one can do that with time series data. But you get a lot more options in terms of statistical methods to use when you have panel data. You can therefore get more reliable conclusions most of the time. So, I would think that statistical analysis of whatever questions are of interest to the researcher would be at the top of the list. Similarly, someone could use it for descriptive analysis of trends over time, to identify what industries are growing or declining in an area, what compensation rates are increasing or declining in different industries in an area. Someone might want to make charts and graphs for a report, perhaps, all of which would be facilitated by the panel data.
JT: I would just, as a comment, add that right now, at least as we’re selling this panel data product, it isn’t every single possible variable that is an item in an IMPLAN model. There are some things that we’re not including. We’re including all of the things that we think are the most important, and that we, as the data team, estimate without the software. So, just note that not every single number that you find in an IMPLAN model is part of the panel data product.
Jimmy pointed out trends and making graphs and all this and just one named type of analysis I’d want to throw in there is “shift share”, where you're looking at trends in one place compared to trends in another place and who’s growing and shrinking at what rates. Those get pretty interesting. They’re really good visualizations because it's basically mapping data on a graph with four quadrants and then you have a scatter plot of data and you're paying attention not only to where you are on the plot (which quadrant), but how big your circle is. And then, if you do that for all years of the data set, then those circles move and grow and shrink and it's just a very neat way to represent the data.
JS: So, you might put shift share analysis data on such a chart and you have performance in your area and industry relative to that same industry in the country. So, are you trying to perform differently versus your other place. And then are you performing differently versus the trend over time on the other axes. So, in a way it ends up looking like a SWOT chart but with numeric values attached to each of your positions vis-a-vis the coordinate system.
TF: When releasing the 2016 data set, will that retroactively affect the compatibility of the panel data or from this point on, will everything fit together tightly?
JT: That's a great question and we have discussed that quite a bit. The plan right now is to go back and estimate all years’ data and have a whole new panel data product every five years, whenever the BEA releases its benchmark I-O tables. So, the closer we get to that fifth year, that year’s annual data set is going to differ more and more from the rest of the panel data as we continue to do little tweaks to our data processes. So, the tightest that they'll fit is in that first year when we redo everything and then there will be slight variations going forward.
JS: I was going to add that there will be these variations for reasons that are outside of our control. Our sources will continue to revise. In 2013, the Bureau of Economics Analysis made a revision to its definition of GDP and therefore revised the estimate of GDP in 1920, so those things will occur. But also, we’ll continue to improve our own methods for putting all the data together (which is in our control) and they'll still change and that is why we want to reissue panel data every so often.
JT: Very true, Jimmy. And even a third reason is that when we go back in time and estimate these older years again, we use all current, up-to-date, and revised raw data. Whereas, when we're doing one of our annual data set releases, several of our data sources that we have to use the raw data—that we use for that year—is data from the previous year and we have to project it into this current year. So, even just the year of the raw data being used is a little bit different. Thus, each additional year will still be a valid part of the panel data set but will be somewhat less tight with the rest.
TF: It sounds like every day it gets better! I was going to mention that five years seems actually kind of ambitious to me. In conversing with clients, it sounds like those who are producing impact analyses on a cyclical bases are revisiting things every five to ten years anyway. So, it sounds like for those who would be using the panel data through time from IMPLAN, to be updating every five years—at least to my anecdotal and limited experience—is a pretty comfortable number.
JS: Yeah, if we tried to spread that out longer we would lose some measure of familiarity and institutional knowledge of what we've done, so it helps keep everything in order and to make the updates somewhat more frequently.
TF: Were there any surprises for you guys while working on this project?
JS: No, it's just it's a big effort to make sure that everything is consistent. That's my global observation about it. It's making sure that we've appropriately integrated all the improvements we've made over the years into the time series data. And, as Jenny mentioned before, in our annual data process we'll produce, say, IMPLAN 2016 data to describe the state of the economy in 2016. We use data or measurement of 2016, but we’ll also use things that are measurements or extrapolations from 2015. And then, we'll have to project that and say, “what do we think that's gonna be in 2016?” So, we have to disentangle all of that use of projected data from the processes we used to build the panel data so that is maybe one of the bigger challenges. So ultimately, it's a simpler process insofar as we don't have to project, but we have to adapt a bunch of processes that are, well, adjusted to using older data.
TF: So, in five years will you end up having to reinvent the wheel?
JS: No. It’ll be way simpler. It won't be so trivial as pressing a button, but you know we did it once, so we have a bunch of existing programs that process data that kind of recognizes, “I am a time series program. I am not going to do any projections. I am going to do the same thing for many years at once.” And we’re trying to, as we make an update to our annual process, to add that to our time series process. But, we’ll still have to do some curating of that. Jenny did all the work though.
JT: Jimmy did help in a non-trivial way, yes. And he's described this perfectly. Probably the biggest challenge for me was working with data that were in different sectoring schemes—for example, different NAICS codes—those were very, very painful—but yeah, having done all this the first time around… the second time will, no question, be easier by far. So, that's good.
TF: What kind of quality controls do you end up using in a project as complex as this?
JT: That’s a really, really good question and I have a couple things I can say. Jimmy, if you want to go first, please go ahead.
JS: You go first Jenny and then I’ll add to that.
JT: Okay. I’ll start off by saying that it helps to have several eyes on the data and even though I did the majority of estimating them, I got a lot of feedback from Jimmy, from Drew, and from the rest of the team who would compare my estimates to raw data sources. Drew knows a lot about certain areas in Louisiana, so, he'd be building models in IMPLAN with the Louisiana data and just giving it a gut-check for how things would (or should) look—looking for anomalies or things that are surprising. We definitely looked (and Jimmy talked about this being one of the uses of the time series data) at trends. So, we looked at trends and said, “Is any year popping up as a huge change in that year?”. If any caught our eye, we knew that we might want to investigate them to make sure that those changes are real. For some things like that, we use the same quality of control processes that we use in our annual data. So, we have a lot of quality control checks built in, making sure that every county sums up to its state and that the states sum up the U.S. and all those kinds of checks. And there can’t be negative employment and just gobs of those kind of checks. So, it went through, I would say, not maybe quite as rigorous of a check that we do every year (just because there are so many checks and it would be hard to do that for that many years) but with those kinds of checks we have built into our system. Jimmy?
JS: Yeah, so I think I agree with all that and I'll add: I’d say there are two goals when we’re trying to make the data be “any good.” One is what we're doing with the data. We ask ourselves, "Are we, in fact, doing what we think we're doing?" — which isn't always so clear. We have a bunch of computer scripts that process data and we wrote them and we think they do something and if we made a typo or misunderstood something or simply did not think about something correctly, they might not actually be doing what we think. So, to make sure that we’re actually adhering to our own declared method is kind of goal number one. Goal number two is that the data matches external data that we expect the data to match (or that are appropriately different from external data in ways that we expect). And then, how we go about identifying whether we’ve met either of those two goals is all the methods that Jenny mentioned of looking for big changes from year to year, looking for doing internal consistency checks, making sure things add up—basically anytime we can think of a rule, we will review that the rule is upheld. But, we’ll also just look at things that seem anomalous and check out whether they are appropriately anomalous or problematically anomalous.
JT: Thank you, Jimmy, that was a great description and I'll just add one more thing to that. We estimate a lot of numbers that don't exist anywhere in the raw data sources that are public, so, not all of our numbers can be compared to something that is somewhere out there as a raw number. So, that is where the comparing year to year becomes really important and where, if we see a big change from year to year, we go back and investigate and ask, "Okay, how did we come up with this estimate?" or "Did the program do what we expected it to do?" just like Jimmy said. And if so, do we still like what the program is doing and are we okay with this estimate it gave us?