Spreadsheet and disclaimers

In the process of making a first-time hybrid buyer’s guide as a companion piece to my road guide, I ended up doing a fair bit of market research. The hybrid market is substantially more fragmented than the entry-level road market, and I wasn't as familiar with common offerings and groupset ranking. I ended up going through the hybrid offerings of 19 manufacturers selling bikes in the US and compiling a master spreadsheet of components and other identifying parameters. While I'm feeding a more general analysis into a first-time hybrid/commuter buying guide, I wanted to also share the spreadsheet/base data for those of you who can put it to good use. 

Link to download the spreadsheet 

Some housekeeping at the top: the information in this spreadsheet came exclusively from bike manufacturer and component-maker websites. It is therefore public information, and the spreadsheet is freely available for anyone to use in any way they wish. However, please note that while I spent a considerable amount of time checking for accuracy in my data entry, I don't make any guarantees that I did a perfect job. The spreadsheet was generated in Excel 2010 for Windows for ease of general use; if any formatting seems to be breaking I'd check for compatibility issues in that direction first. 

Graphs and stats

I'm really happy with these graphs and diagrams generated in Flourish, so we'll talk about those first before going on to methodology so they're more visible. Sorry to people interested in methodology, I promise there's a ton of that below. 

Sankey diagram - front derailleur specification on the left, rear derailleur on the right

Sankey diagram - front derailleur specification on the left, rear derailleur on the right

As I said above, most of this spreadsheet was intended for use as a base dataset for others and for my own market guide. In light of that, the analysis here will not be comprehensive, and will assume a fair bit of context knowledge. We'll be focusing here on individual statistics and data patterns, starting with this Sankey diagram of front versus rear derailleur lines. 

This diagram shows front derailleur specification on the left and rear derailleur on the right. Squarespace does not play nice with the kind of embed link Flourish can provide, but command-clicking any of these diagrams should take you to the public Flourish page, where you can interact with the models and create your own graphs and diagrams from the data if you’d like. This particular diagram should highlight a specific link when you hover over it. Going through like that, you can see what the common pairings are. Clicking on a specific link should freeze that link as the opaque one.

'None' denotes 1-by models without a front derailleur. The groupsets are ranked by price at which the front derailleur first appears (Tourney being the front derailleur found in the least expensive model). Subsets of a groupset (Tourney TY310) are classified by just their line name so trends can be more visible.

What I think is interesting here is that front derailleurs tend not to link down. While an Altus rear derailleur has a good chance of coming with a Tourney front derailleur (Tourney being a lower tier in our list), an Altus front derailleur is not paired with a Tourney rear derailleur in our dataset. 

Scatter plot - price versus rear derailleur line

Scatter plot - price versus rear derailleur line

Rear derailleur choice seemed to be a good indicator for the general component tier of a hybrid bike. I believe this is mostly because a significant part of the market now has no front derailleur, and shifters seem to come in more variations than do rear derailleurs, and for me out those three tend to have the most significant say in how a hybrid drivetrain feels. Anyway, this graph plots rear derailleur against list price. 

The interactivity of this one via the Flourish URL is in manufacturers - clicking on an individual manufacturer in the key removes it from the graph (it's a little tedious to get to a single manufacturer this way, but this setup offers more flexibility overall). Mousing over any of the dots should also get you more information about that data point. 

From a purely rear-derailleur-for-the-buck perspective, there are some clearly bad choices. Many of the outliers can be explained by frame material (the grid-of-charts for this was extremely unpleasant to the eye so we're skipping that visualisation), but in general the pricing window for a given rear derailleur is fairly set. 

I'm intrigued by the general ordering of groupsets here. You may notice the order in this graph is different from the order in the Sankey diagram - these groupsets were ordered by price at which they first appear, as in the Sankey, but why 105 comes in a hundred US dollars lower than Tiagra I couldn't say. 

Sunburst diagram - inner layer is categories, outer layer is frame material within that category

Sunburst diagram - inner layer is categories, outer layer is frame material within that category

Here we have categories and constituent frame types. I delineated aluminum-alloy where it was specified, it's probably debatable whether that was a good idea. Either way, these are some of the marketing words used by manufacturers to describe these models. 

Many models had more than one category descriptor - in those cases, I tried to use the one that was least broad. If a model was described as Fitness-Urban, that would go under Urban. The hierarchical clustering wasn't super strong methodologically, but it might still give a reasonable trend overview.  

In general, this diagram came about because I find it silly how many different descriptors are used to sell basically the same bike. While I understand the benefit of separating your truly mountain-focused "Adventure" bike from your cruiser-like "Comfort" bike, industry standardisation would probably be reasonably helpful here. If I'm showing my friend options for their first hybrid, I don't want to have to explain that they should be looking for pavement, city, urban, and fitness bikes but not comfort or dual sport, for example. Within a brand, this wording might be helpful in breaking down options, but as a whole I think it leads to consumer confusion, especially since we see that most of these bikes come in at a more general-consumer price point. 

Scatter plot - price versus gear ratio range

Scatter plot - price versus gear ratio range

This last(ish) Flourish chart shows price versus gear ratio range. Drivetrain tiers may affect consistency/smoothness of ride feel and mechanical longevity, but gear ratios directly influence how many times my sister will need to get off her bike and walk up a hill. 

This plots gear ratio range, which is just highest gear ratio minus lowest gear ratio. While this is not a perfect way to track which bikes offer reasonable ratio choices, models with higher gear ratio ranges tend to have better low- and high-end options. Up on the y-axis is generally better, left on the x-axis is generally better, so we want to be in the top left-hand corner. 

So at the very low end of gear ratio we have our one-point-somethings. These tend to be 1-by systems, which is reasonably understandable, but I run 1-by on my road bike now and my gear ratio range is substantially larger than that, so that's not really an excuse. That grouping is also at the low end of the market price-wise, though. 

The other worst offenders are the Trek FX Sport Carbon 4 (a 2.68 gear ratio range with a 1700 USD price tag) and the BMC Alpenchallenge 01 THREE (also 2.68 GRR but a 1600 USD price tag, likely 100 dollars off for having to put up with that ridiculous name). Marin and Fuji, on the other hand, have some very inexpensive models coming in at the highest section of GRR.

Like the first scatter plot, this one is interactive on manufacturer. It also includes trend lines per manufacturer, which are less useful when the sample size is small but perhaps interesting for larger model count manufacturers. We expect trend lines going up and right, or staying flat. 

Scatter plot - gear ratio range versus price

Scatter plot - gear ratio range versus price

Fun fact, if you graph these data points with the axes flipped and price on the Y axis, you see something you could reasonably call a graph of perfectly elastic gear ratio range trend lines. In price elasticity, we usually talk about the extremes as being perfectly inelastic demand (sell me this no matter the price) and perfectly elastic demand (I don't want this at all

unless it's at this exact price). 

In the normal 'price on the Y axis' kind of supply-demand graph, perfectly inelastic demand is a vertical line and perfectly elastic is a horizontal line. Perfect elasticity/inelasticity is mostly hypothetical, but here we see something kind of resembling it. If we consider these trend lines to be supply lines, we posit that a supplier will try to make whatever gear ratio a customer wants as long as the customer pays a specific price. 

Manufacturers seem to recognise consumer preference for gear ratio ranges - if there were no preference, then we would expect manufacturers to just make whatever one or two ranges were most convenient. Instead, they offer many models across the spectrum, which suggests that there is consumer demand for diversification in that dimension. So our conclusion would seem to be that suppliers recognise differentiated demand but see no difference in manufacturing cost, so supply in regards to GRR is almost perfectly elastic for some of these manufacturers (Jamis, Marin, and Cannondale being a few specific cases). 

From a purely market research perspective, this is a moderately silly waste of time, but I thought it was interesting how foundational economic principles and scatterplots could surface a decently well-reasoned hypothesis for component pricing in manufacture. It would maybe be interesting to run this for other parameters in the dataset and see what surfaces as price-affecting and not (all else being equal). 

Breakdown of price range concentration by manufacturer

And a bit about the in-spreadsheet stats, the first of which is percent of models in each price bracket by manufacturer. Before we go on, all of these are calculated as decimals; if your personal preference for percentages differs from mine you can multiply all of those functions by 100. 

None of this section was wildly surprising to me, but I do find it funny that Canyon only sells bikes over a thousand dollars. Liv and Giant are well-distributed, but Giant does seem to sell slightly more mid-tier bikes (650-999 USD) than Liv, which is interesting given how close the companies are. 

In groups by market position, it looks like: Liv, Giant, Marin, and Jamis at the entry-level price point; Raleigh, Fuji, and Felt at the entry- to mid-level; Specialized, Cannondale, Trek, and Ribble at the mid-level; and the rest at specific price points. I would expect to see a manufacturer excel where they have the strongest market positioning (as opposed to in models priced significantly above or below their line average), but that is mostly conjecture at this point. 

Gear ratio stats by manufacturer

If you're looking for the easiest gearing possible, consider bikes from Kona (0.65 minimum) or Bianchi (0.69 minimum). Kona makes sense to me because of the mountain pedigree, but Bianchi's a surprise. Also, it looks like with the exception of Specialized all of these manufacturers have sub-0.9 offerings, which is a decent spinning gear as is. For descenders, maybe avoid Diamondback and Kona. 

Rear derailleur concentration by price bracket

Brake type concentration by price bracket

Gear ratio stats by price bracket

While the majority of the market is aluminum frames, you see some steel on the low and high end and carbon only after 1500 USD. Half of the available models have Tourney and Altus rear derailleurs, with another small hot spot with Acera. Hydraulic disc brakes form the majority of brake options by 650 USD and are by far the most prevalent type starting at 800 USD. You get substantially better GRR as you go up in price brackets, but most of the gains look like they extend the high end of the gear ratio range rather than the low end. 

Data gathering methodology

A few more general philosophy notes before we go further. In the process of cleaning up the spreadsheet and doing some basic stats work on it, I tried to make some of the component-specific columns more readable by standardising the names for various components. I did not create or extrapolate any information not available on the manufacturer's model page, but in some fringe cases the information given was far more detailed than for other manufacturers so I removed the most specific modifiers.

I also excluded the following categories: fatbikes, beach cruisers (as a general rule; there are some fairly cruiser-y models in the spreadsheet), touring, anything marketed as a mountain or road bike, electric (including electric-assist), cargo, gravel, anything with drop bars. All of these exclusions had reasoning behind them, but for explanation length concerns please send me an email if you'd like to hear that reasoning. 

Spreadsheet layout and information presentation methodology

It would probably be best to now consider the spreadsheet in terms of its constituent tabs. First up is "All," which does what it says on the tin. This is the main repository of component information, probably most useful for people who just want to filter by component parts. 

Within "All," most of the fields are self-explanatory. 'Marketing' is a marker for whether the model was specifically targeted toward a single gender. Mostly this comes as W tags for women's-specific bikes. I have a lot of thoughts about the practice of gender-specific marketing in the bike industry in general, but we can table that for now. This column did not mark models which were step-through or otherwise had features traditionally marketed toward women but which did not include that in their marketing. 

Information that is not provided but was expected is noted with a dash. Fields that wouldn't necessarily have data are left empty if no data is provided. 

'BB' and 'BB type' came about because I know next to nothing about bottom brackets but think they're important. If a manufacturer specified a part number or product line, that went in 'BB'. Descriptors like threaded or cartridge went in 'BB type'. At least one of these fields was expected, so both are dashed if empty.

A note for people concerned about my use of ‘crankset’ versus ‘chainrings’ or another term: I share your concern but feel like you probably get what I’m talking about with that one.

Fields for front and rear derailleur numbers list the number of cogs, then the gearing. 'LGR' and 'HGR' are the lowest and highest gear ratio numbers, and 'GR range' is a simple HGR-LGR calculation. The conditional formatting on the LGR and HGR columns highlights values more than one standard deviation above or below the mean, and 'GR range' just does three-color lowest to highest value.

For 'Wheel note', there may be some errors with what is listed as 650b versus 650c, but the '650' part (versus road-standard 700) is what interested me with those so I did not go back when I learned 650b and 650c were different standards. I apologise if the distinction is integral to your use case. 

"Manuf" gives you basic details on the manufacturers included. These are not statistics on the actual bikes, but information about whether bikes are available to purchase online in the US/Canada as well as how much data I was able to gather from each manufacturer. Some manufacturers had more useful websites for data gathering (Ribble, Cannondale, Canyon), while some were substantially less helpful (Specialized, Giordano). In total, I considered 285 models over 19 manufacturers, with a highest model count of 51 (Cannondale) and five manufacturers under five models (Diamondback, Tommaso, BMC, Giordano, Surly). There was a hard upper cap at 2500 USD, which mainly affected Canyon. 

The next ten tabs are manufacturer-specific. The only change here from the constituent rows in "All" is the conditional formatting on gear ratios now affects only values from within the manufacturer specified, in case you have a prefered manufacturer. Manufacturers with six or fewer models were grouped into "Misc manufs," for cleanliness reasons. 

As the remaining four tabs are stats aggregation tabs, we'll transition to a new section to deal with those so there can remain some hope of navigating this explainer. 

Stats methodology

We talk about basic statistical analysis earlier on - in this section we'll explore specifically the methodology behind how these numbers were generated. 

First let's consider the "Stats" tab, which is sorted by manufacturer. For the vast majority of these cells, data was gathered with COUNTIF functions and their sister COUNT/COUNTA count...erparts. The only fields generated entirely without number bases were the 'mens/womens models same specs' fields, for which there was so little data that it made sense to just compare manually and post only the results. All of the other fields should be traceable to the underlying data.

The dotted borders in this tab denote the sections on which the conditional formatting was run. In the section that tracks what percentage of a manufacturer's models fall into a certain price point, the conditional formatting shows the highest concentration per manufacturer. However, in the following 'median price' section, the conditional formatting shows which manufacturer has the highest median price. 'Percent of models priced' is largely a checksum field. 

Tire size v. price correlation coefficient was generated only for the manufacturers with more than 10 priced models. Bianchi, while it gave specs for 13 fitness-focused flat-bar models, did not offer pricing for literally anything that I researched, so it was not included in this. I am not bitter nor am I frustrated. 

I split the price brackets up mostly arbitrarily based on my own experience and how the pricing generally seemed to fall, but graphing it out those brackets did appear to make sense. Most of the market concentration is in the 500-799 USD range, and those two constituent brackets are smaller than the ones on the higher end. 

The 'all' column typically treats those statistics for the full dataset in the "All" tab (sorry). Mostly what I mean is that it's not taking the median or sum of the columns to the left of it, it's generating those same statistics for the larger dataset. 

"Brackets" contains the same data as "All" but sorted on price with brackets delineated by thick borders. This was mostly done so "All" could stay unsorted, for more stable functions. 

In "Brackets stats," most of the same general methodology applies, with a few quirks regarding groupset hieirarchy and rear derailleur/brake choice. 

Since it was interesting for me to see what the groupset trends were over different price points, I sorted the groupsets based on a rough per-manufacturer-category ranking. Shimano publishes rankings for both its road and mountain groupsets, so I took the relative ranking of groupsets within those categories and used that to merge. For example, if 105 was ranked 5 of 7 road groupsets, it was assigned the value .714, so it would rank higher than a groupset that was, say, 4 of 9 (.44) in Shimano's mountain category. 

This presented the obvious challenge of being not at all based on how any normal person would compare groupsets. Moreover, it required deciding whether Tourney was primarily a road or mountain groupset (I went with road in the end). Still, as a very rough estimate it appeared to work reasonably well for our purposes. If you have a different preferred ranking, the functions are hard-mapped to cells in the "Brackets" tab, so feel free to reposition those rows since the calculations should stay the same. 

For the Tourney issue described above, I didn't draw much attention to the road- versus mountain-branded drivetrain percentages in rows 20 and 21. 

The other major difference in methodology between the "Stats" tab and "Brackets stats" is the mode function in rows 35 and 36. This function’s syntax was taken from someone on the internet, and should be outputting the modal value in the referenced dataset. However, even excepting the lack of research into the function, there is a possible issue in generalisation. If there are two Tourney TY310 and two Tourney TY510 listings, for example, and three Sora listings, Sora would be listed as the modal component for that dataset (even if all three were different Sora part numbers or generations, depending on what information the manufacturer provided).

'Non-700cc wheels' was just generated by counting all of the occurrences of a non-blank 'Wheel notes' column in "Brackets," since that column only contained notes when the wheel size was not 700cc. 

Below that comes a not-easily-traceable set of stats. These I pulled by filtering "Brackets" on whatever category I was considering, then highlighting the prices and looking at the quick functions Excel 2010 shows in the bottom right when you highlight data. I would be surprised if these are not the numbers generated by doing this in a more reproducible manner, but of course the possibility exists. In these rows, Tourney includes sub-specifications like Tourney TY310. 

"Completeness" just tracks how many fields in a manufacturer are filled with dashes. This tab holds the traceable functions for the numbers given in the "Manuf" tab. 

Further possibilities

I don’t have a STATA license anymore, but would love to see some sort of analysis on the relative price increases of a specific groupset or component. Some of the statistics included vaguely start to look at differences between component lines, but nothing at the scale that I think might be able to be derived from the raw data. It would be interesting to me how much, for example, bumping up from a Tourney rear derailleur to Alivio increases price ceteris paribus, or whether gear ratios actually do anything to price.

Also in the realm of possible further data aggregation is reviews. I didn’t touch at all on any market reviews for the compilation of this spreadsheet (I will for the hybrid buyer’s guide, but not in any rigorous way), it might yield notable results to map even overall/average review scores to certain model parameters. Or use the incidence rate of reviews on certain models to map market share or demand, perhaps.

These are not things I have the time or ability to put together at this point, but if you do any of this or anything else fun with the data, please do let me know! I’d be thrilled to see what other trends might or might not surface.

Last notes

I hope you've enjoyed your time here. Please let me know if you see any errors in the spreadsheet or in any of the analysis. As I said, this will get bundled into my hybrid/commuter buying guide, but the data itself is free for you to use. Any edits to the dataset will be listed below. You can also access previous version downloads from the list, if you’d like.

Current data version: 2020.01

2020.01 - original data