Methodology

Site Selection Scoring Models: A Defensible Guide

Q: What is a site selection scoring model?

A site selection scoring model is a structured method for ranking candidate locations using criteria, weights, transformations, gates, and overlays.

Q: What is a location score?

A location score is a numeric summary of how well a candidate site fits an operator’s site selection criteria.

Q: What is the difference between a site score and a sales forecast?

A site score measures strategic fit, while a sales forecast estimates expected performance. The final recommendation should use both.

Q: What is the best site selection scoring method?

For most multi-unit operators, the best practical method is hard gates plus a transparent weighted score, forecast range, network overlays, sensitivity analysis, and post-opening validation.

Q: What criteria should be included in a site selection scorecard?

Common criteria include trade area population, income, daytime population, category spend, competition, traffic, accessibility, co-tenancy, feasibility, rent, customer origins, cannibalization, and saturation.

Q: How should site selection criteria be weighted?

Weights should reflect strategy, be set before candidates are scored, be documented, and be tested for sensitivity.

Q: What is AHP in site selection?

AHP, or Analytic Hierarchy Process, is a multi-criteria decision method that derives weights from pairwise comparisons and checks consistency.

Q: What is TOPSIS in site selection?

TOPSIS ranks candidate sites by comparing each one to an ideal site and a worst-case site.

Q: Should cannibalization be part of the score?

Cannibalization should usually be shown as a network overlay rather than hidden inside the core score.

A complete guide to weighted scoring, AHP, TOPSIS, thresholds, gates, normalization, forecasts, overlays, confidence, and validation.

📖 28 min read · Last updated May 2026

Executive summary

A site score measures strategic fit. It shows whether a candidate matches the operator's reach, demand, competition, access, format, and growth strategy.
A score is different from a forecast. A score ranks fit. A forecast estimates sales, visits, orders, members, patients, or deposits.
The strongest architecture is gates + weighted score + forecast + overlays + recommendation. Gates remove blocked sites, scores rank eligible sites, forecasts estimate performance, and overlays add portfolio context.
Site scoring has a nearly century-long lineage. Reilly, Converse, Christaller, Applebaum, Huff, GIS suitability analysis, MCDA, and modern ML all address the same core decision problem.
Weighted scores are useful, but fragile. Rank reversal, normalization instability, compensatory aggregation, double-counting, and weight gaming can change the recommendation.
Hard requirements should be gates. Zoning, drive-thru feasibility, payer mix, rent ceiling, minimum trade area, and operational blockers should not be averaged away.

A site score should make a decision easier to challenge.

That sounds backwards until a candidate reaches the real estate committee.

A black-box score can look impressive in a pipeline review. It can rank the right sites, produce a confident number, and make the deck feel more analytical. But the moment someone asks why the site scored 82, why a failed store would have scored well, why a high-traffic corridor lost points, why cannibalization was ignored, or why the model changed since last quarter, the score has to become more than a number.

It has to become a decision record.

A defensible site selection scoring model shows how a candidate location fits the operator's strategy. It exposes the criteria, weights, gates, thresholds, transformations, assumptions, data sources, missing fields, model version, confidence level, forecast range, and network impact.

Most importantly, it separates four artifacts that often get collapsed into one vague "AI score."

Artifact	Question it answers
Score	Does this site fit our strategy?
Forecast	How large could this location be?
Overlay	What network or execution context changes the decision?
Recommendation	What should we do next?

The score organizes the evidence. The forecast estimates performance. The overlays explain risk. The recommendation turns the analysis into a decision.

What is a site selection scoring model?

A site selection scoring model is a structured method for comparing candidate locations against a defined set of criteria.

For a retailer, the model might evaluate trade area population, income, category spend, co-tenancy, competition, access, foot traffic, visibility, and cannibalization. For a restaurant, it may add daypart demand, drive-thru feasibility, pickup flow, delivery radius, commute direction, and kitchen capacity. For healthcare, it may include payer mix, provider capacity, patient access, service-line demand, referral leakage, and regulatory constraints.

A scoring model helps answer:

Which sites should advance to deeper review?
Which candidates best match the expansion strategy?
Which variables drive the recommendation?
Which sites fail hard requirements?
Which high-scoring sites need more research?
Which candidates create network risk through cannibalization or saturation?
Which recommendations are high-confidence, and which are fragile?

A good scorecard gives every stakeholder a way to interrogate the recommendation.

The real estate team can ask about access and trade area assumptions. Finance can ask about forecast range and rent-to-sales risk. Operations can ask about feasibility. Franchise teams can ask about encroachment. Executives can ask whether the site advances the strategy or simply looks good on a map.

The score should organize that conversation, not end it.

Score vs forecast vs overlay vs recommendation

The most common mistake in site selection scoring is turning four different artifacts into one number.

Artifact	Purpose	Example output
Score	Measures strategic fit	74 / 100 with component breakdown
Forecast	Estimates expected performance	P10 / P50 / P90 sales, visits, orders, members, patients
Overlay	Adds network and execution context	Cannibalization, saturation, feasibility, confidence
Recommendation	Converts evidence into action	Advance, reject, research, revise, hold

A site can score well and forecast small. It may be strategically perfect but located in a limited trade area. A site can forecast large and score poorly. It may have high traffic and demand, but the wrong customer profile, bad access, weak economics, or heavy cannibalization.

A clean decision package looks like this:

Core score:
Reach + Demand + Competition + Accessibility

Forecast:
Expected sales / visits / members / patients, with range

Overlays:
Cannibalization
Saturation
Feasibility
Confidence
Validation requirements

Recommendation:
Advance / reject / research / revise

This distinction is especially important for AI-assisted scoring. A single "AI score" can hide whether the model is ranking strategic fit, expected sales, market potential, analog similarity, network impact, or all of those at once. The more concepts a score absorbs, the harder it becomes to defend.

The history of location scoring

Site selection scoring did not begin with AI. The modern scorecard sits on nearly a century of retail geography, spatial interaction modeling, analog reasoning, GIS suitability analysis, and multi-criteria decision analysis.

Reilly's Law of Retail Gravitation (1931)

William J. Reilly's 1931 The Law of Retail Gravitation applied a gravity analogy to retail trade areas. The idea was simple: larger retail centers attract customers from farther away, while distance weakens that attraction.

The classic attraction-balance relationship is:

PA / dA² = PB / dB²

Rearranged:

dA / dB = √(PA / PB)

Where:

Term	Meaning
`PA`, `PB`	population or size proxy for retail centers A and B
`dA`	distance from the breakpoint to center A
`dB`	distance from the breakpoint to center B

If A is larger than B, the breakpoint is farther from A and closer to B. That means A's trade area extends farther toward B. This is the point that often gets muddled when the formula is presented without clear variable definitions.

Reilly's model solved an early problem: how to draw deterministic trade area boundaries between competing cities or retail centers. Its limits are obvious today. It assumes simple geography, ignores road networks, treats consumers as homogeneous, and draws hard boundaries where real trade areas overlap. But the core insight still matters: demand, attraction, and distance can be modeled. (Wikipedia)

Converse and the breaking-point formula (1949)

Paul Converse later rearranged Reilly's logic into a more practical breaking-point formula:

dBP from B = DAB / (1 + √(PA / PB))

Where:

Term	Meaning
`DAB`	distance between centers A and B
`PA`, `PB`	population or size proxy for A and B
`dBP from B`	breakpoint distance from center B

Example:

City A population = 120,000
City B population = 30,000
Distance between cities = 60 miles

dBP from B = 60 / (1 + √(120,000 / 30,000))
dBP from B = 60 / (1 + 2)
dBP from B = 20 miles

The model says the smaller city's trade area extends about 20 miles toward the larger city. The breakpoint is 40 miles from the larger city. That is the expected result: the larger center pulls from farther away. (Wikipedia)

For modern operators, the value is historical and conceptual. Converse made gravity theory usable for trade-area boundary drawing. Modern scoring models have moved beyond deterministic breakpoints, but they still ask where a site's reach begins to fade.

Christaller, threshold, and range (1933)

Walter Christaller's central place theory added two ideas that still sit underneath site scoring: threshold and range. In central place theory, threshold is the minimum market needed to support a good or service, and range is the maximum distance customers are willing to travel for it. (Wikipedia)

Concept	Meaning in site selection
Threshold	minimum demand needed to support a location
Range	maximum distance or travel time customers will tolerate

Every supportable-unit model, minimum-demand gate, drive-time screen, and trade area threshold inherits this logic.

A market may contain population, but the location only works if enough demand exists inside the range customers will actually travel. A clinic may have a strong need profile, but if patients cannot reach it within the service standard, the demand is not operationally useful. A delivery unit may have addressable households, but if it cannot serve them inside the delivery promise, the theoretical market is larger than the practical market.

Christaller's threshold/range pair is still inside the questions operators ask every day:

Does the catchment contain enough target customers?
Will those customers travel far enough to use the site?

Applebaum and the analog method (1966)

William Applebaum's 1966 work in the Journal of Marketing Research on store trade areas, market penetration, and potential sales formalized one of the most important operator-side methods in site selection: compare a candidate location with existing stores that have similar trade area characteristics.

The analog method asks:

Which existing stores look most like this candidate?
How did those stores perform?
What capture rate did they achieve?
What changed when similar stores opened nearby?

Applebaum's method dominated much of U.S. retail site selection from the 1960s through the 1990s because it matched how operators reason. If a proposed site resembles three existing stores, and those stores all reached similar AUVs or patient volumes, the comparison gives the committee a grounded forecast.

Analog methods still matter because they are explainable. A real estate team can inspect the stores behind the estimate.

Their weakness is data dependency. Analogs break when:

the brand has too few stores
the prototype changes
the market type is new
customer behavior shifts
the candidate is out of distribution
first-party customer-origin data is missing

Analog scoring is powerful when the portfolio is mature and comparable. It becomes fragile when the next site is unlike the past.

Huff and probabilistic trade areas (1963, 1964)

David Huff's 1963 and 1964 work replaced deterministic boundaries with probability. A customer does not simply "belong" to one trade area. Instead, the model estimates the probability that a customer at origin i chooses store j.

A common Huff formulation:

Pij = (Aj^α × Dij^-β) / Σ(Ak^α × Dik^-β)

Where:

Term	Meaning
`Pij`	probability that demand from origin `i` chooses store `j`
`Aj`	attractiveness of store `j`
`Dij`	distance, drive time, or travel cost from `i` to `j`
`α`	attractiveness sensitivity
`β`	distance-decay sensitivity
`k`	all competing locations in the choice set

The Huff model changed site selection because it allowed overlapping trade areas. A customer could have some probability of choosing Store A, Store B, a competitor, or no purchase. Esri describes the Huff model as a spatial interaction model where probability depends on distance, site attractiveness, and the distance and attractiveness of competing sites. Esri also notes that calibration is needed because default exponent values may not apply to the specific trade area being modeled. (ArcGIS Pro)

That matters for scoring because a site's value depends on choice probabilities, not just the number of people inside a polygon. A candidate can sit in a dense market and still be weak if competitors are more attractive or easier to reach. A site can sit farther away and still capture demand if its access and format are superior.

Lakshmanan and Hansen: market potential (1965)

Lakshmanan and Hansen's 1965 retail market potential work extended the spatial interaction tradition into shopping center sales and market aggregation. The practical move was to connect origin demand, travel friction, and destination attractiveness to potential sales.

That lineage shows up in modern location scoring whenever a model asks:

How much demand is available?
How likely is this site to capture it?
How much is already allocated to competitors or existing stores?

Nakanishi and Cooper: multiplicative competitive interaction (1974)

Masao Nakanishi and Lee Cooper's 1974 multiplicative competitive interaction model extended Huff by replacing a single attractiveness variable, such as store size, with a bundle of attractiveness attributes.

A simplified multiplicative attractiveness structure:

Aj = X1j^β1 × X2j^β2 × X3j^β3 ... Xnj^βn

Where each X is a site or store attribute and each β is an elasticity.

Those attributes might include:

store size
parking
price
assortment
frontage
co-tenancy
brand strength
reviews
operating hours
delivery coverage
format
local awareness

This is the bridge from gravity models to modern multivariate site scoring. A store's attractiveness is not one variable. It is a weighted bundle of factors.

Modern scoring models still use this logic, even when they do not call it MCI. They combine multiple attributes into a site fit score or sales forecast.

Regression-based new-store forecasting

As chains built larger portfolios, regression and statistical forecasting entered site selection. Existing locations became training data. Candidate-site features became predictors.

The question shifted from:

Which store does this site resemble?

to:

Given these trade area, access, competition, and format attributes,
what range of outcomes should we expect?

Regression, generalized linear models, and later machine learning helped operators move beyond pure analog reasoning. But these methods introduce their own risks: overfitting, sample-size limits, stationarity assumptions, and weak performance on new formats or new geographies.

A chain with 80 stores and 30 variables does not have "big data." It has a small modeling problem that needs discipline.

GIS and suitability analysis (1980s-1990s)

GIS made location scoring repeatable. Drive-time polygons, demographic overlays, competitor layers, traffic counts, POI data, and suitability maps allowed in-house teams to run hundreds of analyses instead of commissioning one-off studies.

Esri's suitability analysis workflow is a good public example of the modern GIS approach. It identifies sites that meet user-defined criteria, preprocesses variables onto comparable scales, applies weights, combines them, and scales final scores. It also supports positive, inverse, ideal, and target-site influence types. (Esri Documentation)

That is the site scoring pattern most teams recognize:

Define criteria.
Transform variables.
Weight criteria.
Combine scores.
Rank candidates.

MCDA and decision science

Multi-criteria decision analysis, or MCDA, gave scoring models a formal decision framework. Weighted sums, AHP, TOPSIS, ELECTRE, PROMETHEE, sensitivity analysis, and rank-stability checks all address the same issue: how to make a decision when multiple criteria matter and no single metric is enough.

This matters because site selection is inherently multi-criteria. A candidate can be strong on demand, weak on access, moderate on competition, and uncertain on feasibility. The model has to combine those signals without hiding deal-breakers.

Mobility, machine learning, and AI scores (2010s-2020s)

The 2010s and 2020s added foot-traffic panels, mobile-device data, gradient boosting, store embeddings, and AI-assisted workflows. These tools can improve scoring when used carefully, especially for observed trade areas, visit behavior, analog selection, and feature attribution.

But AI scores are the newest aggregation layer, not a replacement for older questions.

Every scoring model still has to answer:

What criteria matter?
How are variables transformed?
What is being forecast?
What is being scored?
How is cannibalization handled?
What data is missing?
How is the model validated?
Can the committee challenge the recommendation?

The history matters because it keeps the modern score honest. Site selection has always been a structured argument about demand, access, competition, and choice. AI does not remove that structure. It raises the cost of hiding it.

The architecture of a defensible location score

A defensible location score has five layers.

Layer	Purpose	Example
Gates	Remove ineligible or blocked sites	zoning failure, drive-thru impossible, rent ceiling exceeded
Core score	Rank eligible sites against strategy	Reach + Demand + Competition + Accessibility
Forecast	Estimate expected performance	P10 / P50 / P90 sales, visits, patients, members
Overlays	Add portfolio and execution context	cannibalization, saturation, feasibility, confidence
Recommendation	Convert evidence into action	advance, reject, research, revise

This architecture prevents one number from doing too much work.

Gates

Gates answer whether a site is eligible. Failed gates should not be averaged away.

Examples:

zoning does not allow the use
drive-thru is required and impossible
parking is below prototype requirement
rent-to-sales economics fail threshold
trade area is below minimum demand
payer mix is unacceptable
franchise territory rights block the site

A gate-failed site should be labeled:

Status: Not eligible under current criteria
Reason: Drive-thru feasibility failed

It should not receive a misleading score of 58.

Core score

The core score ranks eligible candidates. It should be simple enough to explain and stable enough to compare.

A common structure:

Site score =
Reach contribution
+ Demand contribution
+ Competition contribution or penalty
+ Accessibility contribution

Forecast

A forecast estimates expected performance. It may use analog stores, regression, machine learning, demand allocation, or scenario modeling.

Forecasts should be presented as ranges:

P10: $1.8M
P50: $2.4M
P90: $3.1M

New-store forecasts deserve wide intervals because the site has no operating history.

Overlays

Overlays capture context that belongs next to the score.

Overlay	Why it matters
Cannibalization	A strong site may transfer demand from existing units.
Saturation	A market may have demand but weak marginal returns.
Feasibility	A good market can fail on real estate, access, labor, buildout, or operations.
Confidence	A high score from weak or missing data should not be treated like a high score from strong evidence.
Validation	A model improves only if actual openings are measured.

Domino's fortressing language is a good example of why overlays belong outside the core score. Domino's describes adding stores in existing markets to condense delivery areas and get closer to carryout customers, while also warning that fortressing may negatively affect existing-store sales and can lead to closures if executed too rapidly. (SEC)

Sweetgreen's delivery-radius language makes the same point in channel-specific form. Sweetgreen says new restaurants in or near existing markets can affect existing restaurant sales, and it specifically warns that cannibalization may become significant when an existing restaurant's delivery radius overlaps with a new restaurant's delivery radius. (SEC)

Those are network effects. They should shape the recommendation, but they should not be hidden inside the core site fit score.

Recommendation

The recommendation is the final decision statement.

Possible outputs:

Advance to LOI.
Reject.
Advance only under constrained lease economics.
Research customer-origin data before approval.
Revise prototype.
Hold until better site supply.

The recommendation should name the key reason, not just repeat the score.

The core formulas

Weighted sum

The simplest defensible score is a weighted sum:

Scorej = Σ(wi × xij)

Where:

Term	Meaning
`Scorej`	score for site `j`
`wi`	weight for criterion `i`
`xij`	normalized score of site `j` on criterion `i`

Example:

Component	Normalized score	Weight	Contribution
Reach fit	86	30%	25.8
Demand fit	78	30%	23.4
Competition fit	58	25%	14.5
Accessibility fit	70	15%	10.5
Total			74.2

This model is easy to explain and easy to misuse. It assumes criteria can compensate for each other. Strong demand can offset weak access. Strong reach can offset competition. Sometimes that is appropriate. Sometimes it hides a fatal flaw.

Weighted geometric mean

When balance matters, a geometric mean can penalize unbalanced sites.

Scorej = Π(xij^wi)

The geometric mean reduces the ability of one very strong criterion to fully compensate for a very weak one.

Esri's suitability analysis supports Product and Geometric Mean combination methods. Esri describes geometric mean as useful when criteria are on different scales and when high final scores should require high values in multiple criteria, because an extreme value in one criterion will not disproportionately determine the result. (Esri Documentation)

The Human Development Index also moved to a geometric mean in 2010, which is often used as a teaching example for reducing substitutability across dimensions. The same idea applies to site selection: if demand, access, and feasibility all matter, a site that is excellent on demand but terrible on feasibility should be penalized more heavily than a simple arithmetic average would suggest. (Wikipedia)

Hard requirements should still be gates. The geometric mean reduces compensation, but it does not replace pass/fail eligibility.

AHP consistency

AHP derives weights from pairwise comparisons. It also checks whether those comparisons are internally consistent.

The consistency index:

CI = (λmax - n) / (n - 1)

The consistency ratio:

CR = CI / RI

Where:

Term	Meaning
`λmax`	principal eigenvalue of the pairwise comparison matrix
`n`	number of criteria
`RI`	random index for matrix size `n`

Common RI values:

n	2	3	4	5	6	7	8	9
RI	0	0.58	0.90	1.12	1.24	1.32	1.41	1.45

A common AHP convention is:

CR ≤ 0.10: judgments are acceptably consistent
CR > 0.10: revisit pairwise comparisons

For site selection, this matters because it prevents a committee from saying inconsistent things like:

Demand is much more important than Access.
Access is much more important than Competition.
Competition is much more important than Demand.

AHP does not eliminate judgment. It makes judgment auditable. (Wikipedia)

TOPSIS closeness

TOPSIS ranks candidates by distance from an ideal site and a worst-case site.

Cj = dj- / (dj+ + dj-)

Where:

Term	Meaning
`dj+`	distance from site `j` to the ideal site
`dj-`	distance from site `j` to the negative ideal site
`Cj`	closeness coefficient

A higher Cj means the candidate is closer to the ideal and farther from the negative ideal.

TOPSIS is useful when comparing finalists:

Which candidate is most like our ideal site profile?

Its weakness is normalization sensitivity. Different preprocessing methods can change distances, and therefore rankings. The MCDM literature continues to emphasize that reference-type methods such as TOPSIS can produce different rankings depending on how reference solutions, distance measures, and normalization are defined. (arXiv)

Huff probability

Huff-style models can support scoring when the question is demand allocation.

Pij = (Aj^α × Dij^-β) / Σ(Ak^α × Dik^-β)

Where:

Term	Meaning
`Pij`	probability that demand from origin `i` chooses site `j`
`Aj`	attractiveness of site `j`
`Dij`	travel time, distance, or travel cost
`α`	attractiveness sensitivity
`β`	travel-cost sensitivity
`k`	all options in the choice set

For a scoring model, this helps estimate:

expected capture from an origin
competitor pressure
same-brand transfer
market share potential
cannibalization-adjusted demand

The key move is comparing allocation before and after a candidate enters the market.

MCI attractiveness

The multiplicative competitive interaction model generalizes attractiveness across multiple attributes:

Aj = Πm(Xmj^βm)

Where:

Term	Meaning
`Aj`	attractiveness of site or store `j`
`Xmj`	attribute `m` for site `j`
`βm`	elasticity or importance of attribute `m`

This is useful because store attractiveness is rarely one variable. It can include frontage, parking, co-tenancy, price, hours, brand strength, and format.

MCDA methods: weighted sum, AHP, TOPSIS, and outranking

A site scorecard is a multi-criteria decision model. There is no universally best method. The right method depends on the audience, data, governance burden, and decision stage.

Weighted sum model

The weighted sum model is the most common.

It works well when:

stakeholders need transparency
criteria are measurable
the model is used for ranking
hard requirements are handled as gates
criteria are reasonably independent

It fails when:

raw variables are not normalized
weights are changed after seeing the answer
correlated inputs are double-counted
hard requirements are averaged into the score
the score is treated as a forecast

AHP

AHP is useful when stakeholders need to set weights explicitly. Instead of asking "what should Demand weigh?", AHP asks stakeholders to compare criteria pair by pair:

Is Demand more important than Accessibility?
Is Competition more important than Reach?
How much more important?

This is useful for governance because it exposes trade-offs. It becomes cumbersome when there are too many criteria.

AHP also has a known failure mode: rank reversal. Adding a new alternative, including a duplicate or near-duplicate, can change the ranking of existing alternatives. (Wikipedia)

TOPSIS

TOPSIS is useful for short lists. It asks:

Which site is closest to the ideal and farthest from the worst case?

This is intuitive for comparing finalists, but it can be sensitive to normalization choices. If the same candidates rank differently under min-max, vector, or percentile normalization, the model should surface that instability rather than hide it.

Outranking methods

ELECTRE and PROMETHEE compare candidates pair by pair and can include veto thresholds. They are more complex, but they align with how real estate decisions often work:

Site A can outrank Site B unless Site A fails a critical requirement.

Outranking methods are useful when the organization wants partial ordering instead of false precision. They are harder to explain to a non-technical committee.

Practical architecture

For most multi-unit operators, the strongest architecture is:

Hard gates
+ transparent weighted score
+ forecast range
+ network overlays
+ sensitivity analysis
+ post-opening validation

This gives the committee a score that is understandable, a forecast that is honest, and a recommendation that can be challenged.

Why scoring models fail

A site scoring model can look rigorous and still be fragile. The main failure modes have names.

Rank reversal

Rank reversal occurs when adding or removing a candidate changes the ranking of existing candidates.

Operational example:

Initial ranking:
1. Site A
2. Site B
3. Site C

After adding Site D:
1. Site B
2. Site A
3. Site D
4. Site C

If Site D is not truly relevant to the comparison between A and B, the committee will ask why the original ranking changed.

Rank reversal can occur in AHP, TOPSIS, and other MCDA methods depending on how alternatives are normalized and compared. It does not make those methods useless. It means a scoring model should understand the conditions under which rankings are stable. (Wikipedia)

Normalization instability

The same raw data can produce different rankings depending on normalization.

Example:

Candidate	Population	Competitors
Site A	120,000	12
Site B	90,000	4
Site C	45,000	1

A min-max transformation, percentile rank, z-score transformation, and vector normalization can all tell a different story.

Esri's suitability analysis documentation treats preprocessing as a separate step for exactly this reason. It includes MinMax, Percentile, ZScore, and Raw preprocessing options, with notes about outlier sensitivity, skewed distributions, and when raw values are appropriate. (Esri Documentation)

The model should document:

which normalization method was used
why that method fits the variable
whether rankings are stable under alternative transformations
whether outliers distort the result

Compensatory aggregation

Weighted sums are compensatory. A strong score in one criterion can offset a weak score in another.

That is appropriate when criteria are genuinely substitutable:

Two demand proxies can be averaged.
Two access measures can be combined.

It is dangerous when criteria are essential:

Zoning cannot be fixed by strong income.
No drive-thru cannot be fixed by high traffic.
Unacceptable payer mix cannot be fixed by strong patient need.

This is the deeper reason gates matter. Essential criteria should be gates or multiplicative penalties, not additive preferences.

The geometric-mean example is useful here. A geometric mean reduces the ability of one strong dimension to fully compensate for a weak one. In site selection, that means a site with excellent demand and poor feasibility should look more fragile than a simple arithmetic average would suggest. Esri's suitability analysis describes geometric mean as a combination method that requires high values in multiple criteria for high final scores. (Esri Documentation)

Double-counting correlated variables

Many site variables measure the same underlying signal.

Examples:

population, households, and density
income and education
daytime population and employment density
competitor count and saturation
traffic count and road hierarchy
corner lot and visibility score

If each is weighted independently, the model may overweight one concept because it appears in several columns.

A simple correlation audit helps. When two variables are highly correlated, combine them, drop one, or explicitly justify why both belong.

A practical rule:

If correlation is greater than 0.7, review for redundancy.

That is a heuristic, not a law. The point is to prevent accidental double-counting.

Weight gaming

Weights should express strategy before the site list is scored. If stakeholders change weights after seeing the output, the model becomes a negotiation tool.

A defensible process documents:

who set the weights
when weights were set
why weights were chosen
what changed since the prior version
how sensitive rankings are to those weights

Out-of-distribution candidates

A scoring model trained or calibrated on suburban drive-thru restaurants may not work for urban walk-up stores. A model built on traditional gyms may not work for boutique fitness studios. A healthcare model calibrated on primary care clinics may not work for imaging centers.

A site can be high-scoring inside the model's historical range and low-confidence outside it.

The brief should say when a candidate is out of distribution.

Normalization, thresholds, penalties, and gates

Raw variables rarely belong directly in a weighted score.

Population may range from 5,000 to 200,000. Income may range from $35,000 to $180,000. Competitor counts may range from 0 to 60. Traffic counts may range from 2,000 to 80,000 vehicles per day. If those values are added directly, the variable with the largest numeric range dominates.

Common transformations

Method	How it works	Best use	Risk
Min-max	scales values from 0 to 1	bounded variables, intuitive scoring	outliers distort range
Percentile rank	scores by rank position	skewed data, committee communication	loses magnitude differences
Z-score	standard deviations from mean	portfolio benchmarking	less intuitive
Raw	uses original value	already comparable fields	can dominate model
Log transform	compresses skewed values	income, density, spending	harder to explain
Fit-band scoring	rewards values near ideal band	traffic, income, spacing	requires calibration

Fit-band scoring

Some variables are non-monotonic.

Traffic is a good example. Low traffic may not provide enough exposure. Extremely high-speed or high-volume traffic can make access difficult. In QSR scoring, a site may score best in a target band rather than simply scoring higher as traffic rises.

A simple fit-band function:

Traffic score = 0 below minimum
Traffic score rises to peak in target band
Traffic score falls as access friction increases

This is more realistic than treating traffic as "higher is always better."

Gates

Gates are pass/fail requirements.

Gate	Why it belongs outside the score
Zoning allows use	A high demand score cannot fix illegal use.
Drive-thru feasible	A drive-thru-dependent prototype cannot average this away.
Rent below ceiling	Unit economics can block approval.
Minimum trade area population	A market may be too small regardless of access.
Parking requirement	Operations may be infeasible without it.
Healthcare payer mix	Demand without reimbursement can be weak.
Franchise territory	Legal or contractual constraints can block the site.

A gate-failed site should be surfaced separately:

Gate result: Failed
Reason: Drive-thru feasibility
Recommendation: revise prototype or reject

Penalties

Penalties reduce a score when risk is present but not fatal.

Examples:

Penalty	Use case
Cannibalization penalty	Candidate overlaps existing store demand.
Saturation penalty	Recent cohorts underperform in market.
Access penalty	Poor ingress, left-turn friction, or one-way constraint.
Confidence penalty	Missing or stale critical inputs.
Competitor penalty	Direct competitors dominate the target trade area.

Penalties should be visible. Hidden penalties create committee confusion.

Missing data

Do not silently impute critical inputs.

Missing data should be handled in one of three ways:

Missing-data approach	Use case
Research required	critical field missing, such as rent or zoning
Confidence reduction	useful field missing, such as traffic or customer-origin data
Documented imputation	low-risk variable, clear method, strong rationale

The site brief should say:

Confidence: Medium
Reason: traffic data unavailable; score relies on road class and observed nearby demand

A high score with low confidence is not an approval. It is a research priority.

Weights, sensitivity, and rank stability

Weights express strategy.

A coffee concept may weight morning access and worker density heavily. A grocery concept may weight household density, income, vehicle access, basket opportunity, and co-tenancy. An urgent care operator may weight patient access, payer mix, provider capacity, and service-line need.

Weights should be set before candidate sites are scored.

Common ways to choose weights

Method	Description	Best use
Equal weights	Every component has the same influence	early model, no strong strategy yet
Expert weights	Leadership or analysts assign weights	small teams, simple model
AHP	Pairwise comparisons produce weights	governance-heavy teams
Historical calibration	Weights tuned against past openings	mature operators with data
Segment-specific weights	Different weights by format or market type	multi-format operators
Hybrid	Expert weights refined by validation	most practical approach

Esri's suitability analysis documentation states that weights significantly affect resulting scores, that weight selection is subjective, and that weights should be backed by strong rationale and subject-matter expertise. (Esri Documentation)

Sensitivity analysis

Sensitivity analysis asks:

If we change the weights, does the recommendation change?

Examples:

If Competition weight increases from 25% to 35%, does Site A still rank first?
If Demand weight decreases by 10 points, does Site B move ahead?
If missing traffic data is added, does Site C change materially?
If cannibalization transfer is 30% instead of 20%, does the site still clear the hurdle?

A good scorecard should show:

Sensitivity question	Output
What weight change would flip the top two sites?	stability interval
Which criterion drives the ranking most?	driver analysis
Which site is second-best under alternate weights?	robustness check
Which inputs are missing or low confidence?	confidence note
Which site remains top under multiple methods?	consensus ranking

Rank instability does not automatically invalidate a model. It tells the committee that the decision is sensitive and deserves more evidence.

Stability intervals

A stability interval shows how much a weight can change before the ranking changes.

Example:

Demand weight: 30%
Ranking remains stable if Demand weight stays between 24% and 38%.
If Demand weight drops below 24%, Site B overtakes Site A.

That is a much more useful committee statement than:

Site A scores 82.

It tells decision-makers whether the recommendation is robust or fragile.

Deep vertical examples

"Site scoring" is not one universal checklist. Different categories require different gates, weights, transformations, and overlays.

The reference points below are practical scoring heuristics drawn from industry sources and operator disclosures. They are not universal thresholds. Every operator should calibrate them against its format, geography, pricing, customer profile, and post-opening outcomes.

QSR and fast casual

QSR scoring is usually driven by frequency, access, speed, daypart, and channel mix.

Common scoring criteria:

Criterion	Why it matters
Drive-time reach	QSR catchments are short and convenience-driven.
Traffic and route access	Exposure matters, but access friction can overwhelm volume.
Drive-thru feasibility	For many QSR formats, drive-thru is a major revenue channel.
Lunch and dinner demand	Daypart mix can determine viability.
Worker and commuter density	Especially important for breakfast and lunch.
Delivery coverage	Affects kitchen assignment and order density.
Competitor clustering	Some clustering signals demand; too much creates saturation.
Cannibalization	High-frequency categories may tolerate some transfer but must measure it.

Useful QSR heuristics:

Signal	Practical scoring treatment
~25,000+ ADT near the site	common freestanding QSR traffic floor, but not a guarantee
Very high-speed / very high-volume road	fit-band penalty if drivers cannot decelerate or turn
QSR trade area around 5-7 minutes	starting point for car-oriented formats
Fast casual around 7-10 minutes	often broader than traditional QSR
Drive-thru can account for 70% or more of revenue at many traditional QSR formats	drive-thru feasibility may be a gate
Lunch may drive 35-40% of daily revenue in many QSR models	daypart demand should be scored, not averaged away
$40k-$80k household income may index strongly for QSR frequency in some models	income should often use a fit band, not "higher is always better"

QSR trade-area and revenue benchmarks are highly format-specific. A coffee drive-thru, burger drive-thru, chicken concept, fast casual bowl concept, and suburban pizza unit should not share the same scoring model.

A defensible QSR scorecard should use fit bands:

Traffic: fit band, not "higher is always better"
Income: concept fit band, not "higher is always better"
Competition: nonlinear, because some clustering is beneficial
Cannibalization: overlay, because transfer can be intentional or harmful
Drive-thru feasibility: gate

Example QSR scoring structure:

Layer	Example
Gate	drive-thru feasible, parking adequate, ingress acceptable
Core score	reach, demand, competition, accessibility
Forecast	P10 / P50 / P90 AUV with analog stores
Overlay	delivery overlap, same-brand transfer, market saturation
Recommendation	advance only if drive-thru geometry and lease terms hold

Urgent care and healthcare

Healthcare scoring is less about retail sales and more about access, capacity, reimbursement, and service-line fit.

Common scoring criteria:

Criterion	Why it matters
Patient access	Travel time and barriers shape actual utilization.
Payer mix	Demand without reimbursement is not equivalent to demand.
Provider capacity	Saturation depends on supply, not just population.
Service-line demand	Urgent care, primary care, imaging, dental, and specialty care differ.
Referral leakage	A new clinic may retain demand inside the system.
Regulatory constraints	Licensing and certificate requirements can be gates.
Labor and provider availability	A site cannot operate without staff.

Useful urgent care heuristics:

Signal	Practical scoring treatment
3-5 mile radius or 12-15 minute drive-time catchment	starting point, not a universal rule
2,800-3,500 square foot footprint	common urgent care prototype range
~20,000 population per existing urgent care	rough saturation reference
Household income $50k-$100k	common urgent care fit band
Payer mix	gate or high-weight criterion
Provider availability	feasibility gate
Highways, rivers, undeveloped land, income gradients	barriers that distort catchment shape

A healthcare scorecard should separate need from reachable, reimbursable, staffed demand.

Example healthcare scoring structure:

Layer	Example
Gate	licensing, payer threshold, provider availability, minimum footprint
Core score	access, demand, competition/supply, accessibility
Forecast	visits, patient panels, appointment utilization
Overlay	leakage recapture, system transfer, capacity relief
Recommendation	advance if payer mix, staffing, and access pass threshold

A healthcare model should also distinguish access gaps from business opportunity. An underserved area may have high community need but weak reimbursement, limited staffing, or regulatory constraints. That may still be strategically important, but the scoring model should make the trade-off explicit.

Fitness clubs

Fitness scoring is driven by membership penetration, income, lifestyle fit, commute patterns, co-tenancy, churn, and peak-hour capacity.

Common scoring criteria:

Criterion	Why it matters
Target adult population	Sets the membership pool.
Income and lifestyle fit	Membership type and price point vary by concept.
Drive-time / walk-time reach	Convenience affects join rate and retention.
Competitor capacity	Many markets have demand but limited remaining white space.
Co-tenancy	Grocery, health-focused retail, and daily-use centers can help.
Peak-hour access	Parking and commute patterns matter at morning and evening peaks.
Churn and retention	A site that improves convenience may reduce churn.

Useful fitness heuristics, drawn from the Health & Fitness Association's 2024-2025 reporting:

Signal	Practical scoring treatment
U.S. gym membership penetration around 24.9% of the age-6+ population (HFA 2024)	starting demand multiplier
Industry retention benchmark around 66.4% in HFA 2025 reporting	churn and retention matter, not only new joins
25-44 age segment is the largest membership group at roughly one-third of memberships	age-weighted demand scoring
Members with household income above $75k represent just over half of all memberships	income and lifestyle fit matter
Boutique studios, big-box clubs, and high-value low-price formats serve different customer profiles	do not use one scoring model for all fitness concepts

A simple fitness demand screen:

Potential members =
Catchment population
× category penetration
× target customer fit
- competitor member capacity

A more defensible scorecard includes churn reduction and peak utilization:

Network value =
new memberships
+ churn reduction
+ capacity relief
- transfer from existing clubs
- added rent and labor

Example fitness scoring structure:

Layer	Example
Gate	footprint, parking, rent ceiling, zoning
Core score	target population, lifestyle fit, access, competition
Forecast	members, ramp curve, contribution
Overlay	member transfer, churn reduction, capacity relief
Recommendation	advance if member pool and retention benefit clear threshold

Scorecards vs spreadsheets

Spreadsheets are a reasonable starting point.

They work when:

the company has a small footprint
one or two people own the model
the concept has one format
opening cadence is slow
assumptions rarely change
cannibalization is minimal
the decision does not require an audit trail

They begin to fail when:

multiple stakeholders edit weights
versions diverge
candidate lists change weekly
the company has many existing stores
portfolio overlap matters
data sources need timestamps
the committee needs consistent briefs
the model must be validated against outcomes

The Federal Reserve's SR 11-7 guidance is written for banking model risk, but its definition is useful for any high-stakes quantitative decision system. It defines a model as a quantitative method, system, or approach that applies theory, assumptions, and data to produce quantitative estimates. The definition also covers quantitative approaches with qualitative or expert-judgment inputs when the output is quantitative. (Federal Reserve)

That definition covers many site selection scorecards.

The most important SR 11-7 lesson for site scoring is its warning about spreadsheets. The guidance says, "User-developed applications, such as spreadsheets or ad hoc database applications used to generate quantitative estimates, are particularly prone to model risk." (Federal Reserve)

A real estate committee often asks questions that spreadsheets struggle to answer:

Which model version produced this score?
What changed since last quarter?
Who changed the weights?
What data vintage was used?
Which variables are missing?
What would flip the recommendation?
How did sites like this perform historically?
What is the cannibalization impact?
Did the last five high-scoring sites actually perform?

When those questions become routine, the spreadsheet has become infrastructure.

Governance, documentation, and model risk

A site score does not have to be regulated like a banking model to benefit from model governance.

The best governance ideas are practical:

define the model's intended use
document the data sources
document weights and transformations
version the model
show missing data
track assumptions
measure outcomes
update the model when the market changes
maintain a decision history

SR 11-7 emphasizes effective challenge: objective, informed review by people who can identify model limitations and produce appropriate changes. It also describes validation as conceptual soundness, ongoing monitoring, and outcomes analysis. (Federal Reserve)

That is exactly what a real estate committee should do.

The committee should be able to ask:

Why did this site score well?
Which inputs drive the score?
Which assumptions are fragile?
What would change the recommendation?
What did similar past openings do?

Model cards for site selection

Model Cards were proposed as a way to document trained machine learning models, including intended use, performance characteristics, evaluation procedures, and limitations. The original Model Cards paper describes them as short documents that disclose intended use, performance evaluation, and other relevant information to support transparent model reporting. (arXiv)

A site scoring model card should include:

Field	Example
Model name	QSR Suburban Site Score v2.1
Intended use	Screening U.S. suburban QSR candidates
Out-of-scope use	Urban walk-up sites, airport sites, ghost kitchens
Components	Reach, Demand, Competition, Accessibility
Weights	30 / 30 / 25 / 15
Gates	drive-thru required, rent ceiling, minimum parking
Data sources	ACS, POI, routing, customer origins
Data vintage	ACS 2019-2023, POI May 2026
Calibration cohort	87 openings from 2021-2025
Validation metrics	AUV by score decile, MAPE, bias, cannibalization error
Known limitations	weak rural performance, limited mobility data
Last updated	May 2026

That may sound formal. It is also the type of documentation that lets a decision stand months later.

SHAP and LIME

When a scoring system uses machine learning, explanation tools can help.

SHAP assigns feature-importance values to individual predictions using a unified additive explanation framework. LIME explains individual predictions by learning an interpretable local model around the prediction. (arXiv)

In site selection, these methods are most useful for forecasts or ML-driven components, not for simple weighted scorecards. A transparent weighted score usually does not need SHAP. But if the model uses gradient boosting, random forests, or another opaque predictor, the brief should explain why a particular site received its forecast or risk estimate.

NIST AI RMF and EU AI Act Article 13

AI governance standards point in the same direction: transparency, interpretability, appropriate use, documentation, and ongoing evaluation.

NIST says its AI Risk Management Framework is intended to help organizations incorporate trustworthiness considerations into the design, development, use, and evaluation of AI systems. (NIST) The EU AI Act's Article 13 requires high-risk AI systems to be transparent enough for deployers to interpret outputs and use them appropriately, with instructions that include intended purpose, accuracy metrics, limitations, input specifications, and information to help interpret outputs. (EUR-Lex)

Even if a location scoring model is not legally classified as high-risk AI, the practical standard still applies. A model that affects capital allocation should explain itself.

For site selection, that means:

Intended use
Data sources
Transformations
Weights
Gates
Limitations
Validation results
Confidence level
Human review process

That is the difference between a score and a defensible score.

Calibration and post-opening validation

A score that never gets compared with outcomes is an untested hypothesis.

Validation should answer three questions:

Did high-scoring sites perform better than low-scoring sites?
Did the model correctly identify weak sites?
Did the score explain the right things?

What to validate

Validation target	Metric
Site performance	sales, AUV, visits, members, patients, deposits
Forecast accuracy	MAPE, bias, prediction-interval coverage
Score monotonicity	performance by score decile
Cannibalization	transfer from nearby existing stores
Saturation	cohort AUV decay, same-store impact
Feasibility	sites that failed because of real estate or operations
Confidence	whether low-confidence scores had wider errors

Score decile validation

A simple validation chart:

Score decile	Mature AUV	Expected pattern
90-100	Highest	strongest performance
80-89	High	above average
70-79	Medium	acceptable
60-69	Weak	watchlist
<60	Lowest	rejected or underperforming

The pattern does not need to be perfect. Real operations are messy. But if high-scoring sites do not outperform lower-scoring sites over time, the model needs recalibration.

Forecast validation

Forecasts should be evaluated separately from scores.

Common metrics:

MAPE = mean(|actual - forecast| / actual)

Bias = mean(forecast - actual)

Coverage = share of sites where actual result falls inside forecast interval

For new sites, forecast intervals are more honest than point estimates. A forecast of $2.4M ± $700K is less satisfying than $2.4M, but it is easier to defend.

New-store forecast accuracy should not be compared casually with existing-store demand forecasting. Existing-store forecasts often use operating history, SKU history, seasonality, and recent sales data. A new location has none of that. The honest output is a scenario range or prediction interval, ideally backed by named analogs.

Validation cadence

Validation timing should reflect the format.

Timing	What to check
30 days	operational opening issues, early traffic, data sanity
90 days	ramp pattern, early transfer, channel mix
180 days	emerging trade area, same-store impact
12 months	first-year performance, forecast error
18 months	more stable cannibalization and trade-area evidence for many retail and restaurant formats
24-36 months	mature performance, cohort calibration

QSR and convenience formats may reveal meaningful signals faster than fitness clubs, healthcare clinics, or membership-based businesses. Fitness needs enough time to observe join rate, retention, churn, and peak utilization. Healthcare needs enough time to observe patient panel growth, payer mix, referral patterns, and appointment utilization.

Post-opening validation turns site selection into a learning system.

What to do with validation results

Validation should change the model.

Finding	Model response
High-scoring sites underperform	inspect weights, gates, and omitted variables
Low-confidence sites have large errors	tighten confidence rules
Forecasts are systematically high	recalibrate intercept or analog assumptions
Cannibalization underestimated	increase transfer overlay or change trade area method
One format behaves differently	create separate score profile
Urban sites fail rural-trained model	segment by market type

A scoring model should become more accurate after each opening.

Public examples of scoring logic

Public commercial site scorecards are often proprietary, but several public and academic examples show how formal scoring appears in practice.

Academic AHP studies

AHP-based location studies often publish criteria weights. These are not plug-and-play templates for operators, but they show what explicit criteria weighting looks like.

Context	Example criteria and weights
Gas station siting	One AHP study (Wu, Chen & Pan, ISPRS International Journal of Geo-Information, 2024) weighted population density at 0.633, gas station supply capacity at 0.261, and road network density at 0.106.
Hospital site selection	Şahin, Ocak & Top (Health Policy and Technology, 2019) rank demand, accessibility, competitors, government policy, related industry, and environmental conditions.
Pandemic hospital siting	Boyacı & Şişman (Environmental Science and Pollution Research, 2022) use Pythagorean fuzzy AHP to weight distance to transportation, population centers, land slope, land use, distance to other hospitals, and hazards.
Bank branch siting	Basar, Kabak & Topcu (Socio-Economic Planning Sciences, 2017) weight demographics, cost, competition, transportation, flexibility, and access to public facilities.
Shopping mall criteria	AHP studies of mall performance often find tenant satisfaction among the highest-weighted criteria, sometimes above 0.35.

The lesson is not that a QSR operator should copy a hospital siting model. The lesson is that a defensible scoring model should make criteria, weights, and trade-offs visible.

Public-sector scoring rubrics

Economic development and public-sector site scoring often use a two-stage structure:

Threshold review:
pass / fail

Scored review:
weighted criteria and narrative ranking

That is a good model for commercial site selection. Hard requirements should be handled first. Weighted scoring should rank sites that remain eligible.

Franchise disclosure documents

Franchise Disclosure Document Item 11 often describes site-selection assistance. It may mention demographics, population density, income, geography, physical boundaries, competition, and other site factors. Most FDDs do not disclose the exact scoring matrix.

That tells us something important: even when public disclosure is high-level, site selection criteria are part of the franchise system's operating logic. A serious franchise operator should be able to explain how those criteria become a site approval recommendation.

Current industry state

Vendor methodology transparency varies widely.

GIS suitability tools

GIS suitability tools are often the most transparent because they show preprocessing, weighting, transformation, and combination methods. Esri's suitability analysis documentation describes criteria, influence settings, preprocessing methods, combination methods, score scaling, and weighting, including MinMax, Percentile, ZScore, Raw, Sum, Mean, Product, and Geometric Mean options. (Esri Documentation)

That is good governance. Users can see how variables enter the model.

Predictive analytics vendors

Predictive analytics vendors may use analog stores, regression, machine learning, or proprietary site scores. Some offer useful forecast ranges and comparable-store visibility. Others present a single score without exposing enough detail about features, weights, missing-data handling, validation, or calibration.

The question is not whether the score uses AI. The questions are:

What does the score measure?
What data trained or calibrated it?
How does it treat missing data?
How does it separate fit from forecast?
How does it account for cannibalization?
How has it performed after openings?

AI scores

"AI score" is a marketing term, not a methodology.

A serious AI-assisted site scoring system should still show:

intended use
model type
data sources
feature transformations
training or calibration cohort
validation metrics
confidence level
limitations
local explanation for a given site
human review process

The serious end of the market is moving toward explicit scoring rubrics, model documentation, visible assumptions, and explainable ML where appropriate. The marketing end is moving toward "AI" as a brand. Expansion leaders are getting better at telling the difference.

A Geod-style site scorecard

Geod's public methodology documents the default score around four components: Reach, Demand, Competition, and Accessibility, with default weights of 30%, 30%, 25%, and 15%. The score is a transparent weighted linear model with documented sources, snapshot dates, and visible components.

There are two clean ways to present this kind of score.

Textbook weighted model

In a clean weighted-score display, every component is normalized so higher is better. Competition is therefore expressed as Competition Fit, meaning lower competitive pressure or better competitive context.

Component	Normalized score	Weight	Contribution
Reach fit	86	30%	25.8
Demand fit	78	30%	23.4
Competition fit	58	25%	14.5
Accessibility fit	70	15%	10.5
Total			74.2

Formula:

Score =
0.30 × Reach fit
+ 0.30 × Demand fit
+ 0.25 × Competition fit
+ 0.15 × Accessibility fit

This is the easiest version to explain to a committee.

Signed contribution display

Some products show signed contributions instead of normalized sub-scores.

Site score =
Reach contribution
+ Demand contribution
- Competition penalty
+ Accessibility contribution

Example:

Component	Signed contribution
Reach	+28
Demand	+31
Competition pressure	-12
Accessibility	+27
Site score	74

In this display, "Competition" is not a 0-100 component score. It is a penalty contribution that reduces the final score. That distinction should be explicit in the brief so a careful reader does not try to reconcile a negative contribution with a 0-100 normalized weighted-score formula.

Decision overlays

The score should then be paired with overlays.

Overlay	Example output
Cannibalization	Moderate transfer risk from Store 14
Saturation	Market cohort AUV declining; conditional approval
Feasibility	Conditional: rent and ingress need confirmation
Confidence	Medium-high: strong ACS/POI data, limited first-party origins
Validation plan	Compare ramp, affected-store sales, and customer origins at 90 and 180 days

This preserves the simplicity of the score while giving the committee the context it needs.

Evaluate vs Strategize

A useful product architecture separates first-site evaluation from portfolio strategy.

Mode	What it should do
Evaluate	Apply a defensible score to candidate sites using consistent criteria, weights, thresholds, and source-dated data.
Strategize	Apply the same strategy across a portfolio with model versioning, custom weights, batch scoring, cannibalization, saturation, feasibility, confidence, and decision history.

Geod's Evaluate tool scores specific sites and exports explainable reports. Strategize gives operators a way to apply the same strategy systematically across a multi-unit portfolio.

The Evaluate tier teaches a team that scoring can be repeatable and defensible. The Strategize tier turns that scoring into an institutional system. That matters for operators whose strategy is already in someone's head, in a consultant deck, and in three spreadsheets, but not yet versioned, consistently applied, or easy to defend.

The defensible claim is grounded:

Your expansion strategy becomes a repeatable scoring system,
applied consistently to every candidate you evaluate,
with portfolio-aware overlays and a decision record.

What a defensible site brief should show

A defensible site brief should contain the score and the argument behind it.

Recommended structure:

Site Score Summary

1. Candidate address and prototype
2. Core score and component breakdown
3. Gates passed / failed
4. Criteria weights and model version
5. Data sources and snapshot dates
6. Trade area method and time window
7. Demand assumptions
8. Competition assumptions
9. Accessibility assumptions
10. Forecast range, if available
11. Cannibalization and saturation overlays
12. Feasibility gate
13. Confidence level
14. Sensitivity notes
15. Recommendation
16. Post-opening validation plan

Example brief language

The candidate scores 74 under the QSR Suburban Scorecard v2.1. Reach and Demand are strong because the 10-minute drive-time catchment contains above-threshold target population and income for the concept. Competition reduces the score because the catchment includes a high density of direct fast-casual competitors. Accessibility is favorable, but the feasibility gate remains conditional pending ingress review and rent confirmation. The site should advance to constrained review, with cannibalization against Store 14 measured before final approval.

That paragraph is more useful than:

Site score: 74

Common mistakes

Mistake 1: Treating the score as the recommendation

The score is evidence. The recommendation is a decision.

Mistake 2: Mixing forecast and fit

A site can be strategically strong and forecast modestly. Another can forecast large and fail strategy. Keep these separate.

Mistake 3: Averaging away deal-breakers

Hard requirements belong in gates.

Mistake 4: Scoring raw variables directly

Normalize before weighting. Raw numeric ranges can dominate.

Mistake 5: Hiding weights

Weights are strategy. If they are hidden, the strategy is hidden.

Mistake 6: Double-counting correlated variables

Population, households, density, and daytime population may overlap. Competition count, saturation, and cannibalization can also overlap. Audit correlations.

Mistake 7: Changing weights after seeing the answer

Set weights before scoring candidates. Otherwise the model becomes a negotiation tool.

Mistake 8: Ignoring uncertainty

A high score with low confidence should trigger research, not approval.

Mistake 9: Treating competitors as all negative

Some competitors signal demand or create destination effects. Competition should be category-specific.

Mistake 10: Skipping validation

A model that is never compared with actual openings becomes a permanent assumption.

Mistake 11: Treating AI scoring as a methodology

AI scoring is not itself a methodology. The methodology has to be visible: what the model measures, what data it uses, how it was validated, and what it should be used for.

Mistake 12: Using one model for every format

A suburban drive-thru, an urban pickup store, a full-service clinic, and a boutique fitness studio should not share the same thresholds and weights.

FAQ

What is a site selection scoring model?

A site selection scoring model is a structured method for ranking candidate locations using criteria, weights, transformations, gates, and overlays. It helps operators compare sites consistently and explain why a candidate should advance, be rejected, or require more research.

What is a location score?

A location score is a numeric summary of how well a candidate site fits an operator's site selection criteria. A defensible score should be broken into visible components so decision-makers can see why the site scored well or poorly.

What is the difference between a site score and a sales forecast?

A site score measures strategic fit. A sales forecast estimates expected performance. A site can score well and forecast small, or forecast large and score poorly. The final recommendation should use both.

What is the best site selection scoring method?

For most multi-unit operators, the best practical method is hard gates plus a transparent weighted score, forecast range, network overlays, sensitivity analysis, and post-opening validation. More complex MCDA methods like AHP, TOPSIS, ELECTRE, and PROMETHEE can help in specific governance or short-list situations.

What criteria should be included in a site selection scorecard?

Common criteria include trade area population, income, daytime population, category spend, competition, traffic, accessibility, co-tenancy, site feasibility, parking, rent, customer origins, cannibalization, and saturation. The exact criteria should vary by category and prototype.

How should site selection criteria be weighted?

Weights should reflect strategy and should be set before candidates are scored. Operators can use expert judgment, AHP, historical calibration, or a hybrid approach. Weight choices should be documented and tested for sensitivity.

What is AHP in site selection?

AHP, or Analytic Hierarchy Process, is a multi-criteria decision method that derives weights from pairwise comparisons. It can help teams make trade-offs explicit and check consistency.

What is TOPSIS in site selection?

TOPSIS ranks candidate sites by comparing each one to an ideal site and a worst-case site. It is useful for short-list comparisons but can be sensitive to normalization.

What is rank reversal?

Rank reversal occurs when adding or removing a candidate changes the ranking of existing candidates. It is a known failure mode in several MCDA methods and should be tested when candidate lists change during a scoring process.

Should cannibalization be part of the score?

Cannibalization should usually be shown as a network overlay rather than hidden inside the core score. The core score measures site fit. The overlay shows whether the site creates net-new demand or transfers demand from existing units.

How do you validate a site selection scoring model?

Compare scores against post-opening outcomes. Track performance by score decile, forecast error, same-store impact, cannibalization, confidence level, and outcome by market type. Recalibrate when results drift.

When does a site scoring spreadsheet stop working?

A spreadsheet becomes risky when multiple people change weights, versions diverge, data gets stale, portfolio overlap matters, or the committee needs repeatable briefs and audit history.

Glossary

Site selection scoring model

A structured method for ranking candidate locations using criteria, weights, transformations, and rules.

Location score

A numeric summary of a candidate site's fit against the operator's criteria.

Site scorecard

A table or brief showing the score, components, weights, gates, overlays, and recommendation.

Weighted sum model

A scoring method that multiplies each normalized criterion score by a weight and sums the results.

Weighted geometric mean

A scoring method that multiplies criteria raised to their weights, reducing the ability of one very strong criterion to compensate for a very weak one.

MCDA / MCDM

Multi-criteria decision analysis or multi-criteria decision-making, a family of methods for evaluating alternatives across many criteria.

AHP

Analytic Hierarchy Process, a method that uses pairwise comparisons to derive weights and check consistency.

TOPSIS

Technique for Order Preference by Similarity to Ideal Solution, a method that ranks options by distance from ideal and worst-case alternatives.

Gate

A hard requirement that a site must pass before it is scored or advanced.

Threshold

A minimum, maximum, or target value used to classify or filter a criterion.

Penalty

A visible score reduction applied when a risk is present but not fatal.

Normalization

The process of converting raw variables into comparable scales.

Percentile score

A score based on how a candidate ranks relative to a comparison set.

Z-score

A normalized value showing how many standard deviations a value is above or below the mean.

Forecast

An estimate of expected performance, such as sales, visits, members, patients, deposits, or orders.

Overlay

Additional context layered on top of the score, such as cannibalization, saturation, feasibility, confidence, or validation status.

Confidence level

A high, medium, or low assessment of data quality, model fit, missing inputs, and evidence strength.

Model card

A structured document describing a model's intended use, data, performance, limitations, and governance.

Sensitivity analysis

Testing how changes to weights, assumptions, or data inputs affect the ranking or recommendation.

Rank reversal

A multi-criteria decision failure mode where adding or removing alternatives changes the ranking of existing alternatives.

Post-opening validation

Comparing model predictions and scores against actual results after a location opens.

Conclusion

A defensible site score is a structured argument.

The score shows how the site performs against strategy. The forecast estimates performance. The overlays expose network and execution risk. The recommendation explains what to do next.

Before a candidate advances, the site brief should answer six questions:

Which gates did the site pass or fail?
How did the site score by component?
What assumptions and data sources produced the score?
What forecast range applies, if any?
What overlays change the decision?
What evidence would prove the model right or wrong after opening?

A location score survives committee when the team can take it apart and put it back together.

That is the standard for modern site selection.

See how Geod turns scoring into a defensible site brief

Geod helps operators define scoring criteria, set weights and thresholds, evaluate candidate sites, and export committee-ready briefs with maps, methodology, component breakdowns, source dates, cannibalization, saturation, feasibility, confidence, and validation context.

Evaluate gives teams a defensible way to score candidate sites. Strategize turns that scoring into a repeatable system across a portfolio, with custom weights, versioned models, batch evaluation, portfolio-aware overlays, and decision records.

The goal is not just to rank sites. It is to make every recommendation explainable enough to defend.

Evaluate your next candidate site

Geod helps operators define criteria, set weights and thresholds, evaluate candidate sites, and export committee-ready briefs with component breakdowns, source dates, confidence, and portfolio context.

Start evaluating sites Book a demo