A site score should make a decision easier to challenge.
That sounds backwards until a candidate reaches the real estate committee.
A black-box score can look impressive in a pipeline review. It can rank the right sites, produce a confident number, and make the deck feel more analytical. But the moment someone asks why the site scored 82, why a failed store would have scored well, why a high-traffic corridor lost points, why cannibalization was ignored, or why the model changed since last quarter, the score has to become more than a number.
It has to become a decision record.
A defensible site selection scoring model shows how a candidate location fits the operator's strategy. It exposes the criteria, weights, gates, thresholds, transformations, assumptions, data sources, missing fields, model version, confidence level, forecast range, and network impact.
Most importantly, it separates four artifacts that often get collapsed into one vague "AI score."
| Artifact | Question it answers |
|---|---|
| Score | Does this site fit our strategy? |
| Forecast | How large could this location be? |
| Overlay | What network or execution context changes the decision? |
| Recommendation | What should we do next? |
The score organizes the evidence. The forecast estimates performance. The overlays explain risk. The recommendation turns the analysis into a decision.
What is a site selection scoring model?
A site selection scoring model is a structured method for comparing candidate locations against a defined set of criteria.
For a retailer, the model might evaluate trade area population, income, category spend, co-tenancy, competition, access, foot traffic, visibility, and cannibalization. For a restaurant, it may add daypart demand, drive-thru feasibility, pickup flow, delivery radius, commute direction, and kitchen capacity. For healthcare, it may include payer mix, provider capacity, patient access, service-line demand, referral leakage, and regulatory constraints.
A scoring model helps answer:
- Which sites should advance to deeper review?
- Which candidates best match the expansion strategy?
- Which variables drive the recommendation?
- Which sites fail hard requirements?
- Which high-scoring sites need more research?
- Which candidates create network risk through cannibalization or saturation?
- Which recommendations are high-confidence, and which are fragile?
A good scorecard gives every stakeholder a way to interrogate the recommendation.
The real estate team can ask about access and trade area assumptions. Finance can ask about forecast range and rent-to-sales risk. Operations can ask about feasibility. Franchise teams can ask about encroachment. Executives can ask whether the site advances the strategy or simply looks good on a map.
The score should organize that conversation, not end it.
Score vs forecast vs overlay vs recommendation
The most common mistake in site selection scoring is turning four different artifacts into one number.
| Artifact | Purpose | Example output |
|---|---|---|
| Score | Measures strategic fit | 74 / 100 with component breakdown |
| Forecast | Estimates expected performance | P10 / P50 / P90 sales, visits, orders, members, patients |
| Overlay | Adds network and execution context | Cannibalization, saturation, feasibility, confidence |
| Recommendation | Converts evidence into action | Advance, reject, research, revise, hold |
A site can score well and forecast small. It may be strategically perfect but located in a limited trade area. A site can forecast large and score poorly. It may have high traffic and demand, but the wrong customer profile, bad access, weak economics, or heavy cannibalization.
A clean decision package looks like this:
Core score:
Reach + Demand + Competition + Accessibility
Forecast:
Expected sales / visits / members / patients, with range
Overlays:
Cannibalization
Saturation
Feasibility
Confidence
Validation requirements
Recommendation:
Advance / reject / research / revise
This distinction is especially important for AI-assisted scoring. A single "AI score" can hide whether the model is ranking strategic fit, expected sales, market potential, analog similarity, network impact, or all of those at once. The more concepts a score absorbs, the harder it becomes to defend.
The history of location scoring
Site selection scoring did not begin with AI. The modern scorecard sits on nearly a century of retail geography, spatial interaction modeling, analog reasoning, GIS suitability analysis, and multi-criteria decision analysis.
Reilly's Law of Retail Gravitation (1931)
William J. Reilly's 1931 The Law of Retail Gravitation applied a gravity analogy to retail trade areas. The idea was simple: larger retail centers attract customers from farther away, while distance weakens that attraction.
The classic attraction-balance relationship is:
PA / dA² = PB / dB²
Rearranged:
dA / dB = √(PA / PB)
Where:
| Term | Meaning |
|---|---|
PA, PB | population or size proxy for retail centers A and B |
dA | distance from the breakpoint to center A |
dB | distance from the breakpoint to center B |
If A is larger than B, the breakpoint is farther from A and closer to B. That means A's trade area extends farther toward B. This is the point that often gets muddled when the formula is presented without clear variable definitions.
Reilly's model solved an early problem: how to draw deterministic trade area boundaries between competing cities or retail centers. Its limits are obvious today. It assumes simple geography, ignores road networks, treats consumers as homogeneous, and draws hard boundaries where real trade areas overlap. But the core insight still matters: demand, attraction, and distance can be modeled. (Wikipedia)
Converse and the breaking-point formula (1949)
Paul Converse later rearranged Reilly's logic into a more practical breaking-point formula:
dBP from B = DAB / (1 + √(PA / PB))
Where:
| Term | Meaning |
|---|---|
DAB | distance between centers A and B |
PA, PB | population or size proxy for A and B |
dBP from B | breakpoint distance from center B |
Example:
City A population = 120,000
City B population = 30,000
Distance between cities = 60 miles
dBP from B = 60 / (1 + √(120,000 / 30,000))
dBP from B = 60 / (1 + 2)
dBP from B = 20 miles
The model says the smaller city's trade area extends about 20 miles toward the larger city. The breakpoint is 40 miles from the larger city. That is the expected result: the larger center pulls from farther away. (Wikipedia)
For modern operators, the value is historical and conceptual. Converse made gravity theory usable for trade-area boundary drawing. Modern scoring models have moved beyond deterministic breakpoints, but they still ask where a site's reach begins to fade.
Christaller, threshold, and range (1933)
Walter Christaller's central place theory added two ideas that still sit underneath site scoring: threshold and range. In central place theory, threshold is the minimum market needed to support a good or service, and range is the maximum distance customers are willing to travel for it. (Wikipedia)
| Concept | Meaning in site selection |
|---|---|
| Threshold | minimum demand needed to support a location |
| Range | maximum distance or travel time customers will tolerate |
Every supportable-unit model, minimum-demand gate, drive-time screen, and trade area threshold inherits this logic.
A market may contain population, but the location only works if enough demand exists inside the range customers will actually travel. A clinic may have a strong need profile, but if patients cannot reach it within the service standard, the demand is not operationally useful. A delivery unit may have addressable households, but if it cannot serve them inside the delivery promise, the theoretical market is larger than the practical market.
Christaller's threshold/range pair is still inside the questions operators ask every day:
Does the catchment contain enough target customers?
Will those customers travel far enough to use the site?
Applebaum and the analog method (1966)
William Applebaum's 1966 work in the Journal of Marketing Research on store trade areas, market penetration, and potential sales formalized one of the most important operator-side methods in site selection: compare a candidate location with existing stores that have similar trade area characteristics.
The analog method asks:
Which existing stores look most like this candidate?
How did those stores perform?
What capture rate did they achieve?
What changed when similar stores opened nearby?
Applebaum's method dominated much of U.S. retail site selection from the 1960s through the 1990s because it matched how operators reason. If a proposed site resembles three existing stores, and those stores all reached similar AUVs or patient volumes, the comparison gives the committee a grounded forecast.
Analog methods still matter because they are explainable. A real estate team can inspect the stores behind the estimate.
Their weakness is data dependency. Analogs break when:
- the brand has too few stores
- the prototype changes
- the market type is new
- customer behavior shifts
- the candidate is out of distribution
- first-party customer-origin data is missing
Analog scoring is powerful when the portfolio is mature and comparable. It becomes fragile when the next site is unlike the past.
Huff and probabilistic trade areas (1963, 1964)
David Huff's 1963 and 1964 work replaced deterministic boundaries with probability. A customer does not simply "belong" to one trade area. Instead, the model estimates the probability that a customer at origin i chooses store j.
A common Huff formulation:
Pij = (Aj^α × Dij^-β) / Σ(Ak^α × Dik^-β)
Where:
| Term | Meaning |
|---|---|
Pij | probability that demand from origin i chooses store j |
Aj | attractiveness of store j |
Dij | distance, drive time, or travel cost from i to j |
α | attractiveness sensitivity |
β | distance-decay sensitivity |
k | all competing locations in the choice set |
The Huff model changed site selection because it allowed overlapping trade areas. A customer could have some probability of choosing Store A, Store B, a competitor, or no purchase. Esri describes the Huff model as a spatial interaction model where probability depends on distance, site attractiveness, and the distance and attractiveness of competing sites. Esri also notes that calibration is needed because default exponent values may not apply to the specific trade area being modeled. (ArcGIS Pro)
That matters for scoring because a site's value depends on choice probabilities, not just the number of people inside a polygon. A candidate can sit in a dense market and still be weak if competitors are more attractive or easier to reach. A site can sit farther away and still capture demand if its access and format are superior.
Lakshmanan and Hansen: market potential (1965)
Lakshmanan and Hansen's 1965 retail market potential work extended the spatial interaction tradition into shopping center sales and market aggregation. The practical move was to connect origin demand, travel friction, and destination attractiveness to potential sales.
That lineage shows up in modern location scoring whenever a model asks:
How much demand is available?
How likely is this site to capture it?
How much is already allocated to competitors or existing stores?
Nakanishi and Cooper: multiplicative competitive interaction (1974)
Masao Nakanishi and Lee Cooper's 1974 multiplicative competitive interaction model extended Huff by replacing a single attractiveness variable, such as store size, with a bundle of attractiveness attributes.
A simplified multiplicative attractiveness structure:
Aj = X1j^β1 × X2j^β2 × X3j^β3 ... Xnj^βn
Where each X is a site or store attribute and each β is an elasticity.
Those attributes might include:
- store size
- parking
- price
- assortment
- frontage
- co-tenancy
- brand strength
- reviews
- operating hours
- delivery coverage
- format
- local awareness
This is the bridge from gravity models to modern multivariate site scoring. A store's attractiveness is not one variable. It is a weighted bundle of factors.
Modern scoring models still use this logic, even when they do not call it MCI. They combine multiple attributes into a site fit score or sales forecast.
Regression-based new-store forecasting
As chains built larger portfolios, regression and statistical forecasting entered site selection. Existing locations became training data. Candidate-site features became predictors.
The question shifted from:
Which store does this site resemble?
to:
Given these trade area, access, competition, and format attributes,
what range of outcomes should we expect?
Regression, generalized linear models, and later machine learning helped operators move beyond pure analog reasoning. But these methods introduce their own risks: overfitting, sample-size limits, stationarity assumptions, and weak performance on new formats or new geographies.
A chain with 80 stores and 30 variables does not have "big data." It has a small modeling problem that needs discipline.
GIS and suitability analysis (1980s-1990s)
GIS made location scoring repeatable. Drive-time polygons, demographic overlays, competitor layers, traffic counts, POI data, and suitability maps allowed in-house teams to run hundreds of analyses instead of commissioning one-off studies.
Esri's suitability analysis workflow is a good public example of the modern GIS approach. It identifies sites that meet user-defined criteria, preprocesses variables onto comparable scales, applies weights, combines them, and scales final scores. It also supports positive, inverse, ideal, and target-site influence types. (Esri Documentation)
That is the site scoring pattern most teams recognize:
Define criteria.
Transform variables.
Weight criteria.
Combine scores.
Rank candidates.
MCDA and decision science
Multi-criteria decision analysis, or MCDA, gave scoring models a formal decision framework. Weighted sums, AHP, TOPSIS, ELECTRE, PROMETHEE, sensitivity analysis, and rank-stability checks all address the same issue: how to make a decision when multiple criteria matter and no single metric is enough.
This matters because site selection is inherently multi-criteria. A candidate can be strong on demand, weak on access, moderate on competition, and uncertain on feasibility. The model has to combine those signals without hiding deal-breakers.
Mobility, machine learning, and AI scores (2010s-2020s)
The 2010s and 2020s added foot-traffic panels, mobile-device data, gradient boosting, store embeddings, and AI-assisted workflows. These tools can improve scoring when used carefully, especially for observed trade areas, visit behavior, analog selection, and feature attribution.
But AI scores are the newest aggregation layer, not a replacement for older questions.
Every scoring model still has to answer:
- What criteria matter?
- How are variables transformed?
- What is being forecast?
- What is being scored?
- How is cannibalization handled?
- What data is missing?
- How is the model validated?
- Can the committee challenge the recommendation?
The history matters because it keeps the modern score honest. Site selection has always been a structured argument about demand, access, competition, and choice. AI does not remove that structure. It raises the cost of hiding it.
The architecture of a defensible location score
A defensible location score has five layers.
| Layer | Purpose | Example |
|---|---|---|
| Gates | Remove ineligible or blocked sites | zoning failure, drive-thru impossible, rent ceiling exceeded |
| Core score | Rank eligible sites against strategy | Reach + Demand + Competition + Accessibility |
| Forecast | Estimate expected performance | P10 / P50 / P90 sales, visits, patients, members |
| Overlays | Add portfolio and execution context | cannibalization, saturation, feasibility, confidence |
| Recommendation | Convert evidence into action | advance, reject, research, revise |
This architecture prevents one number from doing too much work.
Gates
Gates answer whether a site is eligible. Failed gates should not be averaged away.
Examples:
- zoning does not allow the use
- drive-thru is required and impossible
- parking is below prototype requirement
- rent-to-sales economics fail threshold
- trade area is below minimum demand
- payer mix is unacceptable
- franchise territory rights block the site
A gate-failed site should be labeled:
Status: Not eligible under current criteria
Reason: Drive-thru feasibility failed
It should not receive a misleading score of 58.
Core score
The core score ranks eligible candidates. It should be simple enough to explain and stable enough to compare.
A common structure:
Site score =
Reach contribution
+ Demand contribution
+ Competition contribution or penalty
+ Accessibility contribution
Forecast
A forecast estimates expected performance. It may use analog stores, regression, machine learning, demand allocation, or scenario modeling.
Forecasts should be presented as ranges:
P10: $1.8M
P50: $2.4M
P90: $3.1M
New-store forecasts deserve wide intervals because the site has no operating history.
Overlays
Overlays capture context that belongs next to the score.
| Overlay | Why it matters |
|---|---|
| Cannibalization | A strong site may transfer demand from existing units. |
| Saturation | A market may have demand but weak marginal returns. |
| Feasibility | A good market can fail on real estate, access, labor, buildout, or operations. |
| Confidence | A high score from weak or missing data should not be treated like a high score from strong evidence. |
| Validation | A model improves only if actual openings are measured. |
Domino's fortressing language is a good example of why overlays belong outside the core score. Domino's describes adding stores in existing markets to condense delivery areas and get closer to carryout customers, while also warning that fortressing may negatively affect existing-store sales and can lead to closures if executed too rapidly. (SEC)
Sweetgreen's delivery-radius language makes the same point in channel-specific form. Sweetgreen says new restaurants in or near existing markets can affect existing restaurant sales, and it specifically warns that cannibalization may become significant when an existing restaurant's delivery radius overlaps with a new restaurant's delivery radius. (SEC)
Those are network effects. They should shape the recommendation, but they should not be hidden inside the core site fit score.
Recommendation
The recommendation is the final decision statement.
Possible outputs:
Advance to LOI.
Reject.
Advance only under constrained lease economics.
Research customer-origin data before approval.
Revise prototype.
Hold until better site supply.
The recommendation should name the key reason, not just repeat the score.
The core formulas
Weighted sum
The simplest defensible score is a weighted sum:
Scorej = Σ(wi × xij)
Where:
| Term | Meaning |
|---|---|
Scorej | score for site j |
wi | weight for criterion i |
xij | normalized score of site j on criterion i |
Example:
| Component | Normalized score | Weight | Contribution |
|---|---|---|---|
| Reach fit | 86 | 30% | 25.8 |
| Demand fit | 78 | 30% | 23.4 |
| Competition fit | 58 | 25% | 14.5 |
| Accessibility fit | 70 | 15% | 10.5 |
| Total | 74.2 |
This model is easy to explain and easy to misuse. It assumes criteria can compensate for each other. Strong demand can offset weak access. Strong reach can offset competition. Sometimes that is appropriate. Sometimes it hides a fatal flaw.
Weighted geometric mean
When balance matters, a geometric mean can penalize unbalanced sites.
Scorej = Π(xij^wi)
The geometric mean reduces the ability of one very strong criterion to fully compensate for a very weak one.
Esri's suitability analysis supports Product and Geometric Mean combination methods. Esri describes geometric mean as useful when criteria are on different scales and when high final scores should require high values in multiple criteria, because an extreme value in one criterion will not disproportionately determine the result. (Esri Documentation)
The Human Development Index also moved to a geometric mean in 2010, which is often used as a teaching example for reducing substitutability across dimensions. The same idea applies to site selection: if demand, access, and feasibility all matter, a site that is excellent on demand but terrible on feasibility should be penalized more heavily than a simple arithmetic average would suggest. (Wikipedia)
Hard requirements should still be gates. The geometric mean reduces compensation, but it does not replace pass/fail eligibility.
AHP consistency
AHP derives weights from pairwise comparisons. It also checks whether those comparisons are internally consistent.
The consistency index:
CI = (λmax - n) / (n - 1)
The consistency ratio:
CR = CI / RI
Where:
| Term | Meaning |
|---|---|
λmax | principal eigenvalue of the pairwise comparison matrix |
n | number of criteria |
RI | random index for matrix size n |
Common RI values:
| n | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|
| RI | 0 | 0.58 | 0.90 | 1.12 | 1.24 | 1.32 | 1.41 | 1.45 |
A common AHP convention is:
CR ≤ 0.10: judgments are acceptably consistent
CR > 0.10: revisit pairwise comparisons
For site selection, this matters because it prevents a committee from saying inconsistent things like:
Demand is much more important than Access.
Access is much more important than Competition.
Competition is much more important than Demand.
AHP does not eliminate judgment. It makes judgment auditable. (Wikipedia)
TOPSIS closeness
TOPSIS ranks candidates by distance from an ideal site and a worst-case site.
Cj = dj- / (dj+ + dj-)
Where:
| Term | Meaning |
|---|---|
dj+ | distance from site j to the ideal site |
dj- | distance from site j to the negative ideal site |
Cj | closeness coefficient |
A higher Cj means the candidate is closer to the ideal and farther from the negative ideal.
TOPSIS is useful when comparing finalists:
Which candidate is most like our ideal site profile?
Its weakness is normalization sensitivity. Different preprocessing methods can change distances, and therefore rankings. The MCDM literature continues to emphasize that reference-type methods such as TOPSIS can produce different rankings depending on how reference solutions, distance measures, and normalization are defined. (arXiv)
Huff probability
Huff-style models can support scoring when the question is demand allocation.
Pij = (Aj^α × Dij^-β) / Σ(Ak^α × Dik^-β)
Where:
| Term | Meaning |
|---|---|
Pij | probability that demand from origin i chooses site j |
Aj | attractiveness of site j |
Dij | travel time, distance, or travel cost |
α | attractiveness sensitivity |
β | travel-cost sensitivity |
k | all options in the choice set |
For a scoring model, this helps estimate:
- expected capture from an origin
- competitor pressure
- same-brand transfer
- market share potential
- cannibalization-adjusted demand
The key move is comparing allocation before and after a candidate enters the market.
MCI attractiveness
The multiplicative competitive interaction model generalizes attractiveness across multiple attributes:
Aj = Πm(Xmj^βm)
Where:
| Term | Meaning |
|---|---|
Aj | attractiveness of site or store j |
Xmj | attribute m for site j |
βm | elasticity or importance of attribute m |
This is useful because store attractiveness is rarely one variable. It can include frontage, parking, co-tenancy, price, hours, brand strength, and format.
MCDA methods: weighted sum, AHP, TOPSIS, and outranking
A site scorecard is a multi-criteria decision model. There is no universally best method. The right method depends on the audience, data, governance burden, and decision stage.
Weighted sum model
The weighted sum model is the most common.
It works well when:
- stakeholders need transparency
- criteria are measurable
- the model is used for ranking
- hard requirements are handled as gates
- criteria are reasonably independent
It fails when:
- raw variables are not normalized
- weights are changed after seeing the answer
- correlated inputs are double-counted
- hard requirements are averaged into the score
- the score is treated as a forecast
AHP
AHP is useful when stakeholders need to set weights explicitly. Instead of asking "what should Demand weigh?", AHP asks stakeholders to compare criteria pair by pair:
Is Demand more important than Accessibility?
Is Competition more important than Reach?
How much more important?
This is useful for governance because it exposes trade-offs. It becomes cumbersome when there are too many criteria.
AHP also has a known failure mode: rank reversal. Adding a new alternative, including a duplicate or near-duplicate, can change the ranking of existing alternatives. (Wikipedia)
TOPSIS
TOPSIS is useful for short lists. It asks:
Which site is closest to the ideal and farthest from the worst case?
This is intuitive for comparing finalists, but it can be sensitive to normalization choices. If the same candidates rank differently under min-max, vector, or percentile normalization, the model should surface that instability rather than hide it.
Outranking methods
ELECTRE and PROMETHEE compare candidates pair by pair and can include veto thresholds. They are more complex, but they align with how real estate decisions often work:
Site A can outrank Site B unless Site A fails a critical requirement.
Outranking methods are useful when the organization wants partial ordering instead of false precision. They are harder to explain to a non-technical committee.
Practical architecture
For most multi-unit operators, the strongest architecture is:
Hard gates
+ transparent weighted score
+ forecast range
+ network overlays
+ sensitivity analysis
+ post-opening validation
This gives the committee a score that is understandable, a forecast that is honest, and a recommendation that can be challenged.
Why scoring models fail
A site scoring model can look rigorous and still be fragile. The main failure modes have names.
Rank reversal
Rank reversal occurs when adding or removing a candidate changes the ranking of existing candidates.
Operational example:
Initial ranking:
1. Site A
2. Site B
3. Site C
After adding Site D:
1. Site B
2. Site A
3. Site D
4. Site C
If Site D is not truly relevant to the comparison between A and B, the committee will ask why the original ranking changed.
Rank reversal can occur in AHP, TOPSIS, and other MCDA methods depending on how alternatives are normalized and compared. It does not make those methods useless. It means a scoring model should understand the conditions under which rankings are stable. (Wikipedia)
Normalization instability
The same raw data can produce different rankings depending on normalization.
Example:
| Candidate | Population | Competitors |
|---|---|---|
| Site A | 120,000 | 12 |
| Site B | 90,000 | 4 |
| Site C | 45,000 | 1 |
A min-max transformation, percentile rank, z-score transformation, and vector normalization can all tell a different story.
Esri's suitability analysis documentation treats preprocessing as a separate step for exactly this reason. It includes MinMax, Percentile, ZScore, and Raw preprocessing options, with notes about outlier sensitivity, skewed distributions, and when raw values are appropriate. (Esri Documentation)
The model should document:
- which normalization method was used
- why that method fits the variable
- whether rankings are stable under alternative transformations
- whether outliers distort the result
Compensatory aggregation
Weighted sums are compensatory. A strong score in one criterion can offset a weak score in another.
That is appropriate when criteria are genuinely substitutable:
Two demand proxies can be averaged.
Two access measures can be combined.
It is dangerous when criteria are essential:
Zoning cannot be fixed by strong income.
No drive-thru cannot be fixed by high traffic.
Unacceptable payer mix cannot be fixed by strong patient need.
This is the deeper reason gates matter. Essential criteria should be gates or multiplicative penalties, not additive preferences.
The geometric-mean example is useful here. A geometric mean reduces the ability of one strong dimension to fully compensate for a weak one. In site selection, that means a site with excellent demand and poor feasibility should look more fragile than a simple arithmetic average would suggest. Esri's suitability analysis describes geometric mean as a combination method that requires high values in multiple criteria for high final scores. (Esri Documentation)
Double-counting correlated variables
Many site variables measure the same underlying signal.
Examples:
- population, households, and density
- income and education
- daytime population and employment density
- competitor count and saturation
- traffic count and road hierarchy
- corner lot and visibility score
If each is weighted independently, the model may overweight one concept because it appears in several columns.
A simple correlation audit helps. When two variables are highly correlated, combine them, drop one, or explicitly justify why both belong.
A practical rule:
If correlation is greater than 0.7, review for redundancy.
That is a heuristic, not a law. The point is to prevent accidental double-counting.
Weight gaming
Weights should express strategy before the site list is scored. If stakeholders change weights after seeing the output, the model becomes a negotiation tool.
A defensible process documents:
- who set the weights
- when weights were set
- why weights were chosen
- what changed since the prior version
- how sensitive rankings are to those weights
Out-of-distribution candidates
A scoring model trained or calibrated on suburban drive-thru restaurants may not work for urban walk-up stores. A model built on traditional gyms may not work for boutique fitness studios. A healthcare model calibrated on primary care clinics may not work for imaging centers.
A site can be high-scoring inside the model's historical range and low-confidence outside it.
The brief should say when a candidate is out of distribution.
Normalization, thresholds, penalties, and gates
Raw variables rarely belong directly in a weighted score.
Population may range from 5,000 to 200,000. Income may range from $35,000 to $180,000. Competitor counts may range from 0 to 60. Traffic counts may range from 2,000 to 80,000 vehicles per day. If those values are added directly, the variable with the largest numeric range dominates.
Common transformations
| Method | How it works | Best use | Risk |
|---|---|---|---|
| Min-max | scales values from 0 to 1 | bounded variables, intuitive scoring | outliers distort range |
| Percentile rank | scores by rank position | skewed data, committee communication | loses magnitude differences |
| Z-score | standard deviations from mean | portfolio benchmarking | less intuitive |
| Raw | uses original value | already comparable fields | can dominate model |
| Log transform | compresses skewed values | income, density, spending | harder to explain |
| Fit-band scoring | rewards values near ideal band | traffic, income, spacing | requires calibration |
Fit-band scoring
Some variables are non-monotonic.
Traffic is a good example. Low traffic may not provide enough exposure. Extremely high-speed or high-volume traffic can make access difficult. In QSR scoring, a site may score best in a target band rather than simply scoring higher as traffic rises.
A simple fit-band function:
Traffic score = 0 below minimum
Traffic score rises to peak in target band
Traffic score falls as access friction increases
This is more realistic than treating traffic as "higher is always better."
Gates
Gates are pass/fail requirements.
| Gate | Why it belongs outside the score |
|---|---|
| Zoning allows use | A high demand score cannot fix illegal use. |
| Drive-thru feasible | A drive-thru-dependent prototype cannot average this away. |
| Rent below ceiling | Unit economics can block approval. |
| Minimum trade area population | A market may be too small regardless of access. |
| Parking requirement | Operations may be infeasible without it. |
| Healthcare payer mix | Demand without reimbursement can be weak. |
| Franchise territory | Legal or contractual constraints can block the site. |
A gate-failed site should be surfaced separately:
Gate result: Failed
Reason: Drive-thru feasibility
Recommendation: revise prototype or reject
Penalties
Penalties reduce a score when risk is present but not fatal.
Examples:
| Penalty | Use case |
|---|---|
| Cannibalization penalty | Candidate overlaps existing store demand. |
| Saturation penalty | Recent cohorts underperform in market. |
| Access penalty | Poor ingress, left-turn friction, or one-way constraint. |
| Confidence penalty | Missing or stale critical inputs. |
| Competitor penalty | Direct competitors dominate the target trade area. |
Penalties should be visible. Hidden penalties create committee confusion.
Missing data
Do not silently impute critical inputs.
Missing data should be handled in one of three ways:
| Missing-data approach | Use case |
|---|---|
| Research required | critical field missing, such as rent or zoning |
| Confidence reduction | useful field missing, such as traffic or customer-origin data |
| Documented imputation | low-risk variable, clear method, strong rationale |
The site brief should say:
Confidence: Medium
Reason: traffic data unavailable; score relies on road class and observed nearby demand
A high score with low confidence is not an approval. It is a research priority.
Weights, sensitivity, and rank stability
Weights express strategy.
A coffee concept may weight morning access and worker density heavily. A grocery concept may weight household density, income, vehicle access, basket opportunity, and co-tenancy. An urgent care operator may weight patient access, payer mix, provider capacity, and service-line need.
Weights should be set before candidate sites are scored.
Common ways to choose weights
| Method | Description | Best use |
|---|---|---|
| Equal weights | Every component has the same influence | early model, no strong strategy yet |
| Expert weights | Leadership or analysts assign weights | small teams, simple model |
| AHP | Pairwise comparisons produce weights | governance-heavy teams |
| Historical calibration | Weights tuned against past openings | mature operators with data |
| Segment-specific weights | Different weights by format or market type | multi-format operators |
| Hybrid | Expert weights refined by validation | most practical approach |
Esri's suitability analysis documentation states that weights significantly affect resulting scores, that weight selection is subjective, and that weights should be backed by strong rationale and subject-matter expertise. (Esri Documentation)
Sensitivity analysis
Sensitivity analysis asks:
If we change the weights, does the recommendation change?
Examples:
- If Competition weight increases from 25% to 35%, does Site A still rank first?
- If Demand weight decreases by 10 points, does Site B move ahead?
- If missing traffic data is added, does Site C change materially?
- If cannibalization transfer is 30% instead of 20%, does the site still clear the hurdle?
A good scorecard should show:
| Sensitivity question | Output |
|---|---|
| What weight change would flip the top two sites? | stability interval |
| Which criterion drives the ranking most? | driver analysis |
| Which site is second-best under alternate weights? | robustness check |
| Which inputs are missing or low confidence? | confidence note |
| Which site remains top under multiple methods? | consensus ranking |
Rank instability does not automatically invalidate a model. It tells the committee that the decision is sensitive and deserves more evidence.
Stability intervals
A stability interval shows how much a weight can change before the ranking changes.
Example:
Demand weight: 30%
Ranking remains stable if Demand weight stays between 24% and 38%.
If Demand weight drops below 24%, Site B overtakes Site A.
That is a much more useful committee statement than:
Site A scores 82.
It tells decision-makers whether the recommendation is robust or fragile.
Deep vertical examples
"Site scoring" is not one universal checklist. Different categories require different gates, weights, transformations, and overlays.
The reference points below are practical scoring heuristics drawn from industry sources and operator disclosures. They are not universal thresholds. Every operator should calibrate them against its format, geography, pricing, customer profile, and post-opening outcomes.
QSR and fast casual
QSR scoring is usually driven by frequency, access, speed, daypart, and channel mix.
Common scoring criteria:
| Criterion | Why it matters |
|---|---|
| Drive-time reach | QSR catchments are short and convenience-driven. |
| Traffic and route access | Exposure matters, but access friction can overwhelm volume. |
| Drive-thru feasibility | For many QSR formats, drive-thru is a major revenue channel. |
| Lunch and dinner demand | Daypart mix can determine viability. |
| Worker and commuter density | Especially important for breakfast and lunch. |
| Delivery coverage | Affects kitchen assignment and order density. |
| Competitor clustering | Some clustering signals demand; too much creates saturation. |
| Cannibalization | High-frequency categories may tolerate some transfer but must measure it. |
Useful QSR heuristics:
| Signal | Practical scoring treatment |
|---|---|
| ~25,000+ ADT near the site | common freestanding QSR traffic floor, but not a guarantee |
| Very high-speed / very high-volume road | fit-band penalty if drivers cannot decelerate or turn |
| QSR trade area around 5-7 minutes | starting point for car-oriented formats |
| Fast casual around 7-10 minutes | often broader than traditional QSR |
| Drive-thru can account for 70% or more of revenue at many traditional QSR formats | drive-thru feasibility may be a gate |
| Lunch may drive 35-40% of daily revenue in many QSR models | daypart demand should be scored, not averaged away |
| $40k-$80k household income may index strongly for QSR frequency in some models | income should often use a fit band, not "higher is always better" |
QSR trade-area and revenue benchmarks are highly format-specific. A coffee drive-thru, burger drive-thru, chicken concept, fast casual bowl concept, and suburban pizza unit should not share the same scoring model.
A defensible QSR scorecard should use fit bands:
Traffic: fit band, not "higher is always better"
Income: concept fit band, not "higher is always better"
Competition: nonlinear, because some clustering is beneficial
Cannibalization: overlay, because transfer can be intentional or harmful
Drive-thru feasibility: gate
Example QSR scoring structure:
| Layer | Example |
|---|---|
| Gate | drive-thru feasible, parking adequate, ingress acceptable |
| Core score | reach, demand, competition, accessibility |
| Forecast | P10 / P50 / P90 AUV with analog stores |
| Overlay | delivery overlap, same-brand transfer, market saturation |
| Recommendation | advance only if drive-thru geometry and lease terms hold |
Urgent care and healthcare
Healthcare scoring is less about retail sales and more about access, capacity, reimbursement, and service-line fit.
Common scoring criteria:
| Criterion | Why it matters |
|---|---|
| Patient access | Travel time and barriers shape actual utilization. |
| Payer mix | Demand without reimbursement is not equivalent to demand. |
| Provider capacity | Saturation depends on supply, not just population. |
| Service-line demand | Urgent care, primary care, imaging, dental, and specialty care differ. |
| Referral leakage | A new clinic may retain demand inside the system. |
| Regulatory constraints | Licensing and certificate requirements can be gates. |
| Labor and provider availability | A site cannot operate without staff. |
Useful urgent care heuristics:
| Signal | Practical scoring treatment |
|---|---|
| 3-5 mile radius or 12-15 minute drive-time catchment | starting point, not a universal rule |
| 2,800-3,500 square foot footprint | common urgent care prototype range |
| ~20,000 population per existing urgent care | rough saturation reference |
| Household income $50k-$100k | common urgent care fit band |
| Payer mix | gate or high-weight criterion |
| Provider availability | feasibility gate |
| Highways, rivers, undeveloped land, income gradients | barriers that distort catchment shape |
A healthcare scorecard should separate need from reachable, reimbursable, staffed demand.
Example healthcare scoring structure:
| Layer | Example |
|---|---|
| Gate | licensing, payer threshold, provider availability, minimum footprint |
| Core score | access, demand, competition/supply, accessibility |
| Forecast | visits, patient panels, appointment utilization |
| Overlay | leakage recapture, system transfer, capacity relief |
| Recommendation | advance if payer mix, staffing, and access pass threshold |
A healthcare model should also distinguish access gaps from business opportunity. An underserved area may have high community need but weak reimbursement, limited staffing, or regulatory constraints. That may still be strategically important, but the scoring model should make the trade-off explicit.
Fitness clubs
Fitness scoring is driven by membership penetration, income, lifestyle fit, commute patterns, co-tenancy, churn, and peak-hour capacity.
Common scoring criteria:
| Criterion | Why it matters |
|---|---|
| Target adult population | Sets the membership pool. |
| Income and lifestyle fit | Membership type and price point vary by concept. |
| Drive-time / walk-time reach | Convenience affects join rate and retention. |
| Competitor capacity | Many markets have demand but limited remaining white space. |
| Co-tenancy | Grocery, health-focused retail, and daily-use centers can help. |
| Peak-hour access | Parking and commute patterns matter at morning and evening peaks. |
| Churn and retention | A site that improves convenience may reduce churn. |
Useful fitness heuristics, drawn from the Health & Fitness Association's 2024-2025 reporting:
| Signal | Practical scoring treatment |
|---|---|
| U.S. gym membership penetration around 24.9% of the age-6+ population (HFA 2024) | starting demand multiplier |
| Industry retention benchmark around 66.4% in HFA 2025 reporting | churn and retention matter, not only new joins |
| 25-44 age segment is the largest membership group at roughly one-third of memberships | age-weighted demand scoring |
| Members with household income above $75k represent just over half of all memberships | income and lifestyle fit matter |
| Boutique studios, big-box clubs, and high-value low-price formats serve different customer profiles | do not use one scoring model for all fitness concepts |
A simple fitness demand screen:
Potential members =
Catchment population
× category penetration
× target customer fit
- competitor member capacity
A more defensible scorecard includes churn reduction and peak utilization:
Network value =
new memberships
+ churn reduction
+ capacity relief
- transfer from existing clubs
- added rent and labor
Example fitness scoring structure:
| Layer | Example |
|---|---|
| Gate | footprint, parking, rent ceiling, zoning |
| Core score | target population, lifestyle fit, access, competition |
| Forecast | members, ramp curve, contribution |
| Overlay | member transfer, churn reduction, capacity relief |
| Recommendation | advance if member pool and retention benefit clear threshold |
Scorecards vs spreadsheets
Spreadsheets are a reasonable starting point.
They work when:
- the company has a small footprint
- one or two people own the model
- the concept has one format
- opening cadence is slow
- assumptions rarely change
- cannibalization is minimal
- the decision does not require an audit trail
They begin to fail when:
- multiple stakeholders edit weights
- versions diverge
- candidate lists change weekly
- the company has many existing stores
- portfolio overlap matters
- data sources need timestamps
- the committee needs consistent briefs
- the model must be validated against outcomes
The Federal Reserve's SR 11-7 guidance is written for banking model risk, but its definition is useful for any high-stakes quantitative decision system. It defines a model as a quantitative method, system, or approach that applies theory, assumptions, and data to produce quantitative estimates. The definition also covers quantitative approaches with qualitative or expert-judgment inputs when the output is quantitative. (Federal Reserve)
That definition covers many site selection scorecards.
The most important SR 11-7 lesson for site scoring is its warning about spreadsheets. The guidance says, "User-developed applications, such as spreadsheets or ad hoc database applications used to generate quantitative estimates, are particularly prone to model risk." (Federal Reserve)
A real estate committee often asks questions that spreadsheets struggle to answer:
- Which model version produced this score?
- What changed since last quarter?
- Who changed the weights?
- What data vintage was used?
- Which variables are missing?
- What would flip the recommendation?
- How did sites like this perform historically?
- What is the cannibalization impact?
- Did the last five high-scoring sites actually perform?
When those questions become routine, the spreadsheet has become infrastructure.
Governance, documentation, and model risk
A site score does not have to be regulated like a banking model to benefit from model governance.
The best governance ideas are practical:
- define the model's intended use
- document the data sources
- document weights and transformations
- version the model
- show missing data
- track assumptions
- measure outcomes
- update the model when the market changes
- maintain a decision history
SR 11-7 emphasizes effective challenge: objective, informed review by people who can identify model limitations and produce appropriate changes. It also describes validation as conceptual soundness, ongoing monitoring, and outcomes analysis. (Federal Reserve)
That is exactly what a real estate committee should do.
The committee should be able to ask:
Why did this site score well?
Which inputs drive the score?
Which assumptions are fragile?
What would change the recommendation?
What did similar past openings do?
Model cards for site selection
Model Cards were proposed as a way to document trained machine learning models, including intended use, performance characteristics, evaluation procedures, and limitations. The original Model Cards paper describes them as short documents that disclose intended use, performance evaluation, and other relevant information to support transparent model reporting. (arXiv)
A site scoring model card should include:
| Field | Example |
|---|---|
| Model name | QSR Suburban Site Score v2.1 |
| Intended use | Screening U.S. suburban QSR candidates |
| Out-of-scope use | Urban walk-up sites, airport sites, ghost kitchens |
| Components | Reach, Demand, Competition, Accessibility |
| Weights | 30 / 30 / 25 / 15 |
| Gates | drive-thru required, rent ceiling, minimum parking |
| Data sources | ACS, POI, routing, customer origins |
| Data vintage | ACS 2019-2023, POI May 2026 |
| Calibration cohort | 87 openings from 2021-2025 |
| Validation metrics | AUV by score decile, MAPE, bias, cannibalization error |
| Known limitations | weak rural performance, limited mobility data |
| Last updated | May 2026 |
That may sound formal. It is also the type of documentation that lets a decision stand months later.
SHAP and LIME
When a scoring system uses machine learning, explanation tools can help.
SHAP assigns feature-importance values to individual predictions using a unified additive explanation framework. LIME explains individual predictions by learning an interpretable local model around the prediction. (arXiv)
In site selection, these methods are most useful for forecasts or ML-driven components, not for simple weighted scorecards. A transparent weighted score usually does not need SHAP. But if the model uses gradient boosting, random forests, or another opaque predictor, the brief should explain why a particular site received its forecast or risk estimate.
NIST AI RMF and EU AI Act Article 13
AI governance standards point in the same direction: transparency, interpretability, appropriate use, documentation, and ongoing evaluation.
NIST says its AI Risk Management Framework is intended to help organizations incorporate trustworthiness considerations into the design, development, use, and evaluation of AI systems. (NIST) The EU AI Act's Article 13 requires high-risk AI systems to be transparent enough for deployers to interpret outputs and use them appropriately, with instructions that include intended purpose, accuracy metrics, limitations, input specifications, and information to help interpret outputs. (EUR-Lex)
Even if a location scoring model is not legally classified as high-risk AI, the practical standard still applies. A model that affects capital allocation should explain itself.
For site selection, that means:
Intended use
Data sources
Transformations
Weights
Gates
Limitations
Validation results
Confidence level
Human review process
That is the difference between a score and a defensible score.
Calibration and post-opening validation
A score that never gets compared with outcomes is an untested hypothesis.
Validation should answer three questions:
- Did high-scoring sites perform better than low-scoring sites?
- Did the model correctly identify weak sites?
- Did the score explain the right things?
What to validate
| Validation target | Metric |
|---|---|
| Site performance | sales, AUV, visits, members, patients, deposits |
| Forecast accuracy | MAPE, bias, prediction-interval coverage |
| Score monotonicity | performance by score decile |
| Cannibalization | transfer from nearby existing stores |
| Saturation | cohort AUV decay, same-store impact |
| Feasibility | sites that failed because of real estate or operations |
| Confidence | whether low-confidence scores had wider errors |
Score decile validation
A simple validation chart:
| Score decile | Mature AUV | Expected pattern |
|---|---|---|
| 90-100 | Highest | strongest performance |
| 80-89 | High | above average |
| 70-79 | Medium | acceptable |
| 60-69 | Weak | watchlist |
| <60 | Lowest | rejected or underperforming |
The pattern does not need to be perfect. Real operations are messy. But if high-scoring sites do not outperform lower-scoring sites over time, the model needs recalibration.
Forecast validation
Forecasts should be evaluated separately from scores.
Common metrics:
MAPE = mean(|actual - forecast| / actual)
Bias = mean(forecast - actual)
Coverage = share of sites where actual result falls inside forecast interval
For new sites, forecast intervals are more honest than point estimates. A forecast of $2.4M ± $700K is less satisfying than $2.4M, but it is easier to defend.
New-store forecast accuracy should not be compared casually with existing-store demand forecasting. Existing-store forecasts often use operating history, SKU history, seasonality, and recent sales data. A new location has none of that. The honest output is a scenario range or prediction interval, ideally backed by named analogs.
Validation cadence
Validation timing should reflect the format.
| Timing | What to check |
|---|---|
| 30 days | operational opening issues, early traffic, data sanity |
| 90 days | ramp pattern, early transfer, channel mix |
| 180 days | emerging trade area, same-store impact |
| 12 months | first-year performance, forecast error |
| 18 months | more stable cannibalization and trade-area evidence for many retail and restaurant formats |
| 24-36 months | mature performance, cohort calibration |
QSR and convenience formats may reveal meaningful signals faster than fitness clubs, healthcare clinics, or membership-based businesses. Fitness needs enough time to observe join rate, retention, churn, and peak utilization. Healthcare needs enough time to observe patient panel growth, payer mix, referral patterns, and appointment utilization.
Post-opening validation turns site selection into a learning system.
What to do with validation results
Validation should change the model.
| Finding | Model response |
|---|---|
| High-scoring sites underperform | inspect weights, gates, and omitted variables |
| Low-confidence sites have large errors | tighten confidence rules |
| Forecasts are systematically high | recalibrate intercept or analog assumptions |
| Cannibalization underestimated | increase transfer overlay or change trade area method |
| One format behaves differently | create separate score profile |
| Urban sites fail rural-trained model | segment by market type |
A scoring model should become more accurate after each opening.
Public examples of scoring logic
Public commercial site scorecards are often proprietary, but several public and academic examples show how formal scoring appears in practice.
Academic AHP studies
AHP-based location studies often publish criteria weights. These are not plug-and-play templates for operators, but they show what explicit criteria weighting looks like.
| Context | Example criteria and weights |
|---|---|
| Gas station siting | One AHP study (Wu, Chen & Pan, ISPRS International Journal of Geo-Information, 2024) weighted population density at 0.633, gas station supply capacity at 0.261, and road network density at 0.106. |
| Hospital site selection | Şahin, Ocak & Top (Health Policy and Technology, 2019) rank demand, accessibility, competitors, government policy, related industry, and environmental conditions. |
| Pandemic hospital siting | Boyacı & Şişman (Environmental Science and Pollution Research, 2022) use Pythagorean fuzzy AHP to weight distance to transportation, population centers, land slope, land use, distance to other hospitals, and hazards. |
| Bank branch siting | Basar, Kabak & Topcu (Socio-Economic Planning Sciences, 2017) weight demographics, cost, competition, transportation, flexibility, and access to public facilities. |
| Shopping mall criteria | AHP studies of mall performance often find tenant satisfaction among the highest-weighted criteria, sometimes above 0.35. |
The lesson is not that a QSR operator should copy a hospital siting model. The lesson is that a defensible scoring model should make criteria, weights, and trade-offs visible.
Public-sector scoring rubrics
Economic development and public-sector site scoring often use a two-stage structure:
Threshold review:
pass / fail
Scored review:
weighted criteria and narrative ranking
That is a good model for commercial site selection. Hard requirements should be handled first. Weighted scoring should rank sites that remain eligible.
Franchise disclosure documents
Franchise Disclosure Document Item 11 often describes site-selection assistance. It may mention demographics, population density, income, geography, physical boundaries, competition, and other site factors. Most FDDs do not disclose the exact scoring matrix.
That tells us something important: even when public disclosure is high-level, site selection criteria are part of the franchise system's operating logic. A serious franchise operator should be able to explain how those criteria become a site approval recommendation.
Current industry state
Vendor methodology transparency varies widely.
GIS suitability tools
GIS suitability tools are often the most transparent because they show preprocessing, weighting, transformation, and combination methods. Esri's suitability analysis documentation describes criteria, influence settings, preprocessing methods, combination methods, score scaling, and weighting, including MinMax, Percentile, ZScore, Raw, Sum, Mean, Product, and Geometric Mean options. (Esri Documentation)
That is good governance. Users can see how variables enter the model.
Predictive analytics vendors
Predictive analytics vendors may use analog stores, regression, machine learning, or proprietary site scores. Some offer useful forecast ranges and comparable-store visibility. Others present a single score without exposing enough detail about features, weights, missing-data handling, validation, or calibration.
The question is not whether the score uses AI. The questions are:
What does the score measure?
What data trained or calibrated it?
How does it treat missing data?
How does it separate fit from forecast?
How does it account for cannibalization?
How has it performed after openings?
AI scores
"AI score" is a marketing term, not a methodology.
A serious AI-assisted site scoring system should still show:
- intended use
- model type
- data sources
- feature transformations
- training or calibration cohort
- validation metrics
- confidence level
- limitations
- local explanation for a given site
- human review process
The serious end of the market is moving toward explicit scoring rubrics, model documentation, visible assumptions, and explainable ML where appropriate. The marketing end is moving toward "AI" as a brand. Expansion leaders are getting better at telling the difference.
A Geod-style site scorecard
Geod's public methodology documents the default score around four components: Reach, Demand, Competition, and Accessibility, with default weights of 30%, 30%, 25%, and 15%. The score is a transparent weighted linear model with documented sources, snapshot dates, and visible components.
There are two clean ways to present this kind of score.
Textbook weighted model
In a clean weighted-score display, every component is normalized so higher is better. Competition is therefore expressed as Competition Fit, meaning lower competitive pressure or better competitive context.
| Component | Normalized score | Weight | Contribution |
|---|---|---|---|
| Reach fit | 86 | 30% | 25.8 |
| Demand fit | 78 | 30% | 23.4 |
| Competition fit | 58 | 25% | 14.5 |
| Accessibility fit | 70 | 15% | 10.5 |
| Total | 74.2 |
Formula:
Score =
0.30 × Reach fit
+ 0.30 × Demand fit
+ 0.25 × Competition fit
+ 0.15 × Accessibility fit
This is the easiest version to explain to a committee.
Signed contribution display
Some products show signed contributions instead of normalized sub-scores.
Site score =
Reach contribution
+ Demand contribution
- Competition penalty
+ Accessibility contribution
Example:
| Component | Signed contribution |
|---|---|
| Reach | +28 |
| Demand | +31 |
| Competition pressure | -12 |
| Accessibility | +27 |
| Site score | 74 |
In this display, "Competition" is not a 0-100 component score. It is a penalty contribution that reduces the final score. That distinction should be explicit in the brief so a careful reader does not try to reconcile a negative contribution with a 0-100 normalized weighted-score formula.
Decision overlays
The score should then be paired with overlays.
| Overlay | Example output |
|---|---|
| Cannibalization | Moderate transfer risk from Store 14 |
| Saturation | Market cohort AUV declining; conditional approval |
| Feasibility | Conditional: rent and ingress need confirmation |
| Confidence | Medium-high: strong ACS/POI data, limited first-party origins |
| Validation plan | Compare ramp, affected-store sales, and customer origins at 90 and 180 days |
This preserves the simplicity of the score while giving the committee the context it needs.
Evaluate vs Strategize
A useful product architecture separates first-site evaluation from portfolio strategy.
| Mode | What it should do |
|---|---|
| Evaluate | Apply a defensible score to candidate sites using consistent criteria, weights, thresholds, and source-dated data. |
| Strategize | Apply the same strategy across a portfolio with model versioning, custom weights, batch scoring, cannibalization, saturation, feasibility, confidence, and decision history. |
Geod's Evaluate tool scores specific sites and exports explainable reports. Strategize gives expansion teams a way to apply the same strategy systematically across a 30-500 location portfolio.
The Evaluate tier teaches a team that scoring can be repeatable and defensible. The Strategize tier turns that scoring into an institutional system. That matters for operators whose strategy is already in someone's head, in a consultant deck, and in three spreadsheets, but not yet versioned, consistently applied, or easy to defend.
The defensible claim is grounded:
Your expansion strategy becomes a repeatable scoring system,
applied consistently to every candidate you evaluate,
with portfolio-aware overlays and a decision record.
What a defensible site brief should show
A defensible site brief should contain the score and the argument behind it.
Recommended structure:
Site Score Summary
1. Candidate address and prototype
2. Core score and component breakdown
3. Gates passed / failed
4. Criteria weights and model version
5. Data sources and snapshot dates
6. Trade area method and time window
7. Demand assumptions
8. Competition assumptions
9. Accessibility assumptions
10. Forecast range, if available
11. Cannibalization and saturation overlays
12. Feasibility gate
13. Confidence level
14. Sensitivity notes
15. Recommendation
16. Post-opening validation plan
Example brief language
The candidate scores 74 under the QSR Suburban Scorecard v2.1. Reach and Demand are strong because the 10-minute drive-time catchment contains above-threshold target population and income for the concept. Competition reduces the score because the catchment includes a high density of direct fast-casual competitors. Accessibility is favorable, but the feasibility gate remains conditional pending ingress review and rent confirmation. The site should advance to constrained review, with cannibalization against Store 14 measured before final approval.
That paragraph is more useful than:
Site score: 74
Common mistakes
Mistake 1: Treating the score as the recommendation
The score is evidence. The recommendation is a decision.
Mistake 2: Mixing forecast and fit
A site can be strategically strong and forecast modestly. Another can forecast large and fail strategy. Keep these separate.
Mistake 3: Averaging away deal-breakers
Hard requirements belong in gates.
Mistake 4: Scoring raw variables directly
Normalize before weighting. Raw numeric ranges can dominate.
Mistake 5: Hiding weights
Weights are strategy. If they are hidden, the strategy is hidden.
Mistake 6: Double-counting correlated variables
Population, households, density, and daytime population may overlap. Competition count, saturation, and cannibalization can also overlap. Audit correlations.
Mistake 7: Changing weights after seeing the answer
Set weights before scoring candidates. Otherwise the model becomes a negotiation tool.
Mistake 8: Ignoring uncertainty
A high score with low confidence should trigger research, not approval.
Mistake 9: Treating competitors as all negative
Some competitors signal demand or create destination effects. Competition should be category-specific.
Mistake 10: Skipping validation
A model that is never compared with actual openings becomes a permanent assumption.
Mistake 11: Treating AI scoring as a methodology
AI scoring is not itself a methodology. The methodology has to be visible: what the model measures, what data it uses, how it was validated, and what it should be used for.
Mistake 12: Using one model for every format
A suburban drive-thru, an urban pickup store, a full-service clinic, and a boutique fitness studio should not share the same thresholds and weights.
Related reading
- Trade area analysis for site selection
- Cannibalization analysis in site selection
- Market saturation analysis in site selection
- White space analysis in site selection
- Explainable site selection
FAQ
What is a site selection scoring model?
A site selection scoring model is a structured method for ranking candidate locations using criteria, weights, transformations, gates, and overlays. It helps expansion teams compare sites consistently and explain why a candidate should advance, be rejected, or require more research.
What is a location score?
A location score is a numeric summary of how well a candidate site fits an operator's site selection criteria. A defensible score should be broken into visible components so decision-makers can see why the site scored well or poorly.
What is the difference between a site score and a sales forecast?
A site score measures strategic fit. A sales forecast estimates expected performance. A site can score well and forecast small, or forecast large and score poorly. The final recommendation should use both.
What is the best site selection scoring method?
For most multi-unit operators, the best practical method is hard gates plus a transparent weighted score, forecast range, network overlays, sensitivity analysis, and post-opening validation. More complex MCDA methods like AHP, TOPSIS, ELECTRE, and PROMETHEE can help in specific governance or short-list situations.
What criteria should be included in a site selection scorecard?
Common criteria include trade area population, income, daytime population, category spend, competition, traffic, accessibility, co-tenancy, site feasibility, parking, rent, customer origins, cannibalization, and saturation. The exact criteria should vary by category and prototype.
How should site selection criteria be weighted?
Weights should reflect strategy and should be set before candidates are scored. Operators can use expert judgment, AHP, historical calibration, or a hybrid approach. Weight choices should be documented and tested for sensitivity.
What is AHP in site selection?
AHP, or Analytic Hierarchy Process, is a multi-criteria decision method that derives weights from pairwise comparisons. It can help teams make trade-offs explicit and check consistency.
What is TOPSIS in site selection?
TOPSIS ranks candidate sites by comparing each one to an ideal site and a worst-case site. It is useful for short-list comparisons but can be sensitive to normalization.
What is rank reversal?
Rank reversal occurs when adding or removing a candidate changes the ranking of existing candidates. It is a known failure mode in several MCDA methods and should be tested when candidate lists change during a scoring process.
Should cannibalization be part of the score?
Cannibalization should usually be shown as a network overlay rather than hidden inside the core score. The core score measures site fit. The overlay shows whether the site creates net-new demand or transfers demand from existing units.
How do you validate a site selection scoring model?
Compare scores against post-opening outcomes. Track performance by score decile, forecast error, same-store impact, cannibalization, confidence level, and outcome by market type. Recalibrate when results drift.
When does a site scoring spreadsheet stop working?
A spreadsheet becomes risky when multiple people change weights, versions diverge, data gets stale, portfolio overlap matters, or the committee needs repeatable briefs and audit history.
Glossary
Site selection scoring model
A structured method for ranking candidate locations using criteria, weights, transformations, and rules.
Location score
A numeric summary of a candidate site's fit against the operator's criteria.
Site scorecard
A table or brief showing the score, components, weights, gates, overlays, and recommendation.
Weighted sum model
A scoring method that multiplies each normalized criterion score by a weight and sums the results.
Weighted geometric mean
A scoring method that multiplies criteria raised to their weights, reducing the ability of one very strong criterion to compensate for a very weak one.
MCDA / MCDM
Multi-criteria decision analysis or multi-criteria decision-making, a family of methods for evaluating alternatives across many criteria.
AHP
Analytic Hierarchy Process, a method that uses pairwise comparisons to derive weights and check consistency.
TOPSIS
Technique for Order Preference by Similarity to Ideal Solution, a method that ranks options by distance from ideal and worst-case alternatives.
Gate
A hard requirement that a site must pass before it is scored or advanced.
Threshold
A minimum, maximum, or target value used to classify or filter a criterion.
Penalty
A visible score reduction applied when a risk is present but not fatal.
Normalization
The process of converting raw variables into comparable scales.
Percentile score
A score based on how a candidate ranks relative to a comparison set.
Z-score
A normalized value showing how many standard deviations a value is above or below the mean.
Forecast
An estimate of expected performance, such as sales, visits, members, patients, deposits, or orders.
Overlay
Additional context layered on top of the score, such as cannibalization, saturation, feasibility, confidence, or validation status.
Confidence level
A high, medium, or low assessment of data quality, model fit, missing inputs, and evidence strength.
Model card
A structured document describing a model's intended use, data, performance, limitations, and governance.
Sensitivity analysis
Testing how changes to weights, assumptions, or data inputs affect the ranking or recommendation.
Rank reversal
A multi-criteria decision failure mode where adding or removing alternatives changes the ranking of existing alternatives.
Post-opening validation
Comparing model predictions and scores against actual results after a location opens.
Conclusion
A defensible site score is a structured argument.
The score shows how the site performs against strategy. The forecast estimates performance. The overlays expose network and execution risk. The recommendation explains what to do next.
Before a candidate advances, the site brief should answer six questions:
- Which gates did the site pass or fail?
- How did the site score by component?
- What assumptions and data sources produced the score?
- What forecast range applies, if any?
- What overlays change the decision?
- What evidence would prove the model right or wrong after opening?
A location score survives committee when the team can take it apart and put it back together.
That is the standard for modern site selection.
See how Geod turns scoring into a defensible site brief
Geod helps expansion teams define scoring criteria, set weights and thresholds, evaluate candidate sites, and export committee-ready briefs with maps, methodology, component breakdowns, source dates, cannibalization, saturation, feasibility, confidence, and validation context.
Evaluate gives teams a defensible way to score candidate sites. Strategize turns that scoring into a repeatable system across a portfolio, with custom weights, versioned models, batch evaluation, portfolio-aware overlays, and decision records.
The goal is not just to rank sites. It is to make every recommendation explainable enough to defend.