Methodology

Site Selection Scoring Models: How to Build a Defensible Location Score

A complete guide to weighted scoring, AHP, TOPSIS, thresholds, gates, normalization, forecasts, overlays, confidence, and validation.

📖 28 min read · Last updated May 2026

TL;DR

  • A site score measures strategic fit. It shows whether a candidate matches the operator's reach, demand, competition, access, format, and growth strategy.
  • A score is different from a forecast. A score ranks fit. A forecast estimates sales, visits, orders, members, patients, or deposits.
  • The strongest architecture is gates + weighted score + forecast + overlays + recommendation. Gates remove blocked sites, scores rank eligible sites, forecasts estimate performance, and overlays add portfolio context.
  • Site scoring has a nearly century-long lineage. Reilly, Converse, Christaller, Applebaum, Huff, GIS suitability analysis, MCDA, and modern ML all address the same core decision problem.
  • Weighted scores are useful, but fragile. Rank reversal, normalization instability, compensatory aggregation, double-counting, and weight gaming can change the recommendation.
  • Hard requirements should be gates. Zoning, drive-thru feasibility, payer mix, rent ceiling, minimum trade area, and operational blockers should not be averaged away.

A site score should make a decision easier to challenge.

That sounds backwards until a candidate reaches the real estate committee.

A black-box score can look impressive in a pipeline review. It can rank the right sites, produce a confident number, and make the deck feel more analytical. But the moment someone asks why the site scored 82, why a failed store would have scored well, why a high-traffic corridor lost points, why cannibalization was ignored, or why the model changed since last quarter, the score has to become more than a number.

It has to become a decision record.

A defensible site selection scoring model shows how a candidate location fits the operator's strategy. It exposes the criteria, weights, gates, thresholds, transformations, assumptions, data sources, missing fields, model version, confidence level, forecast range, and network impact.

Most importantly, it separates four artifacts that often get collapsed into one vague "AI score."

ArtifactQuestion it answers
ScoreDoes this site fit our strategy?
ForecastHow large could this location be?
OverlayWhat network or execution context changes the decision?
RecommendationWhat should we do next?

The score organizes the evidence. The forecast estimates performance. The overlays explain risk. The recommendation turns the analysis into a decision.


What is a site selection scoring model?

A site selection scoring model is a structured method for comparing candidate locations against a defined set of criteria.

For a retailer, the model might evaluate trade area population, income, category spend, co-tenancy, competition, access, foot traffic, visibility, and cannibalization. For a restaurant, it may add daypart demand, drive-thru feasibility, pickup flow, delivery radius, commute direction, and kitchen capacity. For healthcare, it may include payer mix, provider capacity, patient access, service-line demand, referral leakage, and regulatory constraints.

A scoring model helps answer:

  • Which sites should advance to deeper review?
  • Which candidates best match the expansion strategy?
  • Which variables drive the recommendation?
  • Which sites fail hard requirements?
  • Which high-scoring sites need more research?
  • Which candidates create network risk through cannibalization or saturation?
  • Which recommendations are high-confidence, and which are fragile?

A good scorecard gives every stakeholder a way to interrogate the recommendation.

The real estate team can ask about access and trade area assumptions. Finance can ask about forecast range and rent-to-sales risk. Operations can ask about feasibility. Franchise teams can ask about encroachment. Executives can ask whether the site advances the strategy or simply looks good on a map.

The score should organize that conversation, not end it.

Score vs forecast vs overlay vs recommendation

The most common mistake in site selection scoring is turning four different artifacts into one number.

ArtifactPurposeExample output
ScoreMeasures strategic fit74 / 100 with component breakdown
ForecastEstimates expected performanceP10 / P50 / P90 sales, visits, orders, members, patients
OverlayAdds network and execution contextCannibalization, saturation, feasibility, confidence
RecommendationConverts evidence into actionAdvance, reject, research, revise, hold

A site can score well and forecast small. It may be strategically perfect but located in a limited trade area. A site can forecast large and score poorly. It may have high traffic and demand, but the wrong customer profile, bad access, weak economics, or heavy cannibalization.

A clean decision package looks like this:

Core score:
Reach + Demand + Competition + Accessibility

Forecast:
Expected sales / visits / members / patients, with range

Overlays:
Cannibalization
Saturation
Feasibility
Confidence
Validation requirements

Recommendation:
Advance / reject / research / revise

This distinction is especially important for AI-assisted scoring. A single "AI score" can hide whether the model is ranking strategic fit, expected sales, market potential, analog similarity, network impact, or all of those at once. The more concepts a score absorbs, the harder it becomes to defend.

The history of location scoring

Site selection scoring did not begin with AI. The modern scorecard sits on nearly a century of retail geography, spatial interaction modeling, analog reasoning, GIS suitability analysis, and multi-criteria decision analysis.

Reilly's Law of Retail Gravitation (1931)

William J. Reilly's 1931 The Law of Retail Gravitation applied a gravity analogy to retail trade areas. The idea was simple: larger retail centers attract customers from farther away, while distance weakens that attraction.

The classic attraction-balance relationship is:

PA / dA² = PB / dB²

Rearranged:

dA / dB = √(PA / PB)

Where:

TermMeaning
PA, PBpopulation or size proxy for retail centers A and B
dAdistance from the breakpoint to center A
dBdistance from the breakpoint to center B

If A is larger than B, the breakpoint is farther from A and closer to B. That means A's trade area extends farther toward B. This is the point that often gets muddled when the formula is presented without clear variable definitions.

Reilly's model solved an early problem: how to draw deterministic trade area boundaries between competing cities or retail centers. Its limits are obvious today. It assumes simple geography, ignores road networks, treats consumers as homogeneous, and draws hard boundaries where real trade areas overlap. But the core insight still matters: demand, attraction, and distance can be modeled. (Wikipedia)

Converse and the breaking-point formula (1949)

Paul Converse later rearranged Reilly's logic into a more practical breaking-point formula:

dBP from B = DAB / (1 + √(PA / PB))

Where:

TermMeaning
DABdistance between centers A and B
PA, PBpopulation or size proxy for A and B
dBP from Bbreakpoint distance from center B

Example:

City A population = 120,000
City B population = 30,000
Distance between cities = 60 miles

dBP from B = 60 / (1 + √(120,000 / 30,000))
dBP from B = 60 / (1 + 2)
dBP from B = 20 miles

The model says the smaller city's trade area extends about 20 miles toward the larger city. The breakpoint is 40 miles from the larger city. That is the expected result: the larger center pulls from farther away. (Wikipedia)

For modern operators, the value is historical and conceptual. Converse made gravity theory usable for trade-area boundary drawing. Modern scoring models have moved beyond deterministic breakpoints, but they still ask where a site's reach begins to fade.

Christaller, threshold, and range (1933)

Walter Christaller's central place theory added two ideas that still sit underneath site scoring: threshold and range. In central place theory, threshold is the minimum market needed to support a good or service, and range is the maximum distance customers are willing to travel for it. (Wikipedia)

ConceptMeaning in site selection
Thresholdminimum demand needed to support a location
Rangemaximum distance or travel time customers will tolerate

Every supportable-unit model, minimum-demand gate, drive-time screen, and trade area threshold inherits this logic.

A market may contain population, but the location only works if enough demand exists inside the range customers will actually travel. A clinic may have a strong need profile, but if patients cannot reach it within the service standard, the demand is not operationally useful. A delivery unit may have addressable households, but if it cannot serve them inside the delivery promise, the theoretical market is larger than the practical market.

Christaller's threshold/range pair is still inside the questions operators ask every day:

Does the catchment contain enough target customers?
Will those customers travel far enough to use the site?

Applebaum and the analog method (1966)

William Applebaum's 1966 work in the Journal of Marketing Research on store trade areas, market penetration, and potential sales formalized one of the most important operator-side methods in site selection: compare a candidate location with existing stores that have similar trade area characteristics.

The analog method asks:

Which existing stores look most like this candidate?
How did those stores perform?
What capture rate did they achieve?
What changed when similar stores opened nearby?

Applebaum's method dominated much of U.S. retail site selection from the 1960s through the 1990s because it matched how operators reason. If a proposed site resembles three existing stores, and those stores all reached similar AUVs or patient volumes, the comparison gives the committee a grounded forecast.

Analog methods still matter because they are explainable. A real estate team can inspect the stores behind the estimate.

Their weakness is data dependency. Analogs break when:

  • the brand has too few stores
  • the prototype changes
  • the market type is new
  • customer behavior shifts
  • the candidate is out of distribution
  • first-party customer-origin data is missing

Analog scoring is powerful when the portfolio is mature and comparable. It becomes fragile when the next site is unlike the past.

Huff and probabilistic trade areas (1963, 1964)

David Huff's 1963 and 1964 work replaced deterministic boundaries with probability. A customer does not simply "belong" to one trade area. Instead, the model estimates the probability that a customer at origin i chooses store j.

A common Huff formulation:

Pij = (Aj^α × Dij^-β) / Σ(Ak^α × Dik^-β)

Where:

TermMeaning
Pijprobability that demand from origin i chooses store j
Ajattractiveness of store j
Dijdistance, drive time, or travel cost from i to j
αattractiveness sensitivity
βdistance-decay sensitivity
kall competing locations in the choice set

The Huff model changed site selection because it allowed overlapping trade areas. A customer could have some probability of choosing Store A, Store B, a competitor, or no purchase. Esri describes the Huff model as a spatial interaction model where probability depends on distance, site attractiveness, and the distance and attractiveness of competing sites. Esri also notes that calibration is needed because default exponent values may not apply to the specific trade area being modeled. (ArcGIS Pro)

That matters for scoring because a site's value depends on choice probabilities, not just the number of people inside a polygon. A candidate can sit in a dense market and still be weak if competitors are more attractive or easier to reach. A site can sit farther away and still capture demand if its access and format are superior.

Lakshmanan and Hansen: market potential (1965)

Lakshmanan and Hansen's 1965 retail market potential work extended the spatial interaction tradition into shopping center sales and market aggregation. The practical move was to connect origin demand, travel friction, and destination attractiveness to potential sales.

That lineage shows up in modern location scoring whenever a model asks:

How much demand is available?
How likely is this site to capture it?
How much is already allocated to competitors or existing stores?

Nakanishi and Cooper: multiplicative competitive interaction (1974)

Masao Nakanishi and Lee Cooper's 1974 multiplicative competitive interaction model extended Huff by replacing a single attractiveness variable, such as store size, with a bundle of attractiveness attributes.

A simplified multiplicative attractiveness structure:

Aj = X1j^β1 × X2j^β2 × X3j^β3 ... Xnj^βn

Where each X is a site or store attribute and each β is an elasticity.

Those attributes might include:

  • store size
  • parking
  • price
  • assortment
  • frontage
  • co-tenancy
  • brand strength
  • reviews
  • operating hours
  • delivery coverage
  • format
  • local awareness

This is the bridge from gravity models to modern multivariate site scoring. A store's attractiveness is not one variable. It is a weighted bundle of factors.

Modern scoring models still use this logic, even when they do not call it MCI. They combine multiple attributes into a site fit score or sales forecast.

Regression-based new-store forecasting

As chains built larger portfolios, regression and statistical forecasting entered site selection. Existing locations became training data. Candidate-site features became predictors.

The question shifted from:

Which store does this site resemble?

to:

Given these trade area, access, competition, and format attributes,
what range of outcomes should we expect?

Regression, generalized linear models, and later machine learning helped operators move beyond pure analog reasoning. But these methods introduce their own risks: overfitting, sample-size limits, stationarity assumptions, and weak performance on new formats or new geographies.

A chain with 80 stores and 30 variables does not have "big data." It has a small modeling problem that needs discipline.

GIS and suitability analysis (1980s-1990s)

GIS made location scoring repeatable. Drive-time polygons, demographic overlays, competitor layers, traffic counts, POI data, and suitability maps allowed in-house teams to run hundreds of analyses instead of commissioning one-off studies.

Esri's suitability analysis workflow is a good public example of the modern GIS approach. It identifies sites that meet user-defined criteria, preprocesses variables onto comparable scales, applies weights, combines them, and scales final scores. It also supports positive, inverse, ideal, and target-site influence types. (Esri Documentation)

That is the site scoring pattern most teams recognize:

Define criteria.
Transform variables.
Weight criteria.
Combine scores.
Rank candidates.

MCDA and decision science

Multi-criteria decision analysis, or MCDA, gave scoring models a formal decision framework. Weighted sums, AHP, TOPSIS, ELECTRE, PROMETHEE, sensitivity analysis, and rank-stability checks all address the same issue: how to make a decision when multiple criteria matter and no single metric is enough.

This matters because site selection is inherently multi-criteria. A candidate can be strong on demand, weak on access, moderate on competition, and uncertain on feasibility. The model has to combine those signals without hiding deal-breakers.

Mobility, machine learning, and AI scores (2010s-2020s)

The 2010s and 2020s added foot-traffic panels, mobile-device data, gradient boosting, store embeddings, and AI-assisted workflows. These tools can improve scoring when used carefully, especially for observed trade areas, visit behavior, analog selection, and feature attribution.

But AI scores are the newest aggregation layer, not a replacement for older questions.

Every scoring model still has to answer:

  • What criteria matter?
  • How are variables transformed?
  • What is being forecast?
  • What is being scored?
  • How is cannibalization handled?
  • What data is missing?
  • How is the model validated?
  • Can the committee challenge the recommendation?

The history matters because it keeps the modern score honest. Site selection has always been a structured argument about demand, access, competition, and choice. AI does not remove that structure. It raises the cost of hiding it.

The architecture of a defensible location score

A defensible location score has five layers.

LayerPurposeExample
GatesRemove ineligible or blocked siteszoning failure, drive-thru impossible, rent ceiling exceeded
Core scoreRank eligible sites against strategyReach + Demand + Competition + Accessibility
ForecastEstimate expected performanceP10 / P50 / P90 sales, visits, patients, members
OverlaysAdd portfolio and execution contextcannibalization, saturation, feasibility, confidence
RecommendationConvert evidence into actionadvance, reject, research, revise

This architecture prevents one number from doing too much work.

Gates

Gates answer whether a site is eligible. Failed gates should not be averaged away.

Examples:

  • zoning does not allow the use
  • drive-thru is required and impossible
  • parking is below prototype requirement
  • rent-to-sales economics fail threshold
  • trade area is below minimum demand
  • payer mix is unacceptable
  • franchise territory rights block the site

A gate-failed site should be labeled:

Status: Not eligible under current criteria
Reason: Drive-thru feasibility failed

It should not receive a misleading score of 58.

Core score

The core score ranks eligible candidates. It should be simple enough to explain and stable enough to compare.

A common structure:

Site score =
Reach contribution
+ Demand contribution
+ Competition contribution or penalty
+ Accessibility contribution

Forecast

A forecast estimates expected performance. It may use analog stores, regression, machine learning, demand allocation, or scenario modeling.

Forecasts should be presented as ranges:

P10: $1.8M
P50: $2.4M
P90: $3.1M

New-store forecasts deserve wide intervals because the site has no operating history.

Overlays

Overlays capture context that belongs next to the score.

OverlayWhy it matters
CannibalizationA strong site may transfer demand from existing units.
SaturationA market may have demand but weak marginal returns.
FeasibilityA good market can fail on real estate, access, labor, buildout, or operations.
ConfidenceA high score from weak or missing data should not be treated like a high score from strong evidence.
ValidationA model improves only if actual openings are measured.

Domino's fortressing language is a good example of why overlays belong outside the core score. Domino's describes adding stores in existing markets to condense delivery areas and get closer to carryout customers, while also warning that fortressing may negatively affect existing-store sales and can lead to closures if executed too rapidly. (SEC)

Sweetgreen's delivery-radius language makes the same point in channel-specific form. Sweetgreen says new restaurants in or near existing markets can affect existing restaurant sales, and it specifically warns that cannibalization may become significant when an existing restaurant's delivery radius overlaps with a new restaurant's delivery radius. (SEC)

Those are network effects. They should shape the recommendation, but they should not be hidden inside the core site fit score.

Recommendation

The recommendation is the final decision statement.

Possible outputs:

Advance to LOI.
Reject.
Advance only under constrained lease economics.
Research customer-origin data before approval.
Revise prototype.
Hold until better site supply.

The recommendation should name the key reason, not just repeat the score.

The core formulas

Weighted sum

The simplest defensible score is a weighted sum:

Scorej = Σ(wi × xij)

Where:

TermMeaning
Scorejscore for site j
wiweight for criterion i
xijnormalized score of site j on criterion i

Example:

ComponentNormalized scoreWeightContribution
Reach fit8630%25.8
Demand fit7830%23.4
Competition fit5825%14.5
Accessibility fit7015%10.5
Total74.2

This model is easy to explain and easy to misuse. It assumes criteria can compensate for each other. Strong demand can offset weak access. Strong reach can offset competition. Sometimes that is appropriate. Sometimes it hides a fatal flaw.

Weighted geometric mean

When balance matters, a geometric mean can penalize unbalanced sites.

Scorej = Π(xij^wi)

The geometric mean reduces the ability of one very strong criterion to fully compensate for a very weak one.

Esri's suitability analysis supports Product and Geometric Mean combination methods. Esri describes geometric mean as useful when criteria are on different scales and when high final scores should require high values in multiple criteria, because an extreme value in one criterion will not disproportionately determine the result. (Esri Documentation)

The Human Development Index also moved to a geometric mean in 2010, which is often used as a teaching example for reducing substitutability across dimensions. The same idea applies to site selection: if demand, access, and feasibility all matter, a site that is excellent on demand but terrible on feasibility should be penalized more heavily than a simple arithmetic average would suggest. (Wikipedia)

Hard requirements should still be gates. The geometric mean reduces compensation, but it does not replace pass/fail eligibility.

AHP consistency

AHP derives weights from pairwise comparisons. It also checks whether those comparisons are internally consistent.

The consistency index:

CI = (λmax - n) / (n - 1)

The consistency ratio:

CR = CI / RI

Where:

TermMeaning
λmaxprincipal eigenvalue of the pairwise comparison matrix
nnumber of criteria
RIrandom index for matrix size n

Common RI values:

n23456789
RI00.580.901.121.241.321.411.45

A common AHP convention is:

CR ≤ 0.10: judgments are acceptably consistent
CR > 0.10: revisit pairwise comparisons

For site selection, this matters because it prevents a committee from saying inconsistent things like:

Demand is much more important than Access.
Access is much more important than Competition.
Competition is much more important than Demand.

AHP does not eliminate judgment. It makes judgment auditable. (Wikipedia)

TOPSIS closeness

TOPSIS ranks candidates by distance from an ideal site and a worst-case site.

Cj = dj- / (dj+ + dj-)

Where:

TermMeaning
dj+distance from site j to the ideal site
dj-distance from site j to the negative ideal site
Cjcloseness coefficient

A higher Cj means the candidate is closer to the ideal and farther from the negative ideal.

TOPSIS is useful when comparing finalists:

Which candidate is most like our ideal site profile?

Its weakness is normalization sensitivity. Different preprocessing methods can change distances, and therefore rankings. The MCDM literature continues to emphasize that reference-type methods such as TOPSIS can produce different rankings depending on how reference solutions, distance measures, and normalization are defined. (arXiv)

Huff probability

Huff-style models can support scoring when the question is demand allocation.

Pij = (Aj^α × Dij^-β) / Σ(Ak^α × Dik^-β)

Where:

TermMeaning
Pijprobability that demand from origin i chooses site j
Ajattractiveness of site j
Dijtravel time, distance, or travel cost
αattractiveness sensitivity
βtravel-cost sensitivity
kall options in the choice set

For a scoring model, this helps estimate:

  • expected capture from an origin
  • competitor pressure
  • same-brand transfer
  • market share potential
  • cannibalization-adjusted demand

The key move is comparing allocation before and after a candidate enters the market.

MCI attractiveness

The multiplicative competitive interaction model generalizes attractiveness across multiple attributes:

Aj = Πm(Xmj^βm)

Where:

TermMeaning
Ajattractiveness of site or store j
Xmjattribute m for site j
βmelasticity or importance of attribute m

This is useful because store attractiveness is rarely one variable. It can include frontage, parking, co-tenancy, price, hours, brand strength, and format.

MCDA methods: weighted sum, AHP, TOPSIS, and outranking

A site scorecard is a multi-criteria decision model. There is no universally best method. The right method depends on the audience, data, governance burden, and decision stage.

Weighted sum model

The weighted sum model is the most common.

It works well when:

  • stakeholders need transparency
  • criteria are measurable
  • the model is used for ranking
  • hard requirements are handled as gates
  • criteria are reasonably independent

It fails when:

  • raw variables are not normalized
  • weights are changed after seeing the answer
  • correlated inputs are double-counted
  • hard requirements are averaged into the score
  • the score is treated as a forecast

AHP

AHP is useful when stakeholders need to set weights explicitly. Instead of asking "what should Demand weigh?", AHP asks stakeholders to compare criteria pair by pair:

Is Demand more important than Accessibility?
Is Competition more important than Reach?
How much more important?

This is useful for governance because it exposes trade-offs. It becomes cumbersome when there are too many criteria.

AHP also has a known failure mode: rank reversal. Adding a new alternative, including a duplicate or near-duplicate, can change the ranking of existing alternatives. (Wikipedia)

TOPSIS

TOPSIS is useful for short lists. It asks:

Which site is closest to the ideal and farthest from the worst case?

This is intuitive for comparing finalists, but it can be sensitive to normalization choices. If the same candidates rank differently under min-max, vector, or percentile normalization, the model should surface that instability rather than hide it.

Outranking methods

ELECTRE and PROMETHEE compare candidates pair by pair and can include veto thresholds. They are more complex, but they align with how real estate decisions often work:

Site A can outrank Site B unless Site A fails a critical requirement.

Outranking methods are useful when the organization wants partial ordering instead of false precision. They are harder to explain to a non-technical committee.

Practical architecture

For most multi-unit operators, the strongest architecture is:

Hard gates
+ transparent weighted score
+ forecast range
+ network overlays
+ sensitivity analysis
+ post-opening validation

This gives the committee a score that is understandable, a forecast that is honest, and a recommendation that can be challenged.

Why scoring models fail

A site scoring model can look rigorous and still be fragile. The main failure modes have names.

Rank reversal

Rank reversal occurs when adding or removing a candidate changes the ranking of existing candidates.

Operational example:

Initial ranking:
1. Site A
2. Site B
3. Site C

After adding Site D:
1. Site B
2. Site A
3. Site D
4. Site C

If Site D is not truly relevant to the comparison between A and B, the committee will ask why the original ranking changed.

Rank reversal can occur in AHP, TOPSIS, and other MCDA methods depending on how alternatives are normalized and compared. It does not make those methods useless. It means a scoring model should understand the conditions under which rankings are stable. (Wikipedia)

Normalization instability

The same raw data can produce different rankings depending on normalization.

Example:

CandidatePopulationCompetitors
Site A120,00012
Site B90,0004
Site C45,0001

A min-max transformation, percentile rank, z-score transformation, and vector normalization can all tell a different story.

Esri's suitability analysis documentation treats preprocessing as a separate step for exactly this reason. It includes MinMax, Percentile, ZScore, and Raw preprocessing options, with notes about outlier sensitivity, skewed distributions, and when raw values are appropriate. (Esri Documentation)

The model should document:

  • which normalization method was used
  • why that method fits the variable
  • whether rankings are stable under alternative transformations
  • whether outliers distort the result

Compensatory aggregation

Weighted sums are compensatory. A strong score in one criterion can offset a weak score in another.

That is appropriate when criteria are genuinely substitutable:

Two demand proxies can be averaged.
Two access measures can be combined.

It is dangerous when criteria are essential:

Zoning cannot be fixed by strong income.
No drive-thru cannot be fixed by high traffic.
Unacceptable payer mix cannot be fixed by strong patient need.

This is the deeper reason gates matter. Essential criteria should be gates or multiplicative penalties, not additive preferences.

The geometric-mean example is useful here. A geometric mean reduces the ability of one strong dimension to fully compensate for a weak one. In site selection, that means a site with excellent demand and poor feasibility should look more fragile than a simple arithmetic average would suggest. Esri's suitability analysis describes geometric mean as a combination method that requires high values in multiple criteria for high final scores. (Esri Documentation)

Double-counting correlated variables

Many site variables measure the same underlying signal.

Examples:

  • population, households, and density
  • income and education
  • daytime population and employment density
  • competitor count and saturation
  • traffic count and road hierarchy
  • corner lot and visibility score

If each is weighted independently, the model may overweight one concept because it appears in several columns.

A simple correlation audit helps. When two variables are highly correlated, combine them, drop one, or explicitly justify why both belong.

A practical rule:

If correlation is greater than 0.7, review for redundancy.

That is a heuristic, not a law. The point is to prevent accidental double-counting.

Weight gaming

Weights should express strategy before the site list is scored. If stakeholders change weights after seeing the output, the model becomes a negotiation tool.

A defensible process documents:

  • who set the weights
  • when weights were set
  • why weights were chosen
  • what changed since the prior version
  • how sensitive rankings are to those weights

Out-of-distribution candidates

A scoring model trained or calibrated on suburban drive-thru restaurants may not work for urban walk-up stores. A model built on traditional gyms may not work for boutique fitness studios. A healthcare model calibrated on primary care clinics may not work for imaging centers.

A site can be high-scoring inside the model's historical range and low-confidence outside it.

The brief should say when a candidate is out of distribution.

Normalization, thresholds, penalties, and gates

Raw variables rarely belong directly in a weighted score.

Population may range from 5,000 to 200,000. Income may range from $35,000 to $180,000. Competitor counts may range from 0 to 60. Traffic counts may range from 2,000 to 80,000 vehicles per day. If those values are added directly, the variable with the largest numeric range dominates.

Common transformations

MethodHow it worksBest useRisk
Min-maxscales values from 0 to 1bounded variables, intuitive scoringoutliers distort range
Percentile rankscores by rank positionskewed data, committee communicationloses magnitude differences
Z-scorestandard deviations from meanportfolio benchmarkingless intuitive
Rawuses original valuealready comparable fieldscan dominate model
Log transformcompresses skewed valuesincome, density, spendingharder to explain
Fit-band scoringrewards values near ideal bandtraffic, income, spacingrequires calibration

Fit-band scoring

Some variables are non-monotonic.

Traffic is a good example. Low traffic may not provide enough exposure. Extremely high-speed or high-volume traffic can make access difficult. In QSR scoring, a site may score best in a target band rather than simply scoring higher as traffic rises.

A simple fit-band function:

Traffic score = 0 below minimum
Traffic score rises to peak in target band
Traffic score falls as access friction increases

This is more realistic than treating traffic as "higher is always better."

Gates

Gates are pass/fail requirements.

GateWhy it belongs outside the score
Zoning allows useA high demand score cannot fix illegal use.
Drive-thru feasibleA drive-thru-dependent prototype cannot average this away.
Rent below ceilingUnit economics can block approval.
Minimum trade area populationA market may be too small regardless of access.
Parking requirementOperations may be infeasible without it.
Healthcare payer mixDemand without reimbursement can be weak.
Franchise territoryLegal or contractual constraints can block the site.

A gate-failed site should be surfaced separately:

Gate result: Failed
Reason: Drive-thru feasibility
Recommendation: revise prototype or reject

Penalties

Penalties reduce a score when risk is present but not fatal.

Examples:

PenaltyUse case
Cannibalization penaltyCandidate overlaps existing store demand.
Saturation penaltyRecent cohorts underperform in market.
Access penaltyPoor ingress, left-turn friction, or one-way constraint.
Confidence penaltyMissing or stale critical inputs.
Competitor penaltyDirect competitors dominate the target trade area.

Penalties should be visible. Hidden penalties create committee confusion.

Missing data

Do not silently impute critical inputs.

Missing data should be handled in one of three ways:

Missing-data approachUse case
Research requiredcritical field missing, such as rent or zoning
Confidence reductionuseful field missing, such as traffic or customer-origin data
Documented imputationlow-risk variable, clear method, strong rationale

The site brief should say:

Confidence: Medium
Reason: traffic data unavailable; score relies on road class and observed nearby demand

A high score with low confidence is not an approval. It is a research priority.

Weights, sensitivity, and rank stability

Weights express strategy.

A coffee concept may weight morning access and worker density heavily. A grocery concept may weight household density, income, vehicle access, basket opportunity, and co-tenancy. An urgent care operator may weight patient access, payer mix, provider capacity, and service-line need.

Weights should be set before candidate sites are scored.

Common ways to choose weights

MethodDescriptionBest use
Equal weightsEvery component has the same influenceearly model, no strong strategy yet
Expert weightsLeadership or analysts assign weightssmall teams, simple model
AHPPairwise comparisons produce weightsgovernance-heavy teams
Historical calibrationWeights tuned against past openingsmature operators with data
Segment-specific weightsDifferent weights by format or market typemulti-format operators
HybridExpert weights refined by validationmost practical approach

Esri's suitability analysis documentation states that weights significantly affect resulting scores, that weight selection is subjective, and that weights should be backed by strong rationale and subject-matter expertise. (Esri Documentation)

Sensitivity analysis

Sensitivity analysis asks:

If we change the weights, does the recommendation change?

Examples:

  • If Competition weight increases from 25% to 35%, does Site A still rank first?
  • If Demand weight decreases by 10 points, does Site B move ahead?
  • If missing traffic data is added, does Site C change materially?
  • If cannibalization transfer is 30% instead of 20%, does the site still clear the hurdle?

A good scorecard should show:

Sensitivity questionOutput
What weight change would flip the top two sites?stability interval
Which criterion drives the ranking most?driver analysis
Which site is second-best under alternate weights?robustness check
Which inputs are missing or low confidence?confidence note
Which site remains top under multiple methods?consensus ranking

Rank instability does not automatically invalidate a model. It tells the committee that the decision is sensitive and deserves more evidence.

Stability intervals

A stability interval shows how much a weight can change before the ranking changes.

Example:

Demand weight: 30%
Ranking remains stable if Demand weight stays between 24% and 38%.
If Demand weight drops below 24%, Site B overtakes Site A.

That is a much more useful committee statement than:

Site A scores 82.

It tells decision-makers whether the recommendation is robust or fragile.

Deep vertical examples

"Site scoring" is not one universal checklist. Different categories require different gates, weights, transformations, and overlays.

The reference points below are practical scoring heuristics drawn from industry sources and operator disclosures. They are not universal thresholds. Every operator should calibrate them against its format, geography, pricing, customer profile, and post-opening outcomes.

QSR and fast casual

QSR scoring is usually driven by frequency, access, speed, daypart, and channel mix.

Common scoring criteria:

CriterionWhy it matters
Drive-time reachQSR catchments are short and convenience-driven.
Traffic and route accessExposure matters, but access friction can overwhelm volume.
Drive-thru feasibilityFor many QSR formats, drive-thru is a major revenue channel.
Lunch and dinner demandDaypart mix can determine viability.
Worker and commuter densityEspecially important for breakfast and lunch.
Delivery coverageAffects kitchen assignment and order density.
Competitor clusteringSome clustering signals demand; too much creates saturation.
CannibalizationHigh-frequency categories may tolerate some transfer but must measure it.

Useful QSR heuristics:

SignalPractical scoring treatment
~25,000+ ADT near the sitecommon freestanding QSR traffic floor, but not a guarantee
Very high-speed / very high-volume roadfit-band penalty if drivers cannot decelerate or turn
QSR trade area around 5-7 minutesstarting point for car-oriented formats
Fast casual around 7-10 minutesoften broader than traditional QSR
Drive-thru can account for 70% or more of revenue at many traditional QSR formatsdrive-thru feasibility may be a gate
Lunch may drive 35-40% of daily revenue in many QSR modelsdaypart demand should be scored, not averaged away
$40k-$80k household income may index strongly for QSR frequency in some modelsincome should often use a fit band, not "higher is always better"

QSR trade-area and revenue benchmarks are highly format-specific. A coffee drive-thru, burger drive-thru, chicken concept, fast casual bowl concept, and suburban pizza unit should not share the same scoring model.

A defensible QSR scorecard should use fit bands:

Traffic: fit band, not "higher is always better"
Income: concept fit band, not "higher is always better"
Competition: nonlinear, because some clustering is beneficial
Cannibalization: overlay, because transfer can be intentional or harmful
Drive-thru feasibility: gate

Example QSR scoring structure:

LayerExample
Gatedrive-thru feasible, parking adequate, ingress acceptable
Core scorereach, demand, competition, accessibility
ForecastP10 / P50 / P90 AUV with analog stores
Overlaydelivery overlap, same-brand transfer, market saturation
Recommendationadvance only if drive-thru geometry and lease terms hold

Urgent care and healthcare

Healthcare scoring is less about retail sales and more about access, capacity, reimbursement, and service-line fit.

Common scoring criteria:

CriterionWhy it matters
Patient accessTravel time and barriers shape actual utilization.
Payer mixDemand without reimbursement is not equivalent to demand.
Provider capacitySaturation depends on supply, not just population.
Service-line demandUrgent care, primary care, imaging, dental, and specialty care differ.
Referral leakageA new clinic may retain demand inside the system.
Regulatory constraintsLicensing and certificate requirements can be gates.
Labor and provider availabilityA site cannot operate without staff.

Useful urgent care heuristics:

SignalPractical scoring treatment
3-5 mile radius or 12-15 minute drive-time catchmentstarting point, not a universal rule
2,800-3,500 square foot footprintcommon urgent care prototype range
~20,000 population per existing urgent carerough saturation reference
Household income $50k-$100kcommon urgent care fit band
Payer mixgate or high-weight criterion
Provider availabilityfeasibility gate
Highways, rivers, undeveloped land, income gradientsbarriers that distort catchment shape

A healthcare scorecard should separate need from reachable, reimbursable, staffed demand.

Example healthcare scoring structure:

LayerExample
Gatelicensing, payer threshold, provider availability, minimum footprint
Core scoreaccess, demand, competition/supply, accessibility
Forecastvisits, patient panels, appointment utilization
Overlayleakage recapture, system transfer, capacity relief
Recommendationadvance if payer mix, staffing, and access pass threshold

A healthcare model should also distinguish access gaps from business opportunity. An underserved area may have high community need but weak reimbursement, limited staffing, or regulatory constraints. That may still be strategically important, but the scoring model should make the trade-off explicit.

Fitness clubs

Fitness scoring is driven by membership penetration, income, lifestyle fit, commute patterns, co-tenancy, churn, and peak-hour capacity.

Common scoring criteria:

CriterionWhy it matters
Target adult populationSets the membership pool.
Income and lifestyle fitMembership type and price point vary by concept.
Drive-time / walk-time reachConvenience affects join rate and retention.
Competitor capacityMany markets have demand but limited remaining white space.
Co-tenancyGrocery, health-focused retail, and daily-use centers can help.
Peak-hour accessParking and commute patterns matter at morning and evening peaks.
Churn and retentionA site that improves convenience may reduce churn.

Useful fitness heuristics, drawn from the Health & Fitness Association's 2024-2025 reporting:

SignalPractical scoring treatment
U.S. gym membership penetration around 24.9% of the age-6+ population (HFA 2024)starting demand multiplier
Industry retention benchmark around 66.4% in HFA 2025 reportingchurn and retention matter, not only new joins
25-44 age segment is the largest membership group at roughly one-third of membershipsage-weighted demand scoring
Members with household income above $75k represent just over half of all membershipsincome and lifestyle fit matter
Boutique studios, big-box clubs, and high-value low-price formats serve different customer profilesdo not use one scoring model for all fitness concepts

A simple fitness demand screen:

Potential members =
Catchment population
× category penetration
× target customer fit
- competitor member capacity

A more defensible scorecard includes churn reduction and peak utilization:

Network value =
new memberships
+ churn reduction
+ capacity relief
- transfer from existing clubs
- added rent and labor

Example fitness scoring structure:

LayerExample
Gatefootprint, parking, rent ceiling, zoning
Core scoretarget population, lifestyle fit, access, competition
Forecastmembers, ramp curve, contribution
Overlaymember transfer, churn reduction, capacity relief
Recommendationadvance if member pool and retention benefit clear threshold

Scorecards vs spreadsheets

Spreadsheets are a reasonable starting point.

They work when:

  • the company has a small footprint
  • one or two people own the model
  • the concept has one format
  • opening cadence is slow
  • assumptions rarely change
  • cannibalization is minimal
  • the decision does not require an audit trail

They begin to fail when:

  • multiple stakeholders edit weights
  • versions diverge
  • candidate lists change weekly
  • the company has many existing stores
  • portfolio overlap matters
  • data sources need timestamps
  • the committee needs consistent briefs
  • the model must be validated against outcomes

The Federal Reserve's SR 11-7 guidance is written for banking model risk, but its definition is useful for any high-stakes quantitative decision system. It defines a model as a quantitative method, system, or approach that applies theory, assumptions, and data to produce quantitative estimates. The definition also covers quantitative approaches with qualitative or expert-judgment inputs when the output is quantitative. (Federal Reserve)

That definition covers many site selection scorecards.

The most important SR 11-7 lesson for site scoring is its warning about spreadsheets. The guidance says, "User-developed applications, such as spreadsheets or ad hoc database applications used to generate quantitative estimates, are particularly prone to model risk." (Federal Reserve)

A real estate committee often asks questions that spreadsheets struggle to answer:

  • Which model version produced this score?
  • What changed since last quarter?
  • Who changed the weights?
  • What data vintage was used?
  • Which variables are missing?
  • What would flip the recommendation?
  • How did sites like this perform historically?
  • What is the cannibalization impact?
  • Did the last five high-scoring sites actually perform?

When those questions become routine, the spreadsheet has become infrastructure.

Governance, documentation, and model risk

A site score does not have to be regulated like a banking model to benefit from model governance.

The best governance ideas are practical:

  • define the model's intended use
  • document the data sources
  • document weights and transformations
  • version the model
  • show missing data
  • track assumptions
  • measure outcomes
  • update the model when the market changes
  • maintain a decision history

SR 11-7 emphasizes effective challenge: objective, informed review by people who can identify model limitations and produce appropriate changes. It also describes validation as conceptual soundness, ongoing monitoring, and outcomes analysis. (Federal Reserve)

That is exactly what a real estate committee should do.

The committee should be able to ask:

Why did this site score well?
Which inputs drive the score?
Which assumptions are fragile?
What would change the recommendation?
What did similar past openings do?

Model cards for site selection

Model Cards were proposed as a way to document trained machine learning models, including intended use, performance characteristics, evaluation procedures, and limitations. The original Model Cards paper describes them as short documents that disclose intended use, performance evaluation, and other relevant information to support transparent model reporting. (arXiv)

A site scoring model card should include:

FieldExample
Model nameQSR Suburban Site Score v2.1
Intended useScreening U.S. suburban QSR candidates
Out-of-scope useUrban walk-up sites, airport sites, ghost kitchens
ComponentsReach, Demand, Competition, Accessibility
Weights30 / 30 / 25 / 15
Gatesdrive-thru required, rent ceiling, minimum parking
Data sourcesACS, POI, routing, customer origins
Data vintageACS 2019-2023, POI May 2026
Calibration cohort87 openings from 2021-2025
Validation metricsAUV by score decile, MAPE, bias, cannibalization error
Known limitationsweak rural performance, limited mobility data
Last updatedMay 2026

That may sound formal. It is also the type of documentation that lets a decision stand months later.

SHAP and LIME

When a scoring system uses machine learning, explanation tools can help.

SHAP assigns feature-importance values to individual predictions using a unified additive explanation framework. LIME explains individual predictions by learning an interpretable local model around the prediction. (arXiv)

In site selection, these methods are most useful for forecasts or ML-driven components, not for simple weighted scorecards. A transparent weighted score usually does not need SHAP. But if the model uses gradient boosting, random forests, or another opaque predictor, the brief should explain why a particular site received its forecast or risk estimate.

NIST AI RMF and EU AI Act Article 13

AI governance standards point in the same direction: transparency, interpretability, appropriate use, documentation, and ongoing evaluation.

NIST says its AI Risk Management Framework is intended to help organizations incorporate trustworthiness considerations into the design, development, use, and evaluation of AI systems. (NIST) The EU AI Act's Article 13 requires high-risk AI systems to be transparent enough for deployers to interpret outputs and use them appropriately, with instructions that include intended purpose, accuracy metrics, limitations, input specifications, and information to help interpret outputs. (EUR-Lex)

Even if a location scoring model is not legally classified as high-risk AI, the practical standard still applies. A model that affects capital allocation should explain itself.

For site selection, that means:

Intended use
Data sources
Transformations
Weights
Gates
Limitations
Validation results
Confidence level
Human review process

That is the difference between a score and a defensible score.

Calibration and post-opening validation

A score that never gets compared with outcomes is an untested hypothesis.

Validation should answer three questions:

  1. Did high-scoring sites perform better than low-scoring sites?
  2. Did the model correctly identify weak sites?
  3. Did the score explain the right things?

What to validate

Validation targetMetric
Site performancesales, AUV, visits, members, patients, deposits
Forecast accuracyMAPE, bias, prediction-interval coverage
Score monotonicityperformance by score decile
Cannibalizationtransfer from nearby existing stores
Saturationcohort AUV decay, same-store impact
Feasibilitysites that failed because of real estate or operations
Confidencewhether low-confidence scores had wider errors

Score decile validation

A simple validation chart:

Score decileMature AUVExpected pattern
90-100Higheststrongest performance
80-89Highabove average
70-79Mediumacceptable
60-69Weakwatchlist
<60Lowestrejected or underperforming

The pattern does not need to be perfect. Real operations are messy. But if high-scoring sites do not outperform lower-scoring sites over time, the model needs recalibration.

Forecast validation

Forecasts should be evaluated separately from scores.

Common metrics:

MAPE = mean(|actual - forecast| / actual)

Bias = mean(forecast - actual)

Coverage = share of sites where actual result falls inside forecast interval

For new sites, forecast intervals are more honest than point estimates. A forecast of $2.4M ± $700K is less satisfying than $2.4M, but it is easier to defend.

New-store forecast accuracy should not be compared casually with existing-store demand forecasting. Existing-store forecasts often use operating history, SKU history, seasonality, and recent sales data. A new location has none of that. The honest output is a scenario range or prediction interval, ideally backed by named analogs.

Validation cadence

Validation timing should reflect the format.

TimingWhat to check
30 daysoperational opening issues, early traffic, data sanity
90 daysramp pattern, early transfer, channel mix
180 daysemerging trade area, same-store impact
12 monthsfirst-year performance, forecast error
18 monthsmore stable cannibalization and trade-area evidence for many retail and restaurant formats
24-36 monthsmature performance, cohort calibration

QSR and convenience formats may reveal meaningful signals faster than fitness clubs, healthcare clinics, or membership-based businesses. Fitness needs enough time to observe join rate, retention, churn, and peak utilization. Healthcare needs enough time to observe patient panel growth, payer mix, referral patterns, and appointment utilization.

Post-opening validation turns site selection into a learning system.

What to do with validation results

Validation should change the model.

FindingModel response
High-scoring sites underperforminspect weights, gates, and omitted variables
Low-confidence sites have large errorstighten confidence rules
Forecasts are systematically highrecalibrate intercept or analog assumptions
Cannibalization underestimatedincrease transfer overlay or change trade area method
One format behaves differentlycreate separate score profile
Urban sites fail rural-trained modelsegment by market type

A scoring model should become more accurate after each opening.

Public examples of scoring logic

Public commercial site scorecards are often proprietary, but several public and academic examples show how formal scoring appears in practice.

Academic AHP studies

AHP-based location studies often publish criteria weights. These are not plug-and-play templates for operators, but they show what explicit criteria weighting looks like.

ContextExample criteria and weights
Gas station sitingOne AHP study (Wu, Chen & Pan, ISPRS International Journal of Geo-Information, 2024) weighted population density at 0.633, gas station supply capacity at 0.261, and road network density at 0.106.
Hospital site selectionŞahin, Ocak & Top (Health Policy and Technology, 2019) rank demand, accessibility, competitors, government policy, related industry, and environmental conditions.
Pandemic hospital sitingBoyacı & Şişman (Environmental Science and Pollution Research, 2022) use Pythagorean fuzzy AHP to weight distance to transportation, population centers, land slope, land use, distance to other hospitals, and hazards.
Bank branch sitingBasar, Kabak & Topcu (Socio-Economic Planning Sciences, 2017) weight demographics, cost, competition, transportation, flexibility, and access to public facilities.
Shopping mall criteriaAHP studies of mall performance often find tenant satisfaction among the highest-weighted criteria, sometimes above 0.35.

The lesson is not that a QSR operator should copy a hospital siting model. The lesson is that a defensible scoring model should make criteria, weights, and trade-offs visible.

Public-sector scoring rubrics

Economic development and public-sector site scoring often use a two-stage structure:

Threshold review:
pass / fail

Scored review:
weighted criteria and narrative ranking

That is a good model for commercial site selection. Hard requirements should be handled first. Weighted scoring should rank sites that remain eligible.

Franchise disclosure documents

Franchise Disclosure Document Item 11 often describes site-selection assistance. It may mention demographics, population density, income, geography, physical boundaries, competition, and other site factors. Most FDDs do not disclose the exact scoring matrix.

That tells us something important: even when public disclosure is high-level, site selection criteria are part of the franchise system's operating logic. A serious franchise operator should be able to explain how those criteria become a site approval recommendation.

Current industry state

Vendor methodology transparency varies widely.

GIS suitability tools

GIS suitability tools are often the most transparent because they show preprocessing, weighting, transformation, and combination methods. Esri's suitability analysis documentation describes criteria, influence settings, preprocessing methods, combination methods, score scaling, and weighting, including MinMax, Percentile, ZScore, Raw, Sum, Mean, Product, and Geometric Mean options. (Esri Documentation)

That is good governance. Users can see how variables enter the model.

Predictive analytics vendors

Predictive analytics vendors may use analog stores, regression, machine learning, or proprietary site scores. Some offer useful forecast ranges and comparable-store visibility. Others present a single score without exposing enough detail about features, weights, missing-data handling, validation, or calibration.

The question is not whether the score uses AI. The questions are:

What does the score measure?
What data trained or calibrated it?
How does it treat missing data?
How does it separate fit from forecast?
How does it account for cannibalization?
How has it performed after openings?

AI scores

"AI score" is a marketing term, not a methodology.

A serious AI-assisted site scoring system should still show:

  • intended use
  • model type
  • data sources
  • feature transformations
  • training or calibration cohort
  • validation metrics
  • confidence level
  • limitations
  • local explanation for a given site
  • human review process

The serious end of the market is moving toward explicit scoring rubrics, model documentation, visible assumptions, and explainable ML where appropriate. The marketing end is moving toward "AI" as a brand. Expansion leaders are getting better at telling the difference.

A Geod-style site scorecard

Geod's public methodology documents the default score around four components: Reach, Demand, Competition, and Accessibility, with default weights of 30%, 30%, 25%, and 15%. The score is a transparent weighted linear model with documented sources, snapshot dates, and visible components.

There are two clean ways to present this kind of score.

Textbook weighted model

In a clean weighted-score display, every component is normalized so higher is better. Competition is therefore expressed as Competition Fit, meaning lower competitive pressure or better competitive context.

ComponentNormalized scoreWeightContribution
Reach fit8630%25.8
Demand fit7830%23.4
Competition fit5825%14.5
Accessibility fit7015%10.5
Total74.2

Formula:

Score =
0.30 × Reach fit
+ 0.30 × Demand fit
+ 0.25 × Competition fit
+ 0.15 × Accessibility fit

This is the easiest version to explain to a committee.

Signed contribution display

Some products show signed contributions instead of normalized sub-scores.

Site score =
Reach contribution
+ Demand contribution
- Competition penalty
+ Accessibility contribution

Example:

ComponentSigned contribution
Reach+28
Demand+31
Competition pressure-12
Accessibility+27
Site score74

In this display, "Competition" is not a 0-100 component score. It is a penalty contribution that reduces the final score. That distinction should be explicit in the brief so a careful reader does not try to reconcile a negative contribution with a 0-100 normalized weighted-score formula.

Decision overlays

The score should then be paired with overlays.

OverlayExample output
CannibalizationModerate transfer risk from Store 14
SaturationMarket cohort AUV declining; conditional approval
FeasibilityConditional: rent and ingress need confirmation
ConfidenceMedium-high: strong ACS/POI data, limited first-party origins
Validation planCompare ramp, affected-store sales, and customer origins at 90 and 180 days

This preserves the simplicity of the score while giving the committee the context it needs.

Evaluate vs Strategize

A useful product architecture separates first-site evaluation from portfolio strategy.

ModeWhat it should do
EvaluateApply a defensible score to candidate sites using consistent criteria, weights, thresholds, and source-dated data.
StrategizeApply the same strategy across a portfolio with model versioning, custom weights, batch scoring, cannibalization, saturation, feasibility, confidence, and decision history.

Geod's Evaluate tool scores specific sites and exports explainable reports. Strategize gives expansion teams a way to apply the same strategy systematically across a 30-500 location portfolio.

The Evaluate tier teaches a team that scoring can be repeatable and defensible. The Strategize tier turns that scoring into an institutional system. That matters for operators whose strategy is already in someone's head, in a consultant deck, and in three spreadsheets, but not yet versioned, consistently applied, or easy to defend.

The defensible claim is grounded:

Your expansion strategy becomes a repeatable scoring system,
applied consistently to every candidate you evaluate,
with portfolio-aware overlays and a decision record.

What a defensible site brief should show

A defensible site brief should contain the score and the argument behind it.

Recommended structure:

Site Score Summary

1. Candidate address and prototype
2. Core score and component breakdown
3. Gates passed / failed
4. Criteria weights and model version
5. Data sources and snapshot dates
6. Trade area method and time window
7. Demand assumptions
8. Competition assumptions
9. Accessibility assumptions
10. Forecast range, if available
11. Cannibalization and saturation overlays
12. Feasibility gate
13. Confidence level
14. Sensitivity notes
15. Recommendation
16. Post-opening validation plan

Example brief language

The candidate scores 74 under the QSR Suburban Scorecard v2.1. Reach and Demand are strong because the 10-minute drive-time catchment contains above-threshold target population and income for the concept. Competition reduces the score because the catchment includes a high density of direct fast-casual competitors. Accessibility is favorable, but the feasibility gate remains conditional pending ingress review and rent confirmation. The site should advance to constrained review, with cannibalization against Store 14 measured before final approval.

That paragraph is more useful than:

Site score: 74

Common mistakes

Mistake 1: Treating the score as the recommendation

The score is evidence. The recommendation is a decision.

Mistake 2: Mixing forecast and fit

A site can be strategically strong and forecast modestly. Another can forecast large and fail strategy. Keep these separate.

Mistake 3: Averaging away deal-breakers

Hard requirements belong in gates.

Mistake 4: Scoring raw variables directly

Normalize before weighting. Raw numeric ranges can dominate.

Mistake 5: Hiding weights

Weights are strategy. If they are hidden, the strategy is hidden.

Mistake 6: Double-counting correlated variables

Population, households, density, and daytime population may overlap. Competition count, saturation, and cannibalization can also overlap. Audit correlations.

Mistake 7: Changing weights after seeing the answer

Set weights before scoring candidates. Otherwise the model becomes a negotiation tool.

Mistake 8: Ignoring uncertainty

A high score with low confidence should trigger research, not approval.

Mistake 9: Treating competitors as all negative

Some competitors signal demand or create destination effects. Competition should be category-specific.

Mistake 10: Skipping validation

A model that is never compared with actual openings becomes a permanent assumption.

Mistake 11: Treating AI scoring as a methodology

AI scoring is not itself a methodology. The methodology has to be visible: what the model measures, what data it uses, how it was validated, and what it should be used for.

Mistake 12: Using one model for every format

A suburban drive-thru, an urban pickup store, a full-service clinic, and a boutique fitness studio should not share the same thresholds and weights.

FAQ

What is a site selection scoring model?

A site selection scoring model is a structured method for ranking candidate locations using criteria, weights, transformations, gates, and overlays. It helps expansion teams compare sites consistently and explain why a candidate should advance, be rejected, or require more research.

What is a location score?

A location score is a numeric summary of how well a candidate site fits an operator's site selection criteria. A defensible score should be broken into visible components so decision-makers can see why the site scored well or poorly.

What is the difference between a site score and a sales forecast?

A site score measures strategic fit. A sales forecast estimates expected performance. A site can score well and forecast small, or forecast large and score poorly. The final recommendation should use both.

What is the best site selection scoring method?

For most multi-unit operators, the best practical method is hard gates plus a transparent weighted score, forecast range, network overlays, sensitivity analysis, and post-opening validation. More complex MCDA methods like AHP, TOPSIS, ELECTRE, and PROMETHEE can help in specific governance or short-list situations.

What criteria should be included in a site selection scorecard?

Common criteria include trade area population, income, daytime population, category spend, competition, traffic, accessibility, co-tenancy, site feasibility, parking, rent, customer origins, cannibalization, and saturation. The exact criteria should vary by category and prototype.

How should site selection criteria be weighted?

Weights should reflect strategy and should be set before candidates are scored. Operators can use expert judgment, AHP, historical calibration, or a hybrid approach. Weight choices should be documented and tested for sensitivity.

What is AHP in site selection?

AHP, or Analytic Hierarchy Process, is a multi-criteria decision method that derives weights from pairwise comparisons. It can help teams make trade-offs explicit and check consistency.

What is TOPSIS in site selection?

TOPSIS ranks candidate sites by comparing each one to an ideal site and a worst-case site. It is useful for short-list comparisons but can be sensitive to normalization.

What is rank reversal?

Rank reversal occurs when adding or removing a candidate changes the ranking of existing candidates. It is a known failure mode in several MCDA methods and should be tested when candidate lists change during a scoring process.

Should cannibalization be part of the score?

Cannibalization should usually be shown as a network overlay rather than hidden inside the core score. The core score measures site fit. The overlay shows whether the site creates net-new demand or transfers demand from existing units.

How do you validate a site selection scoring model?

Compare scores against post-opening outcomes. Track performance by score decile, forecast error, same-store impact, cannibalization, confidence level, and outcome by market type. Recalibrate when results drift.

When does a site scoring spreadsheet stop working?

A spreadsheet becomes risky when multiple people change weights, versions diverge, data gets stale, portfolio overlap matters, or the committee needs repeatable briefs and audit history.

Glossary

Site selection scoring model

A structured method for ranking candidate locations using criteria, weights, transformations, and rules.

Location score

A numeric summary of a candidate site's fit against the operator's criteria.

Site scorecard

A table or brief showing the score, components, weights, gates, overlays, and recommendation.

Weighted sum model

A scoring method that multiplies each normalized criterion score by a weight and sums the results.

Weighted geometric mean

A scoring method that multiplies criteria raised to their weights, reducing the ability of one very strong criterion to compensate for a very weak one.

MCDA / MCDM

Multi-criteria decision analysis or multi-criteria decision-making, a family of methods for evaluating alternatives across many criteria.

AHP

Analytic Hierarchy Process, a method that uses pairwise comparisons to derive weights and check consistency.

TOPSIS

Technique for Order Preference by Similarity to Ideal Solution, a method that ranks options by distance from ideal and worst-case alternatives.

Gate

A hard requirement that a site must pass before it is scored or advanced.

Threshold

A minimum, maximum, or target value used to classify or filter a criterion.

Penalty

A visible score reduction applied when a risk is present but not fatal.

Normalization

The process of converting raw variables into comparable scales.

Percentile score

A score based on how a candidate ranks relative to a comparison set.

Z-score

A normalized value showing how many standard deviations a value is above or below the mean.

Forecast

An estimate of expected performance, such as sales, visits, members, patients, deposits, or orders.

Overlay

Additional context layered on top of the score, such as cannibalization, saturation, feasibility, confidence, or validation status.

Confidence level

A high, medium, or low assessment of data quality, model fit, missing inputs, and evidence strength.

Model card

A structured document describing a model's intended use, data, performance, limitations, and governance.

Sensitivity analysis

Testing how changes to weights, assumptions, or data inputs affect the ranking or recommendation.

Rank reversal

A multi-criteria decision failure mode where adding or removing alternatives changes the ranking of existing alternatives.

Post-opening validation

Comparing model predictions and scores against actual results after a location opens.

Conclusion

A defensible site score is a structured argument.

The score shows how the site performs against strategy. The forecast estimates performance. The overlays expose network and execution risk. The recommendation explains what to do next.

Before a candidate advances, the site brief should answer six questions:

  1. Which gates did the site pass or fail?
  2. How did the site score by component?
  3. What assumptions and data sources produced the score?
  4. What forecast range applies, if any?
  5. What overlays change the decision?
  6. What evidence would prove the model right or wrong after opening?

A location score survives committee when the team can take it apart and put it back together.

That is the standard for modern site selection.

See how Geod turns scoring into a defensible site brief

Geod helps expansion teams define scoring criteria, set weights and thresholds, evaluate candidate sites, and export committee-ready briefs with maps, methodology, component breakdowns, source dates, cannibalization, saturation, feasibility, confidence, and validation context.

Evaluate gives teams a defensible way to score candidate sites. Strategize turns that scoring into a repeatable system across a portfolio, with custom weights, versioned models, batch evaluation, portfolio-aware overlays, and decision records.

The goal is not just to rank sites. It is to make every recommendation explainable enough to defend.

Evaluate your next candidate site

Geod helps expansion teams define criteria, set weights and thresholds, evaluate candidate sites, and export committee-ready briefs with component breakdowns, source dates, confidence, and portfolio context.

Start evaluating sitesBook a demo