8- GRIDS¶
Nearest-Neighbor Search¶
Goal: Find the data point closest to a target.
Formal Definition: Given data points \(X = {x_1, x_2, ..., x_N}\), target \(x'\), and distance function \(dist(x, y)\), find \(x_i\) that minimizes \(dist(x', x_i)\).
Difference from Binary Search: - Binary search looks for an exact match. - NNS finds the closest match.
Uses: Spatial search (nearest location), similarity search (closest value or string).
Linear Scan for NNS¶
Idea: Check every point, compute its distance to the target, and keep the closest.
Algorithm (Simplified): .. code-block:: none
best = A[0] best_dist = dist(target, best) for each point in A:
- if dist(target, point) < best_dist:
best = point
return best
Works for 1D or multidimensional data.
Complexity: O(N) time, O(1) space.
Simple but inefficient for large datasets.
Spatial Data¶
2D data represented as points:
(x, y).Use Euclidean distance: \(dist = sqrt((x_1 - x_2)^2 + (y_1 - y_2)^2)\)
Sorting by a single dimension (x or y) doesn’t work for 2D nearest-neighbor search.
Must consider all dimensions.
Grids¶
Purpose: Efficiently organize and search 2D data. - Divide space into bins (cells) indexed by (xbin, ybin). - Each bin holds multiple points.
Mapping points to bins: .. math:
x_{bin} = Floor((x - x_{start}) / x_{bin\_width})
y_{bin} = Floor((y - y_{start}) / y_{bin\_width})
Grid Structure (simplified): .. code-block:: text
Grid {
num_x_bins, num_y_bins x_start, x_end, y_start, y_end x_bin_width, y_bin_width bins (2D array of lists)
}
GridPoint: .. code-block:: text
GridPoint { Float x, Float y, GridPoint next }
Grid Operations¶
Insert Point: 1. Compute bin indices (xbin, ybin). 2. Check bounds. 3. Add new point to front of bin’s linked list.
Delete Point: 1. Locate bin. 2. Traverse linked list. 3. Remove first matching point (using approximate equality).
Approximate Equality: .. code-block:: text
- approx_equal(x1, y1, x2, y2):
return abs(x1 - x2) <= threshold and abs(y1 - y2) <= threshold
Grid-Based Search¶
Goal: Speed up nearest-neighbor search by ignoring distant bins. - Once a best distance is found, prune bins that cannot contain closer points.
Distance to a Bin: Compute the minimum distance from the target to the bin’s boundary: If this minimum > best distance, skip the bin.
Helper Function: .. code-block:: text
- MinDistToBin(g, xbin, ybin, x, y):
if out of range: return Inf compute x_min, x_max, y_min, y_max find shortest (x, y) distance to bin edges return sqrt(x_dist^2 + y_dist^2)
Key Takeaways¶
Nearest-Neighbor Search finds closest matches, not exact ones.
Linear Scan is simple but slow.
Grids divide space into regions, improving search efficiency.
Bins hold multiple points (linked lists).
Pruning reduces unnecessary checks.
Foundation for advanced spatial structures (e.g., k-d trees, quadtrees).
Helper: MinDistToBin¶
Purpose: Compute the shortest distance from a target point to a grid bin.
Logic: 1. Check if bin indices are valid — return Inf if invalid. 2. Compute bin’s min/max bounds along x and y. 3. Compare target’s coordinates to bin bounds. 4. Return shortest Euclidean distance to bin edge.
Note: Invalid bins return Inf for pruning safety, though error-throwing may be preferable. Analogy: Neighbor throws the ball just far enough to reach our yard.
Linear Scan Over Bins¶
Algorithm: GridLinearScanNN(g, x, y) 1. Initialize best_dist = Inf, best_candidate = null. 2. Loop over all bins (xbin, ybin). 3. Skip bins if MinDistToBin(…) >= best_dist. 4. For each valid bin, check all points (linked list). 5. Update best candidate if a closer point is found. 6. Return nearest neighbor.
Pros: Simple and clear. Cons: Inefficient on large or sparse grids. Supports: Targets inside or outside grid bounds. Optimization: Prunes bins with min distance > best found distance.
Expanding Search¶
Idea: Prioritize checking bins closest to the target. - Start with the bin containing the target point. - Expand outward to nearby bins (spiral/diamond pattern). - Stop once all unvisited bins are farther than current best point.
Analogy: Searching for lost keys—start nearby, expand outward.
Helper: GridCheckBin¶
Purpose: Return the closest point in a given bin within a distance threshold.
Steps: 1. Verify valid bin indices (return null if invalid). 2. Initialize best_dist = threshold. 3. Scan all points in bin; compute Euclidean distances. 4. Return closest point if distance < threshold, else null.
GridSearchExpanding¶
Algorithm: Expanding nearest-neighbor search.
Compute starting bin (xb, yb) for target; clamp to valid grid range.
Set steps = 0, explore = True.
While explore: - For each horizontal offset xoff, compute yoff. - Use MinDistToBin() to prune distant/invalid bins. - Use GridCheckBin() to find closer points. - Update best_pt and best_d if improved. - Continue until no bins within threshold remain.
Tradeoff: + Avoids checking distant bins → efficient. − More complex control logic (spiral order, bounds, stopping criteria).
Grid Size Tradeoffs¶
Large bins: Fewer bins, more points per bin → slower within-bin scans.
Small bins: Many empty bins → wasted checks.
Optimal bin size: Depends on data density and distribution.
Non-uniform grids: Adapt bin size dynamically.
Beyond 2D & Spatial Data¶
Scaling to Higher Dimensions: - For D dimensions with K bins each → K^D bins (memory cost grows fast). - Inefficient for high-D spaces (many empty bins).
Generalized Euclidean Distance: .. math:
dist = sqrt(sum(w_d * (x_i[d] - x_j[d])^2))
Applications: - Spatial: nearest store, location-based search. - Non-spatial: similarity search (e.g., coffee attributes: strength, acidity). - Use weighted distances to emphasize important dimensions.
Key Takeaways¶
MinDistToBin: Enables pruning by measuring bin proximity.
GridLinearScanNN: Simple baseline using full scan.
Expanding Search: Prioritizes closer bins → efficient pruning.
Grid Size: Critical for performance balance.
Beyond 2D: High dimensions increase complexity and memory.
Generalization: Nearest-neighbor search applies to non-spatial data.
Grids introduce multi-value bins (linked lists), laying groundwork for trees and adaptive structures.