8- GRIDS

Linear Scan for NNS

Idea: Check every point, compute its distance to the target, and keep the closest.

Algorithm (Simplified): .. code-block:: none

best = A[0] best_dist = dist(target, best) for each point in A:

if dist(target, point) < best_dist:

best = point

return best

  • Works for 1D or multidimensional data.

  • Complexity: O(N) time, O(1) space.

  • Simple but inefficient for large datasets.

Spatial Data

  • 2D data represented as points: (x, y).

  • Use Euclidean distance: \(dist = sqrt((x_1 - x_2)^2 + (y_1 - y_2)^2)\)

  • Sorting by a single dimension (x or y) doesn’t work for 2D nearest-neighbor search.

  • Must consider all dimensions.

Grids

Purpose: Efficiently organize and search 2D data. - Divide space into bins (cells) indexed by (xbin, ybin). - Each bin holds multiple points.

Mapping points to bins: .. math:

x_{bin} = Floor((x - x_{start}) / x_{bin\_width})
y_{bin} = Floor((y - y_{start}) / y_{bin\_width})

Grid Structure (simplified): .. code-block:: text

Grid {

num_x_bins, num_y_bins x_start, x_end, y_start, y_end x_bin_width, y_bin_width bins (2D array of lists)

}

GridPoint: .. code-block:: text

GridPoint { Float x, Float y, GridPoint next }

Grid Operations

Insert Point: 1. Compute bin indices (xbin, ybin). 2. Check bounds. 3. Add new point to front of bin’s linked list.

Delete Point: 1. Locate bin. 2. Traverse linked list. 3. Remove first matching point (using approximate equality).

Approximate Equality: .. code-block:: text

approx_equal(x1, y1, x2, y2):

return abs(x1 - x2) <= threshold and abs(y1 - y2) <= threshold

Key Takeaways

  • Nearest-Neighbor Search finds closest matches, not exact ones.

  • Linear Scan is simple but slow.

  • Grids divide space into regions, improving search efficiency.

  • Bins hold multiple points (linked lists).

  • Pruning reduces unnecessary checks.

  • Foundation for advanced spatial structures (e.g., k-d trees, quadtrees).

Helper: MinDistToBin

Purpose: Compute the shortest distance from a target point to a grid bin.

Logic: 1. Check if bin indices are valid — return Inf if invalid. 2. Compute bin’s min/max bounds along x and y. 3. Compare target’s coordinates to bin bounds. 4. Return shortest Euclidean distance to bin edge.

Note: Invalid bins return Inf for pruning safety, though error-throwing may be preferable. Analogy: Neighbor throws the ball just far enough to reach our yard.

Linear Scan Over Bins

Algorithm: GridLinearScanNN(g, x, y) 1. Initialize best_dist = Inf, best_candidate = null. 2. Loop over all bins (xbin, ybin). 3. Skip bins if MinDistToBin(…) >= best_dist. 4. For each valid bin, check all points (linked list). 5. Update best candidate if a closer point is found. 6. Return nearest neighbor.

Pros: Simple and clear. Cons: Inefficient on large or sparse grids. Supports: Targets inside or outside grid bounds. Optimization: Prunes bins with min distance > best found distance.

Helper: GridCheckBin

Purpose: Return the closest point in a given bin within a distance threshold.

Steps: 1. Verify valid bin indices (return null if invalid). 2. Initialize best_dist = threshold. 3. Scan all points in bin; compute Euclidean distances. 4. Return closest point if distance < threshold, else null.

GridSearchExpanding

Algorithm: Expanding nearest-neighbor search.

  1. Compute starting bin (xb, yb) for target; clamp to valid grid range.

  2. Set steps = 0, explore = True.

  3. While explore: - For each horizontal offset xoff, compute yoff. - Use MinDistToBin() to prune distant/invalid bins. - Use GridCheckBin() to find closer points. - Update best_pt and best_d if improved. - Continue until no bins within threshold remain.

Tradeoff: + Avoids checking distant bins → efficient. − More complex control logic (spiral order, bounds, stopping criteria).

Grid Size Tradeoffs

  • Large bins: Fewer bins, more points per bin → slower within-bin scans.

  • Small bins: Many empty bins → wasted checks.

  • Optimal bin size: Depends on data density and distribution.

  • Non-uniform grids: Adapt bin size dynamically.

Beyond 2D & Spatial Data

Scaling to Higher Dimensions: - For D dimensions with K bins each → K^D bins (memory cost grows fast). - Inefficient for high-D spaces (many empty bins).

Generalized Euclidean Distance: .. math:

dist = sqrt(sum(w_d * (x_i[d] - x_j[d])^2))

Applications: - Spatial: nearest store, location-based search. - Non-spatial: similarity search (e.g., coffee attributes: strength, acidity). - Use weighted distances to emphasize important dimensions.

Key Takeaways

  • MinDistToBin: Enables pruning by measuring bin proximity.

  • GridLinearScanNN: Simple baseline using full scan.

  • Expanding Search: Prioritizes closer bins → efficient pruning.

  • Grid Size: Critical for performance balance.

  • Beyond 2D: High dimensions increase complexity and memory.

  • Generalization: Nearest-neighbor search applies to non-spatial data.

  • Grids introduce multi-value bins (linked lists), laying groundwork for trees and adaptive structures.