==================== 8- GRIDS ==================== ------------------------- Nearest-Neighbor Search ------------------------- **Goal:** Find the data point closest to a target. **Formal Definition:** Given data points :math:`X = {x_1, x_2, ..., x_N}`, target :math:`x'`, and distance function :math:`dist(x, y)`, find :math:`x_i` that minimizes :math:`dist(x', x_i)`. **Difference from Binary Search:** - Binary search looks for an exact match. - NNS finds the closest match. **Uses:** Spatial search (nearest location), similarity search (closest value or string). -------------------- Linear Scan for NNS -------------------- **Idea:** Check every point, compute its distance to the target, and keep the closest. **Algorithm (Simplified):** .. code-block:: none best = A[0] best_dist = dist(target, best) for each point in A: if dist(target, point) < best_dist: best = point return best - Works for 1D or multidimensional data. - **Complexity:** O(N) time, O(1) space. - Simple but inefficient for large datasets. -------------------- Spatial Data -------------------- - 2D data represented as points: ``(x, y)``. - Use **Euclidean distance**: :math:`dist = sqrt((x_1 - x_2)^2 + (y_1 - y_2)^2)` - Sorting by a single dimension (x or y) doesn’t work for 2D nearest-neighbor search. - Must consider all dimensions. -------------------- Grids -------------------- **Purpose:** Efficiently organize and search 2D data. - Divide space into **bins (cells)** indexed by (xbin, ybin). - Each bin holds multiple points. **Mapping points to bins:** .. math:: x_{bin} = Floor((x - x_{start}) / x_{bin\_width}) y_{bin} = Floor((y - y_{start}) / y_{bin\_width}) **Grid Structure (simplified):** .. code-block:: text Grid { num_x_bins, num_y_bins x_start, x_end, y_start, y_end x_bin_width, y_bin_width bins (2D array of lists) } **GridPoint:** .. code-block:: text GridPoint { Float x, Float y, GridPoint next } -------------------- Grid Operations -------------------- **Insert Point:** 1. Compute bin indices (xbin, ybin). 2. Check bounds. 3. Add new point to front of bin’s linked list. **Delete Point:** 1. Locate bin. 2. Traverse linked list. 3. Remove first matching point (using approximate equality). **Approximate Equality:** .. code-block:: text approx_equal(x1, y1, x2, y2): return abs(x1 - x2) <= threshold and abs(y1 - y2) <= threshold -------------------- Grid-Based Search -------------------- **Goal:** Speed up nearest-neighbor search by ignoring distant bins. - Once a **best distance** is found, **prune bins** that cannot contain closer points. **Distance to a Bin:** Compute the minimum distance from the target to the bin’s boundary: If this minimum > best distance, skip the bin. **Helper Function:** .. code-block:: text MinDistToBin(g, xbin, ybin, x, y): if out of range: return Inf compute x_min, x_max, y_min, y_max find shortest (x, y) distance to bin edges return sqrt(x_dist^2 + y_dist^2) -------------------- Key Takeaways -------------------- - **Nearest-Neighbor Search** finds closest matches, not exact ones. - **Linear Scan** is simple but slow. - **Grids** divide space into regions, improving search efficiency. - **Bins** hold multiple points (linked lists). - **Pruning** reduces unnecessary checks. - Foundation for advanced spatial structures (e.g., k-d trees, quadtrees). -------------------------- Helper: MinDistToBin -------------------------- **Purpose:** Compute the shortest distance from a target point to a grid bin. **Logic:** 1. Check if bin indices are valid — return `Inf` if invalid. 2. Compute bin’s min/max bounds along x and y. 3. Compare target’s coordinates to bin bounds. 4. Return shortest Euclidean distance to bin edge. **Note:** Invalid bins return `Inf` for pruning safety, though error-throwing may be preferable. Analogy: *Neighbor throws the ball just far enough to reach our yard.* -------------------------- Linear Scan Over Bins -------------------------- **Algorithm:** `GridLinearScanNN(g, x, y)` 1. Initialize `best_dist = Inf`, `best_candidate = null`. 2. Loop over all bins `(xbin, ybin)`. 3. Skip bins if `MinDistToBin(...) >= best_dist`. 4. For each valid bin, check all points (linked list). 5. Update best candidate if a closer point is found. 6. Return nearest neighbor. **Pros:** Simple and clear. **Cons:** Inefficient on large or sparse grids. **Supports:** Targets inside or outside grid bounds. **Optimization:** Prunes bins with min distance > best found distance. -------------------------- Expanding Search -------------------------- **Idea:** Prioritize checking bins *closest* to the target. - Start with the bin containing the target point. - Expand outward to nearby bins (spiral/diamond pattern). - Stop once all unvisited bins are farther than current best point. **Analogy:** Searching for lost keys—start nearby, expand outward. -------------------------- Helper: GridCheckBin -------------------------- **Purpose:** Return the closest point in a given bin within a distance threshold. **Steps:** 1. Verify valid bin indices (return `null` if invalid). 2. Initialize `best_dist = threshold`. 3. Scan all points in bin; compute Euclidean distances. 4. Return closest point if distance < threshold, else `null`. -------------------------- GridSearchExpanding -------------------------- **Algorithm:** Expanding nearest-neighbor search. 1. Compute starting bin `(xb, yb)` for target; clamp to valid grid range. 2. Set `steps = 0`, `explore = True`. 3. While `explore`: - For each horizontal offset `xoff`, compute `yoff`. - Use `MinDistToBin()` to prune distant/invalid bins. - Use `GridCheckBin()` to find closer points. - Update `best_pt` and `best_d` if improved. - Continue until no bins within threshold remain. **Tradeoff:** + Avoids checking distant bins → efficient. − More complex control logic (spiral order, bounds, stopping criteria). -------------------------- Grid Size Tradeoffs -------------------------- - **Large bins:** Fewer bins, more points per bin → slower within-bin scans. - **Small bins:** Many empty bins → wasted checks. - **Optimal bin size:** Depends on data density and distribution. - **Non-uniform grids:** Adapt bin size dynamically. -------------------------- Beyond 2D & Spatial Data -------------------------- **Scaling to Higher Dimensions:** - For D dimensions with K bins each → `K^D` bins (memory cost grows fast). - Inefficient for high-D spaces (many empty bins). **Generalized Euclidean Distance:** .. math:: dist = sqrt(sum(w_d * (x_i[d] - x_j[d])^2)) **Applications:** - Spatial: nearest store, location-based search. - Non-spatial: similarity search (e.g., coffee attributes: strength, acidity). - Use weighted distances to emphasize important dimensions. -------------------------- Key Takeaways -------------------------- - **MinDistToBin:** Enables pruning by measuring bin proximity. - **GridLinearScanNN:** Simple baseline using full scan. - **Expanding Search:** Prioritizes closer bins → efficient pruning. - **Grid Size:** Critical for performance balance. - **Beyond 2D:** High dimensions increase complexity and memory. - **Generalization:** Nearest-neighbor search applies to non-spatial data. - Grids introduce **multi-value bins** (linked lists), laying groundwork for trees and adaptive structures.