No. | Area | Description | number of MBRs | zipped size in MB | coverage | source | used for experiments in |
M1 | L.A. | streets | 131,461 | 1.35 | 0.03 | Tiger | [BKS 93], [BSS 00], [DS 00] |
M2 | L.A. | rivers and railways | 128,971 | 1.99 | 0.22 | Tiger | [BKS 93], [BSS 00], [DS 00] |
M3 | california | streets | 1,888,012 | 16.93 | 0.12 | Tiger | [BSS 00], [DS 00] |
M4 | california | railways | 625,640 | 0.33 | 0.21 | Tiger | |
M5 | california | borders | 234,251 | 2.82 | Tiger | ||
M6 | california | hydrography | 360,330 | 4.12 | Tiger |
Description | format description |
d50 | ASCII file contains sequence of 10'000'000 triples (blank separated). Triples have following format: Operation type, long key, integer paylod. 1 decodes insert, 2 decodes update and 3 decodes delete; The first 1'000'000 operations are insertions (10% of the data set). The remaining 90% of the file represent a mix of insertions, deletions and updates. The portion of the specific operation is decoded in the file name. For example the file d50 consists of 1'000'000 insert operations followed by a mix of insertions ($4'500'000$) and deletions (4'500'000). The file u75 consists of 1'000'000 insert operations followed by a mix of insertions (2'250'000) and updates 6'750'000. Note default payload value is 0. Please, replace it for your purpose. In our experiments we replace it with 16 bytes paylod. Version numbers generated while reading the file line by line. |
u0 | |
u25 | |
u50 | |
u75 | |
u100 |
Description | number of results | format description | zipped size in MB |
M2 & M1 | 85,854 | M2.ID M1.ID (the MBRs of M2 are numerated starting with 10,000,000, the MBRs of M1 with 0) | 0.37 |
M3 & M3 | 9,784,072 |
Description | format description | zipped size in MB |
20-nearest neighbors for each element in M2 | M2.ID M1.ID k 'euclidian distance' (the center of the MBRs was used for the computation, the MBRs of M2 are numerated starting with 10,000,000, the MBRs of M1 with 0) | 33.79 |
Description | format description | size in MB |
USA-data | Contains the minimum bounding rectangles of all streets from TIGER files, containing 72 Million rectangles.
The file is in hadoop sequence file format with datasets <NullWritable, DoublePointRectangle>.
For convenience, we provide a plain data set consisting of rectangles with the following format: <xlow, ylow, xhigh, yhigh>, each coordinate occupying 8 bytes in double floating point format. The file can be obtained from here. |
3338 |
E-USA-data | Extended USA dataset, composed of four copies of USA-data by translating the original data set with the following vectors: (0.0, 0.0), (75.5, -33.9), (0.0, -33.9), (75.5, -3.9).
A plain file can be obtained from here (see USA-data for the file formats). | 13414 |
qr1 | Query point data set obtained by considering every 100-th middle point of the rectangles from USA-data, consisting of 722,261 points. The files are in plain format, sequential point data with <x,y> coordinates, each coordinate occupying 8 bytes in double floating point format. | 22 |
qr2 | Query rectangle data with quadratic rectangles where each rectangle returns 100 results on average, consisting of 722,226 rectangles. The file format ist the same as for qr1. | 2.2 |
qr3 | Query rectangle data with quadratic rectangles where each rectangle returns 1000 results on average, consisting of 22,856 rectangles. The file format ist the same as for qr1. | 0.7 |