Welcome To Our Shell

Mister Spy & Souheyl Bypass Shell

Current Path : /var/www/web-klick.de/dsh/50_dev2017/1310__algorithms/Julia/

Linux ift1.ift-informatik.de 5.4.0-216-generic #236-Ubuntu SMP Fri Apr 11 19:53:21 UTC 2025 x86_64
Upload File :
Current File : /var/www/web-klick.de/dsh/50_dev2017/1310__algorithms/Julia/Locality Sensitive Hashing III.ipynb

{
 "metadata": {
  "language": "Julia",
  "name": "",
  "signature": "sha256:f4950f76220b5b71e6486bc209dd103bbc003a20c3adb92c3c7a601085fd8b6e"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Approximate Nearest Neighbour - Locality Sensitive Hashing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### BlackNearJulia\n",
      "\n",
      "Indexes and searches vectors using a modular pipeline, the *Engine*, \n",
      "build from four different kinds of objects.\n",
      "\n",
      "<img src=\"Pipeline.png\">\n",
      "\n",
      "#### Hashes\n",
      "Given a single vector as input, hashes generate one or more bucket keys from it. During indexing, the vector is then stored in one specific bucket for each key. During search, the neighbour candidates are collected from all these buckets. \n",
      "Hashes should in general be locality-sensitive, thus preserving the spatial structure to some degree. Close vectors should be put in the same buckets. The pipeline can use one or multiple hashes at the same time.\n",
      "\n",
      "#### Storage\n",
      "Storage adapters store and return bucket contents.\n",
      "\n",
      "#### Distance\n",
      "After candidates have been collected from all the matching buckets in storage, the distance to the query vector is computed for all of them. Which distance measure is used is up to you. There are currently two distances available (euclidean and angular), but you can simply implement your own to customize what \"near\" means in your application. If your application does not need a distance measure because the locality-sensitivity of the hashes is enough, you can set distance=None in the engine constructor.\n",
      "\n",
      "#### Filter\n",
      "The last step in the pipeline is an optional **filter chain**. These filters get lists of (vector, data) or (vector, data, distance) tupels, depending on the existence of a distance in the pipeline. They return lists of the same kind but mostly subsets of the input. What these filters actually do, is up to the implementation."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Example Usage"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "using BlackNearJulia\n",
      "using Distance"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Random Projection\n",
      "The random projection method of LSH is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplane at the outset and use the hyperplane to hash input vectors.\n",
      "\n",
      "In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create a random binary hash with 10 bits\n",
      "rbin = RandomBinProjection(\"tst\", 10, 2, randn(10,5));"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ms = MemoryStorage(Dict())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "MemoryStorage(Dict{String,Dict{K,V}}())"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "eng = Engine(rbin, Hamming(), ms, NearestFilter(5));"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data = randn(10,5);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[store_vector!(eng, vec(data[i,:])) for i = 1:10];"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = neighbours(eng, rbin, [0.785316,-0.426244,-0.221896,-0.0145262,-0.089789]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "store_vector!(eng, [0.785316,-0.426244,-0.221896,-0.0145262,-0.059789]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = neighbours(eng, rbin, [0.785316,-0.426244,-0.221896,-0.0145262,-0.089789]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filter_vector!(NearestFilter(), res)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "1-element Array{(Array{Float64,1},Int64),1}:\n",
        " ([0.785316,-0.426244,-0.221896,-0.0145262,-0.059789],1)"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Experiment\n",
      "\n",
      "**BlackNearJulia** is currently developed to bring two classes for experiments. *RecallPrecisionExperiment* and *DistanceRatioExperiment*. They allow to evaluate different engine configurations, hashes, distances and filters on a custom data set.\n",
      "\n",
      "**Distance Ratio**\n",
      "\n",
      "I found out that recall and precision are no good measures when it comes to ANN, because they focus on the actual vectors in the result set. In ANN I am more interested in the preservation of spatial structure and do not care too much, if the result set contains all the exact neighbours or not. \n",
      "\n",
      "So in my eyes a much better measure is the average ANN distance ratio of all the vectors in the data set. \n",
      "\n",
      "I do not know if this has been used before, but I find it to be a really good measure to determine how a certain ANN method performs on a given data set.\n",
      "\n",
      "**The distance ratio of an ANN y is it's distance to the minimal hypersphere around the query vector x, that contains all exact nearest neighbours n, clamped to zero and normalized with this hypersphere's radius.**\n",
      "\n",
      "<img src=\"distance-ratio-explained.png\">\n",
      "\n",
      "This means, that if the average distance ratio is 0.0, all ANNs are within the exact neighbour hypersphere. A ratio of 1.0 means the average ANN is 2*R away from the query vector."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Real World Example"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "using DataFrames"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = readtable(\"/media/removable/DEVELOPMENT/Auswertung/measurementsInCoordinateSystem.csv\", separator='\\t');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "const measures = array(df[4:8]);\n",
      "const labels = array(df[12]);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_test = int(length(labels) * 0.2)\n",
      "train_rows = shuffle([1:length(labels)] .> n_test)\n",
      "\n",
      "x_train, x_test = measures[train_rows, :], measures[!train_rows, :];\n",
      "y_train, y_test = labels[train_rows], labels[!train_rows];"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "const rbin = RandomBinProjection(\"tst\", 12, 2, randn(700,5));"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "const ms = MemoryStorage(Dict())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "MemoryStorage(Dict{String,Dict{K,V}}())"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "const eng = Engine(rbin, Euclidean(), ms, NearestFilter());"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#[store_vector!(eng, vec(measures[i,:])) for i = 1:1453];\n",
      "[store_vector!(eng, vec(x_train[i,:])) for i = 1:size(x_train)[1]];"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = neighbours(eng, rbin, vec(measures[1000,:]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "6-element Array{(Array{Float64,1},Float64),1}:\n",
        " ([0.0,0.0,22.0,766.883,0.0],355.8462633851026)\n",
        " ([0.0,0.0,12.0,411.178,0.0],0.0)              \n",
        " ([0.0,0.0,6.0,237.693,0.0],173.58822245868552)\n",
        " ([0.0,0.0,22.0,766.883,0.0],355.8462633851026)\n",
        " ([0.0,0.0,12.0,411.178,0.0],0.0)              \n",
        " ([0.0,0.0,6.0,237.693,0.0],173.58822245868552)"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filter_vector!(NearestFilter(), res)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "5-element Array{(Array{Float64,1},Float64),1}:\n",
        " ([0.0,0.0,12.0,411.178,0.0],0.0)              \n",
        " ([0.0,0.0,12.0,411.178,0.0],0.0)              \n",
        " ([0.0,0.0,6.0,237.693,0.0],173.58822245868552)\n",
        " ([0.0,0.0,6.0,237.693,0.0],173.58822245868552)\n",
        " ([0.0,0.0,22.0,766.883,0.0],355.8462633851026)"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}

bypass 1.0, Devloped By El Moujahidin (the source has been moved and devloped)
Email: contact@elmoujehidin.net bypass 1.0, Devloped By El Moujahidin (the source has been moved and devloped) Email: contact@elmoujehidin.net