Collaborative project idea? Mindfind - mental Google search

WanderingIshiki · June 9, 2021, 2:17pm

There have been previous threads about having a collaborative project to work on and I was thinking about an idea I’d had for a while now - using MMI to search the web like you would normally do on Google but with out typing in search terms. Maybe if some are up for it we could flesh out ideas and stuff in this thread?

https://mindfind.net/

At the moment this is just a basic site that is just getting 10x URLs from a large database of sites crawled from around the world. I’ll fill in the technical details later but as a springboard for discussion I can envision needing to discuss design around:

The structure of the corpus that contains the data (URLs, metadata, extracted text content etc.) so it’s suited towards being searched with an MMI algorithm.
What kind of MMI algorithms(s) could be good for this. I haven’t given it a lot of thought yet past a 1D random walk up and down the the list of URLs (which isn’t supported by the current system I’m using to store the data and the one “random sampling” feature it did provide (allowing me to plugin a simple RandUniform value) you can see is heavily biased as the same URL ends up getting repeated multiple times in the same page load!)
How we engineer it all together.
And more…

ScottWilber · June 9, 2021, 8:46pm

I can provide technical ideas, algorithms and modeling, but not programming (not my area).

As a start:
A 1D random walk is one way to provide a mentally-influenceable list search. However, it’s not easy to get enough resolution to uniquely map to a list with perhaps millions of elements. The terminal point in a random walk is approximately normally distributed with standard deviation (SD) of square root of N, where N is the number of steps in the walk. The resulting terminal position must be linearized by converting to a z-score (divide the coordinate by SD) and converting z to it cumulative normal distribution probability, p. An MMI generator with a rate of 100,000 bps will provide 20,000 bits in 0.2 seconds (a reasonable mind interaction time). The SD of the terminal point would be 141.42 and the range from lowest to highest is about 6 SD, or about 850 unique points. Clearly this is not adequate for a list that may have hundreds to thousands of times that number of elements. This is one of the issues that would have to be solved, possibly by a two-step search process (a coarse level followed by a fine level). Modeling would be required to test this possibility. Note, this is an algorithm that would be usable for many other applications requiring high precision floating point mentally-influenceable results in one or more dimensions.

wizbiz · June 10, 2021, 10:57am

It would be cool if we used Speedie. (my language). If you can wait a week I could put together a package that we can use.

Its a great language for unifying things because it works on many levels. It’s FAST, and easy to use… so the data-processing stuff won’t bog you down like python, and you won’t go mad dealing with C++/C kinda stuff.

It really is fast. Both at creating code that runs fast, and fast at compiling that code. It could be made faster too, because it relies on gcc/clang but I think that could be optimised.

We could also build some kinda standard libs for statistics and randomness.

Maybe call it the MMI lib?

Also I’ll have a lot more time to help if its in Speedie.

Simurgh · June 10, 2021, 11:22am

Maybe, for start, we should run some kind of topic clusterisation among sites? And/or topic , embedding?
I bet, MMI will work better if we’ll reduce amount of dimensions.

WanderingIshiki · June 10, 2021, 3:01pm

The total number of records I’m playing with at the moment is 3,464,537,205. I’m using data that’s freely available from the CommonCrawl.org project. Basically this group crawls the web regularly and makes the data available for free. It’s hosted as part of Amazon’s Open Data program. That 3.5 billion is just from the crawl done during April 2021.

The data consists of metadata about the pages/content crawled, the actual data itself and raw text extracts. I’m working with just the metadata to have access to a list of URLs. It’s stored on AWS S3 but can be accessed via SQL using AWS Athena.

@ScottWilber - have you ever tried a multi-step approach (so not just coarse and fine but a multiple number)? 3.5 billion is a large number. I’m not sure whether to leave the data as is (and leave the heavy lifting to the MMI algorithm) but maybe just re-structure it a little bit, or perhaps modify it/filter it/clusterize it. This could be a massive undertaking in it’s own, depending on what engineering approach we take.

What’s a good way of determining a suitable N for a RW?

@wizbiz Maybe Speedie could be used here one day, not sure. For now I’m going to be focusing more on discussion and design rather than coding. Parking Warden sounds like a good first demo of it though.

ScottWilber · June 10, 2021, 8:44pm

3.4 billion is a truly huge number – double precision numbers are needed to represent it. That may be the reason you were getting multiple or duplicate returns, a double precision algorithm is required. I have never tried to make an MMI selector with anything like that precision.

I will make a start at modeling multi-level MMI indexing to get an idea what is possible. Beyond the mathematical and algorithmic requirements, I can predict it will be very hard to get meaningful MMI results when trying to obtain so many bits of information (32 bits for an exact single “hit” in 3.4 billion), especially if the domains are randomly ordered, that is, no order at all. To see how linearity impacts the results, consider a linear array representing a one-dimensional location. If the desired position is .80215 times the length, then a returned value of 0.8000 is still very close. If the coordinates are randomized, the adjacent positions could be anywhere along the entire length, so “close” is no longer meaningful. This idea of “closeness” applies broadly to MMI applications – we want a close hit to be actually close.

Sounds like the CommonCrawl data doesn’t provide categories or any type of annotation, which would have been very helpful. Do they include keywords? Any post processing on your part would probably be some type of algorithm for sorting into categories. An alternative, which might be easier to program though probably with diminished performance, would return a block of domains at the first level, and then select one that could be most interesting based on some criteria that would have to be defined. However, as a demonstration we would probably want to use directly available data.

There are many possible algorithms for using random walks. But, since random walks are fundamentally based on integers, one would like the number of unique integers returned by the processing algorithm to be at least as large as the number of items in the list, assuming they are uniformly distributed. The standard deviation of a random walk is the square root of N and the range of integers available is at most about 6N (includes about 99.7% of normally distributed data). However this pushes the algorithm to the limit so the results may be somewhat grainy in the tails (those portions of the distribution near the upper and lower limits). Perhaps a better limit would be 4N. 4 times square root N = n, the number of possible integers output (after rounding). Therefore, N = (n/4)^2. Example: for 1024 unique output integers, use N = 65,536. Note, for the 3.46 billion items in a list, N would be 7.5 x 10^17, a completely unachievable number.

It’s fairly easy to make an algorithm or system that appears to work and will satisfy many people. As a personal principle, I pursue implementations that I know and understand have a real chance of working as advertised. I mention this only because there are plenty of commercially successful products that don’t work at all as described. MMI is a subtle and difficult challenge, but it does actually work. It’s just not easy to get it to do what we want it to.

ScottWilber · June 12, 2021, 1:25am

I ran some simulations using random walks to provide a linear index. What I found was:
4 SD converted to linear integers doesn’t give even close to uniformly distributed numbers. 1 SD is better and 1000 repeated tests give numbers that pass a Kolmogorov–Smirnov test against the uniform distribution; however, I could still see some structure in the data. A 10,000 step RW could provide about 100 unique integers as an index, but with only fair uniformity. A two-stage test using a total of 20,000 MMI bits would provide a unique 10,000 line index. More traction is gotten by using many smaller walks. 20 - 1024 bit walks, about the same total number of bits, these would give 32^20 or 1.27 x 10^30 nearly unique indices. Something in between these two extremes would likely be optimal. I haven’t simulated this yet, but there is still the question of the quality of the resulting index. Very careful testing is always required, and the simplest approach may not yield good enough linearity, but the results are suggestive.

WanderingIshiki · June 13, 2021, 2:05pm

Thanks for running those simulations, I’m starting to digest what you’ve written above.

We have access to the raw data (meta data, raw crawled content (i.e. a download of the pages and their contents directly (graphics etc.), and simple text extract (html tags removed)).
I’d have to see about keywords. Technically we could expect to find the ＜meta keywords="xxxx"＞ inside and could extract them from that. But then “grouping by keywords” becomes a mammoth project in its own with so many degrees of freedom.
If we continue with the approach of a 1d list of URLs, then we could manipulate that list to be in order (e.g. alphabetical). This however could take a very long time and make the time it takes results to be presented to the user impractical. I’ll continue to think about efficient and effective way to sort and deliver the data.

ScottWilber · June 13, 2021, 10:32pm

Now I have run a full simulation that provides 10 billion unique integers for an index. A sample of 10,000 of these passes the K-S test for uniformity, so I’m confident it’s pretty good statistically. It uses 10 walks of 1999 bits each for a total of 19,990 bits, or about 0.2 seconds with a 100Kbps generator. This structure is by no means optimized, which will take a lot more thought and testing. Keeping in mind that MMI is not just a mathematical algorithm. Billions of degrees of freedom is quite a stretch for any MMI application, but it will definitely function (mathematically) as a general approach for any program that requires higher resolution from an MMI source.

I will have to work on simplifying the algorithm before sharing, because it would be a bit long in a lower-level language like C++. I also want to consider optimization.

Putting the data into some kind of order would make a big improvement in MMI performance. However, that’s much harder than it may sound. Alphabetization based on a primary keyword is a start, but the type of order we would prefer would be based on linearization of meaning or abstract concepts the user would recognize. That way, a close hit would still have a similar meaning relative to the user’s intention. This is a general concept for any search of this type, but it would probably require some pretty sophisticated AI program to achieve.

I suspect there is a lot of work already done in this regard because a good search engine already has some of this type of optimization. Then the user must do the fine tuning by looking at the returned hits. There may be a way to piggyback on a search engine by providing a search term chosen by MMI algorithm, and then narrow it further using a second-layer MMI algorithm.

The first step is always to use the data just as is to see how it works, then perhaps use a list alphabetized by top keyword. However, as you noted, that’s still a bit of an exercise in programming, and we are just discussing possibilities. There are always other approaches to explore.

D0ublezer0 · June 13, 2021, 10:49pm

words can probably be semanticaly ordered with something like this GitHub - anvaka/word2vec-graph: Exploring word2vec embeddings as a graph of nearest neighbors

WanderingIshiki · June 16, 2021, 5:08pm

As a first-stage simplified test I setup the following on the Mindfind site:

I’m Feeling Truly Lucky button → obtains 4 different search terms from 4x MED generators and plugs them into a Google search
I’m Feeling Pseudo Lucky button → same as above but uses pseudorandom numbers to feed the random walk (for comparative baseline purposes)

The search terms are in a corpus of 1024 words, alphabetically sorted. These words are the most frequently used English words. This is the first part of simplifying for initial testing purposes - have words grouped so ‘close’ results will indeed appear ‘close’ (because the first few letters will be similar, not necessarily their meanings – a topic for another day). The second, avoiding a multi-step approach (which I know you’re working on now thank you).

The first 3 search terms come from 3x MED100Kx8s and the last one comes from a MED100KX. Each word’s derivation (index calculation) is handled independently of each other. The current amount of entropy (8196 bytes) I’m reading in from the generators takes above 0.6 secs, so I understand that’s ~3x more than the recommended duration of 200 milliseconds – hopefully okay in this type of testing??

Before I progress further I’d like to check that the math and implementation is right. I’ve attached a JavaScript and C# version (the C# version 1D too and based on @D0ublezer0’s 2D implementation from your discussions with him in the Fatum Project thread). Both scripts output the same results so should be implemented in the same way.

N (number of steps in the random walk) is 65,536
Standard deviation is 256 (SD = √𝑁)
n = 1024 (4 * √𝑁)
Read 8192 bytes from the MED (multiplying it by 8 gives us the 65,536 bits we need for N)
𝑜𝑛𝑒𝑠 (number of set bits) in the 8192 bytes is 32,976
𝐶𝑇 (the terminal point’s coordinate in the 1D plane) is 416 (𝐶𝑇 = (2 × 𝑜𝑛𝑒𝑠) − 𝑁)
Yielding z-score of 1.625 (𝑧−𝑠𝑐𝑜𝑟𝑒 = 𝐶𝑇 / √𝑁)
In turn, the z-score linearized gives a cumulative distribution probability p of 0.9479187303997938 – formula omitted from here as referenced in the Fatum Project thread and your Guidelines and Design Examples for Mind-Enabled Applications Part 1 (June 1, 2020) paper’s Appendix.
Lastly, multiplying p by n gives us the rounded index of 971, which is the integer used to extract the 971th word in the 1024 word list

Does this implementation and actual values look correct? It’s been many years since I’ve asked somebody to “please check my math”, but like you say programming isn’t your area, math isn’t my strong suit (but I’ve never had more motivation to actually learn stuff I did decades ago!)

If so, would it be reasonable to notice some effect size in the words selected? At least the first 3 as those generators are 101x bias amplified. One pattern I did tend to see was pairs of ‘close’ words and the 3rd being something different (e.g. practice person slow - this was before I added the 4th term). I think I saw ‘mind’ and ‘leading’ appear in pairs once each. I think the probability of that would be something small like 0.00009536%.

Linearlize.cs.txt (1.3 KB)
linearize.js.txt (1.9 KB)

ScottWilber · June 16, 2021, 9:36pm

The math and the results of the calculations all seem correct based on the numbers shown. The line of code with the variable y can be deleted since it is not used. The results are exact to all 16 digits of p so the functioning of the algorithm is conformed. As a note, the actual accuracy of the approximation is only 7+ decimal digits at this value. That’s way more than enough for this index generation, but it will inject some error if the index is in the billions. A more exact calculation might be required at that point, but not a concern now.

1024 is definitely too large an index for the number of bits in your RW. I suggested previously one could use multiples of the SD as a number of linearized output points, but that was before I did some modeling. Actually, 1 SD is about the max and even then there is some nonlinearity in the center of the range. This shows up in the Fatum Project plot using 10,000 input bits and an output index range of 100 (1 SD). To get good linearity would require about 16,000,000 bits in each walk for your 1024 index range - about SD/4.

16 million bits is unrealistic, but I am now fairly confident in the multi-stage approach, having run dozens of simulations and variations. The 1024 index should be possible with a two-stage algorithm using a total of 32,768 bits – about 0.33 seconds at 100Kbps.

The following algorithm should output 1024 unique index points (0-1023). It requires two random walks using 16,384 bits each. The two walks can be calculated consecutively, but I suspect the MMI results will be better if they are calculated concurrently by alternately taking small blocks of data from the MMI generator and accumulating them in the two counts of ones. That way the effect of intention on the coarse and the fine blocks will happen at the same time.

(Variables:) mm=32; N=16384; (N is the number of bits in each rw)
mm Floor[mm (zs1 = CT1/Sqrt[N]; cdf[zs1])+Floor[mm (zs2 = CT2/Sqrt[N]; cdf[zs2])

Simplified code:
index = mm Floor[mm p1] + Floor[mm p2]
p1 and p2 are the two linearized probability outputs. The max output value is mm*mm-1=1023. Add 1 if the desired range is 1 to 1024.

I have tried many different sampling periods, up to about 1 or even more seconds. My experience is, one second is too long to expect to hold peak mental focus of intention, even though it seems pretty short. I prefer about half a second or less, but less than 0.2 seconds also seems to be wasting an optimal sampling of the peak focus. Most of that testing was related to a single bit of information, which may be different than the more abstract concept being discussed here. I suggest using these time frames as a guide to a starting point. Real world testing is required to see how this kind of MMI system works.

WanderingIshiki · June 17, 2021, 12:49pm

Would you expect any difference if I was to do that in parallel with 2 generators?
I.e. one MED for the course stage (p1) and another MED for the finer stage (p2).

Which leads me onto a more metaphysical question that’s been in the back of my mind for months now… and despite very little still being known about the action of mechanism behind MMI, for multi-dimensional extraction of intents from entropy what is the current thinking behind how “the field” (or whatever terminology refers to it - I remember Jamal referring to it as the “intelligence field” at some point) filters/sorts between different algorithmic approaches like ‘every even bit for the 1st dimension (x), every odd bit for the the 2nd dimension (y)’ or ‘every coarse bit from generator #1, every finer bit from generator #2’ etc.

Is this
index = (mm * Floor(mm * p1) ) + Floor(mm * p2)
or
index = (mm * ( Floor(mm * p1) ) + Floor(mm * p2) )

I’m inferring the usage is similar to () in the context of calling functions like Floor in Mathematica. It’s been 20 years since I did a summer course at university in Mathematica (which was a brilliant course because it taught me the basics of binary/float point arithmetic in my computer science programme).

WanderingIshiki · June 17, 2021, 1:09pm

A simple run of this shows it’s generating indexes > 1023 so I’m assuming the first is the correct one.

ScottWilber · June 17, 2021, 4:43pm

There are two primary functions of square brackets in Mathematica:
list[[i]] double brackets usually enclose an index to a list
Floor[x] single brackets usually enclose the argument of a function
parentheses usually indicate the operations enclosed within them are performed prior to external operations – pretty much the same as in most languages

I think the execution precedence would make
index = (mm * Floor(mm * p1) ) + Floor(mm * p2) = mm * Floor(mm * p1) + Floor(mm * p2), and
index = (mm * ( Floor(mm * p1) ) + Floor(mm * p2) ) = mm * Floor(mm * p1) + Floor(mm * p2)
both the same. However, I’m not familiar with C#.

I believe in c#, Math.Floor(mm * p) is equivalent to Floor[mm * p] in Mathematica, though in Mathematica the arguments don’t have to be declared as double. Since mm is an integer, the product probably has to be declared in C#:
mm * Floor[mm * p1] means, mm is first multiplied times p1, then the Floor operation is performed on their product, finally the result of the Floor operation is multiplied times mm to produce the result of the equation. After the operation is completed on the second equation, the two are added together.

ScottWilber · June 17, 2021, 8:16pm

I have no particular reason to expect a different result if you use 2 parallel generators to generate the two random walks. Except, of course, the generation time would be cut in half. Then I would want N to be adjusted so the interaction period is at least 0.2 seconds. In this algorithm it’s always possible to increase N without negative impact.

As always with subtle variations in MMI systems, real world testing is the final way to judge. One could imagine using two generators separated by some distance, and there might be some difference, but I really don’t know.

Some authors have noted that MMI is not generally defeated or much altered by the complexity of the system measuring it. This is born out by logic, i.e., that MMI generators (or REGs) are too complex and fast for any possible conscious following of the generation process. Moreover, a user does not intend to follow the inner processes of the generator when they try to affect the outcome. Therefore, it seems unlikely the added complexity of using two generators or using a particular pattern of selection from a single generator will have a notable effect on the outcome.

That being said, I have noted many times when a new method of processing is first tested there seems to be a learning curve to reach previous levels of effect size. This is not a rigorously tested theory, but a personal observation. It may suggest there is a period of adapting to the new process.

In terms of underlying mechanisms of MMI, there is a distinct appearance of a quantum mechanical principle being involved. For example, Consider a user’s intention as being one set of information and the observed outcome as being a second set if information. The information can be as simple as a single bit of information, as most of the testing in the published literature is based on. The outcome either matches (a hit) or does not match (a miss). From a quantum mechanical perspective, a hit and a miss are in a superposition until they are measured, that is, observed in the mind of the user. Then the quantum state collapses to a determined state. There are a couple of mechanism by which the intention and the resulting outcome can be connected quantum mechanically within the brain/mind of the user. The linkage provides a better than classical chance of getting a hit for example. One mechanism is called quantum pseudo-telepathy. This is a real phenomenon where no information is transmitted, but there is a shared physical system in an entangled quantum state.

ScottWilber · June 22, 2021, 8:57pm

MMI Index Generator for Very large Index Numbers

I designed an index generator for a very large index to demonstrate the possibility and test the statistical quality. My target was 3.464 billion lines to correspond with the suggested MMI Mindfind project.

The best selection, at least from a statistical perspective, is to use:
nl = 9
N = 5185, (8 * nl)^2 +1 to make the number odd
in a 10-stage generator (9^10 = 3,486,784,401: just above the 3.464 billion target resolution).

Writing out the full equation:
index = (nl^9) * Floor[nl * p1] + (nl^8) * Floor[nl * p2] +(nl^7) * Floor[nl * p3] +
(nl^6) * Floor[nl * p4] +(nl^5) * Floor[nl * p5] +(nl^4) * Floor[nl * p6] +
(nl^3) * Floor[nl * p7] +(nl^2) * Floor[nl * p8] +nl * Floor[nl * p9] + Floor[nl * p10]

p 1 to p10 are the 10 linearized probabilities produced from 10 – 5185 step random walks, for a total of 51,850 MMI bits (0.519 seconds generation time at 100Kbps).

The raw index must be interpolated to reach the exact target number:
interpolated index = Floor[3.464 x 10^9 * index/(nl^10)

The number of unique index outputs is, 3.465 x 10^9, in the range of 0 to (3.465 x 10^9) -1. The exact interpolation range depends on the exact number of lines in the immediate count of sites. If the number exceeds the maximum of 9^10, the design of the index generator would have to be changed to provide a 10-stage generator with nl = 10 – a 10-billion-line potential range.

The KS test for 100,000 samples of the generator gave, KS+ = 0.974 and KS- = 0.9993; very close to linear. The graphical output of the test data (the cumulative distribution function) is:

Note the small kink in the data between 0.4 and 0.6. This nonlinearity means that close to the middle of the index range, when a very large number of indexes are drawn, there will be a slight shift toward higher numbers. This shift in probability is a tiny fraction of a percent. Given the huge size of the index resolution, the shift would at times be significant in terms of hitting an exact target number.

Because MMI seems to inherently compensate for internal complexity of the generation function, the nonlinearity may not be significant. However, given the number of degrees of freedom – in the billions – it seems highly unlikely an MMI application will be able to hit an exact target in one try. This is especially true if the order of the list is random, as it would seem to be in the suggested application. If the list were arranged according to a progression of semantic meaning, a close hit from the index generator would actually be close in meaning as well. These last thoughts are a bit speculative since this type of MMI index generator has never been tried before.

As with any type of Internet search, a block of near index numbers should probably be returned allowing the user to pick the one that seems to best match their intention. This is how one would explore any real-world environment. We look around at hundreds of details and our mind makes sense of what we see in the context of our experience, then we focus on what catches our attention.

WanderingIshiki · June 26, 2021, 4:07pm

I’ve started an implementation on the backend and am working on a way to give some ordered semantic meaning to the list (by domain is the initial quick approach I’m thinking). I also need to figure a way to index the list so it’s a consistent/fixed list. Currently due to the distributed nature of how AWS allows interaction with the data (i.e. which ever compute node is allocated to process a block of data, no doubt dependent on AWS resource availability at the time), the same query can be different each time - not quite the same randomness we’re looking for here. This is deterministic, but only known the the architects of AWS.

ScottWilber · July 15, 2021, 10:19pm

A small but important correction: I used the term “degrees of freedom” imprecisely in a few posts, which I have now edited. The correct term should have been bits of “information content.” Information content, in bits, is the log base 2 of the number of possible outcomes or unique lines in a list, usually rounded up to the nearest integer. In MMI, information content of an answer gives an idea how difficult or how much mental effort is required to obtain the answer. A coin toss – heads/tails – is only one bit of information. That’s the easiest type of question to answer. A Powerball lottery requires 29 bits of information to hit exactly – vastly harder but not theoretically impossible.

WanderingIshiki · July 18, 2021, 3:58am

Due to the unexpected time and cost in trying to effectively prepare a 3.4 billion item list, I’m thinking of partitioning the list into first a 35,906,101 list of distinct registered domain names and then continue the search once domain(s) have been selected. Will continue when I’ve got some free time.