Collaborative project idea? Mindfind - mental Google search

WanderingIshiki · July 25, 2021, 2:28pm

@ScottWilber
Is there a way to calculate on the fly a suitable nl given a 1D list of length len?

ScottWilber · July 25, 2021, 8:57pm

There probably is a way to do what you suggest, but I haven’t worked enough on this method of generating a list index using MMI bits to say what it is. The “best” design is the one that best transforms the mental effort to the correct (intended) result. That is not an easy task, especially when there are no constraints on the number of lines in the list (len). If you can give a realistic range for len, I can suggest a possible design.

WanderingIshiki · July 26, 2021, 12:23pm

I’m thinking something like:

First scan (multiple stages): by domain name.
The range being 1-35,906,101

Second scan (multiple stages): unique URL within that domain
blogspot.com has the most number of unique URLs so the range would be 1-16,351,015.

The thing that concerns me with this approach is that if the index generated in the first scan is off by even one then the second stage could end up being done on a completely different domain.

So still not sure this is a potential effective way.

ScottWilber · July 26, 2021, 8:04pm

This an issue with real MMI applications, and virtually every task that provides something valuable – it’s not possible to get something for nothing.

To hit exactly one out of 35,906,101 requires 25.1 (effectively 26) bits of information, obtained without error. Parsing the 16,351,015 into a unique line is another 24 bits. All together, that’s 50 bits of information. To hit the Powerball exactly only takes 29 bits, a much easier – but extremely difficult – challenge. I am working on a more complete theoretical analysis of mental effort in MMI and achievable information content, but it’s not ready yet.

There is just no way to expect anyone will hit the one and only, exact domain name. To make a useful application requires a very clever approach. Not saying I know the exact details, but there are some guidelines.

As I noted before, if the domains are arranged in a list that made near hits also near in a meaningful way, the chance of getting such a near hit increases orders of magnitude. However, making such an ordered list is probably not a reasonable project. The alternative is to use a different approach entirely.

It would be possible to make a very short list of popular keywords, perhaps 64 words. It would then be possible to arrange the list manually so near words are meaningfully related. The list should be arranged so the words at the end are related to the words in the beginning so there is no discontinuity around the ends of the list. A 64-line list is only 6 bits of information, but the ordering would provide an important advantage. Then use an MMI application to select 2, or even 3, words from the list. Use these keywords to perform a usual online search. Assuming there will be hundreds to millions of returns, make a selection from the top dozen or so, based on the selected keywords, to finally return about 3 or more for the user to look at. (Exact numbers and list sizes are only suggested possibilities)

Only one of the finally returned sites needs to have content similar to the user’s intention to be a “hit.” As we know from Randonautica, every outing is not a hit. What’s important is that the application captures the user’s interest and is entertaining to use and see what happens.

Doubtless, there are a large number of approaches, and I suggest just picking one and trying it. If you want to try the two stage search with the index sizes you suggested, I will provide the MMI index generator designs as examples.

WanderingIshiki · August 4, 2021, 2:44am

I’m continuing looking into ways to better arrange the indexes including suggestions above, but in the meantime would I be able to help getting suggested nl and N values for index lengths of:

466,550 - a rather large English dictionary (which was also the source of the Intent Suggestions in early Randonautica)

and

35,906,101 - the list of domains I’ve currently got

That’ll give us a v1 prototype of Mindfind to at least make available.

ScottWilber · August 4, 2021, 7:38pm

For the 466,550 line index generator, use nl = 9 and N = 6671 bits (834 bytes - 1 bit to make it odd). There will be 6 lines giving a total output of 9 to the 6th power = 531,441 unique indices. Then use interpolation to get the exact number of indices desired:
interpolated index = Floor(466,550 (output/531,441)). This will output 0 to 466,549. The total number of bytes used is 5004 (40,032 bits) produced in 0.4 seconds at a generation rate of 100,000 bps.

For the 35,906,101 line index generator, use nl = 9 and N = 4999 bits (625 bytes - 1 bit to make it odd). There will be 8 lines giving a total output of 9 to the 8th power = 43,046,721 unique indices. Then use interpolation to get the exact number of indices desired:
interpolated index = Floor(35,906,101 (output/43,046,721)). This will output 0 to 35,906,100. The total number of bytes used is 5000 (40,000 bits) produced in 0.4 seconds at a generation rate of 100,000 bps.

Note, the first list requires 19 bits of information and the second, 26 to get exact hits. There are many ways of producing these indices (several variations I haven’t mentioned), and there are ways of reducing the difficulty of the task or improving the MMI power of the index making algorithm. However this is good enough for a start.

WanderingIshiki · August 7, 2021, 3:24pm

Thank you. I’ve put the latest additions online at:

https://mindfind.net

For now I’m considering this as the first kind of prototype/test version. It needs further testing and tuning as time permits.

Here’s a brief description of the search functionality:

Web Search - searches the index from the 35.9 million uniquely registered domains searched by CommonCrawl.org in June 2021 and returns a Google-like search result page. It returns a random page from within each domain, but also you can go to the domain’s root directly. Mindfind takes 10 trials from a MED, and applies the indexing algorithm discussed above, listing them in z-score order – giving 10 results in total.

(If you get no results after a while / or a white page - try refreshing)

True Terms - uses 3x MED devices to search each term individually and combine the 3 words for a normal Google search. The total list here is the 1024 most commonly used words in English.

Pseudo Terms - the same as True Terms, but using a PRNG (pseudo random number generator) instead of Mind-Enabled Devices (MEDs).

Intent Suggestions - the above index searching algorithm applied to a rather large English dictionary (which was also the source of the Intent Suggestions in early Randonautica - 466,550 words in total).

The source codes are available here:

ScottWilber · August 7, 2021, 8:24pm

Thanks for your effort to make a working prototype.

I suggest to limit the number of returns for True Terms and Pseudo Terms, because there are up to billions. Perhaps consider limiting the returns to English language only. I know there are many potential users whose native language is not English. Most people do also speak English, but not the many other languages that may show up in unconstrained searches.

Several of the most common words don’t add information to a Google search, such as, a or the. Consider editing the top 1000 word list to exclude such words, and replacing them with some more meaningful words in a longer most-used word list (or just make the list shorter).

WanderingIshiki · August 8, 2021, 10:10am

I’ve added a little filter to Web Search so that only English results will be shown. It’s probably not 100% accurate but should be better than nothing.

Do you mean use only one or two instead of 3 search terms? (At the moment these 2 features generate 3 search terms combined together). As all I am doing behind the scenes is opening Google with whatever 3 words were generated, I don’t have any way to limit the number of results Google yields. However, these should all generally be in English as the list of words any of the terms can be generated from are only English words.

Yes you’re very right here. The list hasn’t been calibrated at all for the purpose at hand so it contains the likes of a, the etc in there as well.

WanderingIshiki · August 8, 2021, 10:21am

So as I’m trying out the new Web Search I get this as my first result:

Not quite what I was thinking, so I call in my 8 year old son who was actually in the room watching some Minecraft video on YouTube (right where the MEDfarm server is located and running) and ask him what he was just thinking about these past few minutes… low and behold ways to watch Dragon Ball from here in Singapore which shows in this site’s top page:

www.dream-force.tv

ScottWilber · August 8, 2021, 4:24pm

I meant to reduce the number of hits or results returned by the search, since Google may return up to 50 pages (their internal limit). Apparently there may be a way to do this. Try this link for example:
https://www.codeproject.com/Questions/197422/How-to-limit-the-number-of-results-returned-by-goo
A lot of what Google provides is often changed or “deprecated.” I saw one older (10 years ago) search result, but it was unclear if any suggested solution was still valid.