Towards Telesophy: Federating All the World’ s Knowledge

MALE SPEAKER: This is my
attempt to increase the sartorial quotient of Google,
and it hasn’t worked at all. On the other hand– well, I noticed you have
a coat on, that’s true. Greg Chesson gets two points
for showing up with a coat. It’s a real pleasure to
introduce Bruce Schatz to you. I’ve known Bruce for
rather a long time. My first introduction to him
came as we both began getting excited about digital libraries
and the possibility of accumulating enormous amounts
of information in digital form that could be
worked on, manipulated by, processed through software that
we hope would augment our brain power. So Bruce has been in the
information game for longer than he’s actually willing
to admit I suspect. He’s currently at the University
of Illinois, Champaign-Urbana. As you will remember, that’s
also the area where the National Center for
Supercomputer Applications is located. Bruce was around at the time
when Mark and Jason was doing work on the first browsers,
the mosaic versions of the browsers derived from Tim
BernersLee’s work. Actually, the one thing that
Bruce may not realize he gets credit for is teaching
me how to pronounce caenorhabditis elegans. I looked at it before and I
couldn’t figure out, and maybe I didn’t even say it
right this time. But this is a tiny little worm
that consists of 50 cells. It was the first living organism
that we actually completely sequenced
the genome for. Then we got interested in
understanding how does the genome actually reflect itself
as this little worm develops from a single fertilized cell. So Bruce introduced me to the
idea of collecting everything that was known about that
particular organism, and to turn it into a database that one
could manipulate and use in order to carry
out research. Well, let me just explain a
little bit more about his background and then turn this
over to him, because you’re here not to listen to his
bio, but to listen to what he has to say. He’s currently director of
something called CANIS– C-A-N-I-S. I thought it
had to do with dogs until I re-read it. It says Community Architecture
is for Network Information Systems. BRUCE SCHATZ: That’s why they
let me in the building. MALE SPEAKER: I’m sorry. BRUCE SCHATZ: That’s why they
let me in the building. MALE SPEAKER: Because
along with the other canines that are here. It’s at the University of
Illinois, Champaign-Urbana, and he’s been working on
federated all the world’s knowledge, just like we are, by
building pioneer research systems in industrial and
academic settings. He’s really done a lot of work
over a period of 25 or 30 years in this domain. The title of the talk uses the
term telesophy, which he introduced as a project at
Belcorp in the 1980s. Later on, he worked at UIUC on
something called DeLIver D-E-L-I-V-E-R, and now more
recently on semantics. That’s the reason that I
asked him to come here. He’s working on something called
BeeSpace, which is spelled B-E-E, as in the little
buzzing organism. This is an attempt as I
understand it, but I’m going to learn more, an attempt to
take a concept space and organize it in such a way that
we can assist people thinking through and understanding more
deeply what we know about that particular organism. So this is a deep dive into
a semantic problem. So I’m not going to bore you
with any more biographical material, except to say that
Bruce has about nine million slides to go through, so please
set your modems at 50 gigabits per second because he’s
going to have to go that fast to get through all of it. I’ve asked him to leave some
time at the end for questions. I already have one queued up. So Bruce, with that rather quick
introduction, let me thank you for coming out to
join us at Google and turn this over to you to teach
us about semantics. BRUCE SCHATZ: Thank you. I have one here, so you can
just turn yours off. Thank you. I was asked to give a talk
about semantics, which I supposedly know something
about. So this is going to be both a
talk that’s broad and deep at the same time, and it’s going to
try to do something big and grand, and also try to do
something deep that you can take away with it. So that may mean that it fails
completely and does none of those, or maybe it does
all of those. I’ve actually been giving this
talk for 25 years and– now, of course, it
doesn’t work. Am I not pointing it
in the right place? I’m pushing it but
it’s not going. Oh, there it goes. OK, sorry. Can you flip it back there? Sorry about that. Small technical difficulty, but
the man behind the curtain is fixing it. So I gave this talk first more
than 20 years ago in the hot Silicon Valley research lab
that all the grad students wanted to go to, which was
called Xerox PARC. I think a few people actually
have heard of Xerox PARC. It sort of still exists now. We went down completely? There we go. Thank you very much. I was pushing this idea that you
could federate and search through all the world’s
knowledge, and the uniform reaction that was, boy,
that would be great, but it’s not possible. And I said, no, you’re wrong. Here, I’ll show you a system
that searches across multiple sources and goes across
networks, and does pictures and text and follows links, and
I’ll explain each piece about how it works. Then they said, that’s great,
but not in our lifetime. Well, 10 years later was
mosaic and the web. And 20 years later I’m delighted
to be here, and all of you have actually done it. You’ve done all the world’s
knowledge to some degree. What I want to talk about is how
far are you and what you need to do before you take over
the rest of the world and I die, which is another
20 years. So what’s going to happen
in the next 20 years. The main thing I’m going to
say is a lot’s happened on tele, but not too
much on sophy. So you’re halfway to the hive
mine, and since I’m working on honey bees, at the end you will
see a picture of honey bees and hear something about
hive mines, but it will be very short. Basically, if you look at
Google’s mission, the mission is doing a lot about access and
organization of all the world’s knowledge. Actually, to a degree that’s
possible, you do an excellent job about that. However, you do almost nothing
about the next stages, which are usually called analysis
and synthesis. Solving actual problems, looking
at things in different places, combining stuff
and sharing it. And that’s because if you look
at the graph of research over the years, we’re sort of here,
and you’re doing commercially what was done in the research
area about 10 years ago, but you’re not doing
this stuff yet. So the telesophy system
was about here. Mosaic was about to here. Those are the things that searching across many sources– like what I showed, we’re really
working pretty well in research labs with
1,000 people. They weren’t working
with 100 million. But if Google’s going to survive
10 more years, you’re going to have to do whatever
research systems do here. So pay attention. This doesn’t work
with students. With students I have
to say I’m going to fail you at the end. But you have a real reason, a
monetary reason, and a moral reason to actually
pay attention. So back to the outline. I’m going to talk about what
are different ways to think about doing all the world’s
knowledge, and how to go through all the levels. I’m going to do all the levels
and sort of say you are here, and then I’m going to
concentrate on the next set of things that you haven’t
quite got to. The two particular things I’m
going to talk about our scalable semantics and concept
navigation, which probably don’t mean anything to you now,
but if I do my job right, 45 minutes, actually now 10 of
them are up, so 35 minutes from now they will
mean something. At the end I’m going to talk
about suppose you cared about this enough to do something,
what kind of big thing would you actually do? I sort of do these big, one of
the kind pioneering projects with stuff that doesn’t
quite work just to show it’s really possible. So the overall goal is you
probably all grew up on reading cyberspace novels is
sort of plugging your head and being one with all the
world’s knowledge. Trying to sort of get the
concepts in your head to match whatever is actually out there
in a way that you can get what you want. The problem is over
time what the network can do has increased. So in the– I can’t say the old
days, man– in the good days, people worked
on packets and tried to do data transmission. The era that I sort of worked
mostly in was an object era where we try and give the
information to people to do, [UNINTELLIGIBLE]
to do pictures. All the action in big research
labs now is on concepts, is on trying to do deeper things,
but still it worked like these too. They work everywhere. So you don’t have a specialized
AI program that only works for income taxes. That’s not good enough. No Google person would ever do
something that only works in one case, unless there
was a huge amount of money behind it. I’ll stop making money comments,
but the food is great here. So this is one common layout,
and there’s four or five others, which in the absence
of time, I will omit. But if you want to talk to me
afterwards, there’s lots of points of view about how to get
from here to there, where there is always all the world’s
knowledge, and here is whatever you can do now. Depending on what point of view
you take, it’s possible to go to the next step
differently because you have a different orientation. So the one that I’m going to
do in this talk is the linguistic one, which usually
goes syntax, structure, semantics, pragmatics. So syntax is what’s actually
there, like an actual set of bits in a file, a set of
words in a document. Structure is the parts,
it’s not the holes. So if you part something in
structure, you can tell that this particular thing is a
person’s name, this is the introduction to a paper, this
is the methods part. You can tell what the parts are
and you can search those differentially. Semantics is when you go inside
and you try to get something about the meaning, and
as you’ll see, people have pretty much given up on doing
real meaning, and they pretty much try to do, rather
than meaning, they try to do context. What’s around it in a way that
helps you understand it. Actually, when Google was a
research project, and the people that started it were
actually on the Stanford Digital Library Project, I was
running the Illinois Digital Library Project at the same
time, they said there’s enough context in web links to be able
to really do something. There were a lot of people that
said no, web links are made for all sorts of things,
and they don’t have any semantics, and they’re
not useful at all. But obviously, they were wrong
enough to make this building and employ all of you. The real goal is down here in
doing actual reality, in doing with so-called pragmatics. Pragmatics is sort of when
you use something. So it’s test dependent. The meaning of something
is always the same. so if this is a gene that
regulates cancer, it always does that. But lots of time, the task
you’re working on varies what you’re interested in,
what you know. I’m not going to say very much
about pragmatics because people have gotten very far on
it in terms of doing a big grand scale. But I actually know quite
a bit about it. If you really wanted to solve
health care, for example, you’d have to go down the
pragmatic route and try to measure people with as
large a vector as you can possibly get. And again, if people are
interested, that’s a topic I’d be happy to talk about, but it’s
off this particular talk. This particular talk is about
federation, as I said. So what does it mean
to federate each one of those levels? So to do syntax federation,
which is what the telesophy system pioneered, and for the
most part, what Google does in the sense of federating all
the web sources that are crawled, is it tries to make
essentially send the same query into every different
place. So true syntax federation,
which is actually what telesophy did, but not really
what Google does, is you start at your place and you go out to
each one of the sources and they have to remember where
they are on the network. They might go up and down,
and so you might have to retry them. And you have to know what
syntax the queries need. And when the results come back,
you have to know how to handle that. You have to do a lot about
eliminating duplicates when the results come back. So a very common problem is you
send out a query to try to get a certain Beatles song,
and you get back 5,000 of them, but they’re all slightly
different, and they’re in different languages and they
have different syntax. Merging those all together
is really complicated. So that’s what syntax
federation is. Structure federation,
which is what this– DELIVER was the DLI, the Digital
Library Initiative project that I ran at the
University of Illinois. It took about engineering
literature, it went out to 10 major scientific publisher sites
on the fly and allowed you to do a structured query. So you could say find all the
papers in physics journals that are within the last
10 years that mention nanostructures in the figure
caption in the conclusion. So you’re using the parts of
the papers to make use. And at least scientists
make a great deal of effort in doing that. In order to do that, you have
to figure out some way of making the mark-up uniform. So you have problems that you
just started to see in syntactic world where
who’s an author? If you have a physics paper that
has 100 authors, which one of them is the author? It might not be any of them
actually, it might be the organization that did it. Or if you have a movie, who’s
the author of a movie? Is it the producer, the writer, the star, the director? So there’s a lot of problems
there in how you do the mark-up uniformly
and how you make different values the same. For the most part, structure
has not made it into mass systems yet, although there have
been a lot of attempts to try to make languages for
structure like the semantic web that Vin and I were talking
about beforehand. But the amount of correctly
marked-up structured text is very small right now. So if you were going to use it
to search the 10 billion items that you can crawl on
the web now, you wouldn’t get very far. Semantics federation, which is
what I’m going to talk about today mostly, is about a
completely different topic. It’s about going inside and
actually looking at the phrases and figuring out the
meaning, as much of the meaning as you can. And then when you have many
small pieces, trying to match something that’s the same here
to something the same here. And doing that uniformly
is the job of semantics federation. So let me now go into
the first of the two technical topics. So the first topic I’m going to
do is how do you actually represent the things, and that’s
going to be a little slow going. Then I’m going to give some
examples of if you’re able to get this deeper level
representation, this deeper level structuring, what kind
of system you can build. It’s in a somewhat specialized
domain. It’s in biology and medicine,
because well, if you’re a professor and you work at a
university, that’s where you can get money to
work on things. You can’t get money to work on
the kind of things that are arbitrarily on the web. So scalable, so we’re now
into scalable semantics. I’ve been using this for 10
years, and every once in while someone will stand up and say
that’s an oxymoron, it doesn’t make sense because semantics
means really deep, and scalable means really broad,
and those pull in opposite directions. And I said yes, you understood
what the problem is. So in the old days, what
it used to mean is– what semantics used to mean
is you do deep meaning. So you had a deep structure
parser that would go in and figure out yes, this document
was on operating systems that only work on this class of
computers, and only solved this class of physics problem. So it’s on a very narrow,
detailed topic. There were many, many AI systems
made that did that. What happened when the
government started putting large amounts of money into
it– so most of this got developed in the– the base technology got
developed in the DARPA Trek program trying to read newspaper
articles looking for what would now be called
terrorists. What they found basically
is the deep programs were very narrow. If you trained something to
recognize income taxes, or you trained something to recognize
high-powered rifles, it wouldn’t help at all
in the next one. And there were just too many
individual topics to try to pick out the individual
types of sentences and individual slots. So what happened is the broad
ones beat out the deep ones when the machines got really
fast. When it became clear, and I’ll show you some machine
curves, when it became clear that you could actually parse
noun phrases arbitrarily out, then people begin using
noun phrases. When it became clear you could
do what are called entities, in other words, you could
say this phrase is actually a person. This phrase is actually
someone that lives in California. Then people started using it. Basically what happened is
semantics changed from being we know everything about this
particular topic and this phrase means one, it’s meaning
type 869, to we have 20 kinds of entities, and this is a gene,
and it occurs with his other gene. So we’ll say if you search for
this gene and it doesn’t work, you should search for
this other one. I’ll show you lots of cases
where that sort of guilt by association really helps. I’m not defending it necessarily
as being real semantics, I’m defending it as
something that you can do everywhere. So the upshot is this is
an engineering problem. It’s a question of if you could
do deep parsing and say yes, this person wasn’t– it’s true they said they were
interested in ice cream cones, but they really meant pine cones
when they said cone, then you would do that. But it’s generally not possible
to do that, except in very isolated circumstances. So you end up thinking globally,
thinking about all possible knowledge, but
acting locally. I guess this is a green building
so I’m allowed to make this kind of joke. So you look at a small, narrow
collection, and analyze the context, what occurs with each
other very precisely, and do something there. And that creates one
good situation. In other words, it means now
you’re able to go much deeper, and I’ll show you lots of
examples of going much deeper. But it creates one bad
situation, which is traditionally information
retrieval works like dialogue did in my era, or like
Google does now. You take everything you can get,
and pile it into one big huge server farm, and
then you search it. You index it once in one big
index and you search it. Well, the problem is if you want
to go deeper in semantics that doesn’t work,
because you mixed together too many things. You have to unmix them, and then
you have to worry about to get from here to there. So you change a central problem
into a distributed problem with all of the hard
features that go with distribution. What this is doing in
terms of– if you want physical analogy. For many years I taught
at a library school. The way indexes work in the real
world is for really big topics like if you have
electrical engineering, there’s a society that is big
enough and well-defined enough to employ people to
tag every topic. So they say here’s an article
about Windows, this one is about operating systems.
Here’s an article about Windows, this one is about
heat conservation. A person who’s looking at
that, and out of their selection of all the topics,
they say which topics the things are on. That worked fine as long as most
the information in the world was in these large, but
fairly small number of well-defined databases. That’s not the world we’re
living in now. We’re mostly living
in this world. So there still are a very large
number of big formal databases that are done by
hand, but nearly all the databases, nearly all the
collections, are these informal ones with communities
or groups or individuals. The advance of crawling
technology that’s been able to take all these and collect them
all together into one big place has actually made the
problem worse because now there’s not only apples and
oranges and pears altogether, but there’s lots of things that
aren’t fruit at all and aren’t really anything,
but they’re in there. So there’s many different things
that you don’t know how to deal with, and you
have to do something automatically with them. It’s not the case that
you can get– my daughter who keeps track of
all the cats on the block and has a website with their
pictures, it’s not the case that you can get her to employ
a professional curator from the library school who will tag
those correctly so that someone who’s a cat fancier in
the next town can see them. That’s not true. Need some kind of automatic
support. So I’m going to talk about
the automatic support. I’m doing OK for time. There’s two things. I’m going to talk about entities
and I’m going to talk about concepts. So here are entities. What entities are is trying to
figure out what type of thing something is. So one way is you have
hand-tagged XML, like the mark-up, like the
semantic web. So they take a particular domain
and they say there are 20 types here, and
we’ll mark-up each document correctly. So if we’re in humanities that
might work pretty well. This is a person, this is a
place, this is a type of vase, this is a time period
in Roman history. If you’re out on the web in that
situation where 90% of the stuff’s informally, then
even if there was a systematic set of types, the people
aren’t going to do it. So if you have well marked-up
hand ones you’re going to use them, but if you don’t
then you have to do something automatic. The thing that tends to work
automatic is to try to tag things by machine with training
sets, and I’m going to say a little bit about
what that means. First you go into the document
and you pull out the phrases. So you don’t do whole words. And in fact, over time the
experimental system I’ve built have gotten better the more you
can get away from words and change them into whole
phrases that are the equivalent phrase that works
in that particular domain. Right now search engines
don’t do that. That’s a big part
of the problem. Then you have to recognize
the part of speech. Is it a noun or a verb
or an object. Again, 10 years ago, you needed
a specialized grammar and it only worked in a
particular subject. Now there’s things trained on
enough machine learning algorithms, trained on enough
things that you can get very high accurately, up
in the high 90s, with parts of speech. And in fact, they’re actually
systems, and you can tell this was secretly funded by the CIA
under some other name that recognized person, places, and
things pretty accurately. So if you want to recognize
newspaper articles and automatically tag these
correctly, it actually does a pretty good job. Again, commercial search engines
tend not to use those. So here’s an example of
entities in biology. These won’t mean very
much, but they’ll give you the feeling. Here’s a kind of functional
phrase. A gene is a type of an entity
and encodes a chemical. So here’s an example. The foraging gene encodes a
cyclic GMP protein kinase. So this is one of the entities
and this is the other entity. In scientific language things
are very regularized, so there’s lots of sentences that
are actually that easy. Or here’s another one. Chemical causes behaviors. Here’s one that’s
a little harder. I tried to put one
a little harder. This one says gene regulates
behavior, but that’s not in the sentence. What’s actually in the sentence
is this gene, which is a ortholog of this other
gene– so it doesn’t say directly, it says indirectly
it’s a gene– is involved in the regulation,
which is not the same phrase as regulates. So you have to do a little bit
of parsing to get a phrase like gene regulates behaviors. But the natural language
technology is now good enough to do that accurately. I did do a little bit of prep
and look at some of the commercial systems that
were doing this. If you want to ask a question
about those later I’ll make a comment. But they’re all competitors,
they’re not Google, so I didn’t want to say up front. The last comment I’m going to
make about entities is they come in different varieties. That means that sometimes
you’ll do them and sometimes you won’t. So there’s some of
them, and again, these are biology examples. There’s some of them that are
just straight lists, so the names of organisms, like honey
bee or fruit fly are almost always exactly those
same words. So those are easy entities
to tag very accurately. Things like genes or parts of
the body vary somewhat, but there often are tag phrases that
say this is the part of a body and here it is. It’s a wing. Or this is a gene and it’s
the foraging gene. So there are often tags there. If you get training sets
you do pretty well. Then there’s really hard things
like what kind of– these are sort of functional
phrases– what kind of behavior is
the honey bee doing? What kind of function does the
computer operate with? Those ones are almost always
different, so you need a really big training set to
do those accurately. If you were going to try to
do entities across all the world’s knowledge, you would
have two problems. I think that’s the last thing I’m going
to say on this, yes. The first is you would have to
try to make a run at the hard ones, or at least say well,
we’re only going to do these because that’s all we
can do uniformly. The second thing is you have to
realize that the entities are different in each
major subject areas. So the biology ones are not the
same as the medicine ones, which are more disease-like, and
the medicine ones aren’t the same as the physics ones,
and the physics ones aren’t the same as the grocery
store ones. My guess is there’s a relatively
limited number of popular ones, and if you’re back
in the same style that, trying to classify all the web
knowledge, like Yahoo!– that used to be Yahoo!’s main
strategy, for instance. That there are a couple hundred
really important ones and a couple thousand
big ones. So if you had enough money and
enough expert teams and set each one up to making training
sets, you could actually do entities all the way across. A research project can’t muster
that except in one small area. That’s all I’m going to
say about entities. Now, let me explain just a
little bit about what you do with entities, and then
give a big example. So what you do with entities,
you know you might think you’re going to answer questions
with them, and that’s what the commercial
systems are doing. You can sort of answer
questions, so you can say this gene seems to affect this
behavior and this organism. So you can say what are all
the things that affect foraging in insects and
get out lots of– this is sort of you have
a relational table. You take a document and change
it into a relational database. You can answer that kind of
question, but there’s lots of kinds of questions
you can’t answer. What you can do is after you
extract these entities, these units, is you can compute
these context graphs. You can see in this document how
often do these two things occur together. That one you get a lot of
mileage from, because if you try to search for this one and
you can’t find it, you can search for this other one. Or if you’re trying to search
for this one and you can’t find it, you can go down the
list of the ones it commonly occurs with and it’s sort of
a suggestion facility. People that watch search in
libraries, what they typically comment on is people don’t
know what words to try. They’ll try all the words they
can think of and then they’ll start searching dictionaries or
looking at other papers or asking the people
next to them. So since you can automatically
do suggestion by making this graph of all entities that are
related to all the other entities in terms of how often
they occur together in a collection, then you can
use it for suggestion. This is my computer
engineering slide. The other unusual feature
about Google that people didn’t predict is could
you build a big enough supercomputer to handle 10
billion items. And dialogue would have said no because IBM
will not sell you that many platters and that big a thing. Well, what they didn’t realize
was what the rise of PCs would do if you hook things together
and you could partition the problem enough. The research people hit that
same curve a decade earlier. So I was trying to do
these relations– this is my six or seven
year history. These are all how big a
collection you can do and find these entities and these
relations basically on workstations. So this is like a Sun-2, and
that’s a Sun-3, and this is a network of Sun-3’s,
about 10 of them. This one is discovering
supercomputers at NCSA and you could get 1,000 all
at one time. That made a big difference, and
it meant– in fact, this was a big hero experiment. It was the first supercomputer
computation in information retrieval. For quite a while, it was the
biggest computation that NCSA had ever done. They couldn’t figure out why
you’d want to integrate all the world’s knowledge. Why would anybody
want to do that? I think in 1998, Google was
probably about 10 employees. So that question hadn’t
come up yet. The number of articles in
Medline was still much greater than the number of articles
on the web. So here’s what that computation
was like. It had about 280 million
concepts, so that number was big then. It’s now small. However, if you fast forward to
today, the machines are lot faster, so the server I just
bought for $25,000 has more memory than that supercomputer
eight years ago. These are big memory
computations. You can guess it’s got a big
matrix inside that has all the phrases versus all the phrases
and how often they occur. So the more physical RAM
you have the better. What it turns out is if you’re
able to put a connection graph, so this is a graph of
which terms are related to which other terms
all in memory. And it’s a nice graph, like a
small worlds graphs, which looks kind of like this. So there’s a group here that’s
all sort of connected and another group here. So it comes in groups. That tends to be true of just
about any kind of text that people have seen. Then you can find all the
inter-relations really fast because you don’t have to look
at this one versus this one because you know that
you can stop here. So there’s a way of ending
the propagation. What that means is you can
now do things on the fly while you wait. So you don’t have to pre-compute
the collections anymore, which is what
we had to do before. You can do a search, make a new
collection, and then make it semantics. Then make it deep. You can cluster it on the fly
into little chunks, you can find these inter-related
graphs while you wait. And that’s while you wait
with a $25,000 server. If you had something the size
of Google you could not only do all the world’s knowledge,
which isn’t that much proportionally bigger, but
you could also do deeper. So that’s why now I want
to show you what a real concepts-based system looks
like so that you get some feeling as to how different
the interaction is. Generally, there’s two things
in a space system. One of them is called
federation– I’ve been talking about
that before. It’s how do you go from one
collection to another. The other is called
the integration. It’s if you have an entity
what can you go out to. We’re talking about going
across collection, and I didn’t mean to say this
was replacing IP. IP is under everything. But what I meant is it’s
replacing words. This was the first interspace
system, the one DARPA paid for. There aren’t words
in this anymore. When you point to
this you get– you get that whole phrase,
simple analgesics, and all the things that are equivalent to
it phrase-wise after you do all the linguistic parsing. So it looks like a bunch
of words, but it isn’t. It’s a bunch of inter-related
concepts and those are uniformly indexed across
all the sources. So you can go from simple
analgesics here to all the concepts, all the phrases that
it’s nearby, to the ones that are nearby there, to the
documents, to which little cluster it’s in. You can sort of go from concept
to concept, the concept across all the
different sources. The main reason I showed this
was to just show that words don’t exist anymore. You’ve got these deeper level
thing, which I tried to convince you earlier
was possible to do. Also because this DARPA project
broke up in 2000 just before 9-11, and DARPA decided,
they yanked the plug, and they didn’t want to
help analysts anymore. Every person on the project
went to work at Microsoft. So it’s entirely possible that
Windows 2010 is going to have all this stuff in it. Yes, question? AUDIENCE: That’s the
one, this one? AUDIENCE: The hexagon,
[INAUDIBLE]. BRUCE SCHATZ: It is actually,
and let me postpone that because I’m going to answer
it better later. But basically, it’s taking the
document collection and grouping it in to individual
groups of documents which have a similar set of phrases
in them. This is just a bad graphical
representation of it. But I’ll give a good example
of it later. So yes, it’s quite meaningful,
and you’ll see in the session what its utility is. So what I’m actually going to
talk about now in the last five minutes of my talk is
this BeeSpace system. It is about honey bees, so
you’re allowed to have cute pictures and cute puns. Yeah, see? The college students don’t
ever laugh at this, but I always thought it was
funny, bee-havior. So much for that. It must be an age thing. So the point of this system is
you make many, many small collections, and you know
something, and you want to use your terminology and
your knowledge to go somewhere else. So you want to go from molecular
biology into bees, into flies, into neuroscience. So I’m working with the person
that actually is the national lead on the honey bee genome. What it does inside is basically
uses this scalable semantics technology to create
and merge spaces– you’ll hear a lot about spaces
in the next five minutes, so I won’t explain right now– to try to find stuff. So it’s complete navigation,
complete abstraction, but finding things when you don’t
know what you started with. Space is a paradigm,
not a metaphor. I hope I’m not offending any
user interface people. I’m not sure if Dan is still
sitting in the back. In other words, there really
are spaces in there. You take a collection, you
make it into a space. You can then merge two of them,
you can pull out part of it, you can break it into parts
and make one part of that the whole space. So it’s like you have all the
world’s knowledge and you’re breaking it into conceptual
spaces which you can manipulate. You personally, plus you can
share them with other people. So it has quite a different
character than you’re trying to do a search and you get
a set of results back. This particular one does do
entities very universally, but it only does concepts and genes
because that’s all this subject area needed. So please don’t criticize
that particular one. It was chosen narrowly because
we wanted to at least have one that did those uniformly. These are the main operations
I’m now going to show you very quickly through a session. If you go to the BeeSpace site,
which was on that bag. It’s You can use the system and beat
it to death, assuming you can read Midline articles,
which you may or may not be able to. So extract is going to take a
space and figure out what all special terms that distinguish
that space and have a way of searching. Mapping is going to go
back the other way. It’s going to take a space and
break it into parts, and then you can turn each one
into a space itself. This is space algebra, and this
is the summarization. If you find an entity, it
does something with it. This example probably don’t
care, but it’s looking at behavioral maturation. It’s looking at a honey bee as
it grows up, it takes on different societal roles. It takes care of the babies,
it goes out and forges for food, and looking at that across
different species. So it’s a complicated
question. It’s not one that there’s a
well-defined answer to. So now we’re into the BeeSpace
system, which is running right now. So you type behavioral
maturation, you choose a particular space that was
already made out, it’s insects, it’s about 100,000,
and you do browse. So that gets about 7,000
articles, which are here, which is too much to look at. The problem was behavioral
maturation wasn’t the right term. The first thing the system’s
doing is it’s extracting. It tries to go in and analyze
the terms, the phrases, and get out a more detailed set. So that’s issuing extract. It automatically pulls out the
most discriminating terms in that collection, and
you usually have to edit it a little. That’s what I did here. Then you can take those
back and browse again. It’s not working. Oh, did it go? Yeah, I’m sorry. There it is. You got more items. 22,000. AUDIENCE: That’s not necessarily
good if you were trying to narrow it down. BRUCE SCHATZ: The problem was
you narrowed it down too much. You didn’t actually get the
articles about behavior maturation because a lot
of them didn’t say it. What you want to get is all of
the things that might be interesting and then
narrow it down. So that first one was trying to
expand it a little bigger. It was doing sort of a semantic version of query expansion. Now the problem is this one is
too many to actually look through, and now I’m going to go
back the other way and sort of answer the question that
was asked before. So this is automatically taking
that collection, and while you wait, it’s breaking
it into a number of different regions. Here it’s about 20– some of
them are off the page. And the regions tend to be– they’re sort of these small
worlds regions that tend to be tightly on the same topic. The topics are kind of hard to
describe what the topics are because they’re automatic,
they’re not on some well-defined term, but they tend
to cluster together well. The thing to notice is this was
done while you wait, even with this small server. So the collection was made on
the fly, and this mapping was done on the fly. This is an interactive
operation. The pre-computation
wasn’t about this. You didn’t have to can this. So we take this particular
region, and now we’re going to operate– this is that, just
that one cluster. Now we’re going to save it. And see, now it’s a fully
fledged space just like all the previous ones. So we were sort of navigating
through space, we found this little collection is what we
want, and now we’re making it into a space ourselves. This one is now well-defined,
it’s about behavioral maturation in a large about
insects, and we wanted to look at multiple organisms. So now
we’re going to start doing space algebra. We’re going to start taking
this and merging it with other things. So here I took the new space I
just made and I’m intersecting it with an old space. Currently, that’s just finding
the documents in common, but we’re now working on fancier
data mining to try to other patterns. So here’s the 21 that
have that feature. If you look at this article,
this article is, in fact, about some basic receptor about
Drosophilidae, which is the fruit fly, an insect,
but it’s about– well, I’m sorry it’s
not fishes. Marine crustaceans are
like lobsters. But it’s something that
lives in the sea. Since you now found something
at the intersection of those two, what you really wanted to
do was describe the genes. Here you can point
to this gene. This was entity-recognized
automatically in green, and tried to summarize it. So here it’s summarized. You can see the summary parts,
however the problem is this particular intersected
space has hardly any documents in it. So there’s not very
much to summarize. You did get the right gene, but
you didn’t summarize it against the useful space. What you want to do is go switch
this term over into this other space, into the
Drosophilidae space, which has like 50,000 articles, and
then summarize it again. So here’s an article
that has it in it. This one you can see
has more entities automatically selected. Then here’s the gene summary
against that space, again, done on the fly. So this is a general summary
facility that if you have an entity and you have a space,
so you have a specific term and a collection, you want to
see what’s known about it in that collection. This is a type of summary you
can do while you wait. It’s a scalable one. You can break it into a
well-known category. You can rank order the sentences
in those particular categories. This is kind of like a new
summary, but it’s a semantic type new summary. Then if you went into that, you
would see that there lots of entities recognized here. All the things in green were
done automatically, and if you pointed to these you would then
summarize those in this space, or you can go off to
another one and summarize it. So I want to just say each one
of these main features was done dynamically on new
collections, while you wait. And you can basically expand the
searches, you can take a search that’s too big and break
it into pieces, you can make a new space and do algebra,
do intersection on it, or if you find a particular
entity, you can summarize it in different
ways. Those are examples of the kinds
of things that you can all do automatically. So the message is these are all
general, and if you have to do biology you sort of work
up towards the inter-space where your intersecting all the
spaces using these sets of ones, by doing birds and bees
and pigs and cows and brains and behavior. These are actually all projects
I’m working on. I work in the genome
center where this project is going on. So it is a birds and bees
in pigs and cows project in some respect. Let me now conclude to allow
some time for questions by just saying this is actually
quite a different world. It’s not pile all the world’s
knowledge in one big place. It’s have many small little
ones, including ones that are sort of dynamic communities
that are made on the fly. And because of that every person
that’s doing it is actually doing just about
everything there– indexing it, using the system,
they’re making new collections, they’re authoring
materials themselves. And the system itself
could be occurring all in one big server. And ours, of course, does. But it could also occur
in many small places. It’s a very small, localized
kind of system. My guess is if you had to do
this on 10 trillion, which is what’s going to be true in a
decade on the web, then you wouldn’t have four or five big
servers that cover the world. What you’d have is at the end of
every block, or you’d have a hierarchy like the telephone
network used to where you’d have servers that actually
handled each set of spaces that they were doing live
manipulation against. It’s quite a different world. It’s much more like the virtual
worlds that the kids today wander around. Maybe you all are a little bit
too old to spend all your time on Neopets or even
on Second Life. So I promised I would end
with what’s a grand project you could do. So one grand project you could
do is take some set of people, like university is very
convenient because you can force undergraduates to do
just about anything. If you’re at the University
of Illinois, there’s 35,000 of them. There’s quite a few of them. There’s few less because
some of them came here. And you capture all the text,
the library and the courses actually where– our library has just gone into
the Google Books program, and all the context which tries
to do the relationships, partially by live with
this kind of system, and partially by– well, I guess this is actually
OK to say but, if you gave everyone at the University
of Illinois free Gmail and a free Gphone– I guess the Gphone isn’t
announced yet, but there’s lots of rumors on the web
that there will be one. Anyway, if you gave everybody
a free email and phone and said with the proviso that
we’re going to capture everything you ever do and we’re
going to use it for good purposes, not selling you ads,
but trying to relate things together to help you understand
the context, then the university would be
delighted because they’d like to educate, not just the people
on campus, but people all over the world and make
money charging them tuition. People at Google might be
delighted because normally you couldn’t do this experiment
because you would get sued out of existence, even with your
lawyers I would guess, if you tried to surreptitiously capture
all the Voit that was coming out of Gphone. That’s not proposed is it? I’ve had people tell me the
University of Illinois might refuse to have it done. But if the undergrad takes it,
that’s the deal, right? They take it as long as we’re
going to record everything. You might really be able to
build a semantically-based social network, so you’re not
sharing a YouTube video by it’s got the same little tag on
top of it, but by some of real, deep, scalable semantics
underneath. So that’s all I have to say, and
I did promise I would put some bees at the end. So someday we will do hive mine,
and it probably will be in your guys’ lifetime,
but not in mine. That’s all I have to say. Thank you. [APPLAUSE] Question, yes? AUDIENCE: I was wondering– [SIDE CONVERSATION] AUDIENCE: I was wondering could
you use the semantic relationships that you’ve
built up to debug the language itself? In other words, create some kind
of metric that detects whether the description or the
expression of a particular concept is coherent or
incoherent, and essentially flag places where the
terminology is insufficiently expressive. BRUCE SCHATZ: Could you
hear the question or should I repeat it? OK. The question was can you
regularize the language since you’re now detecting
all these patterns? That’s actually been done quite
a bit with tagging to quite a large degree
of success. So the reason that our digital
library project succeeded and the one at Elsevier, which was
a big publisher failed, is we had a set of programs that went
through and automatically cleaned up the tagging, the
structure tagging, that was coming back from the publishers
that the authors had provided, and then sent
corrective information to the author’s telling them what
they should have done. But the things that went
into our system were the cleaned up ones. It’s what data mining people
call cleaning the data. It is true that the more regular
things are the better they work, so that if you tried
to do a chat session, like an IM text messaging, it
would work much worse than it did with biology literature,
which is much more regularized. The general experience with
these kinds of systems is that people are much better at
hitting the mark than computers are at handling
variability. So it’s kind of like those
handwriting recognizers that you learned how to write
[UNINTELLIGIBLE]. So my guess is that yes, the
users are trainable. And if I tried to do this with
undergrads, I would certainly do things like fail people that
got in too many– you know, it’s like if you programs
don’t parts correctly than you don’t get
a passing grade. It’s a problem though. The more regular the world is,
the better this brand of semantics does. Is there another question? Yes. AUDIENCE: I will start with a
simple practical question. When I go to PubMed and ask
for references including phytic acid, it knows that
phytic acid is inositol hexakisphosphate. Is there any automation in that
process, or is that just a laborious transcription
process on the part of human being. BRUCE SCHATZ: OK, if you’re
asking what PubMed does, the answer is they have a big
translation table with all those wired in. It’s because they’re a
large organization with a lot of libraries. They’re actually able to provide
a large set of common synonyms to things. If you have an automatic system
it can’t do that. Well, actually ours is sort
of a hybrid system. Ours actually uses the synonyms
like that that PubMed has as a boost to finding
equivalent ones. If you’re not able to do that,
there’s a whole set of linguistics processing that
tries to find things that are synonyms to different
degrees of success. It looks for things
that are in the same slots and sentences. It looks for equivalent
sentences that had different subjects that were
used the same. It uses ways that acronym
expansion are commonly done. There’s a set of heuristics that
work some of the time, maybe two-thirds of the time in
regularized text like this. But they’re not perfect in the
way– the ones you’re seeing are all human generated, and
that’s why they’re so good. You will always use human
generated ones if you could, and in fact, it’s very likely
when I give a more popular version of this kind of talk,
what people point out is even though the kids on the block
that maintain the cat– the one about cats. You know, the small, specialized
collections. Even though they’re not willing
or probably not able to do semantic mark-up,
they are able to do lots of other creation. They are able to show typical
sentences, they are able to do synonyms. And there may be a lot
of value added that comes in at the bottom that improves
each one of these community collections. I expect that that will be a
big, big area when it becomes a big commercial thing. You’ll need to have the users
helping you to provide better information, by better
context. Yes, Greg? AUDIENCE: Remember– don’t
go all the way back– I remember the slide about
functional phrases, and it seemed that in the three
examples that were on that slide, there were of the form,
something I might call a template predicate. In other words, A template
relates to B. You seem to be saying that the system
automatically drive those templates from analyzing
the text. Is that correct? BRUCE SCHATZ: That is correct. AUDIENCE: So my question
then is this. Can you compare and contrast
that technique of producing templates to two other things. Number one, the system that
the Cyc guys did to make– BRUCE SCHATZ: [INAUDIBLE]. AUDIENCE: –to make predicates,
but starting from a different point and ending in
a different point, although they have predicates. That’s comparison number one. Comparison number two is with
respect to, let me just call them template predicates for
lack of a better word. If you have those and you
created them solely from deriving them from
text, then you don’t have world knowledge. You basically have knowledge
that just came from the documents. It seems to me that getting from
the one to the other is what Cyc was trying to do, but
I understand that since they were doing it by hand they
abandoned that and they’re now trying to do automatic
techniques. So that thread of thought
seems to be in the same ballpark as what you’re trying
to do here, but with a different approach. I was wondering if you can
compare and contrast, and maybe there’s a third area of
endeavor trying to get to that next step up that maybe you
could educate us about. BRUCE SCHATZ: Yeah. That is a very, very
good comment. For those of you that don’t know
what Cyc is, C-Y-C. It was a very ambitious attempt at
MCC to try to encode enough common sense knowledge about
all of the world so that it could automatically do
this kind of thing. As Greg said, it was
largely a failure. So let me sort of say what the
spectrum of possible things is as a longer answer. Am I running over my time? Is it OK? MALE SPEAKER: It’s lunchtime
for a lot of these people. Let’s say another five minutes
and then we’ll formally break, and then people who want to hang
out it’s OK, we got it. [SIDE CONVERSATION] BRUCE SCHATZ: I usually get
lunch people by saying there’s free food, but that
doesn’t work here. AUDIENCE: We all
work for food. BRUCE SCHATZ: You all
work for food. So Greg asked a very good
question about where’s the line in automaticness. Well, the old way of solving
this problem used to be you had a fixed set of templates. What that essentially hit a
wall with is each small subject area needed a different
set of templates, and it was a lot of work
to make the templates. So then they were a set of
people that said you needed, if you had a small amount of
basic world knowledge, you wouldn’t need the templates, you
could automatically make training examples. The problem is that that could
rarely, only in very isolated cases, could do even as good a
tagging as what I am showing. What most of the people do now
and what most of the examples that I was showing are is a
human comes up with a set of training examples of what are
typical sentences with genes in them in this particular
subject domain. Then the system infers, exactly
as you said, the system infers what the grammar
is, what the slots are going to be. There’s a few people
experimenting with two automatic things, and they don’t
work at present, but my belief is in the next year or
two you’ll see research systems with it. If you had a concerted
commercial after it you could probably do it and get away with
it, it just wouldn’t work all the time. They’re essentially either
trying to automatically make training sets, so you start out
with the collection and you try to pull out sentences
that clearly have some slots in it and then just infer
things from that. Or they try to automatically
infer tags, infer grammar. So you know some things, like
you know body parts and you know genes, and the question
is can you infer behavior, because you know in slots, you
already have slots in the particular subject domain. My feeling is one of those two
will work well enough so that you can use it automatically
and it will always do some kind of tagging. It won’t be as accurate
as these, which are generally correct. And it could either just be
left as is, so it’s like a baseline of everything is tagged
and 60% of them are correct and 30% of them
are ridiculous, but 60% buys you a lot. Or they could be the input
to humans generating it. So the curators I’m working with
in biology, we already have a couple pieces of
software that do this. They don’t have to
look at all the sentences in all the documents. We give them a fixed set of
sentences that are sort of these are typical sentences that
might be ones you’d want to look At And then they
extract things out. So there’s a human step afterwards that does a selection. Almost all the statistical– I went kind of fast through
it– but almost all the statistical programs don’t
produce correct answers. They produce ranked answers that
the top ones are sort of in the next band. My expectation is the tagging
will be like that. So the practical question is
which things are good for which kind of text. So I guess we have time
for another question if anyone has one. MALE SPEAKER: You actually had
another one because you started with your easy one. BRUCE SCHATZ: Should we take
should someone else? MALE SPEAKER: Let me just
suggest, because it is getting close to lunchtime,
let me suggest one last basic question. Given all the information about
bees, have you been able to figure out why they’re
disappearing? BRUCE SCHATZ: It turns out
actually we have a summer workshop on exactly
that topic. And the answer is, like
most things about bees, nobody knows. MALE SPEAKER: So much
for that idea. OK, well thank you
very much, Bruce. We appreciate the time. Those of you who want to hang
out, Bruce has time to stay this afternoon. We can all have lunch
together. BRUCE SCHATZ: I’m generally
hanging out today and tomorrow morning, and there’s a lot of
stuff about the system up on the BeeSpace site, which you’re
welcome to look at. And the slides are also going
to be made available if you want to flip through them. Thank you everyone for staying
through the whole thing.

One thought on “Towards Telesophy: Federating All the World’ s Knowledge

  1. Hey! Av You heard the talk about – H Be Gone (do a google search)? Ive heard great things about it and my mate after a lifetime of fighting said ta ta to quite painful hemorrhoids with it.

Leave a Reply

Your email address will not be published. Required fields are marked *