123.5K
Downloads
212
Episodes
One CA Podcast is here to inspire anyone interested in traveling to work with a partner nation’s people and leadership to forward U.S. foreign policy. We bring in current or former military, diplomats, development officers, and field agents to discuss their experiences and give recommendations for working the ”last three feet” of foreign relations. The show is sponsored by the Civil Affairs Association.
Episodes
Sunday Apr 08, 2018
1: Jon May: Artificial Intelligence for HA/DR Operations - LORELEI
Sunday Apr 08, 2018
Sunday Apr 08, 2018
Please welcome Jon May, Research Assistant Professor of Computer Science at the University of Southern California.
Dr. May describes his work on a DARPA-funded artificial intelligence project called Low Resource Languages for Emergent Incidents (LORELEI) and its connections with HA/DR operations for Civil Affairs.
One CA is sponsored by the Civil Affairs Association.
Hosted and edited by John McElligott.
---
Transcript
00:01:00 Introduction
and welcome to the 1CA podcast. My name is John McElligot. We're joined today by Jonathan May. He received his PhD in computer science from USC in 2010. Prior to rejoining USC and the Information Sciences Institute in 2014, he was a research scientist at SDL Language Weaver. John's researching areas include language, a natural language processing, specifically machine translation and semantic parsing. and formal language theory. Dr. May, thank you very much for your time. Thanks very much for having me. It's great to be here. Sir, before we dive into the program that you're working on and how it relates to humanitarian assistance and disaster response and civil affairs branch of the military, we want to go through some of the basics of what your field entails. So if you could go into more detail about your background and the natural language processing field. Sure, great. I was a computer science major in college, and I started to become very interested in artificial intelligence.
00:02:09 SPEAKER_04
intelligence. I thought it was really cool that, you know, we could build systems that could, you know, try to be, you know, mimic the brain sort of, or play games against humans. And in particular,
00:02:23 SPEAKER_04
I like the idea of, I discovered this field called natural language processing. which is really about how humans and computers can talk to each other, really how computers can understand human language and then produce human language and everything that that entails.
00:02:44 SPEAKER_04
And today you see a lot of natural language processing, or it's also sometimes known as computational linguistics, in your day -to -day life. So if you're just using, say, Google and typing a search query there, you're just... You're using your own words to try to figure out what you want,
00:03:00 SPEAKER_04
want, and then a computer algorithm somewhere is trying to find a web page that's responsive to you. So that's natural language processing right there. Other areas are determining when you spelled a word wrong. A kind of classic example is Siri, who's listening to you speaking,
00:03:19 SPEAKER_04
to you speaking, understanding the speech patterns and turning those into words and understanding what those words are supposed to mean and then trying to give you an answer. automatic translation, which is, you know, where you've got some Chinese webpage and you want to figure out, you know, what does this mean?
00:03:36 SPEAKER_04
You know, maybe it's a train ticket booking page. You need to figure out how to buy your tickets and they don't have the data. Somebody didn't write a translation, so you have to automatically translate these words. And then you can actually engage in commerce there, even though they don't speak your language and you don't speak theirs. So I love all that stuff. It really is. It seems to me like a great way to, particularly translation, to unify the world. So we're all kind of speaking one language together. And yeah, there's lots of great accomplishments that have happened over the past 20 years or so. And I think there's a lot more still to be done. It seems to be a field that's advancing at a rapid pace right now. Yes, yes. In particular, the field has really been around about as long as computers have been around. Pretty much, you know, the early development of computers that were at the end of the Second World War were first used for calculating missile trajectories,
00:04:31 SPEAKER_04
of the Second World War were first used for calculating missile trajectories, but then the second use was trying to do automatic translation. In particular, like in the early 50s, the U .S. was particularly keen, of course, on translating Russian. And this was way back when, but it wasn't very good for a very long time. But in the modern era, we have... volumes of data available to us and really sophisticated fast hardware that's able to process this data and so we're able to take advantage of all this data and learn statistics about the data to help us that have led to lots of gains really practical gains and in the past say five to seven years in particular
00:05:21 SPEAKER_04
You've probably heard about these advent of deep learning, which is the use of this particular kind of technology called neural networks. And they have really led to some really stunning developments. Now, sometimes it can be hard to tell whether you're talking to a computer or a human. Wow. And so it's fascinating. And I wanted to ask you about a question that... was included in a brief that you had provided to some civil affairs troops recently. The question was, can we leverage artificial intelligence or AI to respond to disasters around the world? What inspired you to ask that question? I want to give credit to DARPA for really asking that question before I did. But I saw,
00:06:07 SPEAKER_04
well, I think they saw, and we all saw it together. I was working for... this machine translation company after graduation in 2010.
00:06:16 SPEAKER_04
And I remember, so this was a company and we were providing translation, many different kinds of languages to companies and also for some government projects and also to help human translators actually do their job better. And I remember there was the earthquake, I believe, in Haiti. And it was a big humanitarian crisis. Most of the people in Haiti, of course, speak Haitian Creole, which isn't a language that we've historically spent efforts on trying to build automatic translation systems for. There's not a lot of data. There's not too many people that actually speak Haitian Creole, the population of Haiti,
00:06:56 SPEAKER_04
which is relatively small. But I asked my boss at the time, I said, you know, is there anything that we could do? I feel like maybe we could be of some service. And he said, well, I don't think there's much we could do. I mean, you know, these people are in a crisis situation right now. And it takes us quite a bit of time to gather enough data to build a system. And even building the systems takes some time. And by the time we're ready to deploy a translation system to maybe connect, say, USAID providers with the people on the ground who are maybe texting out their requests. It's going to be too late. So we didn't do anything, but there were people who did. And there was a program where they went down,
00:07:37 SPEAKER_04
they went down, and there was a team of people who did what I do. But they also brought in native Haitians, expats, and they were trying their best to use what technology they could and also just kind of scramble to translate these things as fast as possible. But it was kind of like it would have been better if they prepared this sort of thing ahead of time.
00:08:00 SPEAKER_04
Well, prior to that, we had done, I worked on a team, I think, back in, I want to say, 2003. And we were looking into, you know, if we needed to develop a system in a new language for translation or for, sometimes translation is fine, but you actually typically get lots and lots of data thrown at you all at once. I think analysts can receive, you know, tens of thousands of documents that they have to sift through a day. And just translating them all is not really necessarily going to be that great. There's other techniques that are part of natural language processing, which is understanding the most important parts of a document, trying to provide a summary, or just identify the names of the people, the places, and maybe the events that are happening in a big picture to allow some triage to happen. So we wanted to know, could we build those systems? If we just learned about a language, and somebody said, okay, go, build a system, what could you do in 30 days? And back in 2003, we tried doing this.
00:08:58 SPEAKER_04
doing this. And I was really kind of taken by how surprisingly well we were able to do with the language at the time, the Cebuano,
00:09:06 SPEAKER_04
which is... Where is that spoken? I think it's in the Pacific, in the Pacific Islands region,
00:09:15 SPEAKER_04
and I should look that up. Give me a second,
00:09:19 SPEAKER_04
if that's all right. Maybe Papua New Guinea or someplace like that? So, I'm sorry. The Philippines. is spoken in... Yes, it's an Austronesian language, so it's native to the Philippines. It's the second most spoken language in the Philippines after Tagalog.
00:09:40 SPEAKER_04
It should have been fresh around now. But anyway, yes, so it's spoken in the Philippines. But I hadn't studied it before, and most of our team hadn't. And, you know, we did a pretty good job. It was kind of surprising how well we were able to do without too much specific Cebuano data, and we didn't talk to any Cebuano experts. And so this kind of, I think this idea was sort of stirring around, and then after 2010, at DARPA they came out with this program, which was about,
00:10:12 SPEAKER_04
the name of the program was called Lorelei, and it was about trying to be responsive to the humanitarian aid and disaster relief needs when you don't have a lot of resources available. in terms of data and in terms of time. So given very limited data in the language that you need to build a system for and given a very limited amount of time,
00:10:34 SPEAKER_04
very limited amount of time, really ideally 24 hours is what they're aiming for. What kind of systems can we build? What kind of technology can we build? And so that's been a major focus for me and for a number of researchers actually around the world over the past few years. And it's been great because we really... We get to work with people who speak the language but aren't experts in linguistics or experts in computer science,
00:10:57 SPEAKER_04
speak the language but aren't experts in linguistics or experts in computer science, and they teach us about their language in this really limited time frame. And we're able to build surprisingly sophisticated systems. It was surprising to me at first, actually. And, you know, if you have a little more time, you do a little better, but when you don't have a lot of time, you can still do pretty well. I think there's also been some nice interest in deployment. in various agencies. So it's been a pretty nice story.
00:11:28 SPEAKER_04
story. Right. Yeah, I think 24 hours is very fast for anyone, but especially for civil affairs and for the military, unless we happen to be on the ground or in country already, if there was a natural disaster or outbreak or some kind of man -made event, it would take a little bit longer for most teams to respond. But if USAID or some other assets were already, you know, on their way as a Dart team, for example, then we would be coordinating with them and having a system like this in place would be very helpful. Well, it's really great to hear that 24 hours is a little too fast because, to be honest, if you wait a week, it's a lot better. So, you know, we can do some early triage, but then actually the more we... The more we see how we're doing at the beginning, the better our systems can get. So in our early days, we did give ourselves up to a month. And by the time you're done with a month of training, you've actually got a fairly usable system. It's still not at the same level as, say, like a French -English translation system where we've got 100 billions of words of French and English, and we've been studying that problem for years and years.
00:12:47 SPEAKER_04
We do pretty well, and we learn more insights on the language over the time, too. So our first year, we were working with Uyghur, which I'm actually kind of pronouncing wrong. I think it's more like Uyghur. But this is a language that's spoken in China, in the Xinjiang region, which is in the northwest. So it's spoken by an ethnic minority. It's a Turkic language, actually. It has no relationship to... to Mandarin. And it's, you know, so we were working with Royale and we realized after a few days,
00:13:22 SPEAKER_04
maybe a week of working with it, that hey, you know, this language is actually quite similar to some language that we've already got data for. And we had a lot of Uzbek data. And so we were able to develop techniques for pretending that the Uzbek was Uyghur and actually transforming the Uzbek into Uyghur. And now... increases the amount of data that you've got available. And this is kind of a major part of this program, is trying to look around and see, you know, even though you don't have a lot of resources in the language that you care about, if you have a lot of resources in other related languages, you can figure out what those related languages are. Can you leverage those? Right. And furthermore, you know, there's, to some degree, all languages have things in common, right? So even though...
00:14:09 SPEAKER_04
Chinese and English might seem very, very far apart from each other, and in many ways they are. There's still kind of common understandings that underlie all languages, and you can take advantage of these things too. So there's kind of like language universal ideas.
00:14:25 SPEAKER_04
So if you have a bunch of news data, say, and it's in some language, you don't know this language at all. Maybe you're not even told what the language is. You can still assume that people are probably going to be talking, at some point about dates, right? You know, days of the week or months or years. Right. And, you know, we do tend to have, to segment our...
00:14:51 SPEAKER_04
calendar into, you know, roughly four -week chunks. And so there's, you know, about between 28 and 31 days in every month. And so you can kind of pick up on these common regularities when you see those numbers between 1 and 28 being used near the same words over and over again. You can maybe guess that those words that are used near are names of months. And you kind of like, it's kind of like a cryptic puzzle in a way.
00:15:13 SPEAKER_04
Right. Like the way a linguist would break it down. Exactly, exactly.
00:15:17 SPEAKER_04
So, you know, you're, it's, or really like a... like a Rosetta Stone kind of approach, right? You're triangulating words together and really kind of like unlocking the logic puzzle.
00:15:27 SPEAKER_00
Right, the syntax of it. Yeah, yeah. And the trick there is can you write algorithms to do this?
00:15:33 SPEAKER_04
And, you know, can you get away with doing it when you have imperfect data, noisy data, not a lot of data, data that's not even related to your task? This is a big part of the HADR issue, right? So we want to respond to earthquakes, civil unrest. droughts, floods,
00:15:51 SPEAKER_04
explosions, terrorism. But the data that we have often is not really that.
00:15:55 SPEAKER_04
data that we have often is not really that. You know, sometimes we'll, the most frequent data you're going to have is the Bible and the Quran. Now, there are floods and earthquakes and violence and uprisings and wars in those documents, but they're written in a very different way from the way people are talking about these things nowadays. So, you know, you often actually will get some... Your initial translation engines, you'll see some very flowery language, and this is usually because you're picking up words and phrases that you've learned from translations of the Bible or the chron.
00:16:24 SPEAKER_04
picking up words and phrases that you've learned from translations of the Bible or the chron. And so it's a big challenge to try to figure out what is the kind of language that somebody who's in a disaster situation is using and train your systems to be specifically aware about that domain.
00:16:47 SPEAKER_04
Yeah. One thing that might be helpful is to include as early on as possible a linguist who understands the language, but also someone from the area who understands the colloquialisms and can tell you, well, this is the way we use this word that in another part of the country they don't use that at all. Absolutely. And this is a major part of this program, actually. I think it's one of the things that makes this program unique. relative to other language technology programs, is this express notion and kind of like a controlled study of getting access to a native speaker who can help you,
00:17:27 SPEAKER_04
help you, right? We call these the native informants. We like to think of them as like a taxi driver. So it's somebody who is local to the U .S. now, but is not from here. Their first language is the language in question. And they speak English. as a second language, and they don't have necessarily linguistic expertise, but they do know about their country, of course. And so a big research challenge for us is to know how can we use human resources as effectively and efficiently as possible. And actually,
00:18:02 SPEAKER_04
it occurs to me that civil affairs probably... has a lot of strategies for how to engage with local populations and how to acquire information in the right way and ask questions in the style that's appropriate. This is a big problem for us, actually.
00:18:24 SPEAKER_04
We're computer scientists nerds, right? And I think, like everybody, we're in our own world a lot of the time. We're talking with our own people. And we have our own acronyms for something. And, you know, I might say like one, maybe a term of ours, you know, like, oh, you know, can you tokenize this data and figure out what the foreign and the English words, the alignment of the foreign English words. That's a, you know, I don't expect most people to understand what I'm talking about there. These are some very, very specific kinds of things.
00:18:57 SPEAKER_04
Just to say tokenize. tokenized right yeah what does that mean right it has a very specific meaning in the way that i would think about it and uh and without thinking about it i would just use that word and assume everybody knows it and i imagine that similar kind of situation happens with civil affairs um where you know you have your own terminology but if you want to get information
00:19:04 SPEAKER_04
and without thinking about it i would just use that word and assume everybody knows it and i imagine that similar kind of situation happens with civil affairs um where you know you have your own terminology
00:19:18 SPEAKER_04
from somebody who's, you know, outside that bubble, right? You have to think about how to engage that. And in fact, it might be nice to have a conversation about how the lessons that you learned and that, you know, institutionally in civil affairs has learned because we're sort of... We are designing systems. We're designing methodologies for interfacing with our native informant. But it is challenging. We found that, you know, asking, you know, even just asking the question like, is this sentence a good translation of this other? Is this a good English translation of this foreign sentence? It's very hard to get an answer that is both timely and helpful because, you know, we want to know, you know, is... Is it just a completely random two sentences that could be not alike?
00:20:07 SPEAKER_04
be not alike? They're like night and day to each other. Or is it like maybe there's some shades of translation difference that aren't quite captured that we don't really care about? And by asking the question wrong, we can spend 20 or 30 minutes going down a rabbit hole and not really get the data that we need. And so we need a challenge that we're... addressing and I think getting better at is just asking the right questions that are going to help us.
00:20:29 SPEAKER_04
that we're...
00:20:33 SPEAKER_04
the right questions that are going to help us. We'll be right back after a word from our sponsor. Let me tell you about the Civil Affairs Association, the main sponsor of the 1CA podcast. It was established in 1947. The Civil Affairs Association is a veterans organization serving professionals of the U .S. civil affairs community. Members have served or are currently serving in the armed forces or are the descendants of those who served. As a tax -exempt organization, the association operates within the guidelines of Internal Revenue Code Section 501c19. It is organized for educational, professional, fraternal, and social purposes. The association promotes esprit de corps and disseminates relevant information. The CA Association also serves as an advocate for civil affairs within DOD to ensure an adequate capability to perform any mission assigned or task to the CA community. Membership costs are low. E1 through E4 pay only $5 a year. E5 through E9 pay $20. Cadets and midshipmen pay $10, and officers and civilians pay $25 a year. Life membership is also low, pegged now at $200. So if you're committed to the CA community, then it makes a lot of sense to invest in a life membership and save in the long term.
00:22:05 SPEAKER_04
Hi, welcome back to the 1CA podcast. Dr. May, how do your data sources vary in the test cases you've run so far from an urban area that may have better newspaper distribution or readability to more social media access and use to a more rural area? How does that differ even in the U .S. if you're going to test it here, but in other countries in foreign languages? How would that apply? Yeah, that's a great question. So we are tasked with analyzing news data. So for one thing, we have both text data, right? So like printed material,
00:22:42 SPEAKER_04
and then also audio spoken data material. For the text data, which is the majority of it, we have news articles. We have discussion forums, right? So like something like Reddit, you know, the kind of thing where people are seeing the conversation. It's casual, but sometimes long. Long paragraphs are written, and also social media, much like tweets, basically, or other kinds of short social media.
00:23:10 SPEAKER_04
short social media. And then with audio, we have broadcast dialogue and also, I think, broadcast news there. And it has been interesting to see how these things have differed. So, like I said, in the first year,
00:23:26 SPEAKER_04
this was two years ago at this point, we were in our surprise language, which is the language in the, we're actually like being tested. This is where the government says go and then you have, you know, an X amount of time to build your system. We were using Uyghur in there. We found that we had really rich, interesting information, I think, in discussion forums or in news articles in particular. The event was about an earthquake, but not too much that even discussed any kind of earthquake or any kind of calamity in tweets. And I remember being told by a native informant that it could be politically sensitive. to complain too much about something,
00:24:16 SPEAKER_04
you know, earthquakes damaging houses. And so the Uyghur people, according to our native informant, was not too happy to do that. And the news articles themselves often had a bit of an official government feel to them,
00:24:32 SPEAKER_04
feel to them, I would say. There was a lot of talk about house construction to prevent earthquakes. there wasn't much of a sense of outright distress, I would say, that we would have expected to see. You didn't see that. So in the next year, we were using two languages at that point.
00:24:58 SPEAKER_04
that point. That was Tigrinya and Oromo. And these are languages that are spoken in Eritrea and Ethiopia, and also Somalia, I think, so that region. We were getting a lot of political differences about – there was a lot of discussion. So there was civil unrest and drought, I believe, being discussed there.
00:25:23 SPEAKER_04
There was a lot of discussion of gang activity or al -Shabaab, I believe,
00:25:27 SPEAKER_04
activity or al -Shabaab, I believe, activity. And I remember you would get kind of one side saying these people were terrorists and other side saying these people were heroes. And I remember one particular time a native informant reading an article saying, you know, they were saying that this is happening, this is happening, but that's totally not true. They wouldn't do that. And so you have these questions of, you know, it's maybe not clear what the truth is there. Right. And so in,
00:25:57 SPEAKER_04
and I suppose, I think that was a bit more of a, that was a more of an urban area kind of setting. The addition of the region is more rural. Don't hold me to that. I haven't been on the ground in these places. Your job is not to assign a level of validity to the source. It's really just to do the science of the translation. That is true.
00:26:22 SPEAKER_04
So I very much do focus on translation, but the program overall is focused on not only translation, but on recognizing... The names of people and places and facilities and organizations kind of highlighting those things and figuring out when, like, the same person is being referred to multiple times. You know, you can refer to a person or you can refer to anything in more than one way, right?
00:26:47 SPEAKER_04
You can talk about, you know. Yeah, good guys and bad guys and whether you're using the proper name or a code name or something else. Exactly. People have nicknames or just using a pronoun. It's not necessarily clear. It may be clear to us as humans,
00:27:05 SPEAKER_04
as humans, but it can be difficult for a computer to figure out who the he is referring to, especially if two men are mentioned at the same time. But through context, we can usually figure out based on our knowledge of the world.
00:27:21 SPEAKER_04
We can figure out what's going on, but it can be very hard for a computer to do that. And then on top of that,
00:27:26 SPEAKER_00
And then on
00:27:27 SPEAKER_04
we're trying to figure out, given a document, more or less what the entire situation is on the ground. So, you know, what is the overall event that's being described and what are the needs that people have and have those needs been met and to what degree is there a sense of urgency?
00:27:47 SPEAKER_04
sense of urgency? So they're kind of like, essentially you can think of it as like an entire analyst summary of maybe... be a document or even maybe multiple documents. So this is an extremely challenging job. For that latter, for that last part in particular, it is helpful to know what the truth is. And so I do focus mostly on the machine translation in there. It's more like, okay, given that there's this document, how should that document be translated, right? And Dr. May, is the long -term intent that DARPA has set out to have a... a tool that could be on a dashboard somehow or a mobile device? Do they have a vision for how the tool could be used for the military or for USAID or other U .S. assets? I don't want to put words in darkness now. Okay. But in particular, we are definitely, although we're doing basic research in this program, we are tasked with delivering systems that are... clients can use. I think that different clients have had different needs.
00:28:55 SPEAKER_04
It could be adaptable. Many clients want to work in an air -gapped environment.
00:29:03 SPEAKER_04
security, secure data. Uh, and so, you know, we need to be mindful of that. A lot of times research systems can be built on a variety of computers in a university and, you know, we'll call out to the internet. Um, you know, that's not going to fly for, uh, for somebody that's working with, uh, with the, you know, data that needs clearance. Um, uh, and, um, uh, and yes, it, it's definitely the case that, uh, I've heard from, uh, from clients. I've met at demo days that they're operating in the field, so they need to run on a laptop or a tough book in the field. I want to be careful. Like I said, I don't know exactly what DARPA's particular needs are here, but it definitely varies,
00:29:47 SPEAKER_04
DARPA's particular
00:29:52 SPEAKER_04
I think. That's fascinating. Well, what is your timeline and your team's involvement? Are you going to be collecting data over the next year or two? Is this a contract that you have that's long -term? Yeah, well, so I think the program term is about four years, and it's divided into phases, and there are checkpoints. sometimes funding is altered and sometimes there are down selects. And this is kind of the standard world of the government research. And the way a lot of these programs work is that we're in, there are multiple teams that are working on more or less the same program all at once. So there are teams at Carnegie Mellon and Johns Hopkins and the University of Washington. Let's see, where else? All over the place. I'm forgetting some of my friends. Everywhere, wow. Do you guys get together once in a while to share notes? So we share notes, and the nice thing about this is that we meet and we learn from each other. We'll go in different directions, and if somebody's technology is working better, then we'll say, oh, okay, why did you do that? And this comes up in academic conferences, and they also come up at our regular program meetings. There's kind of a nice big sea of ideas there, and it's a nice way to exchange ideas and to also kind of have a common goal that we're all striving for. So anyway, your question was, yes, it's about four years. We're in, I think it's called phase two at this point, and we have another evaluation coming up in June, and the evaluations keep on getting harder and harder. So the first year we had 28 days, and the second year we had 21. We might have, I think, fewer this time. We do our best to try to actually get to the end condition even as early as possible.
00:31:45 SPEAKER_04
as early as possible. So even in year one, we're producing a 24 -hour system. So you say 28 days and then ratchet down 21 days. That's when DARPA gives you a language and a scenario, you have to put it together as quickly as you can. As quickly as we can with checkpoints that will terminate in the first year. We had 28 days. That's the whole evaluation program. We got our data packs, and then in seven days, our first system was due. And then after a total of 14 days,
00:32:17 SPEAKER_04
so seven more days, our second was due. And then after 28 days, the third one was due. So the system would get on better. And when we meet, we also compare notes and everybody.
00:32:22 SPEAKER_02
system would
00:32:26 SPEAKER_04
You know, everybody who's been evaluating together, we see, oh, you know, what did you do with this period? What did you do with that period? Oh, you know, boy, it was really hard communicating with Native Informant 2. Yeah, but Native Informant 4 was great. And, you know, we learned that we asked this kind of survey question, and it works really well. And it's really great, actually. But so, yeah. You have the most junior postdoc student on coffee duty to keep you alive and moving the whole time? It's something like that, yeah. I mean, I think... I'm at this level. I'm not super senior and I'm not super junior, but I'm super stressed. So, you know, you're spending a lot of hours on native informant duty and making sure,
00:33:05 SPEAKER_04
and making sure, you know, we have to, when you're submitting... your system to be scored you know you want to make sure that all your eyes are dotted and all your t's are crossed right uh and i double check and double check and double check that you did everything right because it's like uh taking an exam basically you want to make sure you've done really well uh but of course you know the evaluation is really it kind of it drives you in a way right because you're you're you know this is that's the the big adrenaline rush uh but it's all in service of developing technology And the real benefit is there's been some really great progress in developing language universal tools, systems that are allowing humans to, so allowing monolingual English speakers to act like bilingual speakers. So one of the big systems, one of the big advantages that we've had is we developed this tool that lets us act like we are the native informant and create translation data because we're so data hungry.
00:33:57 SPEAKER_04
like we are the native informant and create translation data because we're so data hungry. We'd like to get data that's about the domain in question that's in the foreign language and English so that we can build good translation systems. We just don't have that data at first, so we need to somehow get some. And the native informants, we don't get too much time with them. So we can only, and asking people who are not experienced translators to just translate documents for us, it's very tough to do that. We built a system that allows us to actually, even though I don't speak a Roma or Tabernia or Uyghur, it kind of merges the artificial,
00:34:37 SPEAKER_04
the machine translation technology, so the AI, with my own human intuition. So if I can look at a bunch of possible translations of a sentence or possible translations of the words of a sentence, and I can kind of get a sense for what the sentence should be about, and then knowing something about world knowledge lets me be a better translator of that sentence. Like at one point I remember seeing, I can't remember, I think it was, I want to say Czech or... We were trying some trials before our real evaluation. And I don't know anything about Hungarian. But I remember seeing something where they were talking about earthquake in Japan. And then it was like, you can see it from space. And I was like, that's impossible. And I was like, no, no, I remember seeing an article about how the earthquake was so powerful that you could actually see effects of it from space. And then something kind of clicks on.
00:35:22 SPEAKER_00
no, I remember
00:35:30 SPEAKER_04
And then I was able to translate the rest of the document much more easily because I had my access to world knowledge. It's really hard to teach computers to do that still. But I was able to... use that to build a little data set. And building that little data set is very helpful in making sure that your systems are good.
00:35:48 SPEAKER_04
making sure that your systems are good. You're like Neo in the Matrix. You plug it into the side of your head and you figure it out. We haven't gotten to that point yet. We're going to actually plug the computers into the brains and use them, but we're kind of there. Speaking of the brains also, I mentioned this deep learning technology has been wildly successful. There's been a lot of news about it. the programs that can do really well at Go or ChessNow. And the thing about these technologies is that they're extremely data hungry. So they work really well when you've got hundreds of millions or billions of words of examples of translations. And they don't work so well when we don't have a lot of data, when we maybe only have a few hundred thousand words of the Bible translated. And so one, we've been really pushing on this. Because when they do work, they work great. So we want to be able to have them work when you don't have a lot of data. And so we're actively developing techniques for allowing translation systems to be built that are as fluent as these really nice deep learning models that don't have to be translated on all the French English in the world. I would think that as machine processing power and as technology shrinks, or at least capacity to process on a smaller device improves over time, that something more applicable to the field would be more likely. So if we had a civil affairs team somewhere, a pocket -sized approach, instead of having the processing capabilities that you have at USC. I don't know what it requires right now, but if we could shrink that down over time, much more applicable to teams in the field. Absolutely.
00:37:30 SPEAKER_04
I think we are mindful of that. So you do see a lot of work on, especially because these models are pretty big, these really nice ones. So there's been a lot of work on shrinking them down, on getting them to fit on a cell phone processor. And, of course, the cell phone processors are getting bigger as well,
00:37:50 SPEAKER_04
better as well. yeah that's that's a key concern you know we want to there you want to make sure yeah the data can fit on the on the phone and the uh and the processing can too and that it's not going to drain the battery super quickly of course because right because you start doing a whole lot of processing on your phone then all of a sudden the phone goes off as well um so yeah so so and then you know to the degree that we're making we're able to be successful there we have to go outside of ai right we have to look at um the you know uh kind of the
00:38:07 SPEAKER_04
start doing a whole lot of processing on your phone then all of a sudden the phone goes off as well um so yeah so so and then you know to the degree that we're making we're able to be successful there we have to go outside of ai right we have to look at um the you know uh kind of the Computer engineering and electrical engineering and how do you make hardware better? What kind of core algorithms can you use to shrink stuff down so it can fit? So there's been some really great work on that as well. Absolutely. Well, is there any way that your team wanted to get connected to civil affairs to find out a little bit more about what we do or maybe psychological operations? Because they're deeply involved with monitoring media. the targets that we have and populations that we are mindful of when we're in other countries. So from those two perspectives, it would be very helpful, I think, for your team to understand the types of questions we ask, why we ask them, and why it's important to the Army and the Marine Corps and civil affairs. Well, I think this came up, actually. I wasn't thinking about this ahead of time, but like I said, the interaction with Native informants is something that I imagine is... You might have a book on that already, that you guys have way more experience than we do in that kind of engagement.
00:39:28 SPEAKER_04
of engagement. So I think that would be useful. Another more kind of fundamental aspect is that you already told me a little bit like it's important for the foreign operating people to be able to have stuff that's going to fit in a small space. And a lot of times the work that we do is targeted at I hear that a lot, right? And the more exposure that we as researchers can have, at least speaking for myself, the more I understand the pain that people who actually want to use their technology have, actually the better ideas I have, really, and the more responsive I can be, right? So I think there are pains that you experience that I don't even know about. And if I did know about them, they could... They could open a completely new area of research that I'd, you know, get super excited about. So, yeah, I mean, just talking with people, seeing how you work to the degree that that's possible is, I think,
00:40:31 SPEAKER_04
would be super helpful. I mean, and, you know, it can go both ways to some degree, right? So, like, when we produce some system that doesn't work for you, you know, it's helpful if we've had a conversation already about what we're doing. You know, we don't have the problem where we're talking. own terminologies past each other right we do understand each other's world so that we can you know get the right the right interaction between there and resolve whatever problems we do have absolutely if i'm being clear enough but like i think that sometimes you have really there's really simple issues but they can be complicated by a lack of for better for better expression translation i mean you translate between just one one person one field's experience and the others
00:41:13 SPEAKER_04
one field's experience and the others Dr. Jonathan May, I want to thank you for your time. Thank you for being here on the 1CA podcast and talking about Lorelei, the low resource languages for engagement and incidents, and the program that you're working on for DARPA. It's fascinating, and I'm sure that we'll stay in touch after this to talk about how we can connect your team with civil affairs, active duty, and reserve elements, and possibly psychological operations as well for the Army and Marine Corps to help you make some progress. That sounds great. It's been really a pleasure talking to you today. Thank you for spending some time with us. Please subscribe and come back for another installment of 1CA.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.