org.apache.commons.codec.language.bm.BeiderMorseEncoder java code examples

@Override
public Object encode(final Object source) throws EncoderException {
  if (!(source instanceof String)) {
    throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
  }
  return encode((String) source);
}

@Test
public void testOOM() throws EncoderException {
  final String phrase = "200697900'-->&#1913348150;</  bceaeef >aadaabcf\"aedfbff<!--\'-->?>cae"
      + "cfaaa><?&#<!--</script>&lang&fc;aadeaf?>>&bdquo<    cc =\"abff\"    /></   afe  >"
      + "<script><!-- f(';<    cf aefbeef = \"bfabadcf\" ebbfeedd = fccabeb >";
  final BeiderMorseEncoder encoder = new BeiderMorseEncoder();
  encoder.setNameType(NameType.GENERIC);
  encoder.setRuleType(RuleType.EXACT);
  encoder.setMaxPhonemes(10);
  final String phonemes = encoder.encode(phrase);
  assertTrue(phonemes.length() > 0);
  final String[] phonemeArr = phonemes.split("\\|");
  assertTrue(phonemeArr.length <= 10);
}

@Test
public void testSetNameTypeAsh() {
  final BeiderMorseEncoder bmpm = new BeiderMorseEncoder();
  bmpm.setNameType(NameType.ASHKENAZI);
  assertEquals("Name type should have been set to ash", NameType.ASHKENAZI, bmpm.getNameType());
}

private BeiderMorseEncoder createGenericApproxEncoder() {
  final BeiderMorseEncoder encoder = new BeiderMorseEncoder();
  encoder.setNameType(NameType.GENERIC);
  encoder.setRuleType(RuleType.APPROX);
  return encoder;
}

@Test(expected = IllegalArgumentException.class)
public void testSetRuleTypeToRulesIllegalArgumentException() {
  final BeiderMorseEncoder bmpm = new BeiderMorseEncoder();
  bmpm.setRuleType(RuleType.RULES);
}

@Test
public void testSetRuleTypeExact() {
  final BeiderMorseEncoder bmpm = new BeiderMorseEncoder();
  bmpm.setRuleType(RuleType.EXACT);
  assertEquals("Rule type should have been set to exact", RuleType.EXACT, bmpm.getRuleType());
}

@Test
public void testSetConcat() {
  final BeiderMorseEncoder bmpm = new BeiderMorseEncoder();
  bmpm.setConcat(false);
  assertFalse("Should be able to set concat to false", bmpm.isConcat());
}

@Override
protected StringEncoder createStringEncoder() {
  return new BeiderMorseEncoder();
}

private void assertNotEmpty(final BeiderMorseEncoder bmpm, final String value) throws EncoderException {
  Assert.assertFalse(value, bmpm.encode(value).equals(""));
}

  @Test
  public void testSpeedCheck3() throws EncoderException {
    final BeiderMorseEncoder bmpm = this.createGenericApproxEncoder();
    final String phrase = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";

    for (int i = 1; i <= phrase.length(); i++) {
      bmpm.encode(phrase.subSequence(0, i));
    }
  }
}

@Test
public void testSpeedCheck2() throws EncoderException {
  final BeiderMorseEncoder bmpm = this.createGenericApproxEncoder();
  final String phrase = "ItstheendoftheworldasweknowitandIfeelfine";
  for (int i = 1; i <= phrase.length(); i++) {
    bmpm.encode(phrase.subSequence(0, i));
  }
}

/**
 * Tests we do not blow up.
 *
 * @throws EncoderException
 */
@Test
public void testAllChars() throws EncoderException {
  final BeiderMorseEncoder bmpm = createGenericApproxEncoder();
  for (char c = Character.MIN_VALUE; c < Character.MAX_VALUE; c++) {
    bmpm.encode(Character.toString(c));
  }
}

/**
 * (Un)luckily, the worse performing test because of the data in {@link #TEST_CHARS}
 *
 * @throws EncoderException
 */
@Test(/* timeout = 20000L */)
public void testSpeedCheck() throws EncoderException {
  final BeiderMorseEncoder bmpm = this.createGenericApproxEncoder();
  final StringBuilder stringBuffer = new StringBuilder();
  stringBuffer.append(TEST_CHARS[0]);
  for (int i = 0, j = 1; i < 40; i++, j++) {
    if (j == TEST_CHARS.length) {
      j = 0;
    }
    bmpm.encode(stringBuffer.toString());
    stringBuffer.append(TEST_CHARS[j]);
  }
}

/**
 * Tests https://issues.apache.org/jira/browse/CODEC-125?focusedCommentId=13071566&page=com.atlassian.jira.plugin.system.issuetabpanels:
 * comment-tabpanel#comment-13071566
 *
 * @throws EncoderException
 */
@Test
public void testEncodeGna() throws EncoderException {
  final BeiderMorseEncoder bmpm = createGenericApproxEncoder();
  bmpm.encode("gna");
}

@Test(timeout = 10000L)
public void testLongestEnglishSurname() throws EncoderException {
  final BeiderMorseEncoder bmpm = createGenericApproxEncoder();
  bmpm.encode("MacGhilleseatheanaich");
}

@Override
public Object encode(final Object source) throws EncoderException {
  if (!(source instanceof String)) {
    throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
  }
  return encode((String) source);
}

@Override
public Object encode(final Object source) throws EncoderException {
  if (!(source instanceof String)) {
    throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
  }
  return encode((String) source);
}

@Override
public Object encode(final Object source) throws EncoderException {
  if (!(source instanceof String)) {
    throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
  }
  return encode((String) source);
}

@Override
public Object encode(final Object source) throws EncoderException {
  if (!(source instanceof String)) {
    throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
  }
  return encode((String) source);
}

@Override
public Object encode(final Object source) throws EncoderException {
  if (!(source instanceof String)) {
    throw new EncoderException("BeiderMorseEncoder encode parameter is not of type String");
  }
  return encode((String) source);
}

Javadoc

Encodes strings into their Beider-Morse phonetic encoding.

Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range of words.

This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it is mutable, and may not be thread-safe. If you require a guaranteed thread-safe encoding then use PhoneticEngine directly.

Encoding overview

Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what language the word comes from. For example, if it ends in "ault" then it infers that the word is French. Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some runs of letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up into phonemes at different places, so this stage results in a set of possible language-specific phonetic representations. Lastly, this language-specific phonetic representation is processed by a table of rules that re-writes it phonetically taking into account systematic pronunciation differences between languages, to move it towards a pan-indo-european phonetic representation. Again, sometimes there are multiple ways this could be done and sometimes things that can be pronounced in several ways in the source language have only one way to represent them in this average phonetic language, so the result is again a set of phonetic spellings.

Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. In this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final encoding. Secondly, some names have standard prefixes, for example, "Mac/Mc" in Scottish (English) names. As sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word is encoded once with the prefix and once without it. The resulting encoding contains one and then the other result.

Encoding format

Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where there are multiple possible phonetic representations, these are joined with a pipe (|) character. If multiple hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed in elipses and these blocks are then joined with hyphens. For example, "d'ortley" has a possible prefix. The form without prefix encodes to "ortlaj|ortlej", while the form with prefix encodes to "dortlaj|dortlej". Thus, the full, combined encoding is " (ortlaj|ortlej)-(dortlaj|dortlej)".

The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many potential phonetic interpretations. For example, "Renault" encodes to "rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult". The APPROX rules will tend to produce larger encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word. Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by splitting on pipe (|) and indexing under each of these alternatives.

Most used methods

encode
<init>
getNameType
Gets the name type currently in operation.
getRuleType
Gets the rule type currently in operation.
isConcat
Discovers if multiple possible encodings are concatenated.
setConcat
Sets how multiple possible phonetic encodings are combined.
setMaxPhonemes
Sets the number of maximum of phonemes that shall be considered by the engine.
setNameType
Sets the type of name. Use NameType#GENERIC unless you specifically want phonetic encodings optimize
setRuleType
Sets the rule type to apply. This will widen or narrow the range of phonetic encodings considered.

Popular in Java

Creating JSON documents from java classes using gson
setContentView (Activity)
getExternalFilesDir (Context)
getApplicationContext (Context)
Thread (java.lang)
A thread is a thread of execution in a program. The Java Virtual Machine allows an application to ha
UnknownHostException (java.net)
Thrown when a hostname can not be resolved.
KeyStore (java.security)
KeyStore is responsible for maintaining cryptographic keys and their owners. The type of the syste
TreeMap (java.util)
Walk the nodes of the tree left-to-right or right-to-left. Note that in descending iterations, next
Manifest (java.util.jar)
The Manifest class is used to obtain attribute information for a JarFile and its entries.
Font (java.awt)
The Font class represents fonts, which are used to render text in a visible way. A font provides the
CodeWhisperer alternatives

How to useBeiderMorseEncoder in org.apache.commons.codec.language.bm

Best Java code snippets using org.apache.commons.codec.language.bm.BeiderMorseEncoder (Showing top 20 results out of 315)

How to use
BeiderMorseEncoder
in
org.apache.commons.codec.language.bm