How to use
normalizeSpacesAndSoftHyphens
method
in
zemberek.core.text.TextUtil

Best Java code snippets using zemberek.core.text.TextUtil.normalizeSpacesAndSoftHyphens (Showing top 3 results out of 315)

line = TextUtil.normalizeApostrophes(line);
line = TextUtil.normalizeQuotesHyphens(line);
line = TextUtil.normalizeSpacesAndSoftHyphens(line);

s = TextUtil.normalizeApostrophes(s);
s = TextUtil.normalizeQuotesHyphens(s);
s = TextUtil.normalizeSpacesAndSoftHyphens(s);
s = removeMultipleSymbols(s);

@Override
public void run() throws Exception {
 initializeOutputDir();
 IOUtil.checkDirectoryArgument(modelRoot, "Model Root");
 IOUtil.checkFileArgument(inputPath, "Input File");
 Path out = outDir.resolve(inputPath.toFile().getName() + ".ne");
 List<String> lines = Files.readAllLines(inputPath, StandardCharsets.UTF_8);
 List<String> sentences = TurkishSentenceExtractor.DEFAULT.fromParagraphs(lines);
 Log.info("There are %d lines and about %d sentences", lines.size(), sentences.size());
 TurkishMorphology morphology = TurkishMorphology.createWithDefaults();
 PerceptronNer ner = PerceptronNer.loadModel(modelRoot, morphology);
 Stopwatch sw = Stopwatch.createStarted();
 int tokenCount = 0;
 try (PrintWriter pw = new PrintWriter(out.toFile(), "UTF-8")) {
  for (String sentence : sentences) {
   sentence = TextUtil.normalizeApostrophes(sentence);
   sentence = TextUtil.normalizeQuotesHyphens(sentence);
   sentence = TextUtil.normalizeSpacesAndSoftHyphens(sentence);
   List<String> words = TurkishTokenizer.DEFAULT.tokenizeToStrings(sentence);
   tokenCount += words.size();
   NerSentence result = ner.findNamedEntities(sentence, words);
   pw.println(result.getAsTrainingSentence(annotationStyle));
  }
 }
 double secs = sw.elapsed(TimeUnit.MILLISECONDS) / 1000d;
 Log.info("Token count = %s", tokenCount);
 Log.info("File processed in %.4f seconds.", secs);
 Log.info("Speed = %.2f tokens/sec", tokenCount / secs);
 Log.info("Result is written in %s", out);
}

Javadoc

Replaces all unicode space like characters with " " and replaces soft hyphens [u00ad].

Popular methods of TextUtil

normalizeApostrophes
This method converts different apostrophe symbols to a unified form.
normalizeQuotesHyphens
This method converts different single and double quote symbols to a unified form. also it reduces tw
convertAmpersandStrings
replaces all special html Strings such as(&....; or &#dddd;) with their original characters.
cleanHtmlTagsAndComments
cleanScripts
containsCombiningDiacritics
Returns true iff input contains Combining Diacritics symbols. These characters sometimes appear in d
getAttributes
returns a map with attributes of an xml line. For example if [content] is `` and [element] is `Foo`
getHtmlBody
removeAmpresandStrings
This method removes all &....; type strings form html.
separateWords

Popular in Java

Making http requests using okhttp
getSystemService (Context)
setRequestProperty (URLConnection)
findViewById (Activity)
ServerSocket (java.net)
This class represents a server-side socket that waits for incoming client connections. A ServerSocke
SocketTimeoutException (java.net)
This exception is thrown when a timeout expired on a socket read or accept operation.
Time (java.sql)
Java representation of an SQL TIME value. Provides utilities to format and parse the time's represen
TreeSet (java.util)
TreeSet is an implementation of SortedSet. All optional operations (adding and removing) are support
Callable (java.util.concurrent)
A task that returns a result and may throw an exception. Implementors define a single method with no
DateTimeFormat (org.joda.time.format)
Factory that creates instances of DateTimeFormatter from patterns and styles. Datetime formatting i
Best plugins for Eclipse

How to use normalizeSpacesAndSoftHyphensmethodin zemberek.core.text.TextUtil

Best Java code snippets using zemberek.core.text.TextUtil.normalizeSpacesAndSoftHyphens (Showing top 3 results out of 315)

How to use
normalizeSpacesAndSoftHyphens
method
in
zemberek.core.text.TextUtil